public void nrtOpenDir() { try { Document doc = new Document(); Field f = new Field("f", "test", Store.YES, Index.ANALYZED); doc.add(f); for (int i = 0; i < 20; i++) { w.addDocument(doc); w.commit(); IndexReader r = IndexReader.open(dir); System.out.println(r.numDocs()); } } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }
1.在大数据量的情况下indexWriter.commit()很耗时!(This may be a costly operation, so you should test the cost in your application and do it only when really necessary.)
/** * reopen -> openIfChanged */ public void nrtReopen() { try { Document doc = new Document(); Field f = new Field("f", "test", Store.YES, Index.ANALYZED); doc.add(f); IndexReader r = IndexReader.open(dir); for (int i = 0; i < 20; i++) { w.addDocument(doc); w.commit(); // r = r.reopen(); r = IndexReader.openIfChanged(r); System.out.println(r.numDocs()); } } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }
open(dir)的确是个很费时的过程,openIfChanged会比open省时些,因为他只刷新增那部分内容。(Opening an IndexReader is an expensive operation. This method can be used to refresh an existing IndexReader to reduce these costs. This method tries to only load segments that have changed or were created after the IndexReader was (re)opened.)
public void nrtNRT() { try { Document doc = new Document(); Field f = new Field("f", "test", Store.YES, Index.ANALYZED); doc.add(f); for (int i = 0; i < 20; i++) { w.addDocument(doc); // IndexReader r = w.getReader(); IndexReader r = IndexReader.open(w, false); System.out.println(r.numDocs()); } } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }
long bt = System.currentTimeMillis(); IndexReader r = IndexReader.open(dir); for (int i = 0; i < readDocCount; i++) { IndexReader nr = IndexReader.openIfChanged(r); if (nr != null) r = nr; } long et = System.currentTimeMillis(); System.out.println("reopen:" + (et - bt) + "ms");
long bt = System.currentTimeMillis(); IndexReader r = null; for (int i = 0; i < readDocCount; i++) { r = IndexReader.open(w, false); } long et = System.currentTimeMillis(); System.out.println("nrt:" + (et - bt) + "ms");
When you ask for the IndexReader from the IndexWriter, the IndexWriter will be flushed (docs accumulated in RAM will be written to disk) but not committed (fsync files, write new segments file, etc). The returned IndexReader will search over previously committed segments, as well as the new, flushed but not committed segment. Because flushing will likely be processor rather than IO bound, this should be a process that can be attacked with more processor power if found to be too slow.
Also, deletes are carried in RAM, rather than flushed to disk, which may help in eeking a bit more speed. The result is that you can add and remove documents from a Lucene index in ‘near’ real time by continuously asking for a new Reader from the IndexWriter every second or couple seconds. I haven’t seen a non synthetic test yet, but it looks like its been tested at around 50 documents updates per second without heavy slowdown (eg the results are visible every second).
The patch takes advantage of LUCENE-1483, which keys FieldCaches and Filters at the individual segment level rather than at the index level – this allows you to only reload caches per segment rather then per index – essential for real-time search with filter/cache use.
Near realtime search means thats documents are available for search almost immediately after being indexed - additions and updates to documents are seen in 'near' realtime.
Near realtime search will be added to Solr in version 4.0 and is currently available on trunk.
You can now modify a commit command to be a 'soft' commit. A soft commit will avoid parts of the standard commit that can be costly. You still will want to do normal commits to ensure that documents are on stable storage, but soft commits allow users to see a very near realtime view of the index in the meantime. Be sure to pay special attention to cache and autowarm settings as they can have a significant impact on NRT performance.
You can read about soft commits here: http://wiki.apache.org/solr/UpdateXmlMessages#A.22commit.22_and_.22optimize.22
You can see how to auto soft commit here: http://wiki.apache.org/solr/SolrConfigXml?#Update_Handler_Section
A common configuration might be to 'hard' auto commit every 1-10 minutes and 'soft' auto commit every second. With this configuration, new documents will show up within about a second of being added, and if the power goes out, you will be certain to have a consistent index up to the last 'hard' commit.