zoukankan      html  css  js  c++  java
  • Solr4.8.0源码分析(13)之LuceneCore的索引修复

    Solr4.8.0源码分析(13)之LuceneCore的索引修复

    题记:今天在公司研究elasticsearch,突然看到一篇博客说elasticsearch具有索引修复功能,顿感好奇,于是点进去看了下,发现原来是Lucene Core自带的功能。说实话之前学习Lucene文件格式的时候就想做一个索引文件解析和检测的工具,也动手写了一部分,最后没想到发现了一个已有的工具,正好对照着学习下。

     

    索引的修复主要是用到CheckIndex.java这个类,可以直接查看类的Main函数来了解下。

    1. CheckIndex的使用

    首先使用以下命令来查看lucenecore.jar怎么使用:

     1 192:lib rcf$ java -cp lucene-core-4.8-SNAPSHOT.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex 
     2 
     3 ERROR: index path not specified
     4 
     5 Usage: java org.apache.lucene.index.CheckIndex pathToIndex [-fix] [-crossCheckTermVectors] [-segment X] [-segment Y] [-dir-impl X]
     6 
     7   -fix: actually write a new segments_N file, removing any problematic segments
     8   -crossCheckTermVectors: verifies that term vectors match postings; THIS IS VERY SLOW!
     9   -codec X: when fixing, codec to write the new segments_N file with
    10   -verbose: print additional details
    11   -segment X: only check the specified segments.  This can be specified multiple
    12               times, to check more than one segment, eg '-segment _2 -segment _a'.
    13               You can't use this with the -fix option
    14   -dir-impl X: use a specific FSDirectory implementation. If no package is specified the org.apache.lucene.store package will be used.
    15 
    16 **WARNING**: -fix should only be used on an emergency basis as it will cause
    17 documents (perhaps many) to be permanently removed from the index.  Always make
    18 a backup copy of your index before running this!  Do not run this tool on an index
    19 that is actively being written to.  You have been warned!
    20 
    21 Run without -fix, this tool will open the index, report version information
    22 and report any exceptions it hits and what action it would take if -fix were
    23 specified.  With -fix, this tool will remove any segments that have issues and
    24 write a new segments_N file.  This means all documents contained in the affected
    25 segments will be removed.
    26 
    27 This tool exits with exit code 1 if the index cannot be opened or has any
    28 corruption, else 0.

    当敲java -cp lucene-core-4.8-SNAPSHOT.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex 这个就能看到相当于help的信息啦,但是为什么这里用这么一串奇怪的命令呢?通过java -help来查看-cp 以及 -ea就可以发现,-cp其实等同于-classpath 提供类和jar的搜索路径,-ea等同于-enableassertions提供是否启动断言设置。所以上述的命令其实可以简化为java -cp lucene-core-4.8-SNAPSHOT.jar org.apache.lucene.index.CheckIndex 。

    首先来检查下索引的情况:可以看出信息蛮清楚明了的。

      1 userdeMacBook-Pro:lib rcf$ java -cp lucene-core-4.8-SNAPSHOT.jar -ea:org.apache.lucene.index...  org.apache.lucene.index.CheckIndex ../../../../../../solr/Solr/test/data/index
      2 
      3 Opening index @ ../../../../../../solr/Solr/test/data/index
      4 
      5 Segments file=segments_r numSegments=7 version=4.8 format= userData={commitTimeMSec=1411221019854}
      6   1 of 7: name=_k docCount=18001
      7     codec=Lucene46
      8     compound=false
      9     numFiles=10
     10     size (MB)=0.493
     11     diagnostics = {timestamp=1411221019346, os=Mac OS X, os.version=10.9.4, mergeFactor=10, source=merge, lucene.version=4.8-SNAPSHOT Unversioned directory - rcf - 2014-09-20 21:11:36, os.arch=x86_64, mergeMaxNumSegments=-1, java.version=1.7.0_60, java.vendor=Oracle Corporation}
     12     no deletions
     13     test: open reader.........OK
     14     test: check integrity.....OK
     15     test: check live docs.....OK
     16     test: fields..............OK [3 fields]
     17     test: field norms.........OK [1 fields]
     18     test: terms, freq, prox...OK [36091 terms; 54003 terms/docs pairs; 18001 tokens]
     19     test: stored fields.......OK [54003 total field count; avg 3 fields per doc]
     20     test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
     21     test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_SET]
     22 
     23   2 of 7: name=_l docCount=1000
     24     codec=Lucene46
     25     compound=false
     26     numFiles=10
     27     size (MB)=0.028
     28     diagnostics = {timestamp=1411221019406, os=Mac OS X, os.version=10.9.4, source=flush, lucene.version=4.8-SNAPSHOT Unversioned directory - rcf - 2014-09-20 21:11:36, os.arch=x86_64, java.version=1.7.0_60, java.vendor=Oracle Corporation}
     29     no deletions
     30     test: open reader.........OK
     31     test: check integrity.....OK
     32     test: check live docs.....OK
     33     test: fields..............OK [3 fields]
     34     test: field norms.........OK [1 fields]
     35     test: terms, freq, prox...OK [2002 terms; 3000 terms/docs pairs; 1000 tokens]
     36     test: stored fields.......OK [3000 total field count; avg 3 fields per doc]
     37     test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
     38     test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_SET]
     39 
     40   3 of 7: name=_m docCount=1000
     41     codec=Lucene46
     42     compound=false
     43     numFiles=10
     44     size (MB)=0.028
     45     diagnostics = {timestamp=1411221019478, os=Mac OS X, os.version=10.9.4, source=flush, lucene.version=4.8-SNAPSHOT Unversioned directory - rcf - 2014-09-20 21:11:36, os.arch=x86_64, java.version=1.7.0_60, java.vendor=Oracle Corporation}
     46     no deletions
     47     test: open reader.........OK
     48     test: check integrity.....OK
     49     test: check live docs.....OK
     50     test: fields..............OK [3 fields]
     51     test: field norms.........OK [1 fields]
     52     test: terms, freq, prox...OK [2002 terms; 3000 terms/docs pairs; 1000 tokens]
     53     test: stored fields.......OK [3000 total field count; avg 3 fields per doc]
     54     test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
     55     test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_SET]
     56 
     57   4 of 7: name=_n docCount=1000
     58     codec=Lucene46
     59     compound=false
     60     numFiles=10
     61     size (MB)=0.028
     62     diagnostics = {timestamp=1411221019552, os=Mac OS X, os.version=10.9.4, source=flush, lucene.version=4.8-SNAPSHOT Unversioned directory - rcf - 2014-09-20 21:11:36, os.arch=x86_64, java.version=1.7.0_60, java.vendor=Oracle Corporation}
     63     no deletions
     64     test: open reader.........OK
     65     test: check integrity.....OK
     66     test: check live docs.....OK
     67     test: fields..............OK [3 fields]
     68     test: field norms.........OK [1 fields]
     69     test: terms, freq, prox...OK [2002 terms; 3000 terms/docs pairs; 1000 tokens]
     70     test: stored fields.......OK [3000 total field count; avg 3 fields per doc]
     71     test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
     72     test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_SET]
     73 
     74   5 of 7: name=_o docCount=1000
     75     codec=Lucene46
     76     compound=false
     77     numFiles=10
     78     size (MB)=0.028
     79     diagnostics = {timestamp=1411221019629, os=Mac OS X, os.version=10.9.4, source=flush, lucene.version=4.8-SNAPSHOT Unversioned directory - rcf - 2014-09-20 21:11:36, os.arch=x86_64, java.version=1.7.0_60, java.vendor=Oracle Corporation}
     80     no deletions
     81     test: open reader.........OK
     82     test: check integrity.....OK
     83     test: check live docs.....OK
     84     test: fields..............OK [3 fields]
     85     test: field norms.........OK [1 fields]
     86     test: terms, freq, prox...OK [2002 terms; 3000 terms/docs pairs; 1000 tokens]
     87     test: stored fields.......OK [3000 total field count; avg 3 fields per doc]
     88     test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
     89     test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_SET]
     90 
     91   6 of 7: name=_p docCount=1000
     92     codec=Lucene46
     93     compound=false
     94     numFiles=10
     95     size (MB)=0.028
     96     diagnostics = {timestamp=1411221019739, os=Mac OS X, os.version=10.9.4, source=flush, lucene.version=4.8-SNAPSHOT Unversioned directory - rcf - 2014-09-20 21:11:36, os.arch=x86_64, java.version=1.7.0_60, java.vendor=Oracle Corporation}
     97     no deletions
     98     test: open reader.........OK
     99     test: check integrity.....OK
    100     test: check live docs.....OK
    101     test: fields..............OK [3 fields]
    102     test: field norms.........OK [1 fields]
    103     test: terms, freq, prox...OK [2002 terms; 3000 terms/docs pairs; 1000 tokens]
    104     test: stored fields.......OK [3000 total field count; avg 3 fields per doc]
    105     test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
    106     test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_SET]
    107 
    108   7 of 7: name=_q docCount=1000
    109     codec=Lucene46
    110     compound=false
    111     numFiles=10
    112     size (MB)=0.027
    113     diagnostics = {timestamp=1411221019863, os=Mac OS X, os.version=10.9.4, source=flush, lucene.version=4.8-SNAPSHOT Unversioned directory - rcf - 2014-09-20 21:11:36, os.arch=x86_64, java.version=1.7.0_60, java.vendor=Oracle Corporation}
    114     no deletions
    115     test: open reader.........OK
    116     test: check integrity.....OK
    117     test: check live docs.....OK
    118     test: fields..............OK [3 fields]
    119     test: field norms.........OK [1 fields]
    120     test: terms, freq, prox...OK [2001 terms; 3000 terms/docs pairs; 1000 tokens]
    121     test: stored fields.......OK [3000 total field count; avg 3 fields per doc]
    122     test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
    123     test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_SET]
    124 
    125 No problems were detected with this index.

    由于我的索引文件是正常的,那么通过网上的例子来查看下错误的情况下是什么样子的,并且-fix是怎么样子的效果:

    来自网友:http://blog.csdn.net/laigood/article/details/8296678

     1 Segments file=segments_2cg numSegments=26 version=3.6.1 format=FORMAT_3_1 [Lucene 3.1+] userData={translog_id=1347536741715}
     2   1 of 26: name=_59ct docCount=4711242
     3     compound=false
     4     hasProx=true
     5     numFiles=9
     6     size (MB)=6,233.694
     7     diagnostics = {mergeFactor=13, os.version=2.6.32-71.el6.x86_64, os=Linux, lucene.version=3.6.1 1362471 - thetaphi - 2012-07-17 12:40:12, source=merge, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.6.0_24, java.vendor=Sun Microsystems Inc.}
     8     has deletions [delFileName=_59ct_1b.del]
     9     test: open reader.........OK [3107 deleted docs]
    10     test: fields..............OK [25 fields]
    11     test: field norms.........OK [10 fields]
    12     test: terms, freq, prox...OK [36504908 terms; 617641081 terms/docs pairs; 742052507 tokens]
    13     test: stored fields.......ERROR [read past EOF: MMapIndexInput(path="/usr/local/sas/escluster/data/cluster/nodes/0/indices/index/5/index/_59ct.fdt")]
    14 java.io.EOFException: read past EOF: MMapIndexInput(path="/usr/local/sas/escluster/data/cluster/nodes/0/indices/index/5/index/_59ct.fdt")
    15         at org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(MMapDirectory.java:307)
    16         at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:400)
    17         at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:253)
    18         at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:492)
    19         at org.apache.lucene.index.IndexReader.document(IndexReader.java:1138)
    20         at org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:852)
    21         at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:581)
    22         at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1064)
    23     test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
    24 FAILED
    25     WARNING: fixIndex() would remove reference to this segment; full exception:
    26 java.lang.RuntimeException: Stored Field test failed
    27         at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:593)
    28         at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1064)
    29 
    30 
    31 WARNING: 1 broken segments (containing 4708135 documents) detected
    32 WARNING: 4708135 documents will be lost
    在检查结果中可以看到,分片5的_59ct.fdt索引文件损坏,.fdt文件主要存储lucene索引中存储的fields,所以在检查test: stored fields时出错。
    下面的警告是说有一个损坏了的segment,里面有4708135个文档。
    在原来的命令基础上加上-fix参数可以进行修复索引操作(ps:在进行修改前最好对要修复的索引进行备份,不要在正在执行写操作的索引上执行修复。)
     1 java -cp lucene-core-3.6.1.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /usr/local/sas/escluster/data/cluster/nodes/0/indices/index/5/index/ -fix
     2 NOTE: will write new segments file in 5 seconds; this will remove 4708135 docs from the index. THIS IS YOUR LAST CHANCE TO CTRL+C!
     3   5...
     4   4...
     5   3...
     6   2...
     7   1...
     8 Writing...
     9 OK
    10 Wrote new segments file "segments_2ch"

    还可以通过检查某一个segment:

    1 userdeMacBook-Pro:lib rcf$ java -cp lucene-core-4.8-SNAPSHOT.jar -ea:org.apache.lucene.index...  org.apache.lucene.index.CheckIndex ../../../../../../solr/Solr/test/data/index -segment _9
    2 
    3 Opening index @ ../../../../../../solr/Solr/test/data/index
    4 
    5 Segments file=segments_r numSegments=7 version=4.8 format= userData={commitTimeMSec=1411221019854}
    6 
    7 Checking only these segments: _9:
    8 No problems were detected with this index.

    还可以通过-verbose查看更多详细信息,这里就不在详述。

    2. CheckIndex的源码

        接着我们再来学习下CheckIndex的源码是怎么来实现以上功能的,检查索引的功能主要集中在checkindex()函数上,

      1  public Status checkIndex(List<String> onlySegments) throws IOException {
      2 ...
      3     final int numSegments = sis.size();                         //获取segment个数
      4     final String segmentsFileName = sis.getSegmentsFileName();  //获取segment_N名字
      5     // note: we only read the format byte (required preamble) here!
      6     IndexInput input = null;
      7     try {
      8       input = dir.openInput(segmentsFileName, IOContext.READONCE);//读取segment_N文件
      9     } catch (Throwable t) {
     10       msg(infoStream, "ERROR: could not open segments file in directory");
     11       if (infoStream != null)
     12         t.printStackTrace(infoStream);
     13       result.cantOpenSegments = true;
     14       return result;
     15     }
     16     int format = 0;
     17     try {
     18       format = input.readInt();  //读取segment_N version
     19     } catch (Throwable t) {
     20       msg(infoStream, "ERROR: could not read segment file version in directory");
     21       if (infoStream != null)
     22         t.printStackTrace(infoStream);
     23       result.missingSegmentVersion = true;
     24       return result;
     25     } finally {
     26       if (input != null)
     27         input.close();
     28     }
     29 
     30     String sFormat = "";
     31     boolean skip = false;
     32 
     33     result.segmentsFileName = segmentsFileName;//segment_N名字
     34     result.numSegments = numSegments;      //segment个数
     35     result.userData = sis.getUserData();   //获取user信息,如userData={commitTimeMSec=1411221019854}
     36     String userDataString;
     37     if (sis.getUserData().size() > 0) {
     38       userDataString = " userData=" + sis.getUserData();
     39     } else {
     40       userDataString = "";
     41     }
     42      //获取版本信息,如version=4.8
     43     String versionString = null;
     44     if (oldSegs != null) {
     45       if (foundNonNullVersion) {
     46         versionString = "versions=[" + oldSegs + " .. " + newest + "]";
     47       } else {
     48         versionString = "version=" + oldSegs;
     49       }
     50     } else {
     51       versionString = oldest.equals(newest) ? ( "version=" + oldest ) : ("versions=[" + oldest + " .. " + newest + "]");
     52     }
     53 
     54     msg(infoStream, "Segments file=" + segmentsFileName + " numSegments=" + numSegments
     55         + " " + versionString + " format=" + sFormat + userDataString);
     56 
     57     if (onlySegments != null) {
     58       result.partial = true;
     59       if (infoStream != null) {
     60         infoStream.print("
    Checking only these segments:");
     61         for (String s : onlySegments) {
     62           infoStream.print(" " + s);
     63         }
     64       }
     65       result.segmentsChecked.addAll(onlySegments);
     66       msg(infoStream, ":");
     67     }
     68 
     69     if (skip) {
     70       msg(infoStream, "
    ERROR: this index appears to be created by a newer version of Lucene than this tool was compiled on; please re-compile this tool on the matching version of Lucene; exiting");
     71       result.toolOutOfDate = true;
     72       return result;
     73     }
     74 
     75 
     76     result.newSegments = sis.clone();
     77     result.newSegments.clear();
     78     result.maxSegmentName = -1;
     79     //开始遍历segment,检查segment
     80     for(int i=0;i<numSegments;i++) {
     81       final SegmentCommitInfo info = sis.info(i); //获取segment信息
     82       int segmentName = Integer.parseInt(info.info.name.substring(1), Character.MAX_RADIX);
     83       if (segmentName > result.maxSegmentName) {
     84         result.maxSegmentName = segmentName;
     85       }
     86       if (onlySegments != null && !onlySegments.contains(info.info.name)) {
     87         continue;
     88       }
     89       Status.SegmentInfoStatus segInfoStat = new Status.SegmentInfoStatus();
     90       result.segmentInfos.add(segInfoStat);
     91       //获取segments编号,名字,document个数,如下信息 1 of 7: name=_k docCount=18001
     92       msg(infoStream, "  " + (1+i) + " of " + numSegments + ": name=" + info.info.name + " docCount=" + info.info.getDocCount());
     93       segInfoStat.name = info.info.name;
     94       segInfoStat.docCount = info.info.getDocCount();
     95       
     96       final String version = info.info.getVersion();
     97       if (info.info.getDocCount() <= 0 && version != null && versionComparator.compare(version, "4.5") >= 0) {
     98         throw new RuntimeException("illegal number of documents: maxDoc=" + info.info.getDocCount());
     99       }
    100 
    101       int toLoseDocCount = info.info.getDocCount();
    102 
    103       AtomicReader reader = null;
    104 
    105       try {
    106         final Codec codec = info.info.getCodec(); //获取codec信息,如codec=Lucene46
    107         msg(infoStream, "    codec=" + codec);
    108         segInfoStat.codec = codec;
    109         msg(infoStream, "    compound=" + info.info.getUseCompoundFile());//获取复合文档格式标志位:compound=false
    110         segInfoStat.compound = info.info.getUseCompoundFile();
    111         msg(infoStream, "    numFiles=" + info.files().size());
    112         segInfoStat.numFiles = info.files().size(); //获取段内文件个数numFiles=10
    113         segInfoStat.sizeMB = info.sizeInBytes()/(1024.*1024.);//获取segment大小如size (MB)=0.493
    114         if (info.info.getAttribute(Lucene3xSegmentInfoFormat.DS_OFFSET_KEY) == null) {
    115           // don't print size in bytes if its a 3.0 segment with shared docstores
    116           msg(infoStream, "    size (MB)=" + nf.format(segInfoStat.sizeMB));
    117         }
    118         //获取诊断信息,diagnostics = {timestamp=1411221019346, os=Mac OS X, os.version=10.9.4, mergeFactor=10, 
    119         //source=merge, lucene.version=4.8-SNAPSHOT Unversioned directory - rcf - 2014-09-20 21:11:36, 
    120         //os.arch=x86_64, mergeMaxNumSegments=-1, java.version=1.7.0_60, java.vendor=Oracle Corporation}
    121         Map<String,String> diagnostics = info.info.getDiagnostics();
    122         segInfoStat.diagnostics = diagnostics;
    123         if (diagnostics.size() > 0) {
    124           msg(infoStream, "    diagnostics = " + diagnostics);
    125         }
    126         //判断是否有document删除,如输出no deletions 或者 has deletions [delFileName=_59ct_1b.del]
    127         if (!info.hasDeletions()) {
    128           msg(infoStream, "    no deletions");
    129           segInfoStat.hasDeletions = false;
    130         }
    131         else{
    132           msg(infoStream, "    has deletions [delGen=" + info.getDelGen() + "]");
    133           segInfoStat.hasDeletions = true;
    134           segInfoStat.deletionsGen = info.getDelGen();
    135         }
    136         
    137         //通过新建SegmentReader对象来检查是否能获取索引reader,如果扔出错误说明不能打开。例如输出test: open reader.........OK,
    138         if (infoStream != null)
    139           infoStream.print("    test: open reader.........");
    140         reader = new SegmentReader(info, DirectoryReader.DEFAULT_TERMS_INDEX_DIVISOR, IOContext.DEFAULT);
    141         msg(infoStream, "OK");
    142 
    143         segInfoStat.openReaderPassed = true;
    144         //通过checkIntegrity来检查文件的完整性,例如输出:test: check integrity.....OK
    145         //checkIntegrity函数是通过CodecUtil.checksumEntireFile()实现来检查文件的完整。
    146         if (infoStream != null)
    147           infoStream.print("    test: check integrity.....");
    148         reader.checkIntegrity();
    149         msg(infoStream, "OK");
    150         
    151         //检查document数量是否正确,如果有删除的document时候,通过reader.numDocs() 
    152         //== info.info.getDocCount() - info.getDelCount()以及统计以及检测livedocs的个数是否于reader.numDocs() 一致。
    153         //注意:没有删除的document时候,livedocs为null;
    154         //当没有删除的document时候,reader.maxDoc() == info.info.getDocCount()来检查是否一致。
    155         //输出结果比如:test: check live docs.....OK 
    156         //solr的管理界面可以显示: 现有document,删除的document以及所有的document数量,获取方法就是如下。
    157         if (infoStream != null)
    158           infoStream.print("    test: check live docs.....");
    159         final int numDocs = reader.numDocs();
    160         toLoseDocCount = numDocs;
    161         if (reader.hasDeletions()) {
    162           if (reader.numDocs() != info.info.getDocCount() - info.getDelCount()) {
    163             throw new RuntimeException("delete count mismatch: info=" + (info.info.getDocCount() - info.getDelCount()) + " vs reader=" + reader.numDocs());
    164           }
    165           if ((info.info.getDocCount()-reader.numDocs()) > reader.maxDoc()) {
    166             throw new RuntimeException("too many deleted docs: maxDoc()=" + reader.maxDoc() + " vs del count=" + (info.info.getDocCount()-reader.numDocs()));
    167           }
    168           if (info.info.getDocCount() - numDocs != info.getDelCount()) {
    169             throw new RuntimeException("delete count mismatch: info=" + info.getDelCount() + " vs reader=" + (info.info.getDocCount() - numDocs));
    170           }
    171           Bits liveDocs = reader.getLiveDocs();
    172           if (liveDocs == null) {
    173             throw new RuntimeException("segment should have deletions, but liveDocs is null");
    174           } else {
    175             int numLive = 0;
    176             for (int j = 0; j < liveDocs.length(); j++) {
    177               if (liveDocs.get(j)) {
    178                 numLive++;
    179               }
    180             }
    181             if (numLive != numDocs) {
    182               throw new RuntimeException("liveDocs count mismatch: info=" + numDocs + ", vs bits=" + numLive);
    183             }
    184           }
    185           
    186           segInfoStat.numDeleted = info.info.getDocCount() - numDocs;
    187           msg(infoStream, "OK [" + (segInfoStat.numDeleted) + " deleted docs]");
    188         } else {
    189           if (info.getDelCount() != 0) {
    190             throw new RuntimeException("delete count mismatch: info=" + info.getDelCount() + " vs reader=" + (info.info.getDocCount() - numDocs));
    191           }
    192           Bits liveDocs = reader.getLiveDocs();
    193           if (liveDocs != null) {
    194             // its ok for it to be non-null here, as long as none are set right?
    195             // 这里好像有点问题,当delete document不存在时候,liveDocs应该为null。
    196             for (int j = 0; j < liveDocs.length(); j++) {
    197               if (!liveDocs.get(j)) {
    198                 throw new RuntimeException("liveDocs mismatch: info says no deletions but doc " + j + " is deleted.");
    199               }
    200             }
    201           }
    202           msg(infoStream, "OK");
    203         }
    204         if (reader.maxDoc() != info.info.getDocCount()) {
    205           throw new RuntimeException("SegmentReader.maxDoc() " + reader.maxDoc() + " != SegmentInfos.docCount " + info.info.getDocCount());
    206         }
    207 
    208         // Test getFieldInfos()
    209         // 获取域状态以及数量 如:test: fields..............OK [3 fields]
    210         if (infoStream != null) {
    211           infoStream.print("    test: fields..............");
    212         }         
    213         FieldInfos fieldInfos = reader.getFieldInfos();
    214         msg(infoStream, "OK [" + fieldInfos.size() + " fields]");
    215         segInfoStat.numFields = fieldInfos.size();
    216         
    217         // Test Field Norms
    218         // 获取Field Norms的状态以及数量 test: field norms.........OK [1 fields]
    219         segInfoStat.fieldNormStatus = testFieldNorms(reader, infoStream);
    220 
    221         // Test the Term Index
    222         // 获取Field Index的状态以及数量 test: terms, freq, prox...OK [36091 terms; 54003 terms/docs pairs; 18001 tokens]
    223         segInfoStat.termIndexStatus = testPostings(reader, infoStream, verbose);
    224 
    225         // Test Stored Fields
    226         // 获取Stored Field 的状态 test: stored fields.......OK [54003 total field count; avg 3 fields per doc]
    227         segInfoStat.storedFieldStatus = testStoredFields(reader, infoStream);
    228 
    229         // Test Term Vectors
    230         // 获取Term Field 的状态 test: stored fields.......OK [54003 total field count; avg 3 fields per doc]
    231         segInfoStat.termVectorStatus = testTermVectors(reader, infoStream, verbose, crossCheckTermVectors);
    232 
    233      // 获取Doc Value 的状态 test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_SET]
    234         segInfoStat.docValuesStatus = testDocValues(reader, infoStream);
    235 
    236         // Rethrow the first exception we encountered
    237         //  This will cause stats for failed segments to be incremented properly
    238         if (segInfoStat.fieldNormStatus.error != null) {
    239           throw new RuntimeException("Field Norm test failed");
    240         } else if (segInfoStat.termIndexStatus.error != null) {
    241           throw new RuntimeException("Term Index test failed");
    242         } else if (segInfoStat.storedFieldStatus.error != null) {
    243           throw new RuntimeException("Stored Field test failed");
    244         } else if (segInfoStat.termVectorStatus.error != null) {
    245           throw new RuntimeException("Term Vector test failed");
    246         }  else if (segInfoStat.docValuesStatus.error != null) {
    247           throw new RuntimeException("DocValues test failed");
    248         }
    249 
    250         msg(infoStream, "");
    251 
    252       } catch (Throwable t) {
    253         msg(infoStream, "FAILED");
    254         String comment;
    255         comment = "fixIndex() would remove reference to this segment";
    256         msg(infoStream, "    WARNING: " + comment + "; full exception:");
    257         if (infoStream != null)
    258           t.printStackTrace(infoStream);
    259         msg(infoStream, "");
    260         result.totLoseDocCount += toLoseDocCount;
    261         result.numBadSegments++;
    262         continue;
    263       } finally {
    264         if (reader != null)
    265           reader.close();
    266       }
    267 
    268       // Keeper
    269       result.newSegments.add(info.clone());
    270     }
    271 
    272     if (0 == result.numBadSegments) {
    273       result.clean = true;
    274     } else
    275       msg(infoStream, "WARNING: " + result.numBadSegments + " broken segments (containing " + result.totLoseDocCount + " documents) detected");
    276 
    277     if ( ! (result.validCounter = (result.maxSegmentName < sis.counter))) {
    278       result.clean = false;
    279       result.newSegments.counter = result.maxSegmentName + 1; 
    280       msg(infoStream, "ERROR: Next segment name counter " + sis.counter + " is not greater than max segment name " + result.maxSegmentName);
    281     }
    282     
    283     if (result.clean) {
    284       msg(infoStream, "No problems were detected with this index.
    ");
    285     }
    286 
    287     return result;
    288   }

    其中关于testFieldNorms这几个的源码明天继续学习

  • 相关阅读:
    select/poll/epoll 对比
    I/O Mutiplexing poll 和 epoll
    Socket 编程IO Multiplexing
    ubuntu12.04 lts 安装gcc 4.8
    time since epoch
    ceph-RGW Jewel版新概念
    支持向量机(svm)
    MachineLearning之Logistic回归
    ML之回归
    ML之监督学习算法之分类算法一 ——— 决策树算法
  • 原文地址:https://www.cnblogs.com/rcfeng/p/4044763.html
Copyright © 2011-2022 走看看