zoukankan      html  css  js  c++  java
  • Lucene 源码分析之倒排索引(三)

    上文找到了 collect(…) 方法,其形参就是匹配的文档 Id,根据代码上下文,其中 doc 是由 iterator.nextDoc() 获得的,那 DefaultBulkScorer.iterator 是何时赋值的?代码如下。

    public abstract class Weight implements SegmentCacheable {
        protected static class DefaultBulkScorer extends BulkScorer {
            // ...
            public DefaultBulkScorer(Scorer scorer) {
                // ...
                this.scorer = scorer;
                this.iterator = scorer.iterator();
                this.twoPhase = scorer.twoPhaseIterator();
            }
            // ...
        }
    }
    

    构造函数中 scorer.iterator() 即为匹配的文档 Id,那么 scorer 又是从何而来呢?回顾 Weight.bulkScorer(…) 方法,代码如下。根据上文可知 scorer(context) 的实现类是 TermWeight。

    public abstract class Weight implements SegmentCacheable {
        public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
            Scorer scorer = scorer(context);
            // ...
            return new DefaultBulkScorer(scorer);
        }
    }
    
    public class TermQuery extends Query {
        final class TermWeight extends Weight {
            @Override
            public Scorer scorer(LeafReaderContext context) throws IOException {    
                final TermsEnum termsEnum = getTermsEnum(context);
                if (termsEnum == null) {
                    return null;
                }
                PostingsEnum docs = termsEnum.postings(null, needsScores ? PostingsEnum.FREQS : PostingsEnum.NONE);
                assert docs != null;
                return new TermScorer(this, docs, similarity.simScorer(stats, context));
            }
        }
    }
    
    final class TermScorer extends Scorer {
        private final PostingsEnum postingsEnum;
        TermScorer(Weight weight, PostingsEnum td, Similarity.SimScorer docScorer) {
            super(weight);
            this.docScorer = docScorer;
            this.postingsEnum = td;
        }
        @Override
        public DocIdSetIterator iterator() {
            return postingsEnum;
        }
    }
    

    至此可以确定 scorer.iterator() 来源于 termsEnum.postings(...) 。倒排索引是不是若隐若现了呢。

    下面聚焦于 termsEnum 的实际类型和其 postings(...) 方法。

    根据上文可知,termsEnum 来源于 TermQuery.getTermsEnum(...),代码如下。

    public class TermQuery extends Query {
        private TermsEnum getTermsEnum(LeafReaderContext context) throws IOException {
            final TermState state = termStates.get(context.ord);
            final TermsEnum termsEnum = context.reader().terms(term.field()).iterator();
            termsEnum.seekExact(term.bytes(), state);
            return termsEnum;
        }
    }
    
    public final class LeafReaderContext extends IndexReaderContext {
        private final LeafReader reader;
    }
    

    LeafReader 本身是没有 terms(...) 方法的,也就是说 context.reader() 并不是 LeaferReader,而是其子类。根据上文已知 LeafReaderContext 是 IndexSearcher.leafContexts 其中的一个元素,那么找到 IndexSearcher.leafContexts 的赋值代码也就能知道 context.reader() 的实际类型。

    public class IndexSearcher {
        public IndexSearcher(IndexReader r) {
            this(r, null);
        }
        
        public IndexSearcher(IndexReader r, ExecutorService executor) {
            this(r.getContext(), executor);
        }
        
        public IndexSearcher(IndexReaderContext context, ExecutorService executor) {
            // ...
            leafContexts = context.leaves();
            // ...
        }
    }
    

    根据这部分代码可知,IndexSearcher.leafContexts 来源于 IndexReader.getContext().leaves()。一般来说,这个 IndexReader 是 DirectoryReader.open(...) 返回的一个 StandardDirectoryReader 类。代码如下。

    public abstract class DirectoryReader extends BaseCompositeReader<LeafReader> {
        public static DirectoryReader open(final Directory directory) throws IOException {
            return StandardDirectoryReader.open(directory, null);
        }
    }
    

    那么 IndexSearcher.leafContexts 实际来源于 StandardDirectoryReader.getContext().leaves()

    public final class StandardDirectoryReader extends DirectoryReader {
        // ...
    }
    
    public abstract class DirectoryReader extends BaseCompositeReader<LeafReader> {
        // ...
    }
    
    public abstract class BaseCompositeReader<R extends IndexReader> extends CompositeReader {
        // ...
    }
    
    public abstract class CompositeReader extends IndexReader {
        @Override
        public final CompositeReaderContext getContext() {
            // ...
            readerContext = CompositeReaderContext.create(this);
            return readerContext;
        }
        
        @Override
        public List<LeafReaderContext> leaves() throws UnsupportedOperationException {
            return leaves;
        }
        
        private final List<LeafReaderContext> leaves;
    }
    

    CompositeReaderContext.create(…) 是怎么创建的呢?

    public final class CompositeReaderContext extends IndexReaderContext {   
        static CompositeReaderContext create(CompositeReader reader) {
            return new Builder(reader).build();
        }
    
        private static final class Builder {
            public Builder(CompositeReader reader) {
                this.reader = reader;
            }
    
            public CompositeReaderContext build() {
                return (CompositeReaderContext) build(null, reader, 0, 0);
            }
    
            private IndexReaderContext build(CompositeReaderContext parent, IndexReader reader, int ord, int docBase) {
                if (reader instanceof LeafReader) {
                    final LeafReader ar = (LeafReader) reader;
                    final LeafReaderContext atomic = new LeafReaderContext(parent, ar, ord, docBase, leaves.size(), leafDocBase);
                    leaves.add(atomic);
                    leafDocBase += reader.maxDoc();
                    return atomic;
                } else {
                    final CompositeReader cr = (CompositeReader) reader;
                    final List<? extends IndexReader> sequentialSubReaders = cr.getSequentialSubReaders();
                    final List<IndexReaderContext> children = Arrays.asList(new IndexReaderContext[sequentialSubReaders.size()]);
                    final CompositeReaderContext newParent;
                    if (parent == null) {
                        newParent = new CompositeReaderContext(cr, children, leaves);
                    } else {
                        newParent = new CompositeReaderContext(parent, cr, ord, docBase, children);
                    }
                    int newDocBase = 0;
                    for (int i = 0, c = sequentialSubReaders.size(); i < c; i++) {
                        final IndexReader r = sequentialSubReaders.get(i);
                        children.set(i, build(newParent, r, i, newDocBase));
                        newDocBase += r.maxDoc();
                    }
                    assert newDocBase == cr.maxDoc();
                    return newParent;
                }
            }
        }
        
        private CompositeReaderContext(CompositeReaderContext parent, CompositeReader reader, int ordInParent, int docbaseInParent, List<IndexReaderContext> children, List<LeafReaderContext> leaves) {
            this.leaves = leaves == null ? null : Collections.unmodifiableList(leaves);
            // ...
        }
    }
    

    build(...) 时,传入的 reader 类型是 StandardDirectoryReader,将执行 getSequentialSubReaders() 得到其所有子 reader,并以 reader 作为成员变量创建 LeafReaderContext,然后将 LeafReaderContext 加入到 leaves 中。

    所以 IndexSearcher.leafContexts 的每个元素 LeafReaderContext 的 reader 即为 StandardDirectoryReader 的 getSequentialSubReaders()

    public final class StandardDirectoryReader extends DirectoryReader {
        static DirectoryReader open(final Directory directory, final IndexCommit commit) throws IOException {
            return new SegmentInfos.FindSegmentsFile<DirectoryReader>(directory) {
                @Override
                protected DirectoryReader doBody(String segmentFileName) throws IOException {
                    SegmentInfos sis = SegmentInfos.readCommit(directory, segmentFileName);
                    final SegmentReader[] readers = new SegmentReader[sis.size()];
                    boolean success = false;
                    try {
                        for (int i = sis.size()-1; i >= 0; i--) {
                            readers[i] = new SegmentReader(sis.info(i), sis.getIndexCreatedVersionMajor(), IOContext.READ);
                        }
    
                        DirectoryReader reader = new StandardDirectoryReader(directory, readers, null, sis, false, false);
                        success = true;
    
                        return reader;
                    }
                    // ...
                }
            }.run(commit);
        }
    
        StandardDirectoryReader(Directory directory, LeafReader[] readers, IndexWriter writer, SegmentInfos sis, boolean applyAllDeletes, boolean writeAllDeletes) throws IOException {
            super(directory, readers);
            this.writer = writer;
            this.segmentInfos = sis;
            this.applyAllDeletes = applyAllDeletes;
            this.writeAllDeletes = writeAllDeletes;
        }
    }
    
    public abstract class DirectoryReader extends BaseCompositeReader<LeafReader> {
        protected DirectoryReader(Directory directory, LeafReader[] segmentReaders) throws IOException {
            super(segmentReaders);
            this.directory = directory;
        }
    }
    
    public abstract class BaseCompositeReader<R extends IndexReader> extends CompositeReader {
        protected BaseCompositeReader(R[] subReaders) throws IOException {
            this.subReaders = subReaders;
            // ...
        }
    }
    

    可以分析出,reader 的类型是 SegmentReader,而该类(其实是其父类)确实是有 terms(…) 方法的。代码如下。

    public final class SegmentReader extends CodecReader {
        // ...
        final SegmentCoreReaders core;
    
        @Override
        public FieldsProducer getPostingsReader() {
            return core.fields;
        }
    }
    
    public abstract class CodecReader extends LeafReader implements Accountable {
        @Override
        public final Terms terms(String field) throws IOException {
            return getPostingsReader().terms(field);
        }
    }
    
    final class SegmentCoreReaders {
        final FieldsProducer fields;
        
        SegmentCoreReaders(Directory dir, SegmentCommitInfo si, IOContext context) throws IOException {
            // ...
            final Codec codec = si.info.getCodec();
            final PostingsFormat format = codec.postingsFormat();
            fields = format.fieldsProducer(segmentReadState);
            // ...
        }
    }
    

    在 lucene-7.3.0 中默认的 codec 是 Lucene70Codec,默认 postingsFomat 是 Lucene50PostingsFormat,这个分析过程请见 Lucene 源码分析之 segment(后续补上)。

    所以 SegmentReader.terms(…) 实际调用的是 Lucene50PostingsFormat.fieldsProducer(…).terms(…)。

    public final class Lucene50PostingsFormat extends PostingsFormat {
        @Override
        public FieldsProducer fieldsProducer(SegmentReadState state) throws IOException {
            PostingsReaderBase postingsReader = new Lucene50PostingsReader(state);
            FieldsProducer ret = new BlockTreeTermsReader(postingsReader, state);
            return ret;
        }
    }
    

    最终 SegmentReader.terms(…) 实际调用的是 BlockTreeTermsReader.terms(…)。

    public final class BlockTreeTermsReader extends FieldsProducer {
        @Override
        public Terms terms(String field) throws IOException {
            return fields.get(field);
        }
        
        private final TreeMap<String,FieldReader> fields = new TreeMap<>();
        
        public BlockTreeTermsReader(PostingsReaderBase postingsReader, SegmentReadState state) throws IOException {
        	this.postingsReader = postingsReader;
            fields.put(fieldInfo.name, new FieldReader(...));
        }
    }
    

    则 BlockTreeTermsReader.terms(…) 实际返回的是 FieldReader。

    再次回顾上文中的核心代码。

    public class TermQuery extends Query {
        final class TermWeight extends Weight {
            @Override
            public Scorer scorer(LeafReaderContext context) throws IOException {    
                final TermsEnum termsEnum = getTermsEnum(context);
                if (termsEnum == null) {
                    return null;
                }
                PostingsEnum docs = termsEnum.postings(null, needsScores ? PostingsEnum.FREQS : PostingsEnum.NONE);
                assert docs != null;
                return new TermScorer(this, docs, similarity.simScorer(stats, context));
            }
        }
        
        private TermsEnum getTermsEnum(LeafReaderContext context) throws IOException {
            final TermState state = termStates.get(context.ord);
            final TermsEnum termsEnum = context.reader().terms(term.field()).iterator();
            termsEnum.seekExact(term.bytes(), state);
            return termsEnum;
        }
    }
    

    则 termsEnum 为 FieldReader.iterator(),是一个 SegmentTermsEnum。

    public final class FieldReader extends Terms implements Accountable {
        @Override
        public TermsEnum iterator() throws IOException {
            return new SegmentTermsEnum(this);
        }
    }
    

    则 termsEnum.postings(…) 为 SegmentTermsEnum.postings(…)。

    final class SegmentTermsEnum extends TermsEnum {
        @Override
        public PostingsEnum postings(PostingsEnum reuse, int flags) throws IOException {   
            currentFrame.decodeMetaData();
            return fr.parent.postingsReader.postings(fr.fieldInfo, currentFrame.state, reuse, flags);
        }
    
        final FieldReader fr;
    }
    
    public final class FieldReader extends Terms implements Accountable {
        final BlockTreeTermsReader parent;
    }
    
    public final class BlockTreeTermsReader extends FieldsProducer {
        final PostingsReaderBase postingsReader;
    }
    

    fr 是在 SegmntTermsEnum 的构造函数里出现的。

    final class SegmentTermsEnum extends TermsEnum {
        public SegmentTermsEnum(FieldReader fr) throws IOException {
            this.fr = fr;
        }
    }
    

    而这个 FieldReader 是在 BlockTreeTermsReader 的构造函数里构造的。

    public final class BlockTreeTermsReader extends FieldsProducer {   
        public BlockTreeTermsReader(PostingsReaderBase postingsReader, SegmentReadState state) throws IOException {
            // ...
            fields.put(fieldInfo.name, new FieldReader(this,...));
        }
    }
    
    public final class FieldReader extends Terms implements Accountable {
        FieldReader(BlockTreeTermsReader parent,...) throws IOException {
            this.parent = parent;
        }
    }
    

    则 fr.parent 是 BlockTreeTermsReader,则 fr.parent.postingsReader 是 Lucene50PostingsReader,这就是倒排索引的核心类。

  • 相关阅读:
    hdu 3507 Print Article —— 斜率优化DP
    bzoj 1096 仓库建设 —— 斜率优化DP
    ORDER BY 高级用法之CASE WHEN
    union和union all 的区别
    Ubuntu 链接ln的使用:创建和删除符号链接
    python中set和frozenset方法和区别
    python之sys模块详解
    odoo 8.0 多核启用
    Odoo 中的widget
    Odoo 在 Ubuntu 环境下性能调优
  • 原文地址:https://www.cnblogs.com/studyhs/p/9092928.html
Copyright © 2011-2022 走看看