lucene demo引出的思考

zoukankan html css js c++ java

lucene demo引出的思考
org.apache.lucene.demo.IndexFiles 类中，使用递归的方式去索引文件。在构造了一个IndexWriter索引器之后，就可以向索引器中添加Doucument了，执行真正地建立索引的过程。遍历每个目录，因为每个目录中可能还存在目录，进行深度遍历，采用递归技术找到处于叶节点处的文件(普通的具有扩展名的文件，比如my.txt文件)，然后调用如下代码中：
[java] view plain copy

static void indexDocs(IndexWriter writer, File file)

    throws IOException {

    // file可以读取

    if (file.canRead()) {

      if (file.isDirectory()) { // 如果file是一个目录(该目录下面可能有文件、目录文件、空文件三种情况)

        String[] files = file.list(); // 获取file目录下的所有文件(包括目录文件)File对象，放到数组files里

        // 如果files!=null

        if (files != null) {

          for (int i = 0; i < files.length; i++) { // 对files数组里面的File对象递归索引，通过广度遍历

            indexDocs(writer, new File(file, files[i]));

          }

        }

      } else { // 到达叶节点时，说明是一个File，而不是目录，则建立索引

        System.out.println("adding " + file);

        try {

          writer.addDocument(FileDocument.Document(file));

        }

        catch (FileNotFoundException fnfe) {

          ;

        }

      }

    }

}
上面这一句：

writer.addDocument(FileDocument.Document(file));

其实做了很多工作。每当递归到叶子节点，获得一个文件，而非目录文件，比如文件myWorld.txt。然后对这个文件进行了复杂的操作：

先根据由myWorld.txt构造的File对象f，通过f获取myWorld.txt的具体信息，比如存储路径、修改时间等等，构造多个Field对象，再由这些不同Field的聚合，构建出一个Document对象，最后把Document对象加入索引器IndexWriter对象中，通过索引器可以对这些聚合的Document 的Field中信息进行分词、过滤处理，方便检索。
[java] view plain copy

org.apache.lucene.demo.FileDocument类的源代码如下所示：

package org.apache.lucene.demo;

import java.io.File;

import java.io.FileReader;

import org.apache.lucene.document.DateTools;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

public class FileDocument {

public static Document Document(File f)

       throws java.io.FileNotFoundException {

    // 实例化一个Document

    Document doc = new Document();

    // 根据传进来的File f，构造多个Field对象，然后把他们都添加到Document中

    // 通过f的所在路径构造一个Field对象，并设定该Field对象的一些属性：

    // “path”是构造的Field的名字，通过该名字可以找到该Field

    // Field.Store.YES表示存储该Field；Field.Index.UN_TOKENIZED表示不对该Field进行分词，但是对其进行索引，以便检索

    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));

   // 构造一个具有最近修改修改时间信息的Field

    doc.add(new Field("modified",

        DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),

        Field.Store.YES, Field.Index.UN_TOKENIZED));

    // 构造一个Field，这个Field可以从一个文件流中读取，必须保证由f所构造的文件流是打开的

    doc.add(new Field("contents", new FileReader(f)));

    return doc;

}

private FileDocument() {}

}
通过上面的代码，可以看出Field是何其的重要，必须把Field完全掌握了。

Field类定义了两个很有用enum：Store和Index，用它们来设置对Field进行索引时的一些属性。
[java] view plain copy

/** Specifies whether and how a field should be stored. */

public static enum Store {

   /** Store the original field value in the index. This is useful for short texts

    * like a document's title which should be displayed with the results. The

    * value is stored in its original form, i.e. no analyzer is used before it is

    * stored.

    */

   YES {

     @Override

     public boolean isStored() { return true; }

   },

   /** Do not store the field value in the index. */

   NO {

     @Override

     public boolean isStored() { return false; }

   };

   public abstract boolean isStored();

}

/** Specifies whether and how a field should be indexed. */

public static enum Index {

   /** Do not index the field value. This field can thus not be searched,

    * but one can still access its contents provided it is

    * {@link Field.Store stored}. */

   NO {

     @Override

     public boolean isIndexed()  { return false; }

     @Override

     public boolean isAnalyzed() { return false; }

     @Override

     public boolean omitNorms()  { return true;  }

   },

   /** Index the tokens produced by running the field's

    * value through an Analyzer.  This is useful for

    * common text. */

   ANALYZED {

     @Override

     public boolean isIndexed()  { return true;  }

     @Override

     public boolean isAnalyzed() { return true;  }

     @Override

     public boolean omitNorms()  { return false; }

   },

   /** Index the field's value without using an Analyzer, so it can be searched.

    * As no analyzer is used the value will be stored as a single term. This is

    * useful for unique Ids like product numbers.

    */

   NOT_ANALYZED {

     @Override

     public boolean isIndexed()  { return true;  }

     @Override

     public boolean isAnalyzed() { return false; }

     @Override

     public boolean omitNorms()  { return false; }

   },

   /** Expert: Index the field's value without an Analyzer,

    * and also disable the storing of norms.  Note that you

    * can also separately enable/disable norms by calling

    * {@link Field#setOmitNorms}.  No norms means that

    * index-time field and document boosting and field

    * length normalization are disabled.  The benefit is

    * less memory usage as norms take up one byte of RAM

    * per indexed field for every document in the index,

    * during searching.  Note that once you index a given

    * field <i>with</i> norms enabled, disabling norms will

    * have no effect.  In other words, for this to have the

    * above described effect on a field, all instances of

    * that field must be indexed with NOT_ANALYZED_NO_NORMS

    * from the beginning. */

   NOT_ANALYZED_NO_NORMS {

     @Override

     public boolean isIndexed()  { return true;  }

     @Override

     public boolean isAnalyzed() { return false; }

     @Override

     public boolean omitNorms()  { return true;  }

   },

   /** Expert: Index the tokens produced by running the

    *  field's value through an Analyzer, and also

    *  separately disable the storing of norms.  See

    *  {@link #NOT_ANALYZED_NO_NORMS} for what norms are

    *  and why you may want to disable them. */

   ANALYZED_NO_NORMS {

     @Override

     public boolean isIndexed()  { return true;  }

     @Override

     public boolean isAnalyzed() { return true;  }

     @Override

     public boolean omitNorms()  { return true;  }

   };
Field类中还有一个内部类，它的声明如下：
[java] view plain copy

public static enum TermVector {



    /** Do not store term vectors.

     */

    NO {

      @Override

      public boolean isStored()      { return false; }

      @Override

      public boolean withPositions() { return false; }

      @Override

      public boolean withOffsets()   { return false; }

    },



    /** Store the term vectors of each document. A term vector is a list

     * of the document's terms and their number of occurrences in that document. */

    YES {

      @Override

      public boolean isStored()      { return true;  }

      @Override

      public boolean withPositions() { return false; }

      @Override

      public boolean withOffsets()   { return false; }

    },



    /**

     * Store the term vector + token position information

     *

     * @see #YES

     */

    WITH_POSITIONS {

      @Override

      public boolean isStored()      { return true;  }

      @Override

      public boolean withPositions() { return true;  }

      @Override

      public boolean withOffsets()   { return false; }

    },



    /**

     * Store the term vector + Token offset information

     *

     * @see #YES

     */

    WITH_OFFSETS {

      @Override

      public boolean isStored()      { return true;  }

      @Override

      public boolean withPositions() { return false; }

      @Override

      public boolean withOffsets()   { return true;  }

    },



    /**

     * Store the term vector + Token position and offset information

     *

     * @see #YES

     * @see #WITH_POSITIONS

     * @see #WITH_OFFSETS

     */

    WITH_POSITIONS_OFFSETS {

      @Override

      public boolean isStored()      { return true;  }

      @Override

      public boolean withPositions() { return true;  }

      @Override

      public boolean withOffsets()   { return true;  }

    };
这是一个与词条有关的枚举类型。

在3.0之前的lucene中，通常store index termvector都是被设置为静态内部类。。3.0开始设置为枚举类型。。。。。。

同时，Field的值可以构造成很多类型，Field类中定义了4种：String、Reader、byte[]、TokenStream。

然后就是Field对象的构造，应该看它的构造方法，它有9种构造方法：

还要注意了，通过Field类的声明：

public final class Field extends AbstractField implements Fieldable , Serializable

可以看出，应该对它继承的父类AbstractField类有一个了解，下面的是AbstractField类的属性：
[java] view plain copy

protected String name = "body";

protected boolean storeTermVector = false;

protected boolean storeOffsetWithTermVector = false;

protected boolean storePositionWithTermVector = false;

protected boolean omitNorms = false;

protected boolean isStored = false;

protected boolean isIndexed = true;

protected boolean isTokenized = true;

protected boolean isBinary = false;

protected boolean isCompressed = false;

protected boolean lazy = false;

protected float boost = 1.0f;

protected Object fieldsData = null;
还有Field实现了Fieldable接口，添加了一些对对应的Document中的Field进行管理判断的方法信息。
查看全文

相关阅读:
CCPC长春赛重现
 2016弱校联盟十一专场10.3，BNU52308，大模拟
 2016弱校联盟十一专场10.3，BNU52308，大模拟
 UVA12264 二分最大流，注意pdf的样例是错的
 UVA12264 二分最大流，注意pdf的样例是错的
 UVA1658海军上将，拆点费用流
 UVA1658海军上将，拆点费用流
 angular报错:Please add a @Pipe/@Directive/@Component annotation
angular 自定义验证器
 ionic 学习笔记

原文地址：https://www.cnblogs.com/zwb7926/p/3115577.html