org.apache.lucene.demo.IndexFiles 类中,使用递归的方式去索引文件 。在构造了一个IndexWriter索引器之后 ,就可以向索引器中添加Doucument了,执行真正地建立索引的过程。遍历每个目录,因为每个目录中可能还存在目录,进行深度遍历,采用递归技术找到处于叶节点处的文件(普通的具有扩展名的文件,比如my.txt文件),然后调用如下代码中:
- static void indexDocs(IndexWriter writer, File file)
- throws IOException {
- // file可以读取
- if (file.canRead()) {
- if (file.isDirectory()) { // 如果file是一个目录(该目录下面可能有文件、目录文件、空文件三种情况)
- String[] files = file.list(); // 获取file目录下的所有文件(包括目录文件)File对象,放到数组files里
- // 如果files!=null
- if (files != null) {
- for (int i = 0; i < files.length; i++) { // 对files数组里面的File对象递归索引,通过广度遍历
- indexDocs(writer, new File(file, files[i]));
- }
- }
- } else { // 到达叶节点时,说明是一个File,而不是目录,则建立索引
- System.out.println("adding " + file);
- try {
- writer.addDocument(FileDocument.Document(file));
- }
- catch (FileNotFoundException fnfe) {
- ;
- }
- }
- }
- }
上面这一句:
writer.addDocument(FileDocument.Document(file));
其实做了很多工作。每当递归到叶子节点,获得一个文件,而非目录文件,比如文件myWorld.txt。然后对这个文件进行了复杂的操作 :
先根据由myWorld.txt构造的File对象f,通过f获取myWorld.txt的具体信息,比如存储路径、修改时间等等,构造多个Field对象,再由这些不同Field的聚合,构建出一个Document对象 ,最后把Document对象加入索引器IndexWriter对象中 ,通过索引器可以对 这些聚合的Document 的Field中信息进行分词、过滤处理 ,方便检索。
- org.apache.lucene.demo.FileDocument类的源代码如下所示:
- package org.apache.lucene.demo;
- import java.io.File;
- import java.io.FileReader;
- import org.apache.lucene.document.DateTools;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.document.Field;
- public class FileDocument {
- public static Document Document(File f)
- throws java.io.FileNotFoundException {
- // 实例化一个Document
- Document doc = new Document();
- // 根据传进来的File f,构造多个Field对象,然后把他们都添加到Document中
- // 通过f的所在路径构造一个Field对象,并设定该Field对象的一些属性:
- // “path”是构造的Field的名字,通过该名字可以找到该Field
- // Field.Store.YES表示存储该Field;Field.Index.UN_TOKENIZED表示不对该Field进行分词,但是对其进行索引,以便检索
- doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
- // 构造一个具有最近修改修改时间信息的Field
- doc.add(new Field("modified",
- DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
- Field.Store.YES, Field.Index.UN_TOKENIZED));
- // 构造一个Field,这个Field可以从一个文件流中读取,必须保证由f所构造的文件流是打开的
- doc.add(new Field("contents", new FileReader(f)));
- return doc;
- }
- private FileDocument() {}
- }
通过上面的代码,可以看出Field是何其的重要,必须把Field完全掌握了。
Field类定义了两个很有用enum:Store和Index,用它们来设置对Field进行索引时的一些属性。
- /** Specifies whether and how a field should be stored. */
- public static enum Store {
- /** Store the original field value in the index. This is useful for short texts
- * like a document's title which should be displayed with the results. The
- * value is stored in its original form, i.e. no analyzer is used before it is
- * stored.
- */
- YES {
- @Override
- public boolean isStored() { return true; }
- },
- /** Do not store the field value in the index. */
- NO {
- @Override
- public boolean isStored() { return false; }
- };
- public abstract boolean isStored();
- }
- /** Specifies whether and how a field should be indexed. */
- public static enum Index {
- /** Do not index the field value. This field can thus not be searched,
- * but one can still access its contents provided it is
- * {@link Field.Store stored}. */
- NO {
- @Override
- public boolean isIndexed() { return false; }
- @Override
- public boolean isAnalyzed() { return false; }
- @Override
- public boolean omitNorms() { return true; }
- },
- /** Index the tokens produced by running the field's
- * value through an Analyzer. This is useful for
- * common text. */
- ANALYZED {
- @Override
- public boolean isIndexed() { return true; }
- @Override
- public boolean isAnalyzed() { return true; }
- @Override
- public boolean omitNorms() { return false; }
- },
- /** Index the field's value without using an Analyzer, so it can be searched.
- * As no analyzer is used the value will be stored as a single term. This is
- * useful for unique Ids like product numbers.
- */
- NOT_ANALYZED {
- @Override
- public boolean isIndexed() { return true; }
- @Override
- public boolean isAnalyzed() { return false; }
- @Override
- public boolean omitNorms() { return false; }
- },
- /** Expert: Index the field's value without an Analyzer,
- * and also disable the storing of norms. Note that you
- * can also separately enable/disable norms by calling
- * {@link Field#setOmitNorms}. No norms means that
- * index-time field and document boosting and field
- * length normalization are disabled. The benefit is
- * less memory usage as norms take up one byte of RAM
- * per indexed field for every document in the index,
- * during searching. Note that once you index a given
- * field <i>with</i> norms enabled, disabling norms will
- * have no effect. In other words, for this to have the
- * above described effect on a field, all instances of
- * that field must be indexed with NOT_ANALYZED_NO_NORMS
- * from the beginning. */
- NOT_ANALYZED_NO_NORMS {
- @Override
- public boolean isIndexed() { return true; }
- @Override
- public boolean isAnalyzed() { return false; }
- @Override
- public boolean omitNorms() { return true; }
- },
- /** Expert: Index the tokens produced by running the
- * field's value through an Analyzer, and also
- * separately disable the storing of norms. See
- * {@link #NOT_ANALYZED_NO_NORMS} for what norms are
- * and why you may want to disable them. */
- ANALYZED_NO_NORMS {
- @Override
- public boolean isIndexed() { return true; }
- @Override
- public boolean isAnalyzed() { return true; }
- @Override
- public boolean omitNorms() { return true; }
- };
Field类中还有一个内部类,它的声明如下:
- public static enum TermVector {
- /** Do not store term vectors.
- */
- NO {
- @Override
- public boolean isStored() { return false; }
- @Override
- public boolean withPositions() { return false; }
- @Override
- public boolean withOffsets() { return false; }
- },
- /** Store the term vectors of each document. A term vector is a list
- * of the document's terms and their number of occurrences in that document. */
- YES {
- @Override
- public boolean isStored() { return true; }
- @Override
- public boolean withPositions() { return false; }
- @Override
- public boolean withOffsets() { return false; }
- },
- /**
- * Store the term vector + token position information
- *
- * @see #YES
- */
- WITH_POSITIONS {
- @Override
- public boolean isStored() { return true; }
- @Override
- public boolean withPositions() { return true; }
- @Override
- public boolean withOffsets() { return false; }
- },
- /**
- * Store the term vector + Token offset information
- *
- * @see #YES
- */
- WITH_OFFSETS {
- @Override
- public boolean isStored() { return true; }
- @Override
- public boolean withPositions() { return false; }
- @Override
- public boolean withOffsets() { return true; }
- },
- /**
- * Store the term vector + Token position and offset information
- *
- * @see #YES
- * @see #WITH_POSITIONS
- * @see #WITH_OFFSETS
- */
- WITH_POSITIONS_OFFSETS {
- @Override
- public boolean isStored() { return true; }
- @Override
- public boolean withPositions() { return true; }
- @Override
- public boolean withOffsets() { return true; }
- };
这是一个与词条有关的枚举类型。
在3.0之前的lucene中,通常store index termvector都是被设置为静态内部类。。3.0开始设置为枚举类型。。。。。。
同时,Field的值可以构造成很多类型,Field类中定义了4种:String、Reader、byte[]、TokenStream。
然后就是Field对象的构造,应该看它的构造方法,它有9种构造方法:
还要注意了,通过Field类的声明:
public final class Field extends AbstractField implements Fieldable , Serializable
可以看出,应该对它继承的父类AbstractField类 有一个了解,下面的是AbstractField类的属性:
- protected String name = "body";
- protected boolean storeTermVector = false;
- protected boolean storeOffsetWithTermVector = false;
- protected boolean storePositionWithTermVector = false;
- protected boolean omitNorms = false;
- protected boolean isStored = false;
- protected boolean isIndexed = true;
- protected boolean isTokenized = true;
- protected boolean isBinary = false;
- protected boolean isCompressed = false;
- protected boolean lazy = false;
- protected float boost = 1.0f;
- protected Object fieldsData = null;
还有Field实现了Fieldable接口 ,添加了 一些对对应的Document中的Field进行管理判断的方法信息。