zoukankan      html  css  js  c++  java
  • 【手把手教你全文检索】Lucene索引的【增、删、改、查】

    前言

      搞检索的,应该多少都会了解Lucene一些,它开源而且简单上手,官方API足够编写些小DEMO。并且根据倒排索引,实现快速检索。本文就简单的实现增量添加索引,删除索引,通过关键字查询,以及更新索引等操作。

      目前博猪使用的不爽的地方就是,读取文件内容进行全文检索时,需要自己编写读取过程(这个solr免费帮我们实现)。而且创建索引的过程比较慢,还有很大的优化空间,这个就要细心下来研究了。

      创建索引

      Lucene在进行创建索引时,根据前面一篇博客,已经讲完了大体的流程,这里再简单说下:

    1 Directory directory = FSDirectory.open("/tmp/testindex");
    2 IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
    3 IndexWriter iwriter = new IndexWriter(directory, config);
    4 Document doc = new Document();
    5 String text = "This is the text to be indexed.";
    6 doc.add(new Field("fieldname", text, TextField.TYPE_STORED)); iwriter.close();

      1 创建Directory,获取索引目录

      2 创建词法分析器,创建IndexWriter对象

      3 创建document对象,存储数据

      4 关闭IndexWriter,提交

     1 /**
     2      * 建立索引
     3      * 
     4      * @param args
     5      */
     6     public static void index() throws Exception {
     7         
     8         String text1 = "hello,man!";
     9         String text2 = "goodbye,man!";
    10         String text3 = "hello,woman!";
    11         String text4 = "goodbye,woman!";
    12         
    13         Date date1 = new Date();
    14         analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
    15         directory = FSDirectory.open(new File(INDEX_DIR));
    16 
    17         IndexWriterConfig config = new IndexWriterConfig(
    18                 Version.LUCENE_CURRENT, analyzer);
    19         indexWriter = new IndexWriter(directory, config);
    20 
    21         Document doc1 = new Document();
    22         doc1.add(new TextField("filename", "text1", Store.YES));
    23         doc1.add(new TextField("content", text1, Store.YES));
    24         indexWriter.addDocument(doc1);
    25         
    26         Document doc2 = new Document();
    27         doc2.add(new TextField("filename", "text2", Store.YES));
    28         doc2.add(new TextField("content", text2, Store.YES));
    29         indexWriter.addDocument(doc2);
    30         
    31         Document doc3 = new Document();
    32         doc3.add(new TextField("filename", "text3", Store.YES));
    33         doc3.add(new TextField("content", text3, Store.YES));
    34         indexWriter.addDocument(doc3);
    35         
    36         Document doc4 = new Document();
    37         doc4.add(new TextField("filename", "text4", Store.YES));
    38         doc4.add(new TextField("content", text4, Store.YES));
    39         indexWriter.addDocument(doc4);
    40         
    41         indexWriter.commit();
    42         indexWriter.close();
    43 
    44         Date date2 = new Date();
    45         System.out.println("创建索引耗时:" + (date2.getTime() - date1.getTime()) + "ms
    ");
    46     }

      增量添加索引

      Lucene拥有增量添加索引的功能,在不会影响之前的索引情况下,添加索引,它会在何时的时机,自动合并索引文件。

     1 /**
     2      * 增加索引
     3      * 
     4      * @throws Exception
     5      */
     6     public static void insert() throws Exception {
     7         String text5 = "hello,goodbye,man,woman";
     8         Date date1 = new Date();
     9         analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
    10         directory = FSDirectory.open(new File(INDEX_DIR));
    11 
    12         IndexWriterConfig config = new IndexWriterConfig(
    13                 Version.LUCENE_CURRENT, analyzer);
    14         indexWriter = new IndexWriter(directory, config);
    15 
    16         Document doc1 = new Document();
    17         doc1.add(new TextField("filename", "text5", Store.YES));
    18         doc1.add(new TextField("content", text5, Store.YES));
    19         indexWriter.addDocument(doc1);
    20 
    21         indexWriter.commit();
    22         indexWriter.close();
    23 
    24         Date date2 = new Date();
    25         System.out.println("增加索引耗时:" + (date2.getTime() - date1.getTime()) + "ms
    ");
    26     }

      

      删除索引

      Lucene也是通过IndexWriter调用它的delete方法,来删除索引。我们可以通过关键字,删除与这个关键字有关的所有内容。如果仅仅是想要删除一个文档,那么最好就顶一个唯一的ID域,通过这个ID域,来进行删除操作。

     1 /**
     2      * 删除索引
     3      * 
     4      * @param str 删除的关键字
     5      * @throws Exception
     6      */
     7     public static void delete(String str) throws Exception {
     8         Date date1 = new Date();
     9         analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
    10         directory = FSDirectory.open(new File(INDEX_DIR));
    11 
    12         IndexWriterConfig config = new IndexWriterConfig(
    13                 Version.LUCENE_CURRENT, analyzer);
    14         indexWriter = new IndexWriter(directory, config);
    15         
    16         indexWriter.deleteDocuments(new Term("filename",str));  
    17         
    18         indexWriter.close();
    19         
    20         Date date2 = new Date();
    21         System.out.println("删除索引耗时:" + (date2.getTime() - date1.getTime()) + "ms
    ");
    22     }

      

      更新索引

      Lucene没有真正的更新操作,通过某个fieldname,可以更新这个域对应的索引,但是实质上,它是先删除索引,再重新建立的。

     1 /**
     2      * 更新索引
     3      * 
     4      * @throws Exception
     5      */
     6     public static void update() throws Exception {
     7         String text1 = "update,hello,man!";
     8         Date date1 = new Date();
     9          analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
    10          directory = FSDirectory.open(new File(INDEX_DIR));
    11 
    12          IndexWriterConfig config = new IndexWriterConfig(
    13                  Version.LUCENE_CURRENT, analyzer);
    14          indexWriter = new IndexWriter(directory, config);
    15          
    16          Document doc1 = new Document();
    17         doc1.add(new TextField("filename", "text1", Store.YES));
    18         doc1.add(new TextField("content", text1, Store.YES));
    19         
    20         indexWriter.updateDocument(new Term("filename","text1"), doc1);
    21         
    22          indexWriter.close();
    23          
    24          Date date2 = new Date();
    25          System.out.println("更新索引耗时:" + (date2.getTime() - date1.getTime()) + "ms
    ");
    26     }

      

      通过索引查询关键字

      Lucene的查询方式有很多种,这里就不做详细介绍了。它会返回一个ScoreDoc的集合,类似ResultSet的集合,我们可以通过域名获取想要获取的内容。

     1 /**
     2      * 关键字查询
     3      * 
     4      * @param str
     5      * @throws Exception
     6      */
     7     public static void search(String str) throws Exception {
     8         directory = FSDirectory.open(new File(INDEX_DIR));
     9         analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
    10         DirectoryReader ireader = DirectoryReader.open(directory);
    11         IndexSearcher isearcher = new IndexSearcher(ireader);
    12 
    13         QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "content",analyzer);
    14         Query query = parser.parse(str);
    15 
    16         ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
    17         for (int i = 0; i < hits.length; i++) {
    18             Document hitDoc = isearcher.doc(hits[i].doc);
    19             System.out.println(hitDoc.get("filename"));
    20             System.out.println(hitDoc.get("content"));
    21         }
    22         ireader.close();
    23         directory.close();
    24     }

      全部代码

      1 package test;
      2 
      3 import java.io.File;
      4 import java.util.Date;
      5 import java.util.List;
      6 
      7 import org.apache.lucene.analysis.Analyzer;
      8 import org.apache.lucene.analysis.standard.StandardAnalyzer;
      9 import org.apache.lucene.document.Document;
     10 import org.apache.lucene.document.LongField;
     11 import org.apache.lucene.document.TextField;
     12 import org.apache.lucene.document.Field.Store;
     13 import org.apache.lucene.index.DirectoryReader;
     14 import org.apache.lucene.index.IndexWriter;
     15 import org.apache.lucene.index.IndexWriterConfig;
     16 import org.apache.lucene.index.Term;
     17 import org.apache.lucene.queryparser.classic.QueryParser;
     18 import org.apache.lucene.search.IndexSearcher;
     19 import org.apache.lucene.search.Query;
     20 import org.apache.lucene.search.ScoreDoc;
     21 import org.apache.lucene.store.Directory;
     22 import org.apache.lucene.store.FSDirectory;
     23 import org.apache.lucene.util.Version;
     24 
     25 public class TestLucene {
     26     // 保存路径
     27     private static String INDEX_DIR = "D:\luceneIndex";
     28     private static Analyzer analyzer = null;
     29     private static Directory directory = null;
     30     private static IndexWriter indexWriter = null;
     31 
     32     public static void main(String[] args) {
     33         try {
     34 //            index();
     35             search("man");
     36 //            insert();
     37 //            delete("text5");
     38 //            update();
     39         } catch (Exception e) {
     40             e.printStackTrace();
     41         }
     42     }
     43     /**
     44      * 更新索引
     45      * 
     46      * @throws Exception
     47      */
     48     public static void update() throws Exception {
     49         String text1 = "update,hello,man!";
     50         Date date1 = new Date();
     51          analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
     52          directory = FSDirectory.open(new File(INDEX_DIR));
     53 
     54          IndexWriterConfig config = new IndexWriterConfig(
     55                  Version.LUCENE_CURRENT, analyzer);
     56          indexWriter = new IndexWriter(directory, config);
     57          
     58          Document doc1 = new Document();
     59         doc1.add(new TextField("filename", "text1", Store.YES));
     60         doc1.add(new TextField("content", text1, Store.YES));
     61         
     62         indexWriter.updateDocument(new Term("filename","text1"), doc1);
     63         
     64          indexWriter.close();
     65          
     66          Date date2 = new Date();
     67          System.out.println("更新索引耗时:" + (date2.getTime() - date1.getTime()) + "ms
    ");
     68     }
     69     /**
     70      * 删除索引
     71      * 
     72      * @param str 删除的关键字
     73      * @throws Exception
     74      */
     75     public static void delete(String str) throws Exception {
     76         Date date1 = new Date();
     77         analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
     78         directory = FSDirectory.open(new File(INDEX_DIR));
     79 
     80         IndexWriterConfig config = new IndexWriterConfig(
     81                 Version.LUCENE_CURRENT, analyzer);
     82         indexWriter = new IndexWriter(directory, config);
     83         
     84         indexWriter.deleteDocuments(new Term("filename",str));  
     85         
     86         indexWriter.close();
     87         
     88         Date date2 = new Date();
     89         System.out.println("删除索引耗时:" + (date2.getTime() - date1.getTime()) + "ms
    ");
     90     }
     91     /**
     92      * 增加索引
     93      * 
     94      * @throws Exception
     95      */
     96     public static void insert() throws Exception {
     97         String text5 = "hello,goodbye,man,woman";
     98         Date date1 = new Date();
     99         analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
    100         directory = FSDirectory.open(new File(INDEX_DIR));
    101 
    102         IndexWriterConfig config = new IndexWriterConfig(
    103                 Version.LUCENE_CURRENT, analyzer);
    104         indexWriter = new IndexWriter(directory, config);
    105 
    106         Document doc1 = new Document();
    107         doc1.add(new TextField("filename", "text5", Store.YES));
    108         doc1.add(new TextField("content", text5, Store.YES));
    109         indexWriter.addDocument(doc1);
    110 
    111         indexWriter.commit();
    112         indexWriter.close();
    113 
    114         Date date2 = new Date();
    115         System.out.println("增加索引耗时:" + (date2.getTime() - date1.getTime()) + "ms
    ");
    116     }
    117     /**
    118      * 建立索引
    119      * 
    120      * @param args
    121      */
    122     public static void index() throws Exception {
    123         
    124         String text1 = "hello,man!";
    125         String text2 = "goodbye,man!";
    126         String text3 = "hello,woman!";
    127         String text4 = "goodbye,woman!";
    128         
    129         Date date1 = new Date();
    130         analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
    131         directory = FSDirectory.open(new File(INDEX_DIR));
    132 
    133         IndexWriterConfig config = new IndexWriterConfig(
    134                 Version.LUCENE_CURRENT, analyzer);
    135         indexWriter = new IndexWriter(directory, config);
    136 
    137         Document doc1 = new Document();
    138         doc1.add(new TextField("filename", "text1", Store.YES));
    139         doc1.add(new TextField("content", text1, Store.YES));
    140         indexWriter.addDocument(doc1);
    141         
    142         Document doc2 = new Document();
    143         doc2.add(new TextField("filename", "text2", Store.YES));
    144         doc2.add(new TextField("content", text2, Store.YES));
    145         indexWriter.addDocument(doc2);
    146         
    147         Document doc3 = new Document();
    148         doc3.add(new TextField("filename", "text3", Store.YES));
    149         doc3.add(new TextField("content", text3, Store.YES));
    150         indexWriter.addDocument(doc3);
    151         
    152         Document doc4 = new Document();
    153         doc4.add(new TextField("filename", "text4", Store.YES));
    154         doc4.add(new TextField("content", text4, Store.YES));
    155         indexWriter.addDocument(doc4);
    156         
    157         indexWriter.commit();
    158         indexWriter.close();
    159 
    160         Date date2 = new Date();
    161         System.out.println("创建索引耗时:" + (date2.getTime() - date1.getTime()) + "ms
    ");
    162     }
    163 
    164     /**
    165      * 关键字查询
    166      * 
    167      * @param str
    168      * @throws Exception
    169      */
    170     public static void search(String str) throws Exception {
    171         directory = FSDirectory.open(new File(INDEX_DIR));
    172         analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
    173         DirectoryReader ireader = DirectoryReader.open(directory);
    174         IndexSearcher isearcher = new IndexSearcher(ireader);
    175 
    176         QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "content",analyzer);
    177         Query query = parser.parse(str);
    178 
    179         ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
    180         for (int i = 0; i < hits.length; i++) {
    181             Document hitDoc = isearcher.doc(hits[i].doc);
    182             System.out.println(hitDoc.get("filename"));
    183             System.out.println(hitDoc.get("content"));
    184         }
    185         ireader.close();
    186         directory.close();
    187     }
    188 }
    View Code

      参考资料

      http://www.cnblogs.com/xing901022/p/3933675.html

  • 相关阅读:
    Linux下Kafka单机安装配置
    MySQL30条规范解读
    MySQL联合索引最左匹配范例
    Percona Data Recovery Tool 单表恢复
    SQL中的where条件,在数据库中提取与应用浅析
    【leetcode】908. Smallest Range I
    【leetcode】909. Snakes and Ladders
    【leetcode】910. Smallest Range II
    【leetcode】395. Longest Substring with At Least K Repeating Characters
    【leetcode】907. Sum of Subarray Minimums
  • 原文地址:https://www.cnblogs.com/xing901022/p/3940243.html
Copyright © 2011-2022 走看看