zoukankan      html  css  js  c++  java
  • Lucene2.9.2 + 盘古分词2.3.1(一) 入门: 建立简单索引,搜索(原创)

    有图有真相

    QQ截图20110826113830

    ps:上图可以看到中文分词成功,搜索也命中了;

    说明:如果想好好学Lucene建议看Lucene in action 2nd version,另外2.9.2中对以前很多方法已经废弃,旧代码就别看了;

    下面是代码:

    建立索引
    1. public static void IndexFile(this IndexWriter writer, IO.FileInfo file)
    2. {
    3.     var watch = new Stopwatch();
    4.     var startTime = DateTime.Now;
    5.     watch.Start();
    6.     Console.WriteLine("Indexing  {0}", file.Name);
    7.     writer.AddDocument(file.GetDocument());
    8.     watch.Stop();
    9.     var timeSpan = DateTime.Now - startTime;
    10.     Console.WriteLine("Indexing Completed! Cost time {0}[{1}]", timeSpan.ToString("c"), watch.ElapsedMilliseconds);
    11.  
    12.   }
    13.  
    14. public static Document GetDocument(this IO.FileInfo file)
    15. {
    16.     var doc = new Document();
    17.     doc.Add(new Field("contents", new IO.StreamReader(file.FullName)));
    18.     doc.Add(new Field("filename", file.Name,
    19.     Field.Store.YES, Field.Index.ANALYZED));
    20.     doc.Add(new Field("fullpath", file.FullName,
    21.     Field.Store.YES, Field.Index.NOT_ANALYZED));
    22.     return doc;
    23. }

    Output

    Indexing Scott.txt
    Indexing Completed! Cost time 00:00:02.4231386[2423]
    Indexing 黄金瞳.txt
    Indexing Completed! Cost time 00:00:00.0860049[85]
    There are 2 doc Indexed!
    Index Exit!

    代码解释:

    第14行 GetDocument 建立相应的doc,doc是Lucene核心对象之一,下面是它的定义:

    The Document class represents a collection of fields. Think of it as a virtual document—
    a chunk of data, such as a web page, an email message, or a text file—that you
    want to make retrievable at a later time. Fields of a document represent the document
    or metadata associated with that document. The original source (such as a database
    record, a Microsoft Word document, a chapter from a book, and so on) of
    document data is irrelevant to Lucene. It’s the text that you extract from such binary
    documents, and add as a Field instance, that Lucene processes. The metadata (such
    as author, title, subject and date modified) is indexed and stored separately as fields
    of a document.

    不关心的同学可以将它理解为数据库里表的一条记录,最后查询出结果后得到的也是doc对象,也就是一条记录;

    那么建立索引就是添加很多记录到lucene里;

    第19行 第一个参数就不解释了,第二个参数NOT_ANALYZED并不是就搜不到这个字段而是作为整个字段搜索,不分词而已;

    搜索
    1. public ActionResult Index(string keyWord)
    2.         {
    3.             var originalKeyWords = keyWord;
    4.             ViewBag.TotalResult = 0;
    5.             ViewBag.Results = new List<KeyValuePair<string, string>>();
    6.             if (string.IsNullOrEmpty(keyWord))
    7.             { ViewBag.Message = "Welcome Today!"; return View("Index"); }
    8.  
    9.             var q = keyWord;
    10.  
    11.             var search = new IndexSearcher(_indexDir, true);
    12.            // q = GetKeyWordsSplitBySpace(q, new PanGuTokenizer());
    13.  
    14.             var queryParser =  new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "contents", new PanGuAnalyzer(false));
    15.             var query = queryParser.Parse(q);
    16.             var hits = search.Search(query, 100); //search.Search(bq, 100);
    17.  
    18.             var recCount = hits.totalHits;
    19.             ViewBag.TotalResult = recCount;
    20.             
    21.             //show explain
    22.             for (int d = 0; d < search.MaxDoc(); d++)
    23.             {
    24.                 ViewBag.Explain += search.Explain(query, d).ToHtml();
    25.  
    26.                 var termReader=search.GetIndexReader().Terms();
    27.                 ViewBag.Explain += "<ul >";
    28.                 do
    29.                 {
    30.                     if(termReader.Term()!=null)
    31.                     ViewBag.Explain += string.Format("<li>{0}</li>", termReader.Term().Text());
    32.                 } while (termReader.Next());
    33.                 ViewBag.Explain += "</ul>";
    34.             }
    35.  
    36.             foreach (var hit in hits.scoreDocs)
    37.             {
    38.                 try
    39.                 {
    40.                     var doc = search.Doc(hit.doc);
    41.                     var fileName = doc.Get("filename");
    42.                     // fileName = highlighter.GetBestFragment(originalKeyWords, fileName);
    43.                     //var contents = GetBestFragment(originalKeyWords, new StreamReader(doc.Get("fullpath"), Encoding.GetEncoding("gb2312")));
    44.                     (ViewBag.Results as List<KeyValuePair<string, string>>)
    45.                         .Add(new KeyValuePair<string, string>(fileName, string.Empty));
    46.                 }
    47.                 catch (Exception exc)
    48.                 {
    49.                     Response.Write(exc.Message);
    50.                     throw;
    51.                 }
    52.  
    53.             }
    54.  
    55.             search.Close();
    56.  
    57.             ViewBag.Message = string.Format("????{0}", keyWord);
    58.             return View("Index");
    59.         }

    后续文章会继续贴这些代码,并带上注释,在外面写距离有点远,也累。

  • 相关阅读:
    Lintcode: Delete Digits
    Lintcode: Digit Counts
    Lintcode: Compare Strings
    Lintcode: First Position of Target (Binary Search)
    Lintcode: Binary Representation
    Lintcode: Backpack II
    Lintcode: Backpack
    Lintcode: A+B problem
    Summary: Lowest Common Ancestor in a Binary Tree & Shortest Path In a Binary Tree
    Summary: Prime
  • 原文地址:https://www.cnblogs.com/jinzhao/p/2154229.html
Copyright © 2011-2022 走看看