zoukankan      html  css  js  c++  java
  • 【netcore基础】.Net core通过 Lucene.Net 和 jieba.NET 处理分词搜索功能

    业务要求是对商品标题可以进行模糊搜索

    例如用户输入了【我想查询下雅思托福考试】,这里我们需要先将这句话分词成【查询】【雅思】【托福】【考试】,然后搜索包含相关词汇的商品。

    思路如下

    首先我们需要把数据库里的所有商品内容,自动同步到 Lucene 的分词索引目录下缓存,效果如下

    这里就用到了之前写的自动作业 Hangfire 大家可以参考下面的博文

    https://www.cnblogs.com/jhli/p/10027074.html

    定时更新缓存,后面就可以分词搜索了,更新索引代码如下

            public void UpdateMerchIndex()
            {
                try
                {
                    Console.WriteLine($"[{DateTime.Now}] UpdateMerchIndex job begin...");
    
                    var indexDir = Path.Combine(System.IO.Directory.GetCurrentDirectory(), "temp", "lucene", "merchs");
                    if (System.IO.Directory.Exists(indexDir) == false)
                    {
                        System.IO.Directory.CreateDirectory(indexDir);
                    }
    
                    var VERSION = Lucene.Net.Util.LuceneVersion.LUCENE_48;
                    var director = FSDirectory.Open(new DirectoryInfo(indexDir));
                    var analyzer = new JieBaAnalyzer(TokenizerMode.Search);
                    var indexWriterConfig = new IndexWriterConfig(VERSION, analyzer);
    
                    using (var indexWriter = new IndexWriter(director, indexWriterConfig))
                    {
                        if (File.Exists(Path.Combine(indexDir, "segments.gen")) == true)
                        {
                            indexWriter.DeleteAll();
                        }
    
                        var query = _merchService.Where(t => t.IsDel == false);
    
                        var index = 1;
                        var size = 200;
    
                        var count = query.Count();
    
                        if (count > 0)
                        {
                            while (true)
                            {
                                var rs = query.OrderBy(t => t.CreateTime)
                                .Skip((index - 1) * size)
                                .Take(size).ToList();
    
                                if (rs.Count == 0)
                                {
                                    break;
                                }
    
                                var addDocs = new List<Document>();
    
                                foreach (var item in rs)
                                {
                                    var merchid = item.IdentityId.ToLowerString();
    
                                    var doc = new Document();
                                    var field1 = new StringField("merchid", merchid, Field.Store.YES);
                                    var field2 = new TextField("name", item.Name?.ToLower(), Field.Store.YES);
                                    doc.Add(field1);
                                    doc.Add(field2);
                                    addDocs.Add(doc);// 添加文本到索引中
    
                                }
                                
                                if (addDocs.Count > 0)
                                {
                                    indexWriter.AddDocuments(addDocs);
                                }
    
                                index = index + 1;
                            }
    
                        }
    
                    }
    
                    Console.WriteLine($"[{DateTime.Now}] UpdateMerchIndex job end!");
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"UpdateMerchIndex ex={ex}");
                }
            }

    剩下的就是去查询索引内容,匹配到id,然后去数据库查询响应id的项。

    搜索代码

            protected List<Guid> SearchMerchs(string key)
            {
                if (string.IsNullOrEmpty(key))
                {
                    return null;
                }
                key = key.Trim().ToLower();
    
                var rs = new List<Guid>();
    
                try
                {
                    var indexDir = Path.Combine(System.IO.Directory.GetCurrentDirectory(), "temp", "lucene", "merchs");
    
                    var VERSION = Lucene.Net.Util.LuceneVersion.LUCENE_48;
    
                    if (System.IO.Directory.Exists(indexDir) == true)
                    {
                        var reader = DirectoryReader.Open(FSDirectory.Open(new DirectoryInfo(indexDir)));
                        var search = new IndexSearcher(reader);
                        
                        var directory = FSDirectory.Open(new DirectoryInfo(indexDir), NoLockFactory.GetNoLockFactory());
                        var reader2 = IndexReader.Open(directory);
                        var searcher = new IndexSearcher(reader2);
    
                        var parser = new QueryParser(VERSION, "name", new JieBaAnalyzer(TokenizerMode.Search));
                        var booleanQuery = new BooleanQuery();
    
                        var list = CutKeyWord(key);
                        foreach (var word in list)
                        {
                            var query1 = new TermQuery(new Term("name", word));
                            booleanQuery.Add(query1, Occur.SHOULD);
                        }
    
                        var collector = TopScoreDocCollector.Create(1000, true);
                        searcher.Search(booleanQuery, null, collector);
                        var docs = collector.GetTopDocs(0, collector.TotalHits).ScoreDocs;
    
                        foreach (var d in docs)
                        {
                            var num = d.Doc;
                            var document = search.Doc(num);// 拿到指定的文档
    
                            var merchid = document.Get("merchid");
                            var name = document.Get("name");
    
                            if (Guid.TryParse(merchid, out Guid mid) == true)
                            {
                                rs.Add(mid);
                            }
                        }
                    }
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"SearchMerchs ex={ex}");
                }
    
                return rs;
            }

    对用户输入的话进行拆分分词代码 JiebaNet

            protected List<string> CutKeyWord(string key)
            {
                var rs = new List<string>();
                var segmenter = new JiebaSegmenter();
                var list = segmenter.Cut(key);
                if (list != null && list.Count() > 0)
                {
                    foreach (var item in list)
                    {
                        if (string.IsNullOrEmpty(item) || item.Length <= 1)
                        {
                            continue;
                        }
    
                        rs.Add(item);
                    }
                }
                return rs;
            }

    需要添加的 nuget 引用的包和对应版本

    Hangfire 1.7.0-beta1

    Lucene.Net 4.8.0-beta00005

    Lucene.Net.Analysis.Common 4.8.0-beta00005

    Lucene.Net.QueryParser 4.8.0-beta00005

    需要单独引用的dll文件

    JiebaNet.Segmenter.dll 

    下载地址

    https://pan.baidu.com/s/1D7mQnow0FmoqedNYzugfKw

    如果本地调试没有问题,发布到服务器上 自动执行作业就遇到这个问题

    https://stackoverflow.com/questions/47746582/hangfire-job-throws-system-typeloadexception

     
    System.TypeLoadException
    
    Could not load type ‘***’ from assembly ‘***, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null’.

    其实这个报错并不是原因,把异常打印出来就知道了

    原因是没有将 Resources 文件夹下的字典文件 dict.txt 发布到服务器上

    这个坑让我浪费了半天时间。。。

  • 相关阅读:
    matplotlib实战
    matplotlib常用操作2
    matplotlib 常用操作
    pandas总结
    朴素贝叶斯算法python实现
    什么叫“回归”——“回归”名词的由来&&回归与拟合、分类的区别 && 回归分析
    Latex常用整理
    准备尝试openFrameworks
    常用工具库总结
    K-Means和K Nearest Neighbor
  • 原文地址:https://www.cnblogs.com/jhli/p/10027396.html
Copyright © 2011-2022 走看看