zoukankan      html  css  js  c++  java
  • 【netcore基础】.Net core通过 Lucene.Net 和 jieba.NET 处理分词搜索功能

    业务要求是对商品标题可以进行模糊搜索

    例如用户输入了【我想查询下雅思托福考试】,这里我们需要先将这句话分词成【查询】【雅思】【托福】【考试】,然后搜索包含相关词汇的商品。

    思路如下

    首先我们需要把数据库里的所有商品内容,自动同步到 Lucene 的分词索引目录下缓存,效果如下

    这里就用到了之前写的自动作业 Hangfire 大家可以参考下面的博文

    https://www.cnblogs.com/jhli/p/10027074.html

    定时更新缓存,后面就可以分词搜索了,更新索引代码如下

            public void UpdateMerchIndex()
            {
                try
                {
                    Console.WriteLine($"[{DateTime.Now}] UpdateMerchIndex job begin...");
    
                    var indexDir = Path.Combine(System.IO.Directory.GetCurrentDirectory(), "temp", "lucene", "merchs");
                    if (System.IO.Directory.Exists(indexDir) == false)
                    {
                        System.IO.Directory.CreateDirectory(indexDir);
                    }
    
                    var VERSION = Lucene.Net.Util.LuceneVersion.LUCENE_48;
                    var director = FSDirectory.Open(new DirectoryInfo(indexDir));
                    var analyzer = new JieBaAnalyzer(TokenizerMode.Search);
                    var indexWriterConfig = new IndexWriterConfig(VERSION, analyzer);
    
                    using (var indexWriter = new IndexWriter(director, indexWriterConfig))
                    {
                        if (File.Exists(Path.Combine(indexDir, "segments.gen")) == true)
                        {
                            indexWriter.DeleteAll();
                        }
    
                        var query = _merchService.Where(t => t.IsDel == false);
    
                        var index = 1;
                        var size = 200;
    
                        var count = query.Count();
    
                        if (count > 0)
                        {
                            while (true)
                            {
                                var rs = query.OrderBy(t => t.CreateTime)
                                .Skip((index - 1) * size)
                                .Take(size).ToList();
    
                                if (rs.Count == 0)
                                {
                                    break;
                                }
    
                                var addDocs = new List<Document>();
    
                                foreach (var item in rs)
                                {
                                    var merchid = item.IdentityId.ToLowerString();
    
                                    var doc = new Document();
                                    var field1 = new StringField("merchid", merchid, Field.Store.YES);
                                    var field2 = new TextField("name", item.Name?.ToLower(), Field.Store.YES);
                                    doc.Add(field1);
                                    doc.Add(field2);
                                    addDocs.Add(doc);// 添加文本到索引中
    
                                }
                                
                                if (addDocs.Count > 0)
                                {
                                    indexWriter.AddDocuments(addDocs);
                                }
    
                                index = index + 1;
                            }
    
                        }
    
                    }
    
                    Console.WriteLine($"[{DateTime.Now}] UpdateMerchIndex job end!");
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"UpdateMerchIndex ex={ex}");
                }
            }

    剩下的就是去查询索引内容,匹配到id,然后去数据库查询响应id的项。

    搜索代码

            protected List<Guid> SearchMerchs(string key)
            {
                if (string.IsNullOrEmpty(key))
                {
                    return null;
                }
                key = key.Trim().ToLower();
    
                var rs = new List<Guid>();
    
                try
                {
                    var indexDir = Path.Combine(System.IO.Directory.GetCurrentDirectory(), "temp", "lucene", "merchs");
    
                    var VERSION = Lucene.Net.Util.LuceneVersion.LUCENE_48;
    
                    if (System.IO.Directory.Exists(indexDir) == true)
                    {
                        var reader = DirectoryReader.Open(FSDirectory.Open(new DirectoryInfo(indexDir)));
                        var search = new IndexSearcher(reader);
                        
                        var directory = FSDirectory.Open(new DirectoryInfo(indexDir), NoLockFactory.GetNoLockFactory());
                        var reader2 = IndexReader.Open(directory);
                        var searcher = new IndexSearcher(reader2);
    
                        var parser = new QueryParser(VERSION, "name", new JieBaAnalyzer(TokenizerMode.Search));
                        var booleanQuery = new BooleanQuery();
    
                        var list = CutKeyWord(key);
                        foreach (var word in list)
                        {
                            var query1 = new TermQuery(new Term("name", word));
                            booleanQuery.Add(query1, Occur.SHOULD);
                        }
    
                        var collector = TopScoreDocCollector.Create(1000, true);
                        searcher.Search(booleanQuery, null, collector);
                        var docs = collector.GetTopDocs(0, collector.TotalHits).ScoreDocs;
    
                        foreach (var d in docs)
                        {
                            var num = d.Doc;
                            var document = search.Doc(num);// 拿到指定的文档
    
                            var merchid = document.Get("merchid");
                            var name = document.Get("name");
    
                            if (Guid.TryParse(merchid, out Guid mid) == true)
                            {
                                rs.Add(mid);
                            }
                        }
                    }
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"SearchMerchs ex={ex}");
                }
    
                return rs;
            }

    对用户输入的话进行拆分分词代码 JiebaNet

            protected List<string> CutKeyWord(string key)
            {
                var rs = new List<string>();
                var segmenter = new JiebaSegmenter();
                var list = segmenter.Cut(key);
                if (list != null && list.Count() > 0)
                {
                    foreach (var item in list)
                    {
                        if (string.IsNullOrEmpty(item) || item.Length <= 1)
                        {
                            continue;
                        }
    
                        rs.Add(item);
                    }
                }
                return rs;
            }

    需要添加的 nuget 引用的包和对应版本

    Hangfire 1.7.0-beta1

    Lucene.Net 4.8.0-beta00005

    Lucene.Net.Analysis.Common 4.8.0-beta00005

    Lucene.Net.QueryParser 4.8.0-beta00005

    需要单独引用的dll文件

    JiebaNet.Segmenter.dll 

    下载地址

    https://pan.baidu.com/s/1D7mQnow0FmoqedNYzugfKw

    如果本地调试没有问题,发布到服务器上 自动执行作业就遇到这个问题

    https://stackoverflow.com/questions/47746582/hangfire-job-throws-system-typeloadexception

     
    System.TypeLoadException
    
    Could not load type ‘***’ from assembly ‘***, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null’.

    其实这个报错并不是原因,把异常打印出来就知道了

    原因是没有将 Resources 文件夹下的字典文件 dict.txt 发布到服务器上

    这个坑让我浪费了半天时间。。。

  • 相关阅读:
    R语言:提取路径中的文件名字符串(basename函数)
    课程一(Neural Networks and Deep Learning),第三周(Shallow neural networks)—— 0、学习目标
    numpy.squeeze()的用法
    课程一(Neural Networks and Deep Learning),第二周(Basics of Neural Network programming)—— 4、Logistic Regression with a Neural Network mindset
    Python numpy 中 keepdims 的含义
    课程一(Neural Networks and Deep Learning),第二周(Basics of Neural Network programming)—— 3、Python Basics with numpy (optional)
    课程一(Neural Networks and Deep Learning),第二周(Basics of Neural Network programming)—— 2、编程作业常见问题与答案(Programming Assignment FAQ)
    课程一(Neural Networks and Deep Learning),第二周(Basics of Neural Network programming)—— 0、学习目标
    课程一(Neural Networks and Deep Learning),第一周(Introduction to Deep Learning)—— 0、学习目标
    windows系统numpy的下载与安装教程
  • 原文地址:https://www.cnblogs.com/jhli/p/10027396.html
Copyright © 2011-2022 走看看