zoukankan      html  css  js  c++  java
  • Sphinx中文指南(二)——Sphinx中文分词coreseek篇

    阅读本文前,请先查看前篇——Sphinx中文入门指南

    目前,实现Sphinx中文的分词的方法据我所知有3种:

    1、Coreseek

    2、Sphinx-for-chinese

    3、在客户端先分词,然后使用Sphinx字索引(查看安装原文)直接对输入词进行检索

    Coreseek安装

    在上篇中我们介绍了安装Sphinx的一些必要条件,在此不一一而论。本文基础基于上篇!

    下载Coreseek:

    [root@localhost ~]#cd /usr/local/src
    [root@localhost src]# wget http://www.coreseek.cn/uploads/csft/3.1/Source/csft-3.1.tar.gz  ####coreseek源文件
    [root@localhost src]# wget http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz  #####coreseek所使用的词典
    [root@localhost src]#tar zxvf csft-3.1.tar.gz
    [root@localhost src]#tar zxvf mmseg-3.1.tar.gz

    #####在安装coreseek前必须先安装mmseg
    [root@localhost src]# cd mmseg-3.1
    [root@localhost mmseg-3.1]# ./configure –prefix=/usr/local/mmseg
    [root@localhost mmseg-3.1]# make && make install

    ######## 安装coreseek ########
    ##这里不使用python数据源,若需要,请加上 –with-python,在mmseg上一定要对应路径

    [root@localhost csft-3.1]# ./configure –prefix=/usr/local/coreseek –with-mmseg-includes=/usr/local/mmseg/include/mmseg
                                    –with-mmseg-libs=/usr/local/mmseg/lib  –without-iconv
    [root@localhost csft-3.1]# make && make install

    若无问题,安装完毕后在/usr/local/下生成 coreseek目录及其下文件。

    接下来要生成 mmseg词库及配置文件:

    [root@localhost csft-3.1]#cd /usr/loca/mmseg
    [root@localhost mmseg]# bin/mmseg -u /usr/local/src/mmseg-3.1/data/unigram.txt   ###unigram.txt是对应的词典文件,将会生成unigram.txt.uni
    [root@localhost mmseg]# cd ../coreseek
    [root@localhost coreseek]# mkdir dict ###创建字典目录
    [root@localhost coreseek]# cp  /usr/local/src/mmseg-3.1/data/unigram.txt.uni dict/uni.lib    ###把创建的词典复制到dict
    [root@localhost coreseek]# vim dict/mmseg.ini  ####创建mmseg的配置文件,此文件在coreseek的windows版本已自带!

    mmseg.ini:
    [mmseg]
    merge_number_and_ascii=1;
    number_and_ascii_joint=-;
    compress_space=0;
    seperate_number_ascii=1;
    至此,mmseg配置完毕!下一步配置csft.conf——coreseek的配置文件

    我的配置实例:
    source article_src
    {
            type                                    = mysql
            sql_host                                = 192.168.1.10
            sql_user                                = root
            sql_pass                                = pwd
            sql_db                                  = test
            sql_port                                = 3306  # optional, default is 3306

            sql_query_pre                           = SET NAMES utf8
            #sql_query_pre                           = SET SESSION query_cache_type=OFF ##这个可以关闭sql查询缓存
     sql_query = SELECT id,title,cat_id,member_id,content,created FROM sphinx_article

     sql_attr_uint  = cat_id
     sql_attr_uint  = member_id
     sql_attr_timestamp = created
    sql_query_info = select * from sphinx_article where id=$id

    }

    index article
    {
            source                                  = article_src
            path                                    = /usr/local/coreseek/var/data/article
            docinfo                                 = extern
            charset_type                         = zh_cn.utf-8  ###指定coreseek的编码
            charset_dictpath                    = /usr/local/coreseek/dict  #####coreseek字典文件

            min_prefix_len                        = 0
            min_infix_len                          = 0
            min_word_len                         = 2
            ngram_len               = 1
            ngram_chars = U+4E00..U+9FBF, U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF,
            U+2F800..U+2FA1F, U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF,
            U+3040..U+309F, U+30A0..U+30FF, U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF,
            U+3130..U+318F, U+A000..U+A48F, U+A490..U+A4CF
            html_strip              = 0
    }

    indexer
    {
     mem_limit   = 256M
    }
    searchd
    {
     # address    = 0.0.0.0
     log     =  /usr/local/coreseek/var/log/searchd.log
     query_log   =  /usr/local/coreseek/var/log/query.log
     read_timeout  = 5
     max_children  = 30
     pid_file   =  /usr/local/coreseek/var/log/searchd.pid
     max_matches   = 1000
     seamless_rotate  = 1
    }
     

    建立索引:
      [root@localhost coreseek]# bin/indexer article
    Coreseek Full Text Server 3.1
     Copyright (c) 2006-2008 coreseek.com
    using config file ‘./csft.conf’…
    indexing index ‘article’…
    collected 1000 docs, 0.0 MB
    sorted 0.0 Mhits, 100.0% done
    total 1000 docs, 21460 bytes
    total 3.244 sec, 6614.99 bytes/sec, 30.82 docs/sec
    total 2 reads, 0.0 sec, 26.8 kb/read avg, 0.4 msec/read avg
    total 5 writes, 0.0 sec, 11.0 kb/write avg, 0.1 msec/write avg
    [root@localhost coreseek]#

    使用CLI端测试一下:

    [root@localhost coreseek]# bin/search -c csft.conf -i article 建筑材料租赁
    Coreseek Full Text Server 3.1
     Copyright (c) 2006-2008 coreseek.com
    using config file ‘csft.conf’…
    index ‘article’: query ‘建筑材料租赁 ‘: returned 1 matches of 1 total in 0.035 sec

    displaying matches:
    1. document=14, weight=3
            id=14
            title=???????????????
            cat_id=1
            member_id=2
            content=??????????????????????????????????????????????????????????
            created=1264244709
    words:
    1. ‘建筑’: 3 documents, 3 hits
    2. ‘材料’: 4 documents, 4 hits
    3. ‘租赁’: 2 documents, 2 hits
    [root@localhost coreseek]#

    可见,中文分词成功执行!并能从sql中查询出结果!

    Sphinx中文分词coreseek篇完毕!下一篇:Sphinx中文分词Sphinx-for-chinese
    2010年1月24日最后修改

  • 相关阅读:
    微信小程序在扫一扫进入小程序的时候 安卓手机后台继续运行的常规处理
    在微信小程序上,帮助中心界面实现类似手风琴案例
    使用artTemplate的模板引擎,使用简单
    使用原生JavaScript实现图片预加载,方法简单代码少
    在wepy框架中 使用promise对发送网络请求进行封装 包括post跟get请求
    JavaScript实现按字典排序进行md5加密, 以及个人在小程序也可以实现
    当在微信扫一扫进入小程序 并获取到二维码的参数 从而实现扫码进入小程序
    第九章:看看精彩的世界-使用网络技术
    玉渊潭公园
    军事博物馆
  • 原文地址:https://www.cnblogs.com/Jerry-blog/p/5044631.html
Copyright © 2011-2022 走看看