zoukankan      html  css  js  c++  java
  • Sphinx全文索引安装教程

    关键字: sphinx, 全文索引, 安装 首先了解一下sphinx全文索引的相关知识
    官方网站http://www.sphinxsearch.com/
    官方文档:http://www.sphinxsearch.com/docs/
    中文支持:http://www.coreseek.cn/
    中文使用手册下载http://www.coreseek.cn/uploads/pdf/sphinx_doc_zhcn_0.9.pdf

    基本上看看上面的官方教程和中文使用手册,你应该会安装和使用Sphix全文索引,当然,还有一些细节,需要不断的google和baidu,那为了节省大家的时间,就出一个完整的Sphinx安装教程和结合PHPWIND程序的使用教程(PHPWIND7.5版本支持)。

    接下来开始Sphinx的技术之旅吧!

    考虑到Sphinx全文索引使用的实际需要,主要介绍Sphinx全文索引中文方面的支持。
    这里需要感谢李沫南同学对Sphinx全文索引中文支持的贡献!

    一,Windows下安装Sphinx

    1,开始前的准备工作
    来源:http://www.coreseek.cn/products/ft_down/
    下载csft3.1:http://www.coreseek.cn/uploads/csft/3.1/win32/csft3.1.bin.zip
    下载标准词库:http://www.coreseek.cn/uploads/csft/3.1/data.zip
    解压:csft3.1.bin.zip 如下目录,解压在C:\csft3.1目录下
    解压:data.zip,解压在C:\csft3.1\data目录下 [分词包]


    需要新建log文件夹

    (1)复制    C:\csft3.1\conf\csft.conf.in    文件到    C:\csft3.1\bin\    目录下,并重命名为csft.conf
    注意csft.conf文件里的类似:path = @CONFDIR@/data/test1
    把@CONFDIR@替换为C:\csft3.1\ 如上更改为:path = C:\csft3.1\ data\test1

    (2)把测试数据    C:\csft3.1\conf\example.sql    导入数据库 [这个基本都会吧!]

    (3)建立索引,在DOC界面下运行:indexer.exe --all 如下图,


    建立索引过程需要仔细检查csft.conf数据库配置是否正确。如下:
    sql_host               = localhost    #数据库主机地址
    sql_user               = test  #数据库用户名,拥有数据库所有权限
    sql_pass               =
    sql_db                  = test   #数据库名
    sql_port                = 3306 #可用端口,一般不需要更改

    其它配置使用默认,先体验下sphinx全文索引功能。

    (4)测试搜索是否正常,运行:search.exe test 如下图



    测试正常将返回

    (5)开启搜索进程服务,运行:searchd.exe 如下图

    这样就能提供sphinx全文索引的搜索服务了,以上就是一个简单的操作过程,如果需要支持中文索引,就需要配置相应的参数,具体请查看中文使用手册。为了便于大家了解相关配置,可查看PHPWind程序支持Sphinx全文索引的配置文件,大家可边对照手册边了解[中文支持具体请看linux安装部分]。

    附:PHPWind程序支持Sphinx全文索引的配置。

    Windows下安装Sphix使用csft非常简单,如果大家有兴趣可从sphinx[www.sphinxsearch.com]官方下载安装,不过有点复杂,这里就不介绍了,高手们慢慢体验。

    二,linux下安装Sphinx全文索引,以CentOS 5.3为例

    只能说windows下安装sphinx只是为了体验,因为linux下安装sphinx才是正道。
    为了详细体验Centos下安装Sphinx,重新安装Centos系统,完整体验Sphinx安装过程。
    Coreseek 全文检索服务器版本已经集成sphinx和中文分词补丁,只需要下载MMSeg和Coreseek Fulltext Server(源代码),就能实现Sphinx服务支持。
    下载地址:http://www.coreseek.cn/products/ft_down/

    推荐源代码安装

    1,开始前的准备工作 [如果已经安装就不需要,如果下面列表没有还有其它的请补上]
    1)安装mysql
    2)安装php
    3)安装apache
    4)安装python
    5)安装libiconv
    6)安装gcc-c++
    7)下载Coreseek Fulltext Server(源代码):http://www.coreseek.cn/uploads/csft/3.1/Source/csft-3.1.tar.gz
    8)下载Coreseek Mmseg(源代码):http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz

    执行如下命令
    yum install python python-dev

    2,安装步骤
    (1)下载CSFT与MMseg
    #wget http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz
    #wget http://www.coreseek.cn/uploads/csft/3.1/Source/csft-3.1.tar.gz

    (2)安装MMseg中文分词
    # pwd
    /usr/local [知道当前的安装目录]
    # wget http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz
    # tar xzvf mmseg-3.1.tar.gz
    # mkdir /usr/local/mmseg
    # cd mmseg-3.1
    # ./configure --prefix=/usr/local/mmseg
    # make
    # make install

    运行如下,看看mmseg是否安装成功
    # /usr/local/mmseg/bin/mmseg
    Coreseek COS(tm) MM Segment 1.0
    Copyright By Coreseek.com All Right Reserved.
    Usage: /usr/local/mmseg/bin/mmseg <option> <file>
    -u <unidict>           Unigram Dictionary
    -r           Combine with -u, used a plain text build Unigram Dictionary, default Off
    -b <Synonyms>           Synonyms Dictionary
    -h            print this help and exit


    (3)安装csft-3.1
    # pwd
    /usr/local
    # wget http://www.coreseek.cn/uploads/csft/3.1/Source/csft-3.1.tar.gz
    # tar xzvf csft-3.1.tar.gz
    # mkdir /usr/local/csft
    # cd csft-3.1
    # ./configure --prefix=/usr/local/csft --with-mmseg=/usr/local/mmseg/bin/mmseg --with-mmseg-includes=/usr/local/mmseg/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg/lib/
    # make
    # make install

    这里make的时候可能出错,解决如下:
    1,检查环境是否安装如下软件
    # yum install mysql mysql-devel php-mysql qt4-mysql   [mysql环境要首先安装]
    # yum install python python-dev

    2,是否安装libiconv
    下载地址:http://savannah.gnu.org/projects/libiconv/

    3,如果还有错误,打开src/Makefile文件,进行修改
    # vi src/Makefile 找到182行



    LIBS = -lm -lz -lexpat  -L/usr/local/lib -lpthread
    LIBS = -lm -lz -lexpat -liconv -L/usr/local/lib -lpthread

    这样,如果一切顺利,就开始配置你的sphinx全文索引服务器吧[如果安装有什么问题,欢迎在PHPWind官方提问]!

    3,按下来就是配置
    #cp /usr/local/csft/etc/sphinx-min.conf.dist /usr/local/csft/etc/sphinx.conf
    修改sphinx.conf文件中的数据库参数配置,方法同windows下一样
    sql_host                = localhost
    sql_user                = root
    sql_pass               =
    sql_db                  = test

    4,把体验数据/usr/local/csft/etc/example.sql 导入到数据库 [这一步应该都会]
    5,新建索引
    # /usr/local/csft/bin/indexer --all

    6,测试搜索
    # /usr/local/csft/bin/search test
    如果测试有返回,恭喜你的sphinx全文索引服务器配置成功

    7,接下来就是支持中文的配置和实现

    UTF8编码实例 [如果已经存在utf8的数据库就不需要新建,这里只是举例]
    1)创建一个新的数据库,注意编码为utf8_general_ci,如phpwind
    2)导入部分现有的GBK数据,如pw_threads
    3)配置csft.conf如下
    source数据源部分
    sql_host                = localhost
    sql_user                  = root
    sql_pass                 =
    sql_db                     = phpwind
    sql_query_pre         = SET NAMES utf8
    sql_query_pre         = SET SESSION query_cache_type=OFF
    sql_query                = SELECT tid,fid,authorid,subject FROM pw_threads
    sql_attr_uint            = fid
    sql_attr_uint            = authorid

    索引部分
    charset_type            = zh_cn.utf-8
    charset_dictpath       = /usr/local/csft/
    min_prefix_len           = 0
    min_infix_len             = 0
    min_word_len            = 2

    4)创建数据词典
    #pwd
    /usr/local/mmseg-3.1/data   [这是你解压mmseg的目录下的data]
    运行如下命令
    # mmseg -u unigram.txt
    # ll
    总计 10152
    -rwxr-xr-x 1 root root     715 06-06 18:40 build_unigram.py
    -rwxr-xr-x 1 root root   32674 06-06 18:40 char.stat.txt
    -rwxr-xr-x 1 root root 1051268 06-06 18:40 Lexicon_full_words.txt
    -rwxr-xr-x 1 root root 1826251 06-06 18:40 unigram.txt
    -rw-r--r-- 1 root root 3729280 09-16 20:20 unigram.txt.uni

    将会生成 unigram.txt.uni  文件
    # mv unigram.txt.uni  uni.lib
    # cp uni.lib /usr/local/csft/  [这就是上面我们在配置索引中用的charset_dictpath]

    其它的默认不变,如上方法创建索引
    # /usr/local/csft/bin/indexer --all

    测试是否成功
    # /usr/local/csft/bin/search 测试

    以上就是utf8编码的全文索引实现过程

    GBK编码实例

    与utf8一样,区别在于数据库和数据表使用gbk编码
    同时只需要修改如下配置部分[csft.conf]

    source数据源部分
    sql_query_pre     = SET NAMES gbk

    索引部分
    charset_type            = zh_cn.gbk

    这里需要注意一下,如果要想测试支持gbk,可以写一个PHP文件,调用sphinx提供的api接口,注意要开启searchd进程

    # /usr/local/csft/bin/searchd

    编写如下代码 [注意要与sphinxapi.php目录存放在一个目录]
    sphinxapi.php目录在# /usr/local/csft-3.1/api/下
    也可以直接使用api目录下的test.php直接测试
    <?php
    require_once 'sphinxapi.php';
    $sc = new SphinxClient();
    $sc->SetServer('127.0.0.1',3312);
    $sc->SetConnectTimeout(1);
    $sc->SetWeights(array(100,1));
    $sc->SetMatchMode(SPH_MATCH_ALL);
    $sc->SetArrayResult(TRUE);
    $res = $sc->query("简单");
    var_dump($res);
    ?>

    也可以直接运行search工具[utf8版],如下



    [root@localhost ~]# /usr/local/csft/bin/search 便宜
    Coreseek Full Text Server 3.1
     Copyright (c) 2006-2008 coreseek.com
    using config file '/usr/local/csft/etc/csft.conf'...
    index 'test1': query '便宜 ': returned 4 matches of 4 total in 0.015 sec

    displaying matches:
    1. document=3, weight=1, fid=7, authorid=1
    2. document=97, weight=1, fid=35, authorid=1
    3. document=108, weight=1, fid=32, authorid=1
    4. document=146, weight=1, fid=7, authorid=1

    words:
    1. '便宜': 4 documents, 4 hits

    如果返回false,请检查searchd进程是否开启,如果返回成功,恭喜,你已经成为sphinx的使用者,向下一个高层次进军吧!

    三,后记
    其实很想制作一个安装视频教程,但由于时间有限,在安装过程中肯定会存在一些细节上的问题,只要大家按照上面的步骤一步一步安装,相信能把sphinx拿下,如果有什么问题
    大家可查看http://www.sphinxsearch.com/http://www.coreseek.cn/网站获取更多帮助,同时也可以查看中文手册。

    同时也可以在phpwind官方网站www.phpwind.net提问和分享你的安装过程,把一个细节都亮出来,帮助别人也帮助自己。BY liuhui.php@gmail.com 2009-9-17

    其它链接
    用 PHP 构建自定义搜索引擎
    http://www.ibm.com/developerworks/cn/opensource/os-php-sphinxsearch/index.html

    MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm
    http://technology.chtsai.org/mmseg/

    附phpwind配置实例[gbk版]
    PHPWind搜索sphinx配置实例 [修改部分参数就可直接应用于phpwind程序]

    部分解读:
    如下全文索引使用的是主索引+增量索引的方式,具体大家结合手册了解相关知识

    需要创建一张表 [编码自己定,如下是gbk]
    CREATE TABLE IF NOT EXISTS `search_counter` (
      `counterid` int(11) NOT NULL DEFAULT '0',
      `max_doc_id` int(11) NOT NULL DEFAULT '0',
      `min_doc_id` int(10) NOT NULL DEFAULT '0',
      PRIMARY KEY (`counterid`)
    ) ENGINE=MyISAM DEFAULT CHARSET=gbk;


    csft.conf配置文件

    source tmsgs
    {
        type                                    = mysql
        sql_host                                = localhost
        sql_user                                = root
        sql_pass                                = xxxx
        sql_db                                  = phpwind
        sql_port                                = 3307  # optional, default is 3306
        sql_sock                                = /tmp/mysql3307.sock
        sql_query_pre                           = SET NAMES gbk
        sql_query_pre                           = SET SESSION query_cache_type=OFF
        sql_query_pre                           = REPLACE INTO search_counter SELECT 1,MAX(tid),MIN(tid) FROM pw_tmsgs
        sql_query_range                    = SELECT min_doc_id, max_doc_id FROM search_counter WHERE counterid = 1
        sql_range_step                          = 1000
        sql_query                               = SELECT th.tid,th.subject,th.authorid,th.postdate,th.lastpost,th.fid,th.digest,th.hits,th.replies,t.content FROM pw_threads th  LEFT JOIN pw_tmsgs t USING(tid) WHERE th.tid > $start AND th.tid <= $end
            
        sql_attr_uint                           = authorid
        sql_attr_uint                           = hits
        sql_attr_uint                           = replies
        sql_attr_uint                           = fid
        sql_attr_timestamp                      = postdate
        sql_attr_timestamp                      = lastpost
        sql_attr_uint                           = digest
    }

    source addtmsgs
    {
        type                                    = mysql
        sql_host                                = localhost
        sql_user                                = root
        sql_pass                                = xxxx
        sql_db                                  = phpwind
        sql_port                                = 3307  # optional, default is 3306
        sql_sock                                = /tmp/mysql3307.sock
        sql_query_pre                           = SET NAMES gbk
        sql_query_pre                           = SET SESSION query_cache_type=OFF
        sql_query_range                    = SELECT max_doc_id, max_doc_id+100000 FROM search_counter WHERE counterid = 1
        sql_range_step                          = 100000
        sql_query                               = SELECT th.tid,th.subject,th.authorid,th.postdate,th.lastpost,th.fid,th.digest,th.hits,th.replies,t.content FROM pw_threads th  LEFT JOIN pw_tmsgs t USING(tid) WHERE th.tid > $start AND th.tid <= $end
            
        sql_attr_uint                           = authorid
        sql_attr_uint                           = hits
        sql_attr_uint                           = replies
        sql_attr_uint                           = fid
        sql_attr_timestamp                      = postdate
        sql_attr_timestamp                      = lastpost
        sql_attr_uint                           = digest
        sql_query_post                         = REPLACE INTO search_counter SELECT 1,MAX(tid),MIN(tid) FROM pw_tmsgs
        #sql_attr_uint                          = tid
    }

    source threads
    {
        type                                    = mysql
        sql_host                                = localhost
        sql_user                                = root
        sql_pass                                = xxxxxxx
        sql_db                                  = phpwind
        sql_port                                = 3307  # optional, default is 3306
        sql_sock                                = /tmp/mysql3307.sock
        sql_query_pre                           = SET NAMES gbk
        sql_query_pre                           = SET SESSION query_cache_type=OFF
        sql_query_pre                           = REPLACE INTO search_counter SELECT 3,MAX(tid),MIN(tid) FROM pw_threads
        sql_query_range                    = SELECT min_doc_id, max_doc_id FROM search_counter WHERE counterid = 3
        sql_range_step                          = 1000
        sql_query                               = SELECT th.tid,th.subject,th.authorid,th.postdate,th.lastpost,th.fid,th.digest,th.hits,th.replies FROM pw_threads th  WHERE th.tid > $start AND th.tid <= $end
        sql_attr_uint                           = authorid
        sql_attr_uint                           = hits
        sql_attr_uint                           = replies
        sql_attr_uint                           = fid
        sql_attr_timestamp                      = postdate
        sql_attr_timestamp                      = lastpost
        sql_attr_uint                           = digest
    }

    source addthreads
    {
            type                                    = mysql
            sql_host                                = localhost
            sql_user                                = root
            sql_pass                                = xxx
            sql_db                                  = phpwind
            sql_port                                = 3307  # optional, default is 3306
        sql_sock                                = /tmp/mysql3307.sock
        sql_query_pre                           = SET NAMES gbk
        sql_query_pre                           = SET SESSION query_cache_type=OFF
        sql_query_range                    = SELECT max_doc_id, max_doc_id+100000 FROM search_counter WHERE counterid = 3
        sql_range_step                          = 100000
        sql_query                               = SELECT th.tid,th.subject,th.authorid,th.postdate,th.lastpost,th.fid,th.digest,th.hits,th.replies FROM pw_threads th  WHERE th.tid > $start AND th.tid <= $end
            
        sql_attr_uint                           = authorid
        sql_attr_uint                           = hits
        sql_attr_uint                           = replies
        sql_attr_uint                           = fid
        sql_attr_timestamp                      = postdate
        sql_attr_timestamp                      = lastpost
        sql_attr_uint                           = digest
        sql_query_post                         = REPLACE INTO search_counter SELECT 3,MAX(tid),MIN(tid) FROM pw_threads
        #sql_attr_uint                          = tid
    }

    index tmsgsindex
    {
            source                                  = tmsgs
            path                                    = /usr/local/csft/var/data/tmsgs
            docinfo                                 = extern
            charset_type                            = zh_cn.gbk
            #min_prefix_len  = 0
            #min_infix_len  = 2
            #ngram_len = 2
            charset_dictpath                        = /usr/local/csft/
            min_prefix_len                          = 0
            min_infix_len                           = 0
            min_word_len                            = 2
    }

    index addtmsgsindex
    {
            source                                  = addtmsgs
            path                                    = /usr/local/csft/var/data/addtmsgs
            docinfo                                 = extern
            charset_type  = zh_cn.gbk
            #min_infix_len  = 2
            #ngram_len = 2
            charset_dictpath                    = /usr/local/csft/
            min_prefix_len                        = 0
            min_infix_len                          = 0
            min_word_len                         = 2
    }
    index threadsindex
    {
            source                                  = threads
            path                                    = /usr/local/csft/var/data/threads
            docinfo                                 = extern
            charset_type                            = zh_cn.gbk
            #min_prefix_len  = 0
            #min_infix_len  = 2
            #ngram_len = 2
            charset_dictpath                        = /usr/local/csft/
            min_prefix_len                          = 0
            min_infix_len                           = 0
            min_word_len                            = 2
    }

    index addthreadsindex
    {
            source                                  = addthreads
            path                                    = /usr/local/csft/var/data/addthreads
            docinfo                                 = extern
            charset_type  = zh_cn.gbk
            #min_infix_len  = 2
            #ngram_len = 2
            charset_dictpath                    = /usr/local/csft/
            min_prefix_len                        = 0
            min_infix_len                          = 0
            min_word_len                         = 2
    }
    indexer
    {
            mem_limit                               = 128M
    }

    searchd
    {
            port                                = 3312
            log                                 = /usr/local/csft/var/log/searchd.log
            query_log                           = /usr/local/csft/var/log/query.log
            read_timeout                        = 5
            max_children                        = 30
            pid_file                                = /usr/local/csft/var/log/searchd.pid
            max_matches                         = 1000
            seamless_rotate                     = 1
            preopen_indexes                     = 0
            unlink_old                          = 1
    }

  • 相关阅读:
    121. Best Time to Buy and Sell Stock
    70. Climbing Stairs
    647. Palindromic Substrings
    609. Find Duplicate File in System
    583. Delete Operation for Two Strings
    556 Next Greater Element III
    553. Optimal Division
    539. Minimum Time Difference
    537. Complex Number Multiplication
    227. Basic Calculator II
  • 原文地址:https://www.cnblogs.com/myphoebe/p/2144946.html
Copyright © 2011-2022 走看看