zoukankan      html  css  js  c++  java
  • sphinx (coreseek)——2、区段查询实例

    • 首先需要知道区段查询的定义:

    索引系统需要通过主查询来获取全部的文档信息,一种简单的实现是将整个表的数据读入内存,但是这可能导致整个表被锁定并使得其他操作被阻止(例如:在MyISAM格式上的INSERT操作),同时,将浪费大量内存用于存储查询结果,诸如此类的问题吧。 为了避免出现这种情况,CoreSeek/Sphinx支持一种被称为 区段查询的技术. 首先,CoreSeek/Sphinx从数据库中取出文档ID的最小值和最大值,将由最大值和最小值定义自然数区间分成若干份,一次获取数据,建立索引。现举例如下:

    例 3.1. 范围查询用法举例

    # in sphinx.conf
    
    sql_query_range    = SELECT MIN(id),MAX(id) FROM documents
    sql_range_step = 1000
    sql_query = SELECT * FROM documents WHERE id>=$start AND id<=$end

    如果这个表(documents)中,字段ID的最小值和最大值分别是1 和2345,则sql_query将执行3次:

    1. $start 替换为1,并且将 $end 替换为 1000;
    2. $start 替换为1001,并且将 $end 替换为 2000;
    3. $start 替换为2001,并且将 $end 替换为 2345.

    显然,这对于只有2000行的表,分区查询与整个读入没有太大区别,但是当表的规模扩大到千万级(特别是对于MyISAM格式的表),分区区段查询将提供一些帮助。

    • 以上是coreseek 文档上的定义,分享一下本机测试实例:

     之前做一套域名MX解析系统的时候获取过几百万的域名www title 信息,下面就用检索www 网站titile 数据来测试。

    编辑用于测试的 coreseek 配置文件 csft.range.conf

    source src
    {
            type                    = mysql
            # some straightforward parameters for SQL source types
            sql_host                = localhost
            sql_user                = root
            sql_pass                = xxxxxxxxxxxxx
            sql_db                  = whomx
            sql_port                = 3306  # optional, default is 3306
            sql_query_pre          = SET NAMES utf8
            sql_query_pre          = SET SESSION query_cache_type=OFF
         sql_query              =
         SELECT i.id,title
                FROM  mx_domain_wwwinfo i
               WHERE id>=$start AND id<=$end   
             sql_query_range      = SELECT MIN(id),MAX(id) FROM mx_domain_wwwinfo
    }

    index 配置只需要配置中文字符编码   还有中文词库的位置就可

    indexer searchd 不需要更改。

    接下来测试一下,

    • 生成索引:
    root@timeless-HP-Pavilion-g4-Notebook-PC:/usr/local/coreseek/etc# /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/csft.range.conf   --all --rotate
    Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)]
    Copyright (c) 2007-2011,
    Beijing Choice Software Technologies Inc (http://www.coreseek.com)
    
     using config file '/usr/local/coreseek/etc/csft.range.conf'...
    WARNING: failed to open pid_file '/usr/local/coreseek/var/log/searchd.pid'.
    indexing index 'src'...
    WARNING: Attribute count is 0: switching to none docinfo
    collected 1500837 docs, 92.2 MB
    sorted 17.5 Mhits, 100.0% done
    total 1500837 docs, 92186221 bytes
    total 34.680 sec, 2658122 bytes/sec, 43275.54 docs/sec
    total 16 reads, 0.023 sec, 3631.5 kb/call avg, 1.4 msec/call avg
    total 143 writes, 0.105 sec, 912.1 kb/call avg, 0.7 msec/call avg
    •  根据某个关键词测试:
    • root@timeless-HP-Pavilion-g4-Notebook-PC:/usr/local/coreseek/bin# ./search -c /usr/local/coreseek/etc/csft.range.conf  济南
      Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)]
      Copyright (c) 2007-2011,
      Beijing Choice Software Technologies Inc (http://www.coreseek.com)
      
       using config file '/usr/local/coreseek/etc/csft.range.conf'...
      index 'src': query '济南 ': returned 1000 matches of 11040 total in 0.005 sec
      
      displaying matches:
      1. document=53592, weight=1664
          id=53592
          domain_id=75937
          title=?????????????,??????,??????,?????????????????????????????????????,??????,??????,????,????,??????,??????,??????,??????,????????,????,??????,??????,??????,??????,????????,????????,?????????,?????????
          addtime=1419001556
      2. document=156494, weight=1663
          id=156494
          domain_id=320070
          title=??--??????,?????,????????,????????,??????,??????,?????,?????,???,?????,?????,?????,?????,????,????,??????,??????,??????,?????,?????,?????,???,???????,
          addtime=1419041933
      3. document=53624, weight=1661
          id=53624
          domain_id=74960
          title=???????-???.??.???.???????/?????/?????/?????????/?????/?????/???? ????? ????? ????? ????? ????? ????? ??POS??? ????? ????? ????? ??POS?
          addtime=1419001559
      4. document=908267, weight=1661
          id=908267
          domain_id=3482035
          title=???????-???.??.???.???????/?????/?????/?????????/?????/?????/???? ????? ????? ????? ????? ????? ????? ??POS??? ????? ????? ????? ??POS?
          addtime=1421983846
      5. document=1074259, weight=1659
          id=1074259
          domain_id=2805964
          title=?????? - ???? | ????? | ????? | ?????? | ?????? | ?????? | ?????? | ?????? | ?????? | ?????? | ?????? | ?????? | ?????? | ??????
          addtime=1421998317
      6. document=628662, weight=1658
          id=628662
          domain_id=1603934
          title=????????|????????|?????????|?????????|??????|?????|????????????????|????????|?????????|?????????|??????|?????|????????
          addtime=1420628500
      7. document=82498, weight=1656
          id=82498
          domain_id=75205
          title=???????????????????????????????????????????????????????????????????????????????????????????????????????
          addtime=1419030813
      8. document=373234, weight=1656
          id=373234
          domain_id=75953
          title=????|??????|????????|??????|??????|??????|??????|??????|??????|??????|??????-????????
          addtime=1419481313
      9. document=97657, weight=1655
          id=97657
          domain_id=75152
          title=????????????????????????????????????????????????????????????????????
          addtime=1419032238
      10. document=108426, weight=1655
          id=108426
          domain_id=76651
          title=??????|??????|??SKF??|??NSK??|??FAG??|??NTN??|??KOYO??|??TIMKEN??|??FAG??|????|????????|????????|??????|-??
          addtime=1419033228
      11. document=184337, weight=1655
          id=184337
          domain_id=75654
          title=???????|??????????|???????|??????????|??????|???????|?????????|????????|?????????|???????|??????????
          addtime=1419043496
      12. document=246303, weight=1655
          id=246303
          domain_id=262037
          title=???? ?????? ?????? ?????? ????? ?????? ???? ???? ?????? ???? ??????
          addtime=1419046975
      13. document=261372, weight=1655
          id=261372
          domain_id=544595
          title=??????|????|?????|?????????|???????????|??????|??????|????|?????|?????|?????|?????
          addtime=1419215630
      14. document=1163692, weight=1655
          id=1163692
          domain_id=2514244
          title=??????????????????????????????????????????????????????????????????????????????_??????????????
          addtime=1422005290
      15. document=1163740, weight=1655
          id=1163740
          domain_id=2514240
          title=?????????????????????????????????????????????????????????????????????????_??????????????
          addtime=1422005293
      16. document=1163762, weight=1655
          id=1163762
          domain_id=2514239
          title=????????????????????????????????????????????????????????????????????????????_??????????????
          addtime=1422005295
      17. document=10694, weight=1653
          id=10694
          domain_id=454049
          title=??????|??????|??????|??????|??????|?????|??????|????????|???????????|????????????|???400-070-3005
          addtime=1418996572
      18. document=15876, weight=1653
          id=15876
          domain_id=66098
          title=????????? ???????????? ??????? ?????????? ??????? ??????? ????????? ??????? ??????? ???????_?????????
          addtime=1418997101
      19. document=23385, weight=1653
          id=23385
          domain_id=421622
          title=????0531-82825553|??????|??????????????|????????|???????|??????|????T1|????T3|????T6|????U8
          addtime=1418997836
      20. document=34628, weight=1653
          id=34628
          domain_id=320077
          title=????|?????|?????|?????|?????|?????|?????|?????|????|??????????
          addtime=1418998927
      
      words:
      1. '济南': 11040 documents, 22214 hits

       可以看到结果 :1. '济南': 11040 documents, 22214 hits  以上显示只是 编码问题。

          接下来还有个问题  比如 现在 要增量索引跟区段查询综合在一起怎么办?  下面文章根据百度文库里找到的一篇关于

    《千万级Discuz!数据全文检索方案(Sphinx)》 综合使用coreseek 实现检索。

  • 相关阅读:
    腾讯为什么会出Q立方浏览器?
    String,StringBuffer与StringBuilder的区别??
    Linux Socket编程(不限Linux)
    将div显示在屏幕正中央
    计算鼠标坐标是否在指定范围内
    正则
    ajax异步通信
    CSS Float 换行
    jQuery强大的jQuery选择器
    给display字段增加筛选功能
  • 原文地址:https://www.cnblogs.com/timelesszhuang/p/4771106.html
Copyright © 2011-2022 走看看