zoukankan      html  css  js  c++  java
  • 在sphinx中应用复杂过滤条件

    一、问题的引入
     
    在sphinx应用中,需要对数据进行复杂的条件过滤,刷选出我们需要的数据。这个过程,等同于mysql查询中的where条件。
     
    但sphinx本身的filter并不能支持复杂的逻辑组合运算,各个过滤条件之间只能进行AND操作,连OR操作都不支持。
     
    为了解决这个问题,之前我们是在建立索引的时候,从源数据库取出数据建立索引的时候,利用mysql的sql语句,在数据导入的时候就要把逻辑摊开。
    这样会带来很多问题:
    1)sphinx配置文件复杂,变动不灵活,当业务发生变化的时候,需要修改后台的sphinx配置文件,重建索引;
    2)没法做到业务表现和数据分离,前后台耦合严重;
    3)从源db获取数据的sql语句复杂,不方便维护调整,同时增加建立索引的时间
    4)sphinx的开发维护人员,必须非常清楚业务的特点才能实现,不利于各人的分工协作
    5)当逻辑实在太复杂,数据导入的时候没法将逻辑摊开的时候,采用这种方式,业务就没法实现了
     
     
    二、新的解决方案
    新的解决方案,利用SetSelect()和SetFilter()这两个api,对sphinx的参数进行逻辑操作来实现。
    下面根据例子来说明。
     
    1. 准备测试环境和数据
     
    创建表格和插入数据
    CREATE TABLE `t_test_filter` (
      `Fcommodity_id` bigint(20) unsigned NOT NULL DEFAULT '0',
      `
     
    FUserId` int(10) unsigned NOT NULL DEFAULT '0',
      `
     
    FUserName` varchar(32) NOT NULL DEFAULT '',
      `Fshopid` bigint(20) unsigned NOT NULL DEFAULT '0',
      `Ftitle` varchar(60) NOT NULL DEFAULT '',
      `FProperty` int(10) unsigned NOT NULL DEFAULT '0',
      `FTime` int(10) unsigned NOT NULL DEFAULT '0',
      PRIMARY KEY (`Fcommodity_id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1
     
     
    创建sphinxSE的表格
    CREATE TABLE `t_test_filter_sphinxSE` (
      `id` bigint(20) unsigned NOT NULL,
      `weight` int(11) DEFAULT '1',
      `query` varchar(3072) NOT NULL,
      `total_found` int(11) DEFAULT '0',
      `total_match` int(11) DEFAULT '0',
      `query_time` int(11) DEFAULT '0',
      `
     
    FUserId` int(10) unsigned DEFAULT '0',
      `Fshopid` bigint(20) unsigned NOT NULL DEFAULT '0',
      `FProperty` int(10) unsigned NOT NULL DEFAULT '0',
      `FTime` int(10) unsigned NOT NULL DEFAULT '0',
      KEY `query` (`query`(1536))
    ) ENGINE=SPHINX DEFAULT CHARSET=gbk CONNECTION='sphinx://localhost:9412'
     
    插入测试数据
    insert into t_test_filter values(4567,123,"test qq nick", 1234567,"title name",2324354,9876);
    insert into t_test_filter values(2,34,"nick name", 4545,"title name",67,40);
    insert into t_test_filter values(66,55,"test3 qq3 nick3", 22,"title3 name3 ",88,77);
    insert into t_test_filter values(767,234,"test3 qq nick", 3433,"title3 name3 ",342342,89340);
    insert into t_test_filter values(4545,123,"test qq nick", 1234567,"title name",2324354,9876);
    insert into t_test_filter values(4546,778,"test qq nick", 1234567,"title name",2324354,9876);
     
    2. 使用举例
     
    这里用sphinxSE的接口,举2个例子来说明这种方法:
     
    (1)查询词是nick,满足条件 FUserId=34 || Fshopid = 1234567;
    通过sphinxSE测试查询
    select * from t_test_filter_sphinxSE where query="nick;select=IF(
     
    FUserId=34,1,0) + IF(Fshopid=1234567,10,0) as match_qq; filter= match_qq, 1,10,11";
     

    (2)查询词是nick,满足条件 
    FUserId=34 || (Fshopid = 1234567 && FUserId=778);
    select * from t_test_filter_sphinxSE where query="nick;select=IF(
     FUserId=34,1,0) + IF(Fshopid=1234567,10,0) + IF( FUserId=778,100,0) as match_qq; filter= match_qq, 1,110,111";
     
    为了减少组合导致的过滤条件太多,可以使用诸如下面这种逻辑运算来简化
     
    select * from t_test_filter_sphinxSE where query="nick;select=IF( FUserId=34,1,0) OR ( IF(Fshopid=1234567,1,0) AND IF( FUserId=778,1,0)) as match_qq; filter= match_qq, 1";
     

     这两条语句的作用相同。

     
    其他复杂的查询,可以按照这个原理组合实现。

     

     
    原理:
     
    1. 函数IF()的说明:它接受3个参数,检查第一个参数是否为0.0,若非零则返回第二个参数,为零时则返回第三个参数;
    2. 上面的语法主要是query里面的语句。
    (1) 第一个词是nick,表示需要查询到关键词。
    (2)select语句,这个等同于api接口中的SetFilter()。这个是从配置文件的sql_attr_xxxx中,过滤返回的数据。select计算的结果,保存在某个变量名(match_qq)中。这里的计算支持+, -, *, /, <, > <=, >=, =, <>等操作符。
    (3)filter语句,这个等同于api接口中的SetFilter()。根据select的结果,用filter过滤。不同的查询条件可以得到不同的组合。这样就可以支持非常复杂的查询过滤条件了。
     
    其他接口可以类似,比如c-api和php的接口,可以使用SetSelect()和SetFilter()这两个api实现。
     
    3. 优点和局限
    1.在SetSelect()中支持比较负责数学运算和一些逻辑组合,可以灵活实现各种逻辑;
    2.业务开发人员和sphinx的运维配置人员,可以分开独立
    3.版本限制
    (1) sphinxSE的接口,需要在coreseek4.0.1或者sphinx0.9.9rc1以上的版本才支持。
    (2) c-api和php的接口,在coreseek3.2.14应该已经支持,他已经提供了SetSelect()和SetFilter()这两个api,但我没有实试过。
     
     
     
    三、附录
    这里附上sphinx配置文件,以便大家更好的分析和测试
     
    # file: t_test_filter.conf
     
    source t_test_filter
    {
            type                                    = mysql
     
            sql_host                                = localhost
            sql_user                                = xxxx
            sql_pass                                = xxx
            sql_db                                  = test
            sql_port                                = 3306
     
            sql_query    =  SELECT Fcommodity_id, FUserId, FUserName,Fshopid, Ftitle, FProperty,FTime from t_test_filter
     
            sql_attr_uint                   = FUserId
            sql_attr_bigint                   = Fshopid
            sql_attr_uint                   = FProperty
            sql_attr_uint                   = FTime
    }
     

    index t_test_filter
    {
            source                  = t_test_filter
            path                    = /usr/local/coreseek/var/data/t_test_filter
            docinfo                 = extern
            mlock                   = 0
            morphology              = none
            exceptions              = /usr/local/coreseek/var/exceptions.txt
            min_word_len            = 1
            charset_type = zh_cn.gbk
            charset_dictpath = /usr/local/coreseek/dict
            ngram_len                               = 1
            ngram_chars                     = U+3000..U+2FA1F
            html_strip                              = 0
            blend_mode              = trim_none
    }
     
     
     

    #############################################################################
    ## indexer settings
    #############################################################################
     
    indexer
    {
            # memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
            # optional, default is 32M, max is 2047M, recommended is 256M to 1024M
            mem_limit                       = 40M
     
            # maximum IO calls per second (for I/O throttling)
            # optional, default is 0 (unlimited)
            #
            # max_iops                      = 40
     

            # maximum IO call size, bytes (for I/O throttling)
            # optional, default is 0 (unlimited)
            #
            # max_iosize            = 1048576
     

            # maximum xmlpipe2 field length, bytes
            # optional, default is 2M
            #
            # max_xmlpipe2_field    = 4M
     

            # write buffer size, bytes
            # several (currently up to 4) buffers will be allocated
            # write buffers are allocated in addition to mem_limit
            # optional, default is 1M
            #
            write_buffer            = 1M
    }
     
    #############################################################################
    ## searchd settings
    #############################################################################
     
    searchd
    {
            # hostname, port, or hostname:port, or /unix/socket/path to listen on
            # multi-value, multiple listen points are allowed
            # optional, default is 0.0.0.0:9312 (listen on all interfaces, port 9312)
            #
            # listen                                = 127.0.0.1
            # listen                                = 192.168.0.1:9312
            listen                          = 9412
            # listen                                = /var/run/searchd.sock
            listen = localhost:9406:mysql41
     

            binlog_path = /usr/local/coreseek/var/data
     
            # log file, searchd run info is logged here
            # optional, default is 'searchd.log'
            log                                     = /usr/local/coreseek/var/log/searchd_9412.log
     
            # query log file, all search queries are logged here
            # optional, default is empty (do not log queries)
            query_log                       = /usr/local/coreseek/var/log/query_9412.log
     
            # client read timeout, seconds
            # optional, default is 5
            read_timeout            = 5
     
            # request timeout, seconds
            # optional, default is 5 minutes
            client_timeout          = 300
     
            # maximum amount of children to fork (concurrent searches to run)
            # optional, default is 0 (unlimited)
            max_children            = 30
     
            # PID file, searchd process ID file name
            # mandatory
            pid_file                        = /usr/local/coreseek/var/log/searchd_9412.pid
     
            # max amount of matches the daemon ever keeps in RAM, per-index
            # WARNING, THERE'S ALSO PER-QUERY LIMIT, SEE SetLimits() API CALL
            # default is 1000 (just like Google)
            max_matches                     = 1000000
     
            # seamless rotate, prevents rotate stalls if precaching huge datasets
            # optional, default is 1
            seamless_rotate         = 1
     
            # whether to forcibly preopen all indexes on startup
            # optional, default is 0 (do not preopen)
            preopen_indexes         = 0
     
            # whether to unlink .old index copies on succesful rotation.
            # optional, default is 1 (do unlink)
            unlink_old                      = 1
     
            # attribute updates periodic flush timeout, seconds
            # updates will be automatically dumped to disk this frequently
            # optional, default is 0 (disable periodic flush)
            #
            # attr_flush_period     = 900
     

            # instance-wide ondisk_dict defaults (per-index value take precedence)
            # optional, default is 0 (precache all dictionaries in RAM)
            #
            # ondisk_dict_default   = 1
     

            # MVA updates pool size
            # shared between all instances of searchd, disables attr flushes!
            # optional, default size is 1M
            mva_updates_pool        = 1M
     
            # max allowed network packet size
            # limits both query packets from clients, and responses from agents
            # optional, default size is 8M
            max_packet_size         = 8M
     
            # crash log path
            # searchd will (try to) log crashed query to 'crash_log_path.PID' file
            # optional, default is empty (do not create crash logs)
            #
            # crash_log_path                = /usr/local/coreseek/var/log/crash
     

            # max allowed per-query filter count
            # optional, default is 256
            max_filters                     = 256
     
            # max allowed per-filter values count
            # optional, default is 4096
            max_filter_values       = 4096
     

            # socket listen queue length
            # optional, default is 5
            #
            # listen_backlog                = 5
     

            # per-keyword read buffer size
            # optional, default is 256K
            #
            # read_buffer                   = 256K
     

            # unhinted read size (currently used when reading hits)
            # optional, default is 32K
            #
            # read_unhinted         = 32K
    }
     
    # --eof--
  • 相关阅读:
    正则表达式
    UVALive
    Python科学计算基础篇
    IntelliJ IDEA 2017.3激活与汉化
    hive order by,sort by, distribute by, cluster by作用以及用法
    Hive调优
    Hive 索引
    hive视图
    Hive 分区 分桶使用
    linux内核优化,内核参数详解
  • 原文地址:https://www.cnblogs.com/littlehb/p/3631000.html
Copyright © 2011-2022 走看看