solr - 走看看

zoukankan html css js c++ java

solr

Solr是一个独立的企业级搜索应用服务器，它对外提供类似于Web-service的API接口。
Solr是一个高性能，采用Java5开发，基于Lucene的全文搜索服务器。同时对其进行了扩展，提供了比Lucene更为丰富的查询语言，同时实现了可配置、可扩展并对查询性能进行了优化，并且提供了一个完善的功能管理界面，是一款非常优秀的全文搜索引擎。

Solr优点：
（1）Solr 可以对几乎任何对象进行索引，该对象甚至可以不是ActiveRecord.而Sphinx和RDBMS耦合过于紧密
（2）Solr 索引的对象ID可以非空或者是字符串，而Sphinx要求被索引对象必须拥有非0整数作为ID
（3）Solr 支持Boolean作为查询条件搜索,更加方便 Solr 支持Facets,而Sphinx为此需要做更多工作
（4）Solr是对lucene的包装。所以他可以享受lucene每次升级带来的便利。

一、基本命令：
Rails3安装：
（1）Gemfile 添加 gem 'sunspot_rails', '~> 1.2.1'
（2）bundle install
（3）使用 rails g sunspot_rails:install 生成配置文件（config/sunspot.yml）

Solr服务器启动：
rake sunspot:solr:start（后台进程启动）   或者   rake sunspot:solr:run
Solr停止： rake sunspot:solr:stop
服务器建立索引：rake sunspot:reindex
可按model单独索引：rake sunspot:reindex[1000:Company]
示例：nohup rake sunspot:reindex[1000:Company] --trace RAILS_ENV=production &
查看Solr服务：http://railsserver:{sunspot.yml[RAILS_ENV][port]}/solr（http://127.0.0.1:8983/solr）

二、配置：
Schema.xml配置：
<analyzer type='index'>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
     <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
      </analyzer>
      <analyzer type='query'>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
     <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
      </analyzer>
上述代码分为索引（index）和查询（query），对词语进行处理，分为五步：
（1）<charFilter class="solr.HTMLStripCharFilterFactory"/> 对html字符进行过滤；
（2）<filter class="solr.ASCIIFoldingFilterFactory"/> 过滤特殊字符；
（3）<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/> IK分词；isMaxWordLength设置是否最大词长切分；
（4）<filter class="solr.LowerCaseFilterFactory"/> 小写转换
（5）<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/> 设置同义词词库；
    synonyms格式：pvc => 聚氯乙烯
更多过滤分析条件见：http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

IKAnalyzer.cfg.xml 配置：
文件路径（测试机上）：/opt/ruby-enterprise-1.8.7/lib/ruby/gems/1.8/gems/sunspot-1.2.1/solr/webapps/solr/WEB-INF/classes
文档： http://ik-analyzer.googlecode.com/files/IKAnalyzer%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D%E5%99%A8V3.2.8%E4%BD%BF%E7%94%A8%E6%89%8B%E5%86%8C.pdf
http://linliangyi2007.iteye.com/blog/501228
    自定义词库:
<properties>
         <entry key="ext_dict">/mydict.dic;/com/mycompany/dic/mydict2.dic;</entry>
</properties>
注意:mydict.dic与IKAnalyzer.cfg.xml同目录
屏蔽词配置:
<properties>
     <entry key="ext_stopwords">/ext_stopword.dic</entry>
</properties>
注意:ext_stopword.dic与IKAnalyzer.cfg.xml同目录

三、索引方法：
Model索引定义（以price为例）：
   searchable (:auto_index => true,:auto_remove => true) do
       text :full_name, :stored => true
    string :company_name, :stored => true
    string :brand_name, :stored => true do
        self.try(:product).try(:brand).try(:name)
    end
     integer :brand_id,:references => Brand, :stored => true
     integer :company_id,:references => Company, :stored => true
     integer :attr_value_ids, :references => BaseMaterialTypeAttrValue, :multiple => true, :stored => true
     time :updated_at,:trie => true, :stored => true
end

其中full_name、company_name等是自定义的函数，返回相应的值；自定义索引函数中尽量全用try，否则索引过程中会出现索引错误，例如brand_name

索引定义参数介绍：
（1）text ：表示可以部分匹配，索引时是按分词索引
（2）string：完全匹配，被完整索引
（3）integer，long：整数
（4）float，double：浮点数
（5）time：data/time，相当于ruby的time class
（6）boolean：true、false
属性字段的几个参数：
（1）multiple：索引多个值
（2）references：引用类
（3）stored：是否存储到索引
（4）trie：Boolean: Numeric and time fields only；能使进行range搜索时速度更快。

实时建立索引方法：
（1）Sunspot.index(Objects)
     Sunspot.commit
（2）Sunspot.index!(Objects)
（3）删除索引Sunspot.remove (*objects) ，然后Sunspot.commit
（4）删除索引Sunspot.remove! (*objects)
（5）在索引函数中加入searchable (:auto_index => true,:auto_remove => true) do 也可实时索引；

四、搜索：
两种搜索写法，例如：
（1）@results = Sunspot.new_search(Price)
@results.build { keywords(params[:search][:value], :query_phrase_slop => 1000, :phrase_slop => 1000, :highlight => true, :exclude_fields => [:search_key,:notes], :boost_fields => [:full_name => 5, :description => 4, :base_material_type_description => 3])}
@results.build { with(:status).equal_to("released")}
@results.build { without(:product_id).equal_to(nil)}
@results.build { order_by(:score,:desc) }
@results.build { order_by(:updated_at,:desc) }
@results.build { facet(:brand_id) }
@results.build { facet(:attr_value_ids) }
@results.execute!
（2）Sunspot.search(Price) do
keywords(params[:search][:value], :fields => [:full_name, :description], :highlight => true, :minimum_match => 0)
        with(:updated_at).greater_than(30.days.ago.to_time)
paginate(:page => params[:page] || 1,:per_page => session[:per_page])
facet(:attr_value_ids)
     end

限制方法有：
（1）with
（2）without
（3）order_by
（4）facet：面搜索
（5）paginate：传入rails中的will paginate分页参数
限制条件有：
（1）equal_to
（2）less_than
（3）greater_than
（4）between
（5）any_of 匹配任意一项，值为数组，如with(:attr_value_ids).any_of(params[:search][:attr_value_ids])
（6）all_of 匹配全部，值为数组
    Any_of 和 all_of 可以分别表示or 、and逻辑，可以互相包含，如：
    any_of do
        with(:expired_at).greater_than(Time.now)
        with(:expired_at, nil)
        all_of do
              with(:publshed_at).less_than(Time.now)
              with(:author_id).equal_to(999)
        end
    end

keyword的其它参数详见：http://outoftime.github.com/sunspot/docs/classes/Sunspot/DSL/Fulltext.html

五、页面展示部分：
调用检索结果：
    （1）@search.hits   索引的数据，直接从所用文档中得到数据
    （2）@search.results 得到的数据集是从数据库中反查出来的数据
    （3）@search.each_hit_with_result do |hit, result| ，同上
    （4）hit.result 可用hit从数据库中查出对象
    （5）hit.score 匹配度，搜索时可按score排序
    （6）hit.stored(:name)   直接从索引文档中取得存储的参数
    （7）与will_paginate的结合：will_paginate(@search.hits)
（8）高亮显示关键词：如
        Search方法中搜索关键词语句中加 :highlight => true
    页面调用：hit.highlight(:body).format { |fragment| content_tag(:em, fragment) }
        遇到的问题：页面调用hit.highlight(:body)，如果body字段
含有所检索关键词时，hit.highlight(:body)为空会报错，需要对highlight(:body)进行判断，如果为空则调用hit.stored(:body)显示。
    （9）分面搜索：调用方法
@search.facet(:attr_value_ids).rows   do |row|
    值 row.value
    对应记录数量row.count
end

solr filter 文档
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

sunspot：
http://outoftime.github.com/sunspot/docs/classes/Sunspot/DSL/Fulltext.html
http://outoftime.github.com/sunspot/docs/classes/Sunspot/Search/AbstractSearch.html

查看全文

相关阅读:
Prometheus学习系列（九）之Prometheus 存储
 Prometheus学习系列（八）之Prometheus API说明
 SSE图像算法优化系列七：基于SSE实现的极速的矩形核腐蚀和膨胀（最大值和最小值）算法。
Crimm Imageshop 2.3。
【短道速滑一】OpenCV中cvResize函数使用双线性插值缩小图像到长宽大小一半时速度飞快（比最近邻还快）之异象解析和自我实现。
【算法随记七】巧用SIMD指令实现急速的字节流按位反转算法。
【算法随记六】一段Matlab版本的Total Variation(TV)去噪算法的C语言翻译。
SSE图像算法优化系列三十：GIMP中的Noise Reduction算法原理及快速实现。
一种快速简便优秀的全局曲线调整与局部信息想结合的非线性彩色增强算法（多图深度分析和探索）
【算法随记五】使用FFT变换自动去除图像中严重的网纹。

原文地址：https://www.cnblogs.com/qinyan20/p/3781371.html