zoukankan      html  css  js  c++  java
  • solr+jieba结巴分词

    为什么选择结巴分词

    • 分词效率高
    • 词料库构建时使用的是jieba (python)

    结巴分词Java版本

    • 下载
    git clone https://github.com/huaban/jieba-analysis
    
    • 编译
    cd jieba-analysis
    mvn install
    
    • 注意
    如果mvn版本较高,需要修改pom.xml文件,在plugins前面增加 
    

    solr tokenizer版本

    支持solr 6或7或更高

    如果你的solr像我一样,版本比较新,需要对代码稍做修改,但改动其实不大。(根据给编译时报的错误做修改即可)

    build.gradle的diff

    diff --git a/build.gradle b/build.gradle
    index 2a87525..06c5cc3 100644
    --- a/build.gradle
    +++ b/build.gradle
    @@ -1,4 +1,4 @@
    -group = 'analyzer.solr5'
    +group = 'analyzer.solr7'
    version = '1.0'
    apply plugin: 'java'
    apply plugin: "eclipse"
    @@ -14,15 +14,14 @@ repositories {
    dependencies {
    testCompile group: 'junit', name: 'junit', version: '4.11'
    
    - compile("org.apache.lucene:lucene-core:5.0.0")
    - compile("org.apache.lucene:lucene-queryparser:5.0.0")
    - compile("org.apache.lucene:lucene-analyzers-common:5.0.0")
    - compile('com.huaban:jieba-analysis:1.0.0')
    -// compile("org.fnlp:fnlp-core:2.0-SNAPSHOT")
    + compile("org.apache.lucene:lucene-core:7.1.0")
    + compile("org.apache.lucene:lucene-queryparser:7.1.0")
    + compile("org.apache.lucene:lucene-analyzers-common:7.1.0")
    + compile files('libs/jieba-analysis-1.0.3.jar')
    compile("edu.stanford.nlp:stanford-corenlp:3.5.1")
    }
    
    task "create-dirs" << {
    sourceSets*.java.srcDirs*.each { it.mkdirs() }
    sourceSets*.resources.srcDirs*.each { it.mkdirs() }
    -}
     No newline at end of file
    +}
    

    编译

    ./gladlew build
    

    集成到solr

    拷贝jar包到solr的目录下:server/solr-webapp/webapp/WEB-INF/lib
    

    schema修改

        <fieldType name="text_jieba" class="solr.TextField" positionIncrementGap="100">
          <analyzer type="index">
            <tokenizer class="analyzer.solr7.jieba.JiebaTokenizerFactory"  segMode="SEARCH"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ch.txt" />
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.SnowballPorterFilterFactory" language="English"/>
          </analyzer>
          <analyzer type="query">
            <tokenizer class="analyzer.solr7.jieba.JiebaTokenizerFactory"  segMode="SEARCH"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ch.txt" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.SnowballPorterFilterFactory" language="English"/>
          </analyzer>
        </fieldType>
    
  • 相关阅读:
    boxcox1p归一化+pipeline+StackingCVRegressor
    rt-thread调度锁与关闭中断深度探究
    树莓派4最小化安装Linux
    树莓派4可以不用SD卡启动
    树莓派JTAG详细使用笔记
    树莓派上玩街机游戏
    用树莓派制作红白游戏机
    树莓派4上使用uboot+tftp调试rt-thread程序
    在window上搭建树莓派4b的RT-Thread开发环境2
    树莓派上运行RT-Thread并通过esp8266连接网络
  • 原文地址:https://www.cnblogs.com/lotushy/p/8404603.html
Copyright © 2011-2022 走看看