zoukankan      html  css  js  c++  java
  • solr+jieba结巴分词

    为什么选择结巴分词

    • 分词效率高
    • 词料库构建时使用的是jieba (python)

    结巴分词Java版本

    • 下载
    git clone https://github.com/huaban/jieba-analysis
    
    • 编译
    cd jieba-analysis
    mvn install
    
    • 注意
    如果mvn版本较高,需要修改pom.xml文件,在plugins前面增加 
    

    solr tokenizer版本

    支持solr 6或7或更高

    如果你的solr像我一样,版本比较新,需要对代码稍做修改,但改动其实不大。(根据给编译时报的错误做修改即可)

    build.gradle的diff

    diff --git a/build.gradle b/build.gradle
    index 2a87525..06c5cc3 100644
    --- a/build.gradle
    +++ b/build.gradle
    @@ -1,4 +1,4 @@
    -group = 'analyzer.solr5'
    +group = 'analyzer.solr7'
    version = '1.0'
    apply plugin: 'java'
    apply plugin: "eclipse"
    @@ -14,15 +14,14 @@ repositories {
    dependencies {
    testCompile group: 'junit', name: 'junit', version: '4.11'
    
    - compile("org.apache.lucene:lucene-core:5.0.0")
    - compile("org.apache.lucene:lucene-queryparser:5.0.0")
    - compile("org.apache.lucene:lucene-analyzers-common:5.0.0")
    - compile('com.huaban:jieba-analysis:1.0.0')
    -// compile("org.fnlp:fnlp-core:2.0-SNAPSHOT")
    + compile("org.apache.lucene:lucene-core:7.1.0")
    + compile("org.apache.lucene:lucene-queryparser:7.1.0")
    + compile("org.apache.lucene:lucene-analyzers-common:7.1.0")
    + compile files('libs/jieba-analysis-1.0.3.jar')
    compile("edu.stanford.nlp:stanford-corenlp:3.5.1")
    }
    
    task "create-dirs" << {
    sourceSets*.java.srcDirs*.each { it.mkdirs() }
    sourceSets*.resources.srcDirs*.each { it.mkdirs() }
    -}
     No newline at end of file
    +}
    

    编译

    ./gladlew build
    

    集成到solr

    拷贝jar包到solr的目录下:server/solr-webapp/webapp/WEB-INF/lib
    

    schema修改

        <fieldType name="text_jieba" class="solr.TextField" positionIncrementGap="100">
          <analyzer type="index">
            <tokenizer class="analyzer.solr7.jieba.JiebaTokenizerFactory"  segMode="SEARCH"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ch.txt" />
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.SnowballPorterFilterFactory" language="English"/>
          </analyzer>
          <analyzer type="query">
            <tokenizer class="analyzer.solr7.jieba.JiebaTokenizerFactory"  segMode="SEARCH"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ch.txt" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.SnowballPorterFilterFactory" language="English"/>
          </analyzer>
        </fieldType>
    
  • 相关阅读:
    非监督学习
    4.5_岭回归案例分析
    4.4_回归算法之岭回归
    4.3_回归性能评估与欠拟合|过拟合
    4.2_线性回归案例分析
    回归算法
    HDU 2105 The Center of Gravity (数学)
    HDU 2089 不要62 (数学)
    HDU 2036 改革春风吹满地 (数学)
    HDU 1840 Equations (数学)
  • 原文地址:https://www.cnblogs.com/lotushy/p/8404603.html
Copyright © 2011-2022 走看看