zoukankan      html  css  js  c++  java
  • 1.6.7 Detecting Languages During Indexing

    1. Detecting Languages During Indexing

       在索引的时候,solr可以使用langid UpdateRequestProcessor来识别语言,然后映射文本到特定语言的字段.solr支持这个功能的两个实现:

    1. Tika的语言解析功能:http://tika.apache.org/0.10/detection.html
    2. LangDetect语言解析:http://code.google.com/p/language-detection/

      可以从 http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html中看到它们之间的对比.一般情况下,LangDetect支持更多的语言,具有更高的性能.

      参考http://wiki.apache.org/solr/LanguageDetection获取更多的关于langid UpdateRequestProcessor信息.

     1.1 Configuring Language Detection

      可以在solrconfig.xml中配置langid UpdateRequestProcessor.两个实现具有相同的参数,最少,你需要指定语言识别的字段和字段的结果语言编码.

     1.2 Configuring Tika Language Detection

      这里是solrconfig.xml 中 Tika langid UpdateRequestProcessor的最小的配置.

    <processor
        class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
        <lst name="defaults">
            <str name="langid.fl">title,subject,text,keywords</str>
            <str name="langid.langField">language_s</str>
        </lst>
    </processor>

     1.3 Configuring LangDetect Language Detection

      这里是solrconfig.xml中最小的LangDetect langid配置.

    <processor
        class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFac
    tory">
        <lst name="defaults">
            <str name="langid.fl">title,subject,text,keywords</str>
            <str name="langid.langField">language_s</str>
        </lst>
    </processor>

     1.4 langid Parameters

      正如上面所提到的,两个langid  UpdateRequestProcessor的实现具有相同的参数:

    参数 类型 默认 必填 描述
    langid Boolean true no 开启/关闭语言解析
    langid.fl string none yes 逗号或者空格分隔的字段列表.用于语言探测解析.
    langid.langField string none yes 返回的语言代码指定字段
    langid.langsField multivalued string none no 返回的语言代码指定字段.如果使用langid.map.individual,每一个解析的语言都被添加到这个字段.
    langid.overwrite Boolean false no 指定langField和langsField字段的内容是否被重写.如果它们包含值的话.
    langid.lcmap string none false 空格分隔的列表,指定冒号分隔的语言代码用于语言解析.举例,你可以能使用这个映射中文,日文,韩文到一个cjk字段,并且映射美国英语和英国英语到一个en代码.可以使用langid.lcmap=ja:cjk zh:cjk ko:cjk
    . This affects both the values put into the  en_GB:en en_US:en.这使这两个值放入到langField和langsField字段中.
    langid.threshold float 0.5 no Specifies a threshold value between 0 and 1 that the language
    identification score must reach before  accepts it. With longer langid
    text fields, a high threshold such at 0.8 will give good results. For
    shorter text fields, you may need to lower the threshold for language
    identification, though you will be risking somewhat lower quality
    results. We recommend experimenting with your data to tune your
    results.
    langid.whitelist string none no Specifies a list of allowed language identification codes. Use this in
    combination with  to ensure that you only index langid.map
    documents into fields that are in your schema.
    langid.map Boolean false no Enables field name mapping. If true, Solr will map field names for all
    fields listed in  . langid.fl
    langid.map.fl string none no A comma-separated list of fields for  that is different langid.map
    than the fields specified in  . langid.fl
    langid.map.keepOrig Boolean false no If true, Solr will copy the field during the field name mapping process,
    leaving the original field in place.
    langid..map.individual Boolean false no If true, Solr will detect and map languages for each field individually
    langid.map.individual.fl stromh none no 逗号分割的字段列表,使用 langid.map.individual.不同于langid.fl中指定的字段.
    langid.fallbackFields string none no If no language is detected that meets the  score langid.threshold
    , or if the detected language is not on the  , this langid.whitelist
    field specifies language codes to be used as fallback values. If no
    appropriate fallback languages are found, Solr will use the language
    code specified in  .
    langid.fallback string none no Specifies a language code to use if no language is detected or
    specified in  .
    langid.map.lcmap string determined by
    langid.lcmap
    no A space-separated list specifying colon delimited language code
    mappings to use when mapping field names. For example, you might
    use this to make Chinese, Japanese, and Korean language fields use
    a common  suffix, and map both American and British English *_cjk
    fields to a single  by using  *_en langid.map.lcmap=ja:cjk
    . zh:cjk ko:cjk en_GB:en en_US:en
    langid.map.pattern Java
    regular
    expression
    none no By default, fields are mapped as <field>_<language>. To change this
    pattern, you can specify a Java regular expression in this parameter.
    langid.map.replace Java replace none no By default, fields are mapped as <field>_<language>. To change this
    pattern, you can specify a Java replace in this parameter.
    langid.enforceSchema Boolean true no If false, the  processor does not validate field names against langid
    your schema. This may be useful if you plan to rename or delete
    fields later in the UpdateChain
  • 相关阅读:
    安防视频云服务EasyCVR视频上云网关如何通过wireshark将发送的rtp流数据保存成文件?
    安防视频监控系统视频上云解决方案EasyCVR语音转发功能音频数据打包发送流程介绍
    安防视频监控系统视频上云解决方案EasyCVR音频基础知识介绍
    如何通过RTSP协议视频平台EasyNVR建立一套外网可访问的4S店远程监控系统?
    IP摄像机RTSP协议视频平台EasyNVR点击程序启动后闪退问题排查及解决
    5G时代RTC技术是直播互动的最终选择,EasyRTC视频会议系统将赋能VR/电商直播等更多新场景
    视频会议软件EasyRTC-SFU之mediasoup-demo在 Windows上的编译安装
    视频会议软件/音视频通话软件EasyRTC-SFU开发中如何使用TortoiseGit将代码推送到两个代码仓库?
    SFU架构的云视频会议系统如何取代硬件视频会议系统,成为5G时代的视频会议新宠?
    云架构视频会议系统EasyRTC企业远程会议MCU版与SFU版在行业应用场景上有什么区别?
  • 原文地址:https://www.cnblogs.com/a198720/p/4323051.html
Copyright © 2011-2022 走看看