zoukankan      html  css  js  c++  java
  • Lucene 3下最快的中文分词器

    lucene 3下最快的中文分词器 - zzzhc's Blog


    Lucene 3下最快的中文分词器

     

    Jun 19th, 2011

    包包分词器 - 一个基于字典的快速中文分词器

    source code

    features

    • 简单 1000LOC
    • 高效 7M+ chars/second
    • 支持中文,英语,数字
    • 自动识别未登录词
    • 支持OffsetAttribute
    • 支持TypeAttribute
    • 支持PositionIncrementAttribute

    usage

     
    1
    2
    3
    4
    5
    6
    7
    8
    
    Dict dict = new Dict();
    dict.addAllSpecialTypes();
    BufferedReader dictReader = new BufferedReader(new InputStreamReader(
        new FileInputStream("dict.txt"), "UTF-8"));
    dict.load(dictReader);
    dictReader.close();
    dict.optimize();
    DictAnalyzer dictAnalyzer = new DictAnalyzer(dict);
    

    benchmark

    ant benchmark

     
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    
    supported features:
                      CharTerm  Offset  PositionIncrement  Term  Type
          IKAnalyzer         Y       Y                  N     Y     N
       MMSegAnalyzer         Y       Y                  N     Y     Y
     PaodingAnalyzer         Y       Y                  N     Y     Y
    StandardAnalyzer         Y       Y                  Y     Y     Y
      BaoBaoAnalyzer         Y       Y                  Y     Y     Y
    
    test 1, sample length=26265
                name          chars           time         tokens speed(chars/second)
     PaodingAnalyzer          26265          0.610          12542            43036.87
       MMSegAnalyzer          26265          0.314          14007            83566.52
          IKAnalyzer          26265          0.262          16016           100177.91
    StandardAnalyzer          26265          0.141          22366           185727.87
      BaoBaoAnalyzer          26265          0.038          18185           695682.16
    
    test 2, sample length=262650
                name          chars           time         tokens speed(chars/second)
     PaodingAnalyzer         262650          0.187         125420          1402139.61
          IKAnalyzer         262650          0.163         160160          1613693.16
       MMSegAnalyzer         262650          0.158         140070          1664009.53
      BaoBaoAnalyzer         262650          0.041         181850          6362134.44
    StandardAnalyzer         262650          0.020         223660         12905789.80
    
    test 3, sample length=2626500
                name          chars           time         tokens speed(chars/second)
          IKAnalyzer        2626500          2.251        1601600          1166564.72
     PaodingAnalyzer        2626500          1.462        1254200          1796381.55
       MMSegAnalyzer        2626500          1.043        1400700          2519010.94
      BaoBaoAnalyzer        2626500          0.352        1818500          7458959.20
    StandardAnalyzer        2626500          0.202        2236600         13015280.16
    

     

    Jun 19th, 2011

     

    lucene

  • 相关阅读:
    2020.10.10收获(动手动脑三)
    2020.10.8收获
    2020.10.4收获
    2020.10.11收获
    2020.10.6收获
    2020.10.7收获(动手动脑二)
    2020.10.9收获
    2020.10.3收获
    2020.10.2收获
    2020.10.5收获
  • 原文地址:https://www.cnblogs.com/lexus/p/2402830.html
Copyright © 2011-2022 走看看