zoukankan      html  css  js  c++  java
  • 斯坦福分词工具的试用

    下载链接 戳这里

    下载后的文件夹是这样的:

     

    然后打开eclipse,新建项目,把源文件segDemo.java拷贝进去,把jar包全丢进去(右键项目, properties,Java Build Path,Add External Jars)

    导入data数据包,并且修改源码中的路径,如图所示:

    然后修改segDemo.java并且测试

     1 package test;
     2 import java.io.*;
     3 import java.util.List;
     4 import java.util.Properties;
     5 
     6 import edu.stanford.nlp.ie.crf.CRFClassifier;
     7 import edu.stanford.nlp.ling.CoreLabel;
     8 
     9 
    10 /** This is a very simple demo of calling the Chinese Word Segmenter
    11  *  programmatically.  It assumes an input file in UTF8.
    12  *  <p/>
    13  *  <code>
    14  *  Usage: java -mx1g -cp seg.jar SegDemo fileName
    15  *  </code>
    16  *  This will run correctly in the distribution home directory.  To
    17  *  run in general, the properties for where to find dictionaries or
    18  *  normalizations have to be set.
    19  *
    20  *  @author Christopher Manning
    21  */
    22 
    23 public class SegDemo {
    24 
    25   private static final String basedir = System.getProperty("SegDemo", "data");
    26 
    27   public static void main(String[] args) throws Exception {
    28     System.setOut(new PrintStream(System.out, true, "utf-8"));
    29 
    30     Properties props = new Properties();
    31     props.setProperty("sighanCorporaDict", basedir);
    32     // props.setProperty("NormalizationTable", "data/norm.simp.utf8");
    33     // props.setProperty("normTableEncoding", "UTF-8");
    34     // below is needed because CTBSegDocumentIteratorFactory accesses it
    35     props.setProperty("serDictionary", basedir + "/dict-chris6.ser.gz");
    36     if (args.length > 0) {
    37       props.setProperty("testFile", args[0]);
    38     }
    39     props.setProperty("inputEncoding", "UTF-8");
    40     props.setProperty("sighanPostProcessing", "true");
    41 
    42     CRFClassifier<CoreLabel> segmenter = new CRFClassifier<>(props);
    43     segmenter.loadClassifierNoExceptions(basedir + "/ctb.gz", props);
    44     for (String filename : args) {
    45       segmenter.classifyAndWriteAnswers(filename);
    46     }
    47 
    48     String sample = "我住在美国。";
    49     List<String> segmented = segmenter.segmentString(sample);
    50     System.out.println(segmented);
    51   }
    52 
    53 }

    输出:[我, 住在, 美国, 。]

    之后请随意发挥吧~

  • 相关阅读:
    Sprinig.net 双向绑定 Bidirectional data binding and data model management 和 UpdatePanel
    Memcached是什么
    Spring.net 网络示例 codeproject
    jquery.modalbox.show 插件
    UVA 639 Don't Get Rooked
    UVA 539 The Settlers of Catan
    UVA 301 Transportation
    UVA 331 Mapping the Swaps
    UVA 216 Getting in Line
    UVA 10344 23 out of 5
  • 原文地址:https://www.cnblogs.com/kuqs/p/5435574.html
Copyright © 2011-2022 走看看