zoukankan      html  css  js  c++  java
  • 斯坦福分词工具的试用

    下载链接 戳这里

    下载后的文件夹是这样的:

     

    然后打开eclipse,新建项目,把源文件segDemo.java拷贝进去,把jar包全丢进去(右键项目, properties,Java Build Path,Add External Jars)

    导入data数据包,并且修改源码中的路径,如图所示:

    然后修改segDemo.java并且测试

     1 package test;
     2 import java.io.*;
     3 import java.util.List;
     4 import java.util.Properties;
     5 
     6 import edu.stanford.nlp.ie.crf.CRFClassifier;
     7 import edu.stanford.nlp.ling.CoreLabel;
     8 
     9 
    10 /** This is a very simple demo of calling the Chinese Word Segmenter
    11  *  programmatically.  It assumes an input file in UTF8.
    12  *  <p/>
    13  *  <code>
    14  *  Usage: java -mx1g -cp seg.jar SegDemo fileName
    15  *  </code>
    16  *  This will run correctly in the distribution home directory.  To
    17  *  run in general, the properties for where to find dictionaries or
    18  *  normalizations have to be set.
    19  *
    20  *  @author Christopher Manning
    21  */
    22 
    23 public class SegDemo {
    24 
    25   private static final String basedir = System.getProperty("SegDemo", "data");
    26 
    27   public static void main(String[] args) throws Exception {
    28     System.setOut(new PrintStream(System.out, true, "utf-8"));
    29 
    30     Properties props = new Properties();
    31     props.setProperty("sighanCorporaDict", basedir);
    32     // props.setProperty("NormalizationTable", "data/norm.simp.utf8");
    33     // props.setProperty("normTableEncoding", "UTF-8");
    34     // below is needed because CTBSegDocumentIteratorFactory accesses it
    35     props.setProperty("serDictionary", basedir + "/dict-chris6.ser.gz");
    36     if (args.length > 0) {
    37       props.setProperty("testFile", args[0]);
    38     }
    39     props.setProperty("inputEncoding", "UTF-8");
    40     props.setProperty("sighanPostProcessing", "true");
    41 
    42     CRFClassifier<CoreLabel> segmenter = new CRFClassifier<>(props);
    43     segmenter.loadClassifierNoExceptions(basedir + "/ctb.gz", props);
    44     for (String filename : args) {
    45       segmenter.classifyAndWriteAnswers(filename);
    46     }
    47 
    48     String sample = "我住在美国。";
    49     List<String> segmented = segmenter.segmentString(sample);
    50     System.out.println(segmented);
    51   }
    52 
    53 }

    输出:[我, 住在, 美国, 。]

    之后请随意发挥吧~

  • 相关阅读:
    linux jdk 安装
    hibernate下Session的获取方式
    java http的get,post请求
    DetachedCriteria的简单使用
    传入泛型类型(T.class)的方法
    spring4、hibernate4整合xml配置
    (转)谈依赖注入
    集合类概述
    Swing编程概述
    java构造方法之我见
  • 原文地址:https://www.cnblogs.com/kuqs/p/5435574.html
Copyright © 2011-2022 走看看