Stanford NLP Chinese(中文)的使用

Stanford NLP tools提供了处理中文的三个工具，分别是分词、Parser；具体参考：

http://nlp.stanford.edu/software/parser-faq.shtml#o

1.分词 Chinese segmenter

下载：http://nlp.stanford.edu/software/

Stanford Chinese Word Segmenter A Java implementation of a CRF-based Chinese Word Segmenter

这个包比较大，运行时候需要的内存也多，因而如果用eclipse运行的时候需要修改虚拟内存空间大小：

运行-》自变量-》VM自变量-》-Xmx800m (最大内存空间800m)

demo代码（修改过的，未检验）:

    Properties props = new Properties();
    props.setProperty("sighanCorporaDict", "data");
    // props.setProperty("NormalizationTable", "data/norm.simp.utf8");
    // props.setProperty("normTableEncoding", "UTF-8");
    // below is needed because CTBSegDocumentIteratorFactory accesses it
    props.setProperty("serDictionary","data/dict-chris6.ser.gz");
    //props.setProperty("testFile", args[0]);
    props.setProperty("inputEncoding", "UTF-8");
    props.setProperty("sighanPostProcessing", "true");

    CRFClassifier classifier = new CRFClassifier(props);
    classifier.loadClassifierNoExceptions("data/ctb.gz", props);
    // flags must be re-set after data is loaded
    classifier.flags.setProperties(props);
    //classifier.writeAnswers(classifier.test(args[0]));
    //classifier.testAndWriteAnswers(args[0]);

    String result = classifier.testString("我是中国人！");
    System.out.println(result);

2. Stanford Parser

可以参考http://nlp.stanford.edu/software/parser-faq.shtml#o

http://blog.csdn.net/leeharry/archive/2008/03/06/2153583.aspx

根据输入的训练库不同，可以处理英文，也可以处理中文。输入是分词好的句子，输出词性、句子的语法树（依赖关系）

英文demo（下载的压缩文件中有）：

    LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
    lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});

    String[] sent = { "This", "is", "an", "easy", "sentence", "." };
    Tree parse = (Tree) lp.apply(Arrays.asList(sent));
    parse.pennPrint();
    System.out.println();

    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    Collection tdl = gs.typedDependenciesCollapsed();
    System.out.println(tdl);
    System.out.println();

    TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
    tp.printTree(parse);

中文有些不同：

//LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
    LexicalizedParser lp = new LexicalizedParser("xinhuaFactored.ser.gz");
    //lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});

    //    String[] sent = { "This", "is", "an", "easy", "sentence", "." };
    String[] sent = { "他", "和", "我", "在", "学校", "里", "常", "打", "桌球", "。" };
    String sentence = "他和我在学校里常打台球。";
    Tree parse = (Tree) lp.apply(Arrays.asList(sent));
    //Tree parse = (Tree) lp.apply(sentence);

    parse.pennPrint();

    System.out.println();
/*
    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    Collection tdl = gs.typedDependenciesCollapsed();
    System.out.println(tdl);
    System.out.println();
*/
    //only for English
    //TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
    //chinese
    TreePrint tp = new TreePrint("wordsAndTags,penn,typedDependenciesCollapsed",new ChineseTreebankLanguagePack());
    tp.printTree(parse);

然而有些时候我们不是光只要打印出来的语法依赖关系，而是希望得到关于语法树(图)，则需要采用如下的程序：
       String[] sent = { "他", "和", "我", "在", "学校", "里", "常", "打", "桌球", "。" };
       ParserSentence ps = new ParserSentence();
       Tree parse = ps.parserSentence(sent);
       parse.pennPrint();
       TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();
        GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        Collection tdl = gs.typedDependenciesCollapsed();
        System.out.println(tdl);
        System.out.println();
        for(int i = 0;i < tdl.size();i ++)
        {
           //TypedDependency(GrammaticalRelation reln, TreeGraphNode gov, TreeGraphNode dep)
           TypedDependency td = (TypedDependency)tdl.toArray()[i];
           System.out.println(td.toString());
        }

//采用GrammaticalStructure的方法getGrammaticalRelation(TreeGraphNode gov, TreeGraphNode dep)可以获得两个词的语法依赖关系