zoukankan      html  css  js  c++  java
  • Stanford NLP Chinese(中文)的使用

    Stanford NLP Chinese(中文)的使用_twenz for higher_百度空间

    Stanford NLP Chinese(中文)的使用

    Stanford NLP tools提供了处理中文的三个工具,分别是分词、Parser;具体参考:

    http://nlp.stanford.edu/software/parser-faq.shtml#o

    1.分词 Chinese segmenter

    下载:http://nlp.stanford.edu/software/

    Stanford Chinese Word Segmenter A Java implementation of a CRF-based Chinese Word Segmenter

    这个包比较大,运行时候需要的内存也多,因而如果用eclipse运行的时候需要修改虚拟内存空间大小:

    运行-》自变量-》VM自变量-》-Xmx800m (最大内存空间800m)

    demo代码(修改过的,未检验):

        Properties props = new Properties();
        props.setProperty("sighanCorporaDict", "data");
        // props.setProperty("NormalizationTable", "data/norm.simp.utf8");
        // props.setProperty("normTableEncoding", "UTF-8");
        // below is needed because CTBSegDocumentIteratorFactory accesses it
        props.setProperty("serDictionary","data/dict-chris6.ser.gz");
        //props.setProperty("testFile", args[0]);
        props.setProperty("inputEncoding", "UTF-8");
        props.setProperty("sighanPostProcessing", "true");
        
        CRFClassifier classifier = new CRFClassifier(props);
        classifier.loadClassifierNoExceptions("data/ctb.gz", props);
        // flags must be re-set after data is loaded
        classifier.flags.setProperties(props);
        //classifier.writeAnswers(classifier.test(args[0]));
        //classifier.testAndWriteAnswers(args[0]);
        
        String result = classifier.testString("我是中国人!");
        System.out.println(result);

    2. Stanford Parser

    可以参考http://nlp.stanford.edu/software/parser-faq.shtml#o

    http://blog.csdn.net/leeharry/archive/2008/03/06/2153583.aspx

    根据输入的训练库不同,可以处理英文,也可以处理中文。输入是分词好的句子,输出词性、句子的语法树(依赖关系)

    英文demo(下载的压缩文件中有):

        LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
        lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});

        String[] sent = { "This", "is", "an", "easy", "sentence", "." };
        Tree parse = (Tree) lp.apply(Arrays.asList(sent));
        parse.pennPrint();
        System.out.println();

        TreebankLanguagePack tlp = new PennTreebankLanguagePack();
        GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        Collection tdl = gs.typedDependenciesCollapsed();
        System.out.println(tdl);
        System.out.println();

        TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
        tp.printTree(parse);

    中文有些不同:

      //LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
        LexicalizedParser lp = new LexicalizedParser("xinhuaFactored.ser.gz");
        //lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});

        //    String[] sent = { "This", "is", "an", "easy", "sentence", "." };
        String[] sent = { "他", "和", "我", "在",  "学校", "里", "常", "打", "桌球", "。" };
        String sentence = "他和我在学校里常打台球。";
        Tree parse = (Tree) lp.apply(Arrays.asList(sent));
        //Tree parse = (Tree) lp.apply(sentence);
       
        parse.pennPrint();
        
        System.out.println();
    /*
        TreebankLanguagePack tlp = new PennTreebankLanguagePack();
        GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        Collection tdl = gs.typedDependenciesCollapsed();
        System.out.println(tdl);
        System.out.println();
    */
        //only for English
        //TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
        //chinese
        TreePrint tp = new TreePrint("wordsAndTags,penn,typedDependenciesCollapsed",new ChineseTreebankLanguagePack());
        tp.printTree(parse);

    然而有些时候我们不是光只要打印出来的语法依赖关系,而是希望得到关于语法树(图),则需要采用如下的程序:
            String[] sent = { "他", "和", "我", "在",  "学校", "里", "常", "打", "桌球", "。" };
            ParserSentence ps = new ParserSentence();
            Tree parse = ps.parserSentence(sent);
            parse.pennPrint();
            TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();
            GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
            GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
            Collection tdl = gs.typedDependenciesCollapsed();
            System.out.println(tdl);
            System.out.println();
            for(int i = 0;i < tdl.size();i ++)
            {
                //TypedDependency(GrammaticalRelation reln, TreeGraphNode gov, TreeGraphNode dep)
                TypedDependency td = (TypedDependency)tdl.toArray()[i];
                System.out.println(td.toString());
            }

    //采用GrammaticalStructure的方法getGrammaticalRelation(TreeGraphNode gov, TreeGraphNode dep)可以获得两个词的语法依赖关系

  • 相关阅读:
    [Scoi2010]游戏
    HDU3415(单调队列)
    POJ1221(整数划分)
    POJ1050(dp)
    POJ2479(dp)
    HDU1864(背包)
    HDU1175(dfs)
    STL_string.vector中find到的iterator的序号
    Qt532.数值转为16进制(并填充)
    异常处理.VC++
  • 原文地址:https://www.cnblogs.com/lexus/p/2756801.html
Copyright © 2011-2022 走看看