zoukankan      html  css  js  c++  java
  • Stanford NLP Chinese(中文)的使用

    Stanford NLP Chinese(中文)的使用_twenz for higher_百度空间

    Stanford NLP Chinese(中文)的使用

    Stanford NLP tools提供了处理中文的三个工具,分别是分词、Parser;具体参考:

    http://nlp.stanford.edu/software/parser-faq.shtml#o

    1.分词 Chinese segmenter

    下载:http://nlp.stanford.edu/software/

    Stanford Chinese Word Segmenter A Java implementation of a CRF-based Chinese Word Segmenter

    这个包比较大,运行时候需要的内存也多,因而如果用eclipse运行的时候需要修改虚拟内存空间大小:

    运行-》自变量-》VM自变量-》-Xmx800m (最大内存空间800m)

    demo代码(修改过的,未检验):

        Properties props = new Properties();
        props.setProperty("sighanCorporaDict", "data");
        // props.setProperty("NormalizationTable", "data/norm.simp.utf8");
        // props.setProperty("normTableEncoding", "UTF-8");
        // below is needed because CTBSegDocumentIteratorFactory accesses it
        props.setProperty("serDictionary","data/dict-chris6.ser.gz");
        //props.setProperty("testFile", args[0]);
        props.setProperty("inputEncoding", "UTF-8");
        props.setProperty("sighanPostProcessing", "true");
        
        CRFClassifier classifier = new CRFClassifier(props);
        classifier.loadClassifierNoExceptions("data/ctb.gz", props);
        // flags must be re-set after data is loaded
        classifier.flags.setProperties(props);
        //classifier.writeAnswers(classifier.test(args[0]));
        //classifier.testAndWriteAnswers(args[0]);
        
        String result = classifier.testString("我是中国人!");
        System.out.println(result);

    2. Stanford Parser

    可以参考http://nlp.stanford.edu/software/parser-faq.shtml#o

    http://blog.csdn.net/leeharry/archive/2008/03/06/2153583.aspx

    根据输入的训练库不同,可以处理英文,也可以处理中文。输入是分词好的句子,输出词性、句子的语法树(依赖关系)

    英文demo(下载的压缩文件中有):

        LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
        lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});

        String[] sent = { "This", "is", "an", "easy", "sentence", "." };
        Tree parse = (Tree) lp.apply(Arrays.asList(sent));
        parse.pennPrint();
        System.out.println();

        TreebankLanguagePack tlp = new PennTreebankLanguagePack();
        GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        Collection tdl = gs.typedDependenciesCollapsed();
        System.out.println(tdl);
        System.out.println();

        TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
        tp.printTree(parse);

    中文有些不同:

      //LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
        LexicalizedParser lp = new LexicalizedParser("xinhuaFactored.ser.gz");
        //lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});

        //    String[] sent = { "This", "is", "an", "easy", "sentence", "." };
        String[] sent = { "他", "和", "我", "在",  "学校", "里", "常", "打", "桌球", "。" };
        String sentence = "他和我在学校里常打台球。";
        Tree parse = (Tree) lp.apply(Arrays.asList(sent));
        //Tree parse = (Tree) lp.apply(sentence);
       
        parse.pennPrint();
        
        System.out.println();
    /*
        TreebankLanguagePack tlp = new PennTreebankLanguagePack();
        GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        Collection tdl = gs.typedDependenciesCollapsed();
        System.out.println(tdl);
        System.out.println();
    */
        //only for English
        //TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
        //chinese
        TreePrint tp = new TreePrint("wordsAndTags,penn,typedDependenciesCollapsed",new ChineseTreebankLanguagePack());
        tp.printTree(parse);

    然而有些时候我们不是光只要打印出来的语法依赖关系,而是希望得到关于语法树(图),则需要采用如下的程序:
            String[] sent = { "他", "和", "我", "在",  "学校", "里", "常", "打", "桌球", "。" };
            ParserSentence ps = new ParserSentence();
            Tree parse = ps.parserSentence(sent);
            parse.pennPrint();
            TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();
            GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
            GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
            Collection tdl = gs.typedDependenciesCollapsed();
            System.out.println(tdl);
            System.out.println();
            for(int i = 0;i < tdl.size();i ++)
            {
                //TypedDependency(GrammaticalRelation reln, TreeGraphNode gov, TreeGraphNode dep)
                TypedDependency td = (TypedDependency)tdl.toArray()[i];
                System.out.println(td.toString());
            }

    //采用GrammaticalStructure的方法getGrammaticalRelation(TreeGraphNode gov, TreeGraphNode dep)可以获得两个词的语法依赖关系

  • 相关阅读:
    Yield Usage Understanding
    Deadclock on calling async methond
    How to generate file name according to datetime in bat command
    Run Unit API Testing Which Was Distributed To Multiple Test Agents
    druid的关键参数+数据库连接池运行原理
    修改idea打开新窗口的默认配置
    spring boot -thymeleaf-url
    @pathvariable和@RequestParam的区别
    spring boot -thymeleaf-域对象操作
    spring boot -thymeleaf-遍历list和map
  • 原文地址:https://www.cnblogs.com/lexus/p/2756801.html
Copyright © 2011-2022 走看看