zoukankan      html  css  js  c++  java
  • 使用Standford coreNLP进行中文命名实体识别

    因为工作需要,调研了一下Stanford coreNLP的命名实体识别功能。

    Stanford CoreNLP是一个比较厉害的自然语言处理工具,很多模型都是基于深度学习方法训练得到的。

    先附上其官网链接:

    • https://stanfordnlp.github.io/CoreNLP/index.html
    • https://nlp.stanford.edu/nlp/javadoc/javanlp/
    • https://github.com/stanfordnlp/CoreNLP

    本文主要讲解如何在java工程中使用Stanford CoreNLP;

    1.环境准备

    3.5之后的版本都需要java8以上的环境才能运行。需要进行中文处理的话,比较占用内存,3G左右的内存消耗。

    笔者使用的maven进行依赖的引入,使用的是3.9.1版本。

    直接在pom文件中加入下面的依赖:

            <dependency>
                <groupId>edu.stanford.nlp</groupId>
                <artifactId>stanford-corenlp</artifactId>
                <version>3.9.1</version>
            </dependency>
            <dependency>
                <groupId>edu.stanford.nlp</groupId>
                <artifactId>stanford-corenlp</artifactId>
                <version>3.9.1</version>
                <classifier>models</classifier>
            </dependency>
            <dependency>
                <groupId>edu.stanford.nlp</groupId>
                <artifactId>stanford-corenlp</artifactId>
                <version>3.9.1</version>
                <classifier>models-chinese</classifier>
            </dependency>

    3个包分别是CoreNLP的算法包、英文语料包、中文预料包。这3个包的总大小为1.43G。maven默认镜像在国外,而这几个依赖包特别大,可以找有着三个依赖的国内镜像试一下。笔者用的是自己公司的maven仓库。

    2.代码调用

    需要注意的是,因为我是需要进行中文的命名实体识别,因此需要使用中文分词和中文的词典。我们可以先打开引入的jar包的结构:

    其中有个StanfordCoreNLP-chinese.properties文件,这里面设定了进行中文自然语言处理的一些参数。主要指定相应的pipeline的操作步骤以及对应的预料文件的位置。实际上我们可能用不到所有的步骤,或者要使用不同的语料库,因此可以自定义配置文件,然后再引入。那在我的项目中,我就直接读取了该properties文件。

    attention:此处笔者要使用的是ner功能,但可能不想使用其他的一些annotation,想去掉。然而,Stanford CoreNLP有一些局限,就是在ner执行之前,一定需要

    tokenize, ssplit, pos, lemma

    的引入,当然这增加了很大的时间耗时。

    其实我们可以先来分析一下这个properties文件:

    # Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system)
    annotators = tokenize, ssplit, pos, lemma, ner, parse, coref
    
    # segment
    tokenize.language = zh
    segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
    segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
    segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
    segment.sighanPostProcessing = true
    
    # sentence split
    ssplit.boundaryTokenRegex = [.。]|[!?!?]+
    
    # pos
    pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger
    
    # ner 此处设定了ner使用的语言、模型(crf),目前SUTime只支持英文,不支持中文,所以设置为false。
    ner.language = chinese
    ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
    ner.applyNumericClassifiers = true
    ner.useSUTime = false
    
    # regexner
    ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab
    ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE
    
    # parse
    parse.model = edu/stanford/nlp/models/srparser/chineseSR.ser.gz
    
    # depparse
    depparse.model    = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz
    depparse.language = chinese
    
    # coref
    coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
    coref.input.type = raw
    coref.postprocessing = true
    coref.calculateFeatureImportance = false
    coref.useConstituencyTree = true
    coref.useSemantics = false
    coref.algorithm = hybrid
    coref.path.word2vec =
    coref.language = zh
    coref.defaultPronounAgreement = true
    coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz
    coref.print.md.log = false
    coref.md.type = RULE
    coref.md.liberalChineseMD = false
    
    # kbp
    kbp.semgrex = edu/stanford/nlp/models/kbp/chinese/semgrex
    kbp.tokensregex = edu/stanford/nlp/models/kbp/chinese/tokensregex
    kbp.language = zh
    kbp.model = none
    
    # entitylink
    entitylink.wikidict = edu/stanford/nlp/models/kbp/chinese/wikidict_chinese.tsv.gz

    那我们就直接在代码中引入这个properties文件,参考代码如下:

    package com.baidu.corenlp;
    
    import java.util.List;
    import java.util.Map;
    import java.util.Properties;
    
    import edu.stanford.nlp.coref.CorefCoreAnnotations;
    import edu.stanford.nlp.coref.data.CorefChain;
    import edu.stanford.nlp.ling.CoreAnnotations;
    import edu.stanford.nlp.ling.CoreLabel;
    import edu.stanford.nlp.pipeline.Annotation;
    import edu.stanford.nlp.pipeline.StanfordCoreNLP;
    import edu.stanford.nlp.semgraph.SemanticGraph;
    import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations;
    import edu.stanford.nlp.trees.Tree;
    import edu.stanford.nlp.trees.TreeCoreAnnotations;
    import edu.stanford.nlp.util.CoreMap;
    
    /**
     * Created by sonofelice on 2018/3/27.
     */
    public class TestNLP {
        public void test() throws Exception {
            //构造一个StanfordCoreNLP对象,配置NLP的功能,如lemma是词干化,ner是命名实体识别等
            Properties props = new Properties();
            props.load(this.getClass().getResourceAsStream("/StanfordCoreNLP-chinese.properties"));
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
            String text = "袁隆平是中国科学院的院士,他于2009年10月到中国山东省东营市东营区永乐机场附近承包了一千亩盐碱地,"
                    + "开始种植棉花, 年产量达到一万吨, 哈哈, 反正棣琦说的是假的,逗你玩儿,明天下午2点来我家吃饭吧。"
                    + "棣琦是山东大学毕业的,目前在百度做java开发,位置是东北旺东路102号院,手机号14366778890";
    
            long startTime = System.currentTimeMillis();
            // 创造一个空的Annotation对象
            Annotation document = new Annotation(text);
    
            // 对文本进行分析
            pipeline.annotate(document);
    
            //获取文本处理结果
            List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
            for (CoreMap sentence : sentences) {
                // traversing the words in the current sentence
                // a CoreLabel is a CoreMap with additional token-specific methods
                for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                    //                // 获取句子的token(可以是作为分词后的词语)
                    String word = token.get(CoreAnnotations.TextAnnotation.class);
                    System.out.println(word);
                    //词性标注
                    String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                    System.out.println(pos);
                    // 命名实体识别
                    String ne = token.get(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class);
                    String ner = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
                    System.out.println(word + " | analysis : {  original : " + ner + "," + " normalized : "
                            + ne + "}");
                    //词干化处理
                    String lema = token.get(CoreAnnotations.LemmaAnnotation.class);
                    System.out.println(lema);
                }
    
                // 句子的解析树
                Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
                System.out.println("句子的解析树:");
                tree.pennPrint();
    
                // 句子的依赖图
                SemanticGraph graph =
                        sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
                System.out.println("句子的依赖图");
                System.out.println(graph.toString(SemanticGraph.OutputFormat.LIST));
    
            }
    
            long endTime = System.currentTimeMillis();
            long time = endTime - startTime;
            System.out.println("The analysis lasts " + time + " seconds * 1000");
    
            // 指代词链
            //每条链保存指代的集合
            // 句子和偏移量都从1开始
            Map<Integer, CorefChain> corefChains = document.get(CorefCoreAnnotations.CorefChainAnnotation.class);
            if (corefChains == null) {
                return;
            }
            for (Map.Entry<Integer, CorefChain> entry : corefChains.entrySet()) {
                System.out.println("Chain " + entry.getKey() + " ");
                for (CorefChain.CorefMention m : entry.getValue().getMentionsInTextualOrder()) {
                    // We need to subtract one since the indices count from 1 but the Lists start from 0
                    List<CoreLabel> tokens = sentences.get(m.sentNum - 1).get(CoreAnnotations.TokensAnnotation.class);
                    // We subtract two for end: one for 0-based indexing, and one because we want last token of mention 
                    // not one following.
                    System.out.println(
                            "  " + m + ", i.e., 0-based character offsets [" + tokens.get(m.startIndex - 1).beginPosition()
                                    +
                                    ", " + tokens.get(m.endIndex - 2).endPosition() + ")");
                }
            }
        }
    }


    public static void main(String[] args) throws  Exception {
    TestNLP nlp=new TestNLP();
    nlp.test();
    }

     当然,我在运行过程中,只保留了ner相关的分析,别的功能注释掉了。输出结果如下:

    19:46:16.000 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
    19:46:19.387 [main] INFO  e.s.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [3.4 sec].
    19:46:19.388 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
    19:46:19.389 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
    19:46:21.938 [main] INFO  e.s.n.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [2.5 sec].
    19:46:22.099 [main] WARN  e.s.n.p.TokensRegexNERAnnotator - TokensRegexNERAnnotator ner.fine.regexner: Entry has multiple types for ner: 巴伐利亚 STATE_OR_PROVINCE    MISC,GPE,LOCATION    1.  Taking type to be MISC
    19:46:22.100 [main] WARN  e.s.n.p.TokensRegexNERAnnotator - TokensRegexNERAnnotator ner.fine.regexner: Entry has multiple types for ner: 巴伐利亚 州 STATE_OR_PROVINCE    MISC,GPE,LOCATION    1.  Taking type to be MISC
    19:46:22.100 [main] INFO  e.s.n.p.TokensRegexNERAnnotator - TokensRegexNERAnnotator ner.fine.regexner: Read 21238 unique entries out of 21249 from edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab, 0 TokensRegex patterns.
    19:46:22.532 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
    19:46:35.855 [main] INFO  e.s.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/srparser/chineseSR.ser.gz ... done [13.3 sec].
    19:46:35.859 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator coref
    19:46:43.139 [main] INFO  e.s.n.pipeline.CorefMentionAnnotator - Using mention detector type: rule
    19:46:43.148 [main] INFO  e.s.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
    19:46:43.148 [main] INFO  e.s.nlp.wordseg.ChineseDictionary -   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
    19:46:43.329 [main] INFO  e.s.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
    19:46:43.379 [main] INFO  edu.stanford.nlp.wordseg.CorpusChar - Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list [done].
    19:46:43.380 [main] INFO  e.s.nlp.wordseg.AffixDictionary - Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb [done].
    袁隆平 | analysis : {  original : PERSON, normalized : null}
    是 | analysis : {  original : O, normalized : null}
    中国 | analysis : {  original : ORGANIZATION, normalized : null}
    科学院 | analysis : {  original : ORGANIZATION, normalized : null}
    的 | analysis : {  original : O, normalized : null}
    院士 | analysis : {  original : TITLE, normalized : null}
    , | analysis : {  original : O, normalized : null}
    他 | analysis : {  original : O, normalized : null}
    于 | analysis : {  original : O, normalized : null}
    2009年 | analysis : {  original : DATE, normalized : 2009-10-XX}
    10月 | analysis : {  original : DATE, normalized : 2009-10-XX}
    到 | analysis : {  original : O, normalized : null}
    中国 | analysis : {  original : COUNTRY, normalized : null}
    山东省 | analysis : {  original : STATE_OR_PROVINCE, normalized : null}
    东营市 | analysis : {  original : CITY, normalized : null}
    东营区 | analysis : {  original : FACILITY, normalized : null}
    永乐 | analysis : {  original : FACILITY, normalized : null}
    机场 | analysis : {  original : FACILITY, normalized : null}
    附近 | analysis : {  original : O, normalized : null}
    承包 | analysis : {  original : O, normalized : null}
    了 | analysis : {  original : O, normalized : null}
    一千 | analysis : {  original : NUMBER, normalized : 1000}
    亩 | analysis : {  original : O, normalized : null}
    盐 | analysis : {  original : O, normalized : null}
    碱地 | analysis : {  original : O, normalized : null}
    , | analysis : {  original : O, normalized : null}
    开始 | analysis : {  original : O, normalized : null}
    种植 | analysis : {  original : O, normalized : null}
    棉花 | analysis : {  original : O, normalized : null}
    , | analysis : {  original : O, normalized : null}
    年产量 | analysis : {  original : O, normalized : null}
    达到 | analysis : {  original : O, normalized : null}
    一万 | analysis : {  original : NUMBER, normalized : 10000}
    吨 | analysis : {  original : O, normalized : null}
    , | analysis : {  original : O, normalized : null}
    哈哈 | analysis : {  original : O, normalized : null}
    , | analysis : {  original : O, normalized : null}
    反正 | analysis : {  original : O, normalized : null}
    棣琦 | analysis : {  original : PERSON, normalized : null}
    说 | analysis : {  original : O, normalized : null}
    的 | analysis : {  original : O, normalized : null}
    是 | analysis : {  original : O, normalized : null}
    假 | analysis : {  original : O, normalized : null}
    的 | analysis : {  original : O, normalized : null}
    , | analysis : {  original : O, normalized : null}
    逗 | analysis : {  original : O, normalized : null}
    你 | analysis : {  original : O, normalized : null}
    玩儿 | analysis : {  original : O, normalized : null}
    , | analysis : {  original : O, normalized : null}
    明天 | analysis : {  original : DATE, normalized : XXXX-XX-XX}
    下午 | analysis : {  original : TIME, normalized : null}
    2点 | analysis : {  original : TIME, normalized : null}
    来 | analysis : {  original : O, normalized : null}
    我 | analysis : {  original : O, normalized : null}
    家 | analysis : {  original : O, normalized : null}
    吃饭 | analysis : {  original : O, normalized : null}
    吧 | analysis : {  original : O, normalized : null}
    。 | analysis : {  original : O, normalized : null}
    棣琦 | analysis : {  original : PERSON, normalized : null}
    是 | analysis : {  original : O, normalized : null}
    山东 | analysis : {  original : ORGANIZATION, normalized : null}
    大学 | analysis : {  original : ORGANIZATION, normalized : null}
    毕业 | analysis : {  original : O, normalized : null}
    的 | analysis : {  original : O, normalized : null}
    , | analysis : {  original : O, normalized : null}
    目前 | analysis : {  original : DATE, normalized : null}
    在 | analysis : {  original : O, normalized : null}
    百度 | analysis : {  original : ORGANIZATION, normalized : null}
    做 | analysis : {  original : O, normalized : null}
    java | analysis : {  original : O, normalized : null}
    开发 | analysis : {  original : O, normalized : null}
    , | analysis : {  original : O, normalized : null}
    位置 | analysis : {  original : O, normalized : null}
    是 | analysis : {  original : O, normalized : null}
    东北 | analysis : {  original : LOCATION, normalized : null}
    旺 | analysis : {  original : O, normalized : null}
    东路 | analysis : {  original : O, normalized : null}
    102 | analysis : {  original : NUMBER, normalized : 102}
    号院 | analysis : {  original : O, normalized : null}
    , | analysis : {  original : O, normalized : null}
    手机号 | analysis : {  original : O, normalized : null}
    143667788 | analysis : {  original : NUMBER, normalized : 14366778890}
    90 | analysis : {  original : NUMBER, normalized : 14366778890}
    The analysis lasts 819 seconds * 1000
    
    Process finished with exit code 0

    我们可以看到,整个工程的启动耗时还是挺久的。分析过程也比较耗时,819毫秒。

    并且结果也不够准确,跟我在其官网在线demo得到的结果还是有些差异的:

  • 相关阅读:
    Win7停止更新升Win10教程
    linux 进程管理
    linux vi/Vim编辑器
    linux 文件管理
    linux 目录管理
    [开发笔记]-C#判断文件类型
    [开发笔记]-C#获取pdf文档的页数
    [转载]每周问问你的团队这10个问题
    [转载]番茄时间管理法(Pomodoro Technique):一个番茄是如何让你工作更有效率的
    [开发笔记]-Linq to xml学习笔记
  • 原文地址:https://www.cnblogs.com/sonofelice/p/8677001.html
Copyright © 2011-2022 走看看