zoukankan html css js c++ java

stanford corenlp的TokensRegex

最近做一些音乐类、读物类的自然语言理解，就调研使用了下Stanford corenlp，记录下来。

功能

Stanford Corenlp是一套自然语言分析工具集包括：

POS(part of speech tagger)-标注词性
NER(named entity recognizer)-实体名识别
Parser树-分析句子的语法结构，如识别出短语词组、主谓宾等
Coreference Resolution-指代消解，找出句子中代表同一个实体的词。下文的I/my，Nader/he表示的是同一个人

Sentiment Analysis-情感分析
Bootstrapped pattern learning-自展的模式学习（也不知道翻译对不对，大概就是可以无监督的提取一些模式，如提取实体名）
Open IE(Information Extraction)-从纯文本中提取有结构关系组，如"Barack Obama was born in Hawaii" =》 (Barack Obama; was born in; Hawaii)

需求

语音交互类的应用（如语音助手、智能音箱echo）收到的通常是口语化的自然语言，如：我想听一个段子，给我来个牛郎织女的故事，要想精确的返回结果，就需要提出有用的主题词，段子/牛郎织女/故事。看了一圈就想使用下corenlp的TokensRegex，基于tokens序列的正则表达式。因为它提供的可用的工具有：正则表达式、分词、词性、实体类别，另外还可以自己指定实体类别，如指定牛郎织女是READ类别的实体。

Pattern语法

规则格式

{
  // ruleType is "text", "tokens", "composite", or "filter"
  ruleType: "tokens",//tokens是基于切词用于tokens正则，text是文本串用于文本正则，composite/filter还没搞明白
  
  // pattern to be matched  
  pattern: ( ( [ { ner:PERSON } ]) /was/ /born/ /on/ ([ { ner:DATE } ]) ),

  // value associated with the expression for which the pattern was matched
  // matched expressions are returned with "DATE_OF_BIRTH" as the value
  // (as part of the MatchedExpression class)
  result: "DATE_OF_BIRTH"
}

除了上面的字段外还有action/name/stage/active/priority等，可以参考文后的文献。

ruleTypes是tokens，pattern中的基本元素是token，整体用()，1个token用[<expression>]，1个expression用{tag:xx;ner:xx}来表述

ruleTypes是text，pattern就是常规的正则表达式，基本元素就是字符了，整体用//包围

实例

corenlp提供了单条/多条正则表达式的提取，本文就介绍从文件中加载规则来拦截我们需要的文本，并从中提取主题词。

依赖包

<dependency>
     <groupId>edu.stanford.nlp</groupId>
     <artifactId>stanford-corenlp</artifactId>
     <version>3.4.1</version>
</dependency>
<dependency>
      <groupId>edu.stanford.nlp</groupId>
      <artifactId>stanford-corenlp</artifactId>
      <version>3.4.1</version>
      <classifier>models</classifier>
</dependency>
<!--中文支持-->
<dependency>
      <groupId>edu.stanford.nlp</groupId>
      <artifactId>stanford-corenlp</artifactId>
      <version>3.6.0</version>
      <classifier>models-chinese</classifier>
</dependency>

属性配置CoreNLP-chinese.properties（可以参考stanford-corenlp-models-chinese中的配置）

annotators = segment, ssplit, pos, ner, regexner, parse
regexner.mapping = regexner.txt//自定义的实体正则表达式文件

customAnnotatorClass.segment = edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator

segment.model = edu/stanford/nlp/models/segmenter/chinese/pku.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true

ssplit.boundaryTokenRegex = [.]|[!?]+|[。]|[！？]+ //句子切分符

pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger

ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
ner.applyNumericClassifiers = false
ner.useSUTime = false

parse.model = edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz

corenlp中对文本的一次处理称为一个pipeline，annotators代表一个处理节点，如segment切词、ssplit句子切割（将一段话分为多个句子）、pos词性、ner实体命名、regexner是用自定义正则表达式来标注实体类型、parse是句子结构解析。后面就是各annotator的属性。

自定义的规则文件

regexner.txt（将'牛郎织女'的实体类别识别为READ）

牛郎织女	READ

rule.txt（tokensregex规则）

$TYPE="/笑话|故事|段子|口技|谜语|寓言|评书|相声|小品|唐诗|古诗|宋词|绕口令|故事|小说/ | /脑筋/ /急转弯/"
//单类型
{
	ruleType: "tokens",
	pattern: ((?$type $TYPE)),
	result: Format("%s;%s;%s", "", $$type.text.replace(" ",""), "")
}

(?type xx)代表一个命名group，提取该group将结果组装成xx;xx;xx形式返回

代码

//加载tokens正则表达
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFile(TokenSequencePattern.getNewEnv(), "rule.txt");
//创建pipeline
StanfordCoreNLP coreNLP = new StanfordCoreNLP("CoreNLP-chinese.properties");
//处理文本
Annotation annotation = coreNLP.process("听个故事");
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
CoreMap sentence = sentences.get(0); //获得第一个句子分析结果
//过一遍tokens正则
List<MatchedExpression> matchedExpressions = extractor.extractExpressions(sentence);
for (MatchedExpression match : matchedExpressions) {
    System.out.println("Matched expression: " + match.getText() + " with value " + match.getValue());
}

想看下分析结果，如切词、词性、实体名，可以使用下面的函数

    private void debug(CoreMap sentence) {
        // 从CoreMap中取出CoreLabel List，逐一打印出来
        List<CoreLabel> tokens = sentence.get(CoreAnnotations.TokensAnnotation.class);
        System.out.println("字/词" + "	 " + "词性" + "	 " + "实体标记");
        System.out.println("-----------------------------");
        for (CoreLabel token : tokens) {
            String word = token.getString(CoreAnnotations.TextAnnotation.class);
            String pos = token.getString(CoreAnnotations.PartOfSpeechAnnotation.class);
            String ner = token.getString(CoreAnnotations.NamedEntityTagAnnotation.class);
            System.out.println(word + "	 " + pos + "	 " + ner);
        }
    }

功能还是很强大的，毕竟可以用的东西多了，遇到问题时方法就多了。

参考文献

TokensRegex: http://nlp.stanford.edu/software/tokensregex.shtml

SequenceMatchRules: http://nlp.stanford.edu/nlp/javadoc/javanlp-3.5.0/edu/stanford/nlp/ling/tokensregex/SequenceMatchRules.html

Regexner: http://nlp.stanford.edu/software/regexner.html

查看全文

相关阅读:
MyEclipse取消自动跳到Console窗口
 JAVA Socket超时浅析
 利用WireShark进行DNS协议分析
 SSL/TLS协议工作流程
 Linux下安装MySQL
使用Java中的动态代理实现数据库连接池
 在Linux下安装和使用MySQL
自省另外一种python 生成随机在base36 之间的兑换码生成。
python 时间转换相关
 关于utf8mb4的学习了解笔记

原文地址：https://www.cnblogs.com/whuqin/p/5741706.html