句子分割
第一步,是获得一些已经被分割的句子的数据,将他转换成一种合适的提取特征的形式。
sents=nltk.corpus.treebank_raw.sents() tokens=[] boundaries=set() offset=0 for sent in nltk.corpus.treebank_raw.sents(): tokens.extend(sent) offset+=len(sent) boundaries.add(offset-1) #tokens是句子的集合,boundaries为句子边界的集合,然后我们写提取特征函数 def punct_features(tokens,i): return { 'next-word-capitalized':tokens[i+1][0].isupper(), 'prevword':tokens[i-1].lower(), 'punct':tokens[i], 'prev-word-is-one-char':len(tokens[i-1]==1) } #生成特征集 featuresets=[ (punct_features(tokens,i),(i in boundaries)) for i in range(1,len(tokens)-1) if tokens[i] in '.?!' ]
第二步,将这个集合拆开来训练一个标点符号分类器:
>>>size = int(len(featuresets) *0.1) >>>train_set, test_set = featuresets[size:], featuresets[:size] >>>classifier = nltk.NaiveBayesClassifier.train(train_set) >>>nltk.classify.accuracy(classifier,test_set) 0.97419354838709682
我们可以利用上面的这个分类器,来做一个断句器
def segment_sentences(words): start = 0 sents = [] for i, wordin words: if wordin '.?!' and classifier.classify(words,i) ==True: sents.append(words[start:i+1]) start = i+1 if start < len(words): sents.append(words[start:])
识别对话行为类型(识别对话言语下的对话行为)
识别文字蕴含(RTE)
扩展到大型数据集