读取IOB格式与CoNLL2000分块语料库
CoNLL2000,是已经加载标注的文本,使用IOB符号分块。
这个语料库提供的类型有NP,VP,PP。
例如:
hePRPB-NP accepted VBDB-VP the DTB-NP positionNNI-NP ...
chunk.conllstr2tree()的函数作用:将字符串建立一个树表示。
例如:
>>>text = ''' ... he PRPB-NP ... accepted VBDB-VP ... the DTB-NP ... position NNI-NP ... of IN B-PP ... vice NNB-NP ... chairman NNI-NP ... of IN B-PP ... CarlyleNNPB-NP ... GroupNNPI-NP ... , , O ... a DTB-NP ... merchantNNI-NP ... banking NNI-NP ... concernNNI-NP ... . . O ... ''' >>>nltk.chunk.conllstr2tree(text,chunk_types=['NP']).draw()
运行结果如图所示:
对于CoNLL2000分块语料,我们可以对他进行如下操作:
#访问分块语料文件 >>>from nltk.corpusimport conll2000 >>>print conll2000.chunked_sents('train.txt')[99] (S (PP Over/IN) (NP a/DT cup/NN) (PP of/IN) (NP coffee/NN) ,/, (NP Mr./NNPStone/NNP) (VP told/VBD) (NP his/PRP$story/NN) ./.)
#如果只对NP感兴趣,可以这样写 >>>print conll2000.chunked_sents('train.txt',chunk_types=['NP'])[99] (S Over/IN (NP a/DT cup/NN) of/IN (NP coffee/NN) ,/, (NP Mr./NNPStone/NNP) told/VBD (NP his/PRP$story/NN) ./.)
简单评估和基准
>>>grammar= r"NP: {<[CDJNP].*>+}" >>>cp = nltk.RegexpParser(grammar) >>>print cp.evaluate(test_sents) ChunkParsescore: IOB Accuracy: 87.7% Precision: 70.6% Recall: 67.8% F-Measure: 69.2%
我们可以构造一个Unigram标注器来建立一个分块器。
#我们定义一个分块器,其中包括构造函数和一个parse方法,用来给新的句子分块 例7-4. 使用unigram标注器对名词短语分块。 classUnigramChunker(nltk.ChunkParserI): def __init__(self, train_sents): train_data = [[(t,c) for w,t,cin nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = nltk.UnigramTagger(train_data) def parse(self, sentence): pos_tags= [pos for (word,pos) in sentence] tagged_pos_tags = self.tagger.tag(pos_tags) chunktags= [chunktag for (pos, chunktag) in tagged_pos_tags] conlltags =[(word, pos,chunktag)for ((word,pos),chunktag) in zip(sentence, chunktags)] return nltk.chunk.conlltags2tree(conlltags)
注意parse这个函数,他的工作流程是这样的:
1、取一个已经标注的句子作为输入
2、从那句话提取的词性标记开始
3、使用在构造函数中训练过的标注器self.tagger,为词性添加标注IOB块标记。
4、提取块标记,与原句组合。
5、组合成一个块树。
做好块标记器之后,使用分块语料库库训练他。
>>>test_sents = conll2000.chunked_sents('test.txt',chunk_types=['NP']) >>>train_sents = conll2000.chunked_sents('train.txt',chunk_types=['NP']) >>>unigram_chunker= UnigramChunker(train_sents) >>>print unigram_chunker.evaluate(test_sents) ChunkParsescore: IOB Accuracy: 92.9% Precision: 79.9% Recall: 86.8% F-Measure: 83.2%
#我们可以通过这些代码,看到学习情况 >>>postags= sorted(set(pos for sent in train_sents ... for (word,pos) in sent.leaves())) >>>print unigram_chunker.tagger.tag(postags) [('#', 'B-NP'), ('$', 'B-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]
同样,我们也可以建立bigramTagger。
>>>bigram_chunker= BigramChunker(train_sents) >>>print bigram_chunker.evaluate(test_sents) ChunkParsescore: IOB Accuracy: 93.3% Precision: 82.3% Recall: 86.8% F-Measure: 84.5%
训练基于分类器的分块器
目前讨论的分块器有:正则表达式分块器、n-gram分块器,决定创建什么块完全基于词性标记。然而,有时词性标记不足以确定一个句子应如何分块。
例如:
(3) a. Joey/NNsold/VBD the/DT farmer/NN rice/NN ./.
b.Nick/NNbroke/VBD my/DTcomputer/NNmonitor/NN./.
虽然标记都一样,但是很明显分块并不一样。
所以,我们需要使用词的内容信息作为词性标记的补充。
如果想使用词的内容信息的方法之一,是使用基于分类器的标注器对句子分块。
基于分类器的NP分块器的基础代码如下面的代码所示:
#在第2个类上,基本上是标注器的一个包装器,将它变成一个分块器。训练期间,这第二个类映射训练预料中的块树到标记序列 #在parse方法中,它将标注器提供的标记序列转换回一个块树。 classConsecutiveNPChunkTagger(nltk.TaggerI): def __init__(self, train_sents): train_set = [] for tagged_sent in train_sents: untagged_sent = nltk.tag.untag(tagged_sent) history = [] for i, (word, tag) in enumerate(tagged_sent): featureset = npchunk_features(untagged_sent, i, history) train_set.append( (featureset, tag) ) history.append(tag) self.classifier = nltk.MaxentClassifier.train( train_set, algorithm='megam', trace=0) def tag(self, sentence): history = [] for i, wordin enumerate(sentence): featureset = npchunk_features(sentence,i, history) tag = self.classifier.classify(featureset) history.append(tag) return zip(sentence, history) classConsecutiveNPChunker(nltk.ChunkParserI):④ def __init__(self, train_sents): tagged_sents = [[((w,t),c) for (w,t,c) in nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = ConsecutiveNPChunkTagger(tagged_sents) def parse(self, sentence): tagged_sents = self.tagger.tag(sentence) conlltags =[(w,t,c) for ((w,t),c) in tagged_sents] return nltk.chunk.conlltags2tree(conlltags)
然后,定义一个特征提取函数:
>>>def npchunk_features(sentence,i, history): ... word,pos= sentence[i] ... return {"pos": pos} >>>chunker = ConsecutiveNPChunker(train_sents) >>>print chunker.evaluate(test_sents) ChunkParsescore: IOB Accuracy: 92.9% Precision: 79.9% Recall: 86.7% F-Measure: 83.2%
对于这个分类标记器我们还可以做改进,增添一个前面的词性标记。
>>>def npchunk_features(sentence,i, history): ... word,pos= sentence[i] .. . if i ==0: ... prevword, prevpos= "<START>", "<START>" ... else: ... prevword, prevpos= sentence[i-1] ... return {"pos": pos,"prevpos": prevpos} >>>chunker = ConsecutiveNPChunker(train_sents) >>>print chunker.evaluate(test_sents) ChunkParsescore: IOB Accuracy: 93.6% Precision: 81.9% Recall: 87.1% F-Measure: 84.4%
我们可以不仅仅以两个词性为特征,还可以再添加一个词的内容。
>>>def npchunk_features(sentence,i, history): ... word,pos= sentence[i] .. . if i ==0: .. . prevword, prevpos= "<START>", "<START>" ... else: ... prevword, prevpos= sentence[i-1] ... return {"pos": pos,"word": word,"prevpos": prevpos} >>>chunker = ConsecutiveNPChunker(train_sents) >>>print chunker.evaluate(test_sents) ChunkParsescore: IOB Accuracy: 94.2% Precision: 83.4% Recall: 88.6% F-Measure: 85.9%
我们可以试着尝试多加几种特征提取,来增加分块器的表现,例如下面代码中增添了预取特征、配对功能和复杂的语境特征。最后一个特征是tags-since-dt,创建了一个字符串,描述自最近的限定词以来遇到的所有的词性标记。
>>>def npchunk_features(sentence,i, history): ... word,pos= sentence[i] ... if i ==0: ... prevword, prevpos= "<START>", "<START>" ... else: ... prevword, prevpos= sentence[i-1] ... if i ==len(sentence)-1: ... nextword, nextpos= "<END>", "<END>" ... else: ... nextword, nextpos= sentence[i+1] ... return {"pos": pos, ... "word": word, ... "prevpos": prevpos, ... "nextpos": nextpos, .. . "prevpos+pos": "%s+%s" %(prevpos, pos), ... "pos+nextpos": "%s+%s" %(pos, nextpos), ... "tags-since-dt": tags_since_dt(sentence, i)} >>>def tags_since_dt(sentence, i): ... tags = set() ... for word,pos in sentence[:i]: ... if pos=='DT': ... tags = set() ... else: ... tags.add(pos) ... return '+'.join(sorted(tags)) >>>chunker = ConsecutiveNPChunker(train_sents) >>>print chunker.evaluate(test_sents) ChunkParsescore: IOB Accuracy: 95.9% Precision: 88.3% Recall: 90.7% F-Measure: 89.5%