zoukankan      html  css  js  c++  java
  • 【语言处理与Python】7.3开发和评估分块器

    读取IOB格式与CoNLL2000分块语料库

    CoNLL2000,是已经加载标注的文本,使用IOB符号分块。

    这个语料库提供的类型有NP,VP,PP。

    例如:

    hePRPB-NP
    accepted VBDB-VP
    the DTB-NP
    positionNNI-NP
    ...

    chunk.conllstr2tree()的函数作用:将字符串建立一个树表示。

    例如:

    >>>text = '''
    ... he PRPB-NP
    ... accepted VBDB-VP
    ... the DTB-NP
    ... position NNI-NP
    ... of IN B-PP
    ... vice NNB-NP
    ... chairman NNI-NP
    ... of IN B-PP
    ... CarlyleNNPB-NP
    ... GroupNNPI-NP
    ... , , O
    ... a DTB-NP
    ... merchantNNI-NP
    ... banking NNI-NP
    ... concernNNI-NP
    ... . . O
    ... '''
    >>>nltk.chunk.conllstr2tree(text,chunk_types=['NP']).draw()

    运行结果如图所示:

    image

    对于CoNLL2000分块语料,我们可以对他进行如下操作:

    #访问分块语料文件
    >>>from nltk.corpusimport conll2000
    >>>print conll2000.chunked_sents('train.txt')[99]
    (S
        (PP Over/IN)
        (NP a/DT cup/NN)
        (PP of/IN)
        (NP coffee/NN)
        ,/,
        (NP Mr./NNPStone/NNP)
        (VP told/VBD)
        (NP his/PRP$story/NN)
        ./.)
    #如果只对NP感兴趣,可以这样写
    >>>print conll2000.chunked_sents('train.txt',chunk_types=['NP'])[99]
    (S
        Over/IN
        (NP a/DT cup/NN)
        of/IN
        (NP coffee/NN)
        ,/,
        (NP Mr./NNPStone/NNP)
        told/VBD
        (NP his/PRP$story/NN)
        ./.)

    简单评估和基准

    >>>grammar= r"NP: {<[CDJNP].*>+}"
    >>>cp = nltk.RegexpParser(grammar)
    >>>print cp.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy: 87.7%
    Precision: 70.6%
    Recall: 67.8%
    F-Measure: 69.2%

    我们可以构造一个Unigram标注器来建立一个分块器。

    #我们定义一个分块器,其中包括构造函数和一个parse方法,用来给新的句子分块
    例7-4. 使用unigram标注器对名词短语分块。
    classUnigramChunker(nltk.ChunkParserI):
        def __init__(self, train_sents): 
            train_data = [[(t,c) for w,t,cin nltk.chunk.tree2conlltags(sent)]
                for sent in train_sents]
            self.tagger = nltk.UnigramTagger(train_data) 
        def parse(self, sentence): 
            pos_tags= [pos for (word,pos) in sentence]
            tagged_pos_tags = self.tagger.tag(pos_tags)
            chunktags= [chunktag for (pos, chunktag) in tagged_pos_tags]
            conlltags =[(word, pos,chunktag)for ((word,pos),chunktag)
                    in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

    注意parse这个函数,他的工作流程是这样的:

    1、取一个已经标注的句子作为输入

    2、从那句话提取的词性标记开始

    3、使用在构造函数中训练过的标注器self.tagger,为词性添加标注IOB块标记。

    4、提取块标记,与原句组合。

    5、组合成一个块树。

    做好块标记器之后,使用分块语料库库训练他。

    >>>test_sents = conll2000.chunked_sents('test.txt',chunk_types=['NP'])
    >>>train_sents = conll2000.chunked_sents('train.txt',chunk_types=['NP'])
    >>>unigram_chunker= UnigramChunker(train_sents)
    >>>print unigram_chunker.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy: 92.9%
    Precision: 79.9%
    Recall: 86.8%
    F-Measure: 83.2%
    #我们可以通过这些代码,看到学习情况
    >>>postags= sorted(set(pos for sent in train_sents
    ... for (word,pos) in sent.leaves()))
    >>>print unigram_chunker.tagger.tag(postags)
    [('#', 'B-NP'), ('$', 'B-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'),
    (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'),
    ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'),
    ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'),
    ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'),
    ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'),
    ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'),
    ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'),
    ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'),
    ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]

    同样,我们也可以建立bigramTagger。

    >>>bigram_chunker= BigramChunker(train_sents)
    >>>print bigram_chunker.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy: 93.3%
    Precision: 82.3%
    Recall: 86.8%
    F-Measure: 84.5%

    训练基于分类器的分块器

    目前讨论的分块器有:正则表达式分块器、n-gram分块器,决定创建什么块完全基于词性标记。然而,有时词性标记不足以确定一个句子应如何分块。

    例如:

    (3) a. Joey/NNsold/VBD the/DT farmer/NN rice/NN ./.
    b.Nick/NNbroke/VBD my/DTcomputer/NNmonitor/NN./.

    虽然标记都一样,但是很明显分块并不一样。

    所以,我们需要使用词的内容信息作为词性标记的补充。

    如果想使用词的内容信息的方法之一,是使用基于分类器的标注器对句子分块。

    基于分类器的NP分块器的基础代码如下面的代码所示:

    #在第2个类上,基本上是标注器的一个包装器,将它变成一个分块器。训练期间,这第二个类映射训练预料中的块树到标记序列
    #在parse方法中,它将标注器提供的标记序列转换回一个块树。
    classConsecutiveNPChunkTagger(nltk.TaggerI):
        def __init__(self, train_sents):
            train_set = []
            for tagged_sent in train_sents:
                untagged_sent = nltk.tag.untag(tagged_sent)
                history = []
                for i, (word, tag) in enumerate(tagged_sent):
                    featureset = npchunk_features(untagged_sent, i, history) 
                    train_set.append( (featureset, tag) )
                    history.append(tag)
            self.classifier = nltk.MaxentClassifier.train( 
                train_set, algorithm='megam', trace=0)
        def tag(self, sentence):
            history = []
            for i, wordin enumerate(sentence):
                featureset = npchunk_features(sentence,i, history)
                tag = self.classifier.classify(featureset)
                history.append(tag)
            return zip(sentence, history)
    classConsecutiveNPChunker(nltk.ChunkParserI):④
        def __init__(self, train_sents):
            tagged_sents = [[((w,t),c) for (w,t,c) in
                nltk.chunk.tree2conlltags(sent)]
                for sent in train_sents]
            self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
        def parse(self, sentence):
            tagged_sents = self.tagger.tag(sentence)
            conlltags =[(w,t,c) for ((w,t),c) in tagged_sents]
            return nltk.chunk.conlltags2tree(conlltags)

    然后,定义一个特征提取函数:

    >>>def npchunk_features(sentence,i, history):
    ... word,pos= sentence[i]
    ... return {"pos": pos}
    >>>chunker = ConsecutiveNPChunker(train_sents)
    >>>print chunker.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy: 92.9%
    Precision: 79.9%
    Recall: 86.7%
    F-Measure: 83.2%

    对于这个分类标记器我们还可以做改进,增添一个前面的词性标记。

    >>>def npchunk_features(sentence,i, history):
    ... word,pos= sentence[i]
    ..    . if i ==0:
    ...         prevword, prevpos= "<START>", "<START>"
    ...     else:
    ...         prevword, prevpos= sentence[i-1]
    ...     return {"pos": pos,"prevpos": prevpos}
    >>>chunker = ConsecutiveNPChunker(train_sents)
    >>>print chunker.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy: 93.6%
    Precision: 81.9%
    Recall: 87.1%
    F-Measure: 84.4%

    我们可以不仅仅以两个词性为特征,还可以再添加一个词的内容。

    >>>def npchunk_features(sentence,i, history):
    ...     word,pos= sentence[i]
    ..    . if i ==0:
    ..        . prevword, prevpos= "<START>", "<START>"
    ...     else:
    ...         prevword, prevpos= sentence[i-1]
    ...     return {"pos": pos,"word": word,"prevpos": prevpos}
    >>>chunker = ConsecutiveNPChunker(train_sents)
    >>>print chunker.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy: 94.2%
    Precision: 83.4%
    Recall: 88.6%
    F-Measure: 85.9%

    我们可以试着尝试多加几种特征提取,来增加分块器的表现,例如下面代码中增添了预取特征、配对功能和复杂的语境特征。最后一个特征是tags-since-dt,创建了一个字符串,描述自最近的限定词以来遇到的所有的词性标记。

    >>>def npchunk_features(sentence,i, history):
    ...     word,pos= sentence[i]
    ...     if i ==0:
    ...         prevword, prevpos= "<START>", "<START>"
    ...     else:
    ...         prevword, prevpos= sentence[i-1]
    ...     if i ==len(sentence)-1:
    ...         nextword, nextpos= "<END>", "<END>"
    ...     else:
    ...         nextword, nextpos= sentence[i+1]
    ...     return {"pos": pos,
    ...         "word": word,
    ...         "prevpos": prevpos,
    ...         "nextpos": nextpos,
    ..        . "prevpos+pos": "%s+%s" %(prevpos, pos),
    ...         "pos+nextpos": "%s+%s" %(pos, nextpos),
    ...         "tags-since-dt": tags_since_dt(sentence, i)}
    >>>def tags_since_dt(sentence, i):
    ...     tags = set()
    ...     for word,pos in sentence[:i]:
    ...         if pos=='DT':
    ...             tags = set()
    ...         else:
    ...             tags.add(pos)
    ...     return '+'.join(sorted(tags))
    >>>chunker = ConsecutiveNPChunker(train_sents)
    >>>print chunker.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy: 95.9%
    Precision: 88.3%
    Recall: 90.7%
    F-Measure: 89.5%
  • 相关阅读:
    Git: fatal: Pathspec is in submodule
    cnpm不是内部或外部命令 cnpm: command not found
    kubectl top查看k8s pod的cpu , memory使用率情况
    Docker 技巧:删除 Docker 所有镜像
    Docker 快速删除所有容器
    使用Dockerfile文件构建基于centOS系统的nodejs镜像
    CentOS下nodejs最简单的安装方法
    yum安装nodejs 8
    系统空间占用排查 tomcat超大日志catalina.out 删除 与df 状态更新
    用dockerfile构建基于centos系统的jar包的镜像
  • 原文地址:https://www.cnblogs.com/createMoMo/p/3109333.html
Copyright © 2011-2022 走看看