  • Extracting Information from Text With NLTK

    因为现实中的数据多为‘非结构化数据’,比如一般的txt文档,或是‘半结构化数据’,比如html,对于这样的数据需要采用一些技术才能从中提取 出有用的信息。如果所有数据都是‘结构化数据’,比如Xml或关系数据库,那么就不需要特别去提取了,可以根据元数据去任意取到你想要的信息。


    first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer . Next, each sentence is tagged with part-of-speech tags , which will prove very helpful in the next step,named entity recognition . In this step, we search for mentions of potentially interesting entities in each sentence. Finally, we use relation recognition to search for likely relations between different entities in the text.

    可见这儿描述的信息提取的过程,包含4步,分词,词性标注,命名实体识别,实体关系识别,对于分词和词性标注前面已经介绍过了,那么就详细来看看named entity recognition 怎么来实现的。


    The basic technique we will use for entity recognition is chunking, which segments and labels multitoken sequences。


    Noun Phrase Chunking


    One of the most useful sources of information for NP-chunking is part-of-speech tags.

    >>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
    >>> grammar = "NP: {<DT>?<JJ>*<NN>}" #Tag Patterns,定语(0或1个)形容词(任意个)名词(1个)
    >>> cp = nltk.RegexpParser(grammar)
    >>> result = cp.parse(sentence)
    >>> print result
    (NP the/DT little/JJ yellow/JJ dog/NN) #NP-chunking, the little yellow dog
    (NP the/DT cat/NN)) #NP-chunking, # NP-chunking, the cat
    上面的这个方法就是用Regular Expressions来表示tag pattern,从而找到NP-chunking

    再给个例子,tag patterns可以加上多条,可以变的更复杂

    grammar = r"""NP: {<DT|PP/$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and nouns
                                   {<NNP>+} # chunk sequences of proper nouns
    cp = nltk.RegexpParser(grammar)
    sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
    >>> print cp.parse(sentence)
    (NP Rapunzel/NNP) #NP-chunking, Rapunzel
    (NP her/PP$ long/JJ golden/JJ hair/NN)) #NP-chunking, her long golden hair


    >>> cp = nltk.RegexpParser(''CHUNK: {<V.*> <TO> <V.*>}'') #找‘动词 to 动词’的组合
    >>> brown = nltk.corpus.brown
    >>> for sent in brown.tagged_sents():
    ...         tree = cp.parse(sent)
    ...         for subtree in tree.subtrees():
    ...             if subtree.node == ''CHUNK'': print subtree
    (CHUNK combined/VBN to/TO achieve/VB)
    (CHUNK continue/VB to/TO place/VB)
    (CHUNK serve/VB to/TO protect/VB)
    (CHUNK wanted/VBD to/TO wait/VB)
    (CHUNK allowed/VBN to/TO place/VB)
    (CHUNK expected/VBN to/TO become/VB)

