zoukankan      html  css  js  c++  java
  • Python自然语言处理学习笔记(41):5.2 标注语料库

    5.2   Tagged Corpora 标注语料库

    Representing Tagged Tokens 表示标注的语言符号

    By convention in NLTK, a tagged token is represented using a tuple consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple():

    >>> tagged_token = nltk.tag.str2tuple('fly/NN')

    >>> tagged_token

    ('fly', 'NN')

    >>> tagged_token[0]

    'fly'

    >>> tagged_token[1]

    'NN'

    We can construct a list of tagged tokens directly from a string. The first step is to tokenize the string to access the individual word/tag strings, and then to convert each of these into a tuple (using str2tuple()).

    >>> sent = '''

    ... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN

    ... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC

    ... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS

    ... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB

    ... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT

    ... interest/NN of/IN both/ABX governments/NNS ''/'' ./.

    ... '''

    >>> [nltk.tag.str2tuple(t) for t in sent.split()]

    [('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'),

    ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]

    Reading Tagged Corpora 读取已标注的语料库

    Several of the corpora included with NLTK have been tagged for their part-of-speech. Here's an example of what you might see if you opened a file from the Brown Corpus with a text editor:

    The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.

    Other corpora use a variety of formats for storing part-of-speech tags. NLTK's corpus readers provide a uniform interface so that you don't have to be concerned with the different file formats. In contrast with the file extract shown above, the corpus reader for the Brown Corpus represents the data as shown below. Note that part-of-speech tags have been converted to uppercase, since this has become standard practice(标准惯例) since the Brown Corpus was published.

    >>> nltk.corpus.brown.tagged_words()

    [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]

    >>> nltk.corpus.brown.tagged_words(simplify_tags=True)

    [('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]

    Whenever a corpus contains tagged text, the NLTK corpus interface will have a tagged_words() method. Here are some more examples, again using the output format illustrated for the Brown Corpus:

    >>> print nltk.corpus.nps_chat.tagged_words()

    [('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]

    >>> nltk.corpus.conll2000.tagged_words()

    [('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]

    >>> nltk.corpus.treebank.tagged_words()

    [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]

    Not all corpora employ the same set of tags; see the tagset help functionality and the readme() methods mentioned above for documentation. Initially we want to avoid the complications of these tagsets, so we use a built-in mapping to a simplified tagset:

    >>> nltk.corpus.brown.tagged_words(simplify_tags=True)

    [('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]

    >>> nltk.corpus.treebank.tagged_words(simplify_tags=True)

    [('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...]

    Tagged corpora for several other languages are distributed with NLTK, including Chinese, Hindi, Portuguese, Spanish, Dutch and Catalan. These usually contain non-ASCII text, and Python always displays this in hexadecimal when printing a larger structure such as a list.

    >>> nltk.corpus.sinica_treebank.tagged_words()

    [('\xe4\xb8\x80', 'Neu'), ('\xe5\x8f\x8b\xe6\x83\x85', 'Nad'), ...]

    >>> nltk.corpus.indian.tagged_words()

    [('\xe0\xa6\xae\xe0\xa6\xb9\xe0\xa6\xbf\xe0\xa6\xb7\xe0\xa7\x87\xe0\xa6\xb0', 'NN'),

    ('\xe0\xa6\xb8\xe0\xa6\xa8\xe0\xa7\x8d\xe0\xa6\xa4\xe0\xa6\xbe\xe0\xa6\xa8', 'NN'),

    ...]

    >>> nltk.corpus.mac_morpho.tagged_words()

    [('Jersei', 'N'), ('atinge', 'V'), ('m\xe9dia', 'N'), ...]

    >>> nltk.corpus.conll2002.tagged_words()

    [('Sao', 'NC'), ('Paulo', 'VMI'), ('(', 'Fpa'), ...]

    >>> nltk.corpus.cess_cat.tagged_words()

    [('El', 'da0ms0'), ('Tribunal_Suprem', 'np0000o'), ...]

    If your environment is set up correctly, with appropriate editors and fonts, you should be able to display individual strings in a human-readable way. For example, Figure 5.1 shows data accessed using nltk.corpus.indian.

    Figure 5.1: POS-Tagged Data from Four Indian Languages: Bangla, Hindi, Marathi, and Telugu

    If the corpus is also segmented into sentences, it will have a tagged_sents() method that divides up the tagged words into sentences rather than presenting them as one big list. This will be useful when we come to developing automatic taggers, as they are trained and tested on lists of sentences, not words.

    A Simplified Part-of-Speech Tagset   简化的词性标记集合

    Tagged corpora use many different conventions for tagging words. To help us get started, we will be looking at a simplified tagset (shown in Table 5.1).

    Tag

    Meaning

    Examples

    ADJ

    adjective

    new, good, high, special, big, local

    ADV

    adverb

    really, already, still, early, now

    CNJ

    conjunction

    and, or, but, if, while, although

    DET

    determiner

    the, a, some, most, every, no

    EX

    existential

    there, there's

    FW

    foreign word

    dolce, ersatz, esprit, quo, maitre

    MOD

    modal verb

    will, can, would, may, must, should

    N

    noun

    year, home, costs, time, education

    NP

    proper noun

    Alison, Africa, April, Washington

    NUM

    number

    twenty-four, fourth, 1991, 14:24

    PRO

    pronoun

    he, their, her, its, my, I, us

    P

    preposition

    on, of, at, with, by, into, under

    TO

    the word to

    to

    UH

    interjection

    ah, bang, ha, whee, hmpf, oops

    V

    verb

    is, has, get, do, make, see, run

    VD

    past tense

    said, took, told, made, asked

    VG

    present participle

    making, going, playing, working

    VN

    past participle

    given, taken, begun, sung

    WH

    wh determiner

    who, which, when, what, where, howTable 5.1:

    Simplified Part-of-Speech Tagset

    Let's see which of these tags are the most common in the news category of the Brown corpus:

    >>> from nltk.corpus import brown

    >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True)

    >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)

    >>> tag_fd.keys()

    ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD', ...]

    Note

    Your Turn: Plot the above frequency distribution using tag_fd.plot(cumulative=True). What percentage of words are tagged using the first five tags of the above list?   60%

    We can use these tags to do powerful searches using a graphical POS-concordance tool nltk.app.concordance(). Use it to search for any combination of words and POS tags, e.g. N N N N, hit/VD, hit/VN, or the ADJ man.

    Nouns 名词

    Nouns generally refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intelligence. Nouns can appear after determiners and adjectives, and can be the subject or object of the verb, (名词可以出现在限定词和形容词之后,并且可以做动词的主语或宾语)as shown in Table 5.2.

    Word

    After a determiner

    Subject of the verb

    woman

    the woman who I saw yesterday ...

    the woman sat down

    Scotland

    the Scotland I remember as a child ...

    Scotland has five million people

    book

    the book I bought yesterday ...

    this book recounts the colonization of Australia

    intelligence

    the intelligence displayed by the child ...

    Mary's intelligence impressed her teachersTable 5.2:

    Syntactic Patterns involving some Nouns

    The simplified noun tags are N for common nouns like book, and NP for proper nouns like Scotland.

    Let's inspect some tagged text to see what parts of speech occur before a noun, with the most frequent ones first. To begin with, we construct a list of bigrams whose members are themselves word-tag pairs such as (('The', 'DET'), ('Fulton', 'NP')) and (('Fulton', 'NP'), ('County', 'N')). Then we construct a FreqDist from the tag parts of the bigrams.

    >>> word_tag_pairs = nltk.bigrams(brown_news_tagged)

    >>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N'))

    ['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN', ...]

     

      (a,b)也就是(('The', 'DET'), ('Fulton', 'NP')),如果b[1]==’N’,则给出前面这个词的词性a[1]

    This confirms our assertion that nouns occur after determiners and adjectives, including numeral adjectives (tagged as NUM).

    Verbs 动词

    Verbs are words that describe events and actions, e.g. fall, eat in Table 5.3. In the context of a sentence, verbs typically express a relation involving the referents of one or more noun phrases.

    Word

    Simple

    With modifiers and adjuncts (italicized)

    fall

    Rome fell

    Dot com stocks suddenly fell like a stone

    eat

    Mice eat cheese

    John ate the pizza with gustoTable 5.3:

    Syntactic Patterns involving some Verbs

    What are the most common verbs in news text? Let's sort all the verbs by frequency:

    >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True)

    >>> word_tag_fd = nltk.FreqDist(wsj)

    >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')]

    ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V',

    'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD',

    'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V',

    'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', ...]

    Note that the items being counted in the frequency distribution are word-tag pairs. Since words and tags are paired, we can treat the word as a condition and the tag as an event, and initialize a conditional frequency distribution with a list of condition-event pairs. This lets us see a frequency-ordered list of tags given a word:

    >>> cfd1 = nltk.ConditionalFreqDist(wsj)

    >>> cfd1['yield'].keys()

    ['V', 'N']

    >>> cfd1['cut'].keys()

    ['V', 'VD', 'N', 'VN']

    We can reverse the order of the pairs, so that the tags are the conditions, and the words are the events(词作为条件,标签作为事件). Now we can see likely words for a given tag:

    >>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)

    >>> cfd2['VN'].keys()

    ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold',

    'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said', ...]

    To clarify the distinction between VD (past tense) and VN (past participle), let's find words which can be both VD and VN, and see some surrounding text:

    >>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]]

    ['Asked', 'accelerated', 'accepted', 'accused', 'acquired', 'added', 'adopted', ...]

    >>> idx1 = wsj.index(('kicked', 'VD'))

    >>> wsj[idx1-4:idx1+1]

    [('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', 'ADV'),

    ('kicked', 'VD')]

    >>> idx2 = wsj.index(('kicked', 'VN'))

    >>> wsj[idx2-4:idx2+1]

    [('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', 'VN')]

    In this case, we see that the past participle of kicked is preceded by a form of the auxiliary verb have. Is this generally true?

    Note

    Your Turn: Given the list of past participles specified by cfd2['VN'].keys(), try to collect a list of all the word-tag pairs that immediately precede items in that list.

    Adjectives and Adverbs  形容词和副词

    Two other important word classes are adjectives and adverbs. Adjectives describe nouns, and can be used as modifiers (e.g. large in the large pizza), or in predicates (e.g. the pizza is large). English adjectives can have internal structure (e.g. fall+ing in the falling stocks). Adverbs modify verbs to specify the time, manner, place or direction of the event described by the verb (e.g. quickly in the stocks fell quickly). Adverbs may also modify adjectives (e.g. really in Mary's teacher was really nice).

    English has several categories of closed class words in addition to prepositions, such as articles (also often called determiners) (e.g., the, a), modals (e.g., should, may), and personal pronouns (e.g., she, they). Each dictionary and grammar classifies these words differently.

    Note

    Your Turn: If you are uncertain about some of these parts of speech, study them using nltk.app.concordance(), or watch some of the Schoolhouse Rock! grammar videos available at YouTube, or consult the Further Reading section at the end of this chapter.

    Unsimplified Tags  未简化的标签

    Let's find the most frequent nouns of each noun part-of-speech type. The program in Example 5.2 finds all tags starting with NN, and provides a few example words for each one. You will see that there are many variants of NN; the most important contain $ for possessive nouns, S for plural nouns (since plural nouns typically end in s) and P for proper nouns. In addition, most of the tags have suffix modifiers: -NC for citations, -HL for words in headlines and -TL for titles (a feature of Brown tabs).

    def findtags(tag_prefix, tagged_text):

        cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text

                                      if tag.startswith(tag_prefix))

        return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())

    >>> tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))

    >>> for tag in sorted(tagdict):

    ...     print tag, tagdict[tag]

    ...

    NN ['year', 'time', 'state', 'week', 'man']

    NN$ ["year's", "world's", "state's", "nation's", "company's"]

    NN$-HL ["Golf's", "Navy's"]

    NN$-TL ["President's", "University's", "League's", "Gallery's", "Army's"]

    NN-HL ['cut', 'Salary', 'condition', 'Question', 'business']

    NN-NC ['eva', 'ova', 'aya']

    NN-TL ['President', 'House', 'State', 'University', 'City']

    NN-TL-HL ['Fort', 'City', 'Commissioner', 'Grove', 'House']

    NNS ['years', 'members', 'people', 'sales', 'men']

    NNS$ ["children's", "women's", "men's", "janitors'", "taxpayers'"]

    NNS$-HL ["Dealers'", "Idols'"]

    NNS$-TL ["Women's", "States'", "Giants'", "Officers'", "Bombers'"]

    NNS-HL ['years', 'idols', 'Creations', 'thanks', 'centers']

    NNS-TL ['States', 'Nations', 'Masters', 'Rules', 'Communists']

    NNS-TL-HL ['Nations']

    Example 5.2 (code_findtags.py):  Program to Find the Most Frequent Noun Tags

    When we come to constructing part-of-speech taggers later in this chapter, we will use the unsimplified tags.

    Exploring Tagged Corpora  探索标注的语料库

    Let's briefly return to the kinds of exploration of corpora we saw in previous chapters, this time exploiting POS tags.

    Suppose we're studying the word often and want to see how it is used in text. We could ask to see the words that follow often

    >>> brown_learned_text = brown.words(categories='learned')

    >>> sorted(set(b for (a, b) in nltk.ibigrams(brown_learned_text) if a == 'often'))

    [',', '.', 'accomplished', 'analytically', 'appear', 'apt', 'associated', 'assuming',

    'became', 'become', 'been', 'began', 'call', 'called', 'carefully', 'chose', ...]

    However, it's probably more instructive use the tagged_words() method to look at the part-of-speech tag of the following words:

    >>> brown_lrnd_tagged = brown.tagged_words(categories='learned', simplify_tags=True)

    >>> tags = [b[1] for (a, b) in nltk.ibigrams(brown_lrnd_tagged) if a[0] == 'often']

    >>> fd = nltk.FreqDist(tags)

    >>> fd.tabulate()

     VN    V   VD DET ADJ ADV    P CNJ    ,   TO   VG   WH VBZ    .

     15   12    8    5    5    4    4    3    3    1    1    1    1    1

    Notice that the most high-frequency parts of speech following often are verbs. Nouns never appear in this position (in this particular corpus).

    Next, let's look at some larger context, and find words involving particular sequences of tags and words (in this case "<Verb> to <Verb>"). In code-three-word-phrase we consider each three-word window in the sentence , and check if they meet our criterion . If the tags match, we print the corresponding words .

    from nltk.corpus import brown

    def process(sentence):

        for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):

            if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):

                print w1, w2, w3

    >>> for tagged_sent in brown.tagged_sents():

    ...     process(tagged_sent)

    ...

    combined to achieve

    continue to place

    serve to protect

    wanted to wait

    allowed to place

    expected to become

    ...

    Example 5.3 (code_three_word_phrase.py): Figure 5.3: Searching for Three-Word Phrases Using POS Tags

    Finally, let's look for words that are highly ambiguous as to their part of speech tag. Understanding why such words are tagged as they are in each context can help us clarify the distinctions between the tags.

    >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True)

    >>> data = nltk.ConditionalFreqDist((word.lower(), tag)

    ...                                 for (word, tag) in brown_news_tagged)

    >>> for word in data.conditions():

    ...     if len(data[word]) > 3:

    ...         tags = data[word].keys()

    ...         print word, ' '.join(tags)

    ...

    best ADJ ADV NP V

    better ADJ ADV V DET

    close ADV ADJ V N

    cut V N VN VD

    even ADV DET ADJ V

    grant NP N V -

    hit V VD VN N

    lay ADJ V NP VD

    left VD ADJ N VN

    like CNJ V ADJ P -

    near P ADV ADJ DET

    open ADJ V N ADV

    past N ADJ DET P

    present ADJ ADV V N

    read V VN VD NP

    right ADJ N DET ADV

    second NUM ADV DET N

    set VN V VD N -

    that CNJ V WH DET

    Note

    Your Turn: Open the POS concordance tool nltk.app.concordance() and load the complete Brown Corpus (simplified tagset). Now pick some of the above words and see how the tag of the word correlates with the context of the word. E.g. search for near to see all forms mixed together, near/ADJ to see it used as an adjective, near N to see just those cases where a noun follows, and so forth.

  • 相关阅读:
    程序员必读书籍及部分图书导读指南
    Git客户端图文详解如何安装配置GitHub操作流程攻略
    c/c++中int main(int argc,char *argv[])的具体含义
    各大编程语言的特点/区别/类型系统
    Linux 下 make 命令 及 make 笔记
    STL之heap与优先级队列Priority Queue详解
    基金风险等级和客户风险承受能力评价标准公示
    markdown编辑器使用指南
    优秀程序员的良好的学习方式,特征,生活和学习的习惯
    db timedb / ClickHouse / clickhouse
  • 原文地址:https://www.cnblogs.com/yuxc/p/2152667.html
Copyright © 2011-2022 走看看