zoukankan      html  css  js  c++  java
  • Python自然语言处理学习笔记(22):3.6 规格化文本

    3.6 Normalizing Text   规格化文本

    In earlier program examples we have often converted text to lowercase before doing anything with its words, e.g., set(w.lower() for w in text). By using lower(), we have normalized the text to lowercase so that the distinction between The and the is ignored. Often we want to go further than this and strip off any affixes(词缀), a task known as stemming(提取词干). A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization(词元化). We discuss each of these in turn. First, we need to define the data we will use in this section:

    >>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords

    ... is no basis for a system of government. Supreme executive power derives from

    ... a mandate from the masses, not from some farcical aquatic ceremony.
    """

    >>> tokens = nltk.word_tokenize(raw)

     

    Stemmers   词干器

    NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer, you should use one of these in preference to crafting(制作) your own using regular expressions, since NLTK’s stemmers handle a wide range of irregular cases. The Porter and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles the word lying (mapping it to lie), whereas the Lancaster stemmer does not.我觉得basis两个分得都不好

      >>> porter = nltk.PorterStemmer()

      
    >>> lancaster = nltk.LancasterStemmer()

      
    >>> [porter.stem(t) for t in tokens]

      [
    'DENNI'':''Listen'',''strang''women''lie''in''pond',

      
    'distribut''sword''is''no''basi''for''a''system''of''govern',

      
    '.''Suprem''execut''power''deriv''from''a''mandat''from',

      
    'the''mass'',''not''from''some''farcic''aquat''ceremoni''.']

      
    >>> [lancaster.stem(t) for t in tokens]

      [
    'den'':''list'',''strange''wom''lying''in''pond''distribut',

      
    'sword''is''no''bas''for''a''system''of''govern''.''suprem',

      
    'execut''pow''der''from''a''mand''from''the''mass'',''not',

      
    'from''som''farc''aqu''ceremony''.']

     

    Stemming is not a well-defined process, and we typically pick the stemmer that best suits the application we have in mind. The Porter Stemmer is a good choice if you are indexing some texts and want to support search using alternative forms of words (illustrated in Example 3-1, which uses object-oriented programming techniques that are outside the scope of this book, string formatting techniques to be covered in Section 3.9, and the enumerate() function to be explained in Section 4.2).

     

    Example 3-1. Indexing a text using a stemmer.

      class IndexedText(object):

          
    def __init__(self, stemmer, text):

              self._text 
    = text

              self._stemmer 
    = stemmer

              self._index 
    = nltk.Index((self._stem(word), i)

                                 
    for (i, word) in enumerate(text))

          
    def concordance(self, word, width=40):

              key 
    = self._stem(word)

              wc 
    = width/4                # words of context

              
    for i in self._index[key]:

                  lcontext 
    = ' '.join(self._text[i-wc:i])

                  rcontext 
    = ' '.join(self._text[i:i+wc])

                  ldisplay 
    = '%*s' % (width, lcontext[-])

                  rdisplay 
    = '%-*s' % (width, rcontext[:width])

                  
    print ldisplay, rdisplay

          
    def _stem(self, word):

              
    return self._stemmer.stem(word).lower()

      
    >>> porter = nltk.PorterStemmer()

      
    >>> grail = nltk.corpus.webtext.words('grail.txt')

      
    >>> text = IndexedText(porter, grail)

      
    >>> text.concordance('lie')

      r king ! DENNIS : Listen , stran
    ge women lying in ponds distributing swords is no

       beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of

             Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !

      doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well

      ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which

         you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --

      h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k

      
    not stop our fight ' til each one of you lies dead , and the Holy Grail returns t

     

    Lemmatization 词元化

    The WordNet lemmatizer removes affixes only if the resulting word is in its dictionary. This additional checking process makes the lemmatizer slower than the stemmers just mentioned. Notice that it doesn’t handle lying, but it converts women to woman.

    >>> wnl = nltk.WordNetLemmatizer()

    >>> [wnl.lemmatize(t) for t in tokens]

      [
    'DENNIS'':''Listen'',''strange''woman''lying''in''pond',

      
    'distributing''sword''is''no''basis''for''a''system''of',

      
    'government''.''Supreme''executive''power''derives''from''a',

      
    'mandate''from''the''mass'',''not''from''some''farcical',

      
    'aquatic''ceremony''.']

    The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid lemmas (or lexicon headwords(中心词)).

     

    Another normalization task involves identifying non-standard words, including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary. For example, every decimal number could be mapped to a single token 0.0, and every acronym(首字母缩写词) could be mapped to AAA. This keeps the vocabulary small and improves the accuracy of many language modeling tasks.

     

    None
  • 相关阅读:
    oracle基本语句
    SVM入门(六)线性分类器的求解——问题的转化,直观角度
    深入浅出KMeans算法
    SVM入门(三)线性分类器Part 2
    SVM入门(一)SVM的八股简介
    Hadoop源代码分析(五)
    用HTML5 Audio API开发游戏音乐
    Hadoop源代码分析(六)
    SVM入门(四)线性分类器的求解——问题的描述Part1
    SVM入门(二)线性分类器Part 1
  • 原文地址:https://www.cnblogs.com/yuxc/p/2129696.html
Copyright © 2011-2022 走看看