zoukankan      html  css  js  c++  java
  • 自然语言处理(2)之文本资料库

    自然语言处理(2)之文本资料库

    1.获取文本资料库

    本章首先给出了一个文本资料库的实例:nltk.corpus.gutenberg,通过gutenberg实例来学习文本资料库。我们用help来查看它的类型

      1 >>> import nltk
      2 >>> help(nltk.corpus.gutenberg)
      3 Help on PlaintextCorpusReader in module nltk.corpus.reader.plaintext object:
      4 
      5 class PlaintextCorpusReader(nltk.corpus.reader.api.CorpusReader)
      6  |  Reader for corpora that consist of plaintext documents.  Paragraphs
      7  |  are assumed to be split using blank lines.  Sentences and words can
      8  |  be tokenized using the default tokenizers, or by custom tokenizers
      9  |  specificed as parameters to the constructor.
     10  |  
     11  |  This corpus reader can be customized (e.g., to skip preface
     12  |  sections of specific document formats) by creating a subclass and
     13  |  overriding the ``CorpusView`` class variable.
     14  |  
     15  |  Method resolution order:
     16  |      PlaintextCorpusReader
     17  |      nltk.corpus.reader.api.CorpusReader
     18  |      __builtin__.object
     19  |  
     20  |  Methods defined here:
     21  |  
     22  |  __init__(self, root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), sent_tokenizer=<nltk.tokenize.punkt.Punkt
     23 SentenceTokenizer object>, para_block_reader=<function read_blankline_block>, encoding=None)
     24  |      Construct a new plaintext corpus reader for a set of documents
     25  |      located at the given root directory.  Example usage:
     26  |      
     27  |          >>> root = '/usr/local/share/nltk_data/corpora/webtext/'
     28  |          >>> reader = PlaintextCorpusReader(root, '.*.txt')
     29  |      
     30  |      :param root: The root directory for this corpus.
     31  |      :param fileids: A list or regexp specifying the fileids in this corpus.
     32  |      :param word_tokenizer: Tokenizer for breaking sentences or
     33  |          paragraphs into words.
     34  |      :param sent_tokenizer: Tokenizer for breaking paragraphs
     35  |          into words.
     36  |      :param para_block_reader: The block reader used to divide the
     37  |          corpus into paragraph blocks.
     38  |  
     39  |  paras(self, fileids=None, sourced=False)
     40  |      :return: the given file(s) as a list of
     41  |          paragraphs, each encoded as a list of sentences, which are
     42  |          in turn encoded as lists of word strings.
     43  |      :rtype: list(list(list(str)))
     44  |  
     45  |  raw(self, fileids=None, sourced=False)
     46  |      :return: the given file(s) as a single string.
     47  |      :rtype: str
     48  |  
     49  |  sents(self, fileids=None, sourced=False)
     50  |      :return: the given file(s) as a list of
     51  |          sentences or utterances, each encoded as a list of word
     52  |          strings.
     53  |      :rtype: list(list(str))
     54  |  
     55  |  words(self, fileids=None, sourced=False)
     56  |      :return: the given file(s) as a list of words
     57 |          and punctuation symbols.
     58  |      :rtype: list(str)
     59  |  
     60  |  ----------------------------------------------------------------------
     61  |  Data and other attributes defined here:
     62  |  
     63  |  CorpusView = <class 'nltk.corpus.reader.util.StreamBackedCorpusView'>
     64  |      A 'view' of a corpus file, which acts like a sequence of tokens:
     65  |      it can be accessed by index, iterated over, etc.  However, the
     66  |      tokens are only constructed as-needed -- the entire corpus is
     67  |      never stored in memory at once.
     68  |      
     69  |      The constructor to ``StreamBackedCorpusView`` takes two arguments:
     70  |      a corpus fileid (specified as a string or as a ``PathPointer``);
     71  |      and a block reader.  A "block reader" is a function that reads
     72  |      zero or more tokens from a stream, and returns them as a list.  A
     73  |      very simple example of a block reader is:
     74  |      
     75  |          >>> def simple_block_reader(stream):
     76  |          ...     return stream.readline().split()
     77  |      
     78  |      This simple block reader reads a single line at a time, and
     79  |      returns a single token (consisting of a string) for each
     80  |      whitespace-separated substring on the line.
     81  |      
     82  |      When deciding how to define the block reader for a given
     83  |      corpus, careful consideration should be given to the size of
     84  |      blocks handled by the block reader.  Smaller block sizes will
     85  |      increase the memory requirements of the corpus view's internal
     86  |      data structures (by 2 integers per block).  On the other hand,
     87  |      larger block sizes may decrease performance for random access to
     88  |      the corpus.  (But note that larger block sizes will *not*
     89  |      decrease performance for iteration.)
     90  |      
     91  |      Internally, ``CorpusView`` maintains a partial mapping from token
     92  |      index to file position, with one entry per block.  When a token
     93  |      with a given index *i* is requested, the ``CorpusView`` constructs
     94  |      it as follows:
     95  |      
     96  |        1. First, it searches the toknum/filepos mapping for the token
     97  |           index closest to (but less than or equal to) *i*.
     98  |      
     99  |        2. Then, starting at the file position corresponding to that
    100  |           index, it reads one block at a time using the block reader
    101  |           until it reaches the requested token.
    102  |      
    103  |      The toknum/filepos mapping is created lazily: it is initially
    104  |      empty, but every time a new block is read, the block's
    105  |      initial token is added to the mapping.  (Thus, the toknum/filepos
    106  |      map has one entry per block.)
    107  |      
    108  |      In order to increase efficiency for random access patterns that
    109  |      have high degrees of locality, the corpus view may cache one or
    110 |      have high degrees of locality, the corpus view may cache one or
    111  |      more blocks.
    112  |      
    113  |      :note: Each ``CorpusView`` object internally maintains an open file
    114  |          object for its underlying corpus file.  This file should be
    115  |          automatically closed when the ``CorpusView`` is garbage collected,
    116  |          but if you wish to close it manually, use the ``close()``
    117  |          method.  If you access a ``CorpusView``'s items after it has been
    118  |          closed, the file object will be automatically re-opened.
    119  |      
    120  |      :warning: If the contents of the file are modified during the
    121  |          lifetime of the ``CorpusView``, then the ``CorpusView``'s behavior
    122  |          is undefined.
    123  |      
    124  |      :warning: If a unicode encoding is specified when constructing a
    125  |          ``CorpusView``, then the block reader may only call
    126  |          ``stream.seek()`` with offsets that have been returned by
    127  |          ``stream.tell()``; in particular, calling ``stream.seek()`` with
    128  |          relative offsets, or with offsets based on string lengths, may
    129  |          lead to incorrect behavior.
    130  |      
    131  |      :ivar _block_reader: The function used to read
    132  |          a single block from the underlying file stream.
    133  |      :ivar _toknum: A list containing the token index of each block
    134  |          that has been processed.  In particular, ``_toknum[i]`` is the
    135  |          token index of the first token in block ``i``.  Together
    136  |          with ``_filepos``, this forms a partial mapping between token
    137  |          indices and file positions.
    138  |      :ivar _filepos: A list containing the file position of each block
    139  |          that has been processed.  In particular, ``_toknum[i]`` is the
    140  |          file position of the first character in block ``i``.  Together
    141  |          with ``_toknum``, this forms a partial mapping between token
    142  |          indices and file positions.
    143  |      :ivar _stream: The stream used to access the underlying corpus file.
    144  |      :ivar _len: The total number of tokens in the corpus, if known;
    145  |          or None, if the number of tokens is not yet known.
    146  |      :ivar _eofpos: The character position of the last character in the
    147  |          file.  This is calculated when the corpus view is initialized,
    148  |          and is used to decide when the end of file has been reached.
    149  |      :ivar _cache: A cache of the most recently read block.  It
    150  |         is encoded as a tuple (start_toknum, end_toknum, tokens), where
    151  |         start_toknum is the token index of the first token in the block;
    152  |         end_toknum is the token index of the first token not in the
    153  |         block; and tokens is a list of the tokens in the block.
    154  |  
    155  |  ----------------------------------------------------------------------
    156  |  Methods inherited from nltk.corpus.reader.api.CorpusReader:
    157  |  
    158  |  __repr__(self)
    159  |  
    160  |  abspath(self, fileid)
    161  |      Return the absolute path for the given file.
    162  |      
    163  |      :type file: str
    164 
    165 |      :param file: The file identifier for the file whose path
    166  |          should be returned.
    167  |      :rtype: PathPointer
    168  |  
    169  |  abspaths(self, fileids=None, include_encoding=False, include_fileid=False)
    170  |      Return a list of the absolute paths for all fileids in this corpus;
    171  |      or for the given list of fileids, if specified.
    172  |      
    173  |      :type fileids: None or str or list
    174  |      :param fileids: Specifies the set of fileids for which paths should
    175  |          be returned.  Can be None, for all fileids; a list of
    176  |          file identifiers, for a specified set of fileids; or a single
    177  |          file identifier, for a single file.  Note that the return
    178  |          value is always a list of paths, even if ``fileids`` is a
    179  |          single file identifier.
    180  |      
    181  |      :param include_encoding: If true, then return a list of
    182  |          ``(path_pointer, encoding)`` tuples.
    183  |      
    184  |      :rtype: list(PathPointer)
    185  |  
    186  |  encoding(self, file)
    187  |      Return the unicode encoding for the given corpus file, if known.
    188  |      If the encoding is unknown, or if the given file should be
    189  |      processed using byte strings (str), then return None.
    190  |  
    191  |  fileids(self)
    192  |      Return a list of file identifiers for the fileids that make up
    193  |      this corpus.
    194  |  
    195  |  open(self, file, sourced=False)
    196  |      Return an open stream that can be used to read the given file.
    197  |      If the file's encoding is not None, then the stream will
    198  |      automatically decode the file's contents into unicode.
    199  |      
    200  |      :param file: The file identifier of the file to read.
    201  |  
    202  |  readme(self)
    203  |      Return the contents of the corpus README file, if it exists.
    204  |  
    205  |  ----------------------------------------------------------------------
    206  |  Data descriptors inherited from nltk.corpus.reader.api.CorpusReader:
    207  |  
    208  |  __dict__
    209  |      dictionary for instance variables (if defined)
    210  |  
    211  |  __weakref__
    212  |      list of weak references to the object (if defined)
    213  |  
    214  |  root
    215  |      The directory where this corpus is stored.
    216  |      
    217  |      :type: PathPointer

    在PlaintextCorpusReader中可以看到很多本文例子中方法,比如fileids(),words()等等。

    1.1 fileids()返回语料库的文件标识符

    1 >>> from nltk.corpus import gutenberg
    2 >>> gutenberg.fileids()
    3 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

    1.2 words()返回文件的单词列表

    1 >>> from nltk.corpus import gutenberg
    2 >>> gutenberg.fileids()
    3 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
    4 >>> gutenberg.words('austen-emma.txt')
    5 ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]
    6 >>> len(gutenberg.words('austen-emma.txt'))
    7 192427

    用concordance()来搜索文本里的单词

     1 >>> emma = nltk.Text(gutenberg.words('austen-emma.txt'))
     2 >>> emma
     3 <Text: Emma by Jane Austen 1816>
     4 >>> emma.concordance('surperize')
     5 Building index...
     6 No matches
     7 >>> emma.concordance('surprize')
     8 Displaying 25 of 37 matches:
     9 er father , was sometimes taken by surprize at his being still able to pity ` 
    10 hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
    11 Knightley actually looked red with surprize and displeasure , as he stood up ,
    12 r . Elton , and found to his great surprize , that Mr . Elton was actually on 
    13 d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
    14 father was quite taken up with the surprize of so sudden a journey , and his f
    15 y , in all the favouring warmth of surprize and conjecture . She was , moreove
    16 he appeared , to have her share of surprize , introduction , and pleasure . Th
    17 ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
    18 talking aunt had taken me quite by surprize , it must have been the death of m
    19 f all the dialogue which ensued of surprize , and inquiry , and congratulation
    20  the present . They might chuse to surprize her ." Mrs . Cole had many to agre
    21 the mode of it , the mystery , the surprize , is more like a young woman ' s s
    22  to her song took her agreeably by surprize -- a second , slightly but correct
    23 " " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; 
    24 t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
    25 of your admiration may take you by surprize some day or other ." Mr . Knightle
    26 ation for her will ever take me by surprize .-- I never had a thought of her i
    27  expected by the best judges , for surprize -- but there was great joy . Mr . 
    28  sound of at first , without great surprize . " So unreasonably early !" she w
    29 d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
    30 ; and Emma could imagine with what surprize and mortification she must be retu
    31 tled that Jane should go . Quite a surprize to me ! I had not the least idea !
    32  . It is impossible to express our surprize . He came to speak to his father o
    33 g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai

    这里用到了nltk.Text类,再次通过help查看这个类,通过method的查看发现这个类非常有用。

      1 class Text(__builtin__.object)
      2  |  A wrapper around a sequence of simple (string) tokens, which is
      3  |  intended to support initial exploration of texts (via the
      4  |  interactive console).  Its methods perform a variety of analyses
      5  |  on the text's contexts (e.g., counting, concordancing, collocation
      6  |  discovery), and display the results.  If you wish to write a
      7  |  program which makes use of these analyses, then you should bypass
      8  |  the ``Text`` class, and use the appropriate analysis function or
      9  |  class directly instead.
     10  |  
     11  |  A ``Text`` is typically initialized from a given document or
     12  |  corpus.  E.g.:
     13  |  
     14  |  >>> import nltk.corpus
     15  |  >>> from nltk.text import Text
     16  |  >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
     17  |  
     18  |  Methods defined here:
     19  |  
     20  |  __getitem__(self, i)
     21  |  
     22  |  __init__(self, tokens, name=None)
     23  |      Create a Text object.
     24  |      
     25  |      :param tokens: The source text.
     26  |      :type tokens: sequence of str
     27  |  
     28  |  __len__(self)
     29  |  
     30  |  __repr__(self)
     31  |      :return: A string representation of this FreqDist.
     32  |      :rtype: string
     33  |  
     34  |  collocations(self, num=20, window_size=2)
     35  |      Print collocations derived from the text, ignoring stopwords.
     36  |      
     37  |      :seealso: find_collocations
     38  |      :param num: The maximum number of collocations to print.
     39  |      :type num: int
     40  |      :param window_size: The number of tokens spanned by a collocation (default=2)
     41  |      :type window_size: int
     42  |  
     43  |  common_contexts(self, words, num=20)
     44  |      Find contexts where the specified words appear; list
     45  |      most frequent common contexts first.
     46  |      
     47  |      :param word: The word used to seed the similarity search
     48  |      :type word: str
     49  |      :param num: The number of words to generate (default=20)
     50  |      :type num: int
     51  |      :seealso: ContextIndex.common_contexts()
     52  |  
     53 |  concordance(self, word, width=79, lines=25)
     54  |      Print a concordance for ``word`` with the specified context window.
     55  |      Word matching is not case-sensitive.
     56  |      :seealso: ``ConcordanceIndex``
     57  |  
     58  |  count(self, word)
     59  |      Count the number of times this word appears in the text.
     60  |  
     61  |  dispersion_plot(self, words)
     62  |      Produce a plot showing the distribution of the words through the text.
     63  |      Requires pylab to be installed.
     64  |      
     65  |      :param words: The words to be plotted
     66  |      :type word: str
     67  |      :seealso: nltk.draw.dispersion_plot()
     68  |  
     69  |  findall(self, regexp)
     70  |      Find instances of the regular expression in the text.
     71  |      The text is a list of tokens, and a regexp pattern to match
     72  |      a single token must be surrounded by angle brackets.  E.g.
     73  |      
     74  |      >>> from nltk.book import text1, text5, text9
     75  |      >>> text5.findall("<.*><.*><bro>")
     76  |      you rule bro; telling you bro; u twizted bro
     77  |      >>> text1.findall("<a>(<.*>)<man>")
     78  |      monied; nervous; dangerous; white; white; white; pious; queer; good;
     79  |      mature; white; Cape; great; wise; wise; butterless; white; fiendish;
     80  |      pale; furious; better; certain; complete; dismasted; younger; brave;
     81  |      brave; brave; brave
     82  |      >>> text9.findall("<th.*>{3,}")
     83  |      thread through those; the thought that; that the thing; the thing
     84  |      that; that that thing; through these than through; them that the;
     85  |      through the thick; them that they; thought that the
     86  |      
     87  |      :param regexp: A regular expression
     88  |      :type regexp: str
     89  |  
     90  |  generate(self, length=100)
     91  |      Print random text, generated using a trigram language model.
     92  |      
     93  |      :param length: The length of text to generate (default=100)
     94  |      :type length: int
     95  |      :seealso: NgramModel
     96  |  
     97  |  index(self, word)
     98  |      Find the index of the first occurrence of the word in the text.
     99  |  
    100  |  plot(self, *args)
    101  |      See documentation for FreqDist.plot()
    102  |      :seealso: nltk.prob.FreqDist.plot()
    103  |  
    104  |  readability(self, method)
    105  |  
    106  |  similar(self, word, num=20)
    107  |      Distributional similarity: find other words which appear in the
    108  |      same contexts as the specified word; list most similar words first.
    109  |      
    110  |      :param word: The word used to seed the similarity search
    111  |      :type word: str
    112  |      :param num: The number of words to generate (default=20)
    113  |      :type num: int
    114  |      :seealso: ContextIndex.similar_words()
    115  |  
    116  |  vocab(self)
    117  |      :seealso: nltk.prob.FreqDist
    118  |  
    119  |  ----------------------------------------------------------------------
    120  |  Data descriptors defined here:
    121  |  
    122  |  __dict__
    123  |      dictionary for instance variables (if defined)
    124  |  
    125  |  __weakref__
    126  |      list of weak references to the object (if defined)

    1.3 raw,sent,words的区别

    我们通过以下例子来查看raw,sent,words的区别:

      1 #!/bin/envs python
      2 from nltk.corpus import gutenberg
      3 for fileid in gutenberg.fileids():
      4     num_chars = len(gutenberg.raw(fileid))                                  // 字母的个数
      5     num_words = len(gutenberg.words(fileid))                                // 单词的个数
      6     num_sents = len(gutenberg.sents(fileid))                                // 句子的个数
      7     num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))      // 不相同的单词的个数
      8     print int(num_chars/num_words),int(num_words/num_sents),int(num_words/num_vocab),fileid
      
    4 21 26 austen-emma.txt  //平均单词长度   平均每句单词个数   平均单词的重复个数
    4 23 16 austen-persuasion.txt
    4 23 22 austen-sense.txt
    4 33 79 bible-kjv.txt
    4 18 5 blake-poems.txt
    4 17 14 bryant-stories.txt
    4 17 12 burgess-busterbrown.txt
    4 16 12 carroll-alice.txt
    4 17 11 chesterton-ball.txt
    4 19 11 chesterton-brown.txt
    4 16 10 chesterton-thursday.txt
    4 17 24 edgeworth-parents.txt
    4 24 15 melville-moby_dick.txt
    4 52 10 milton-paradise.txt
    4 11 8 shakespeare-caesar.txt
    4 12 7 shakespeare-hamlet.txt
    4 12 6 shakespeare-macbeth.txt
    4 35 12 whitman-leaves.txt

    获取并查看shakespeare-macbeth.txt文本最长的一个句子

      1 #!/bin/envs python
      2 from nltk.corpus import gutenberg
      3 macbenth_sentences = gutenberg.sents('shakespeare-macbeth.txt') # 获取句子的list
      4 print macbenth_sentences
      5 print macbenth_sentences[1037]
      6 longtest_len=max([len(s) for s in macbenth_sentences])         # 获取最长句子的长度
      7 [ s for s in macbenth_sentences if longtest_len == len(s)]     # 获取最长句子的内容
    
    [['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]
    
    ['Good', 'night', ',', 'and', 'better', 'health', 'Attend', 'his', 'Maiesty']
    
    [['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Steele', ',', 'Which', 'smoak', "'", 'd', 'with', 'bloody', 'execution', '(', 'Like', 'Valours', 'Minion', ')', 'caru', "'", 'd', 'out', 'his', 'passage', ',', 'Till', 'hee', 'fac', "'", 'd', 'the', 'Slaue', ':', 'Which', 'neu', "'", 'r', 'shooke', 'hands', ',', 'nor', 'bad', 'farwell', 'to', 'him', ',', 'Till', 'he', 'vnseam', "'", 'd', 'him', 'from', 'the', 'Naue', 'toth', "'", 'Chops', ',', 'And', 'fix', "'", 'd', 'his', 'Head', 'vpon', 'our', 'Battlements']]

    1.4 NPSChatCorpusReader类

    接下来学习下新的一个reader类,nltk给出另一个实例类nltk.corpus.nps_chat,同样用help来查看下该类的信息。可以初步看出该类与xml格式的文件有关。

    1 nps_chat = class NPSChatCorpusReader(nltk.corpus.reader.xmldocs.XMLCorpusReader)
    2  |  Method resolution order:
    3  |      NPSChatCorpusReader
    4  |      nltk.corpus.reader.xmldocs.XMLCorpusReader
    5  |      nltk.corpus.reader.api.CorpusReader
    6  |      __builtin__.object
    7  |  
    8  |  Methods defined here:
    9 ...
    1 >>> from nltk.corpus import nps_chat
    2 >>> nps_chat.fileids()
    3 ['10-19-20s_706posts.xml', '10-19-30s_705posts.xml', '10-19-40s_686posts.xml', '10-19-adults_706posts.xml', '10-24-40s_706posts.xml', '10-26-teens_706posts.xml', '11-06-adults_706posts.xml', '11-08-20s_705posts.xml', '11-08-40s_706posts.xml', '11-08-adults_705posts.xml', '11-08-teens_706posts.xml', '11-09-20s_706posts.xml', '11-09-40s_706posts.xml', '11-09-adults_706posts.xml', '11-09-teens_706posts.xml']
    4 >>> chartoom=nps_chat.posts('10-19-20s_706posts.xml')
    5 >>> chartoom[123]
    6 ['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

     1.5 CategorizedTaggedCorpusReader类

    本文以brown类为实例介绍了CategorizedTaggedCorpusReader类。

      1 >>> from nltk.corpus import brown
      2 >>> help(brown)
      3 class CategorizedTaggedCorpusReader(nltk.corpus.reader.api.CategorizedCorpusReader, TaggedCorpusReader)
      4  |  A reader for part-of-speech tagged corpora whose documents are
      5  |  divided into categories based on their file identifiers.
      6  |  
      7  |  Method resolution order:
      8  |      CategorizedTaggedCorpusReader
      9  |      nltk.corpus.reader.api.CategorizedCorpusReader
     10  |      TaggedCorpusReader
     11  |      nltk.corpus.reader.api.CorpusReader
     12  |      __builtin__.object
     13  |  
     14  |  Methods defined here:
     15  |  
     16  |  __init__(self, *args, **kwargs)
     17  |      Initialize the corpus reader.  Categorization arguments
     18  |      (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to
     19  |      the ``CategorizedCorpusReader`` constructor.  The remaining arguments
     20  |      are passed to the ``TaggedCorpusReader``.
     21  |  
     22  |  paras(self, fileids=None, categories=None)
     23  |  
     24  |  raw(self, fileids=None, categories=None)
     25  |  
     26  |  sents(self, fileids=None, categories=None)
     27  |  
     28  |  tagged_paras(self, fileids=None, categories=None, simplify_tags=False)
     29  |  
     30  |  tagged_sents(self, fileids=None, categories=None, simplify_tags=False)
     31  |  
     32  |  tagged_words(self, fileids=None, categories=None, simplify_tags=False)
     33  |  
     34  |  words(self, fileids=None, categories=None)
     35  |  
     36  |  ----------------------------------------------------------------------
     37  |  Methods inherited from nltk.corpus.reader.api.CategorizedCorpusReader:
     38  |  
     39  |  categories(self, fileids=None)
     40  |      Return a list of the categories that are defined for this corpus,
     41  |      or for the file(s) if it is given.
     42  |  
     43  |  fileids(self, categories=None)
     44  |      Return a list of file identifiers for the files that make up
     45  |      this corpus, or that make up the given category(s) if specified.
     46  |  
     47  |  ----------------------------------------------------------------------
     48  |  Data descriptors inherited from nltk.corpus.reader.api.CategorizedCorpusReader:
     49  |  
     50  |  __dict__
     51  |      dictionary for instance variables (if defined)
     52  |  
     53  |  __weakref__
     54  |      list of weak references to the object (if defined)
     55  |  
     56  |  ----------------------------------------------------------------------
     57  |  Methods inherited from nltk.corpus.reader.api.CorpusReader:
     58  |  
     59  |  __repr__(self)
     60  |  
     61  |  abspath(self, fileid)
     62  |      Return the absolute path for the given file.
     63  |      
     64  |      :type file: str
     65  |      :param file: The file identifier for the file whose path
     66  |          should be returned.
     67  |      :rtype: PathPointer
     68  |  
     69  |  abspaths(self, fileids=None, include_encoding=False, include_fileid=False)
     70  |      Return a list of the absolute paths for all fileids in this corpus;
     71  |      or for the given list of fileids, if specified.
     72  |      
     73  |      :type fileids: None or str or list
     74  |      :param fileids: Specifies the set of fileids for which paths should
     75  |          be returned.  Can be None, for all fileids; a list of
     76  |          file identifiers, for a specified set of fileids; or a single
     77  |          file identifier, for a single file.  Note that the return
     78  |          value is always a list of paths, even if ``fileids`` is a
     79  |          single file identifier.
     80  |      
     81  |      :param include_encoding: If true, then return a list of
     82  |          ``(path_pointer, encoding)`` tuples.
     83  |      
     84  |      :rtype: list(PathPointer)
     85  |  
     86  |  encoding(self, file)
     87  |      Return the unicode encoding for the given corpus file, if known.
     88  |      If the encoding is unknown, or if the given file should be
     89  |      processed using byte strings (str), then return None.
     90  |  
     91  |  open(self, file, sourced=False)
     92  |      Return an open stream that can be used to read the given file.
     93  |      If the file's encoding is not None, then the stream will
     94  |      automatically decode the file's contents into unicode.
     95  |      
     96  |      :param file: The file identifier of the file to read.
     97  |  
     98  |  readme(self)
     99  |      Return the contents of the corpus README file, if it exists.
    100  |  
    101  |  ----------------------------------------------------------------------
    102  |  Data descriptors inherited from nltk.corpus.reader.api.CorpusReader:
    103  |  
    104  |  root
    105  |      The directory where this corpus is stored.
    106  |      
    107  |      :type: PathPointer

    看下 brown的内容,如果获取brown资料库的主题和文件

     1 >>> from nltk.corpus import brown
     2 >>> brown.categories()   //返回brown资料库的主题种类
     3 ['adventure', 'belles_lettres', 'editori', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
     4 >>> brown.fileids()[1:10] //返回brown资料库内的文件
     5 ['ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10']
     6 >>> brown.words(categories='news') //返回brown资料库内类别名为news的类别,并按次进行切分
     7 ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
     8 >>> brown.words(fileids=['cg22'])  //返回brown资料库内的文件名为cg22的文件,并按词进行切分
     9 ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
    10 >>> brown.sents(categories=['news','editori','reviews'])//返回多个类别,并按句进行切分
    11 [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

    对brown内的特定的文体进行计数:

    1 from nltk.corpus import brown
    2 import nltk
    3 news_text = brown.words(categories='news')   //返回brown资料库内类别名为news的类别,并按次进行切分
    4 fdist = nltk.FreqDist([w.lower() for w in news_text]) //获取news的频率分布
    5 modals = ['can','could','may','might','must','will']
    6 for m in modals :
    7 print m + ':',fdist[m], //获取modals的计数

    输出

      can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

    计算多个特定类别的多个文体进行统计

      1 from nltk.corpus import brown
      2 import nltk
      3 cfd = nltk.ConditionalFreqDist(
      4         (genre,word)
      5         for genre in brown.categories()
      6         for word in brown.words(categories=genre))
      7 genres=['new','religion','hobbies','science_fiction','romance','humor']
      8 modals = ['can','could','may','might','must','will']
      9 cfd.tabulate(conditions=genres,samples=modals)
    
                     can could  may might must will
                new    0    0    0    0    0    0
           religion   82   59   78   12   54   71
            hobbies  268   58  131   22   83  264
    science_fiction   16   49    4   12    8   16
            romance   74  193   11   51   45   43
              humor   16   30    8    8    9   13

    1.6  CategorizedPlaintextCorpusReader类

    相比与brown(CategorizedTaggedCorpusReader),retuters(CategorizedPlaintextCorpusReader)的区别在于,retuters可以查找一个或者多个文档涵盖的主题,也可以查找包含在一个或多个类别的文档。

     1 >>> from nltk.corpus import reuters
     2 >>> reuters.fileids()[1:10]
     3 ['test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843']
     4 >>> reuters.categories()
     5 ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']
     6 >>> reuters.categories('training/9865')
     7 ['barley', 'corn', 'grain', 'wheat']
     8 >>> reuters.categories(['training/9865','training/9880'])
     9 ['barley', 'corn', 'grain', 'money-fx', 'wheat']
    10 >>> reuters.categories('training/9880')
    11 ['money-fx']

    对比brown:

    1 >>> from nltk.corpus import brown
    2 >>> brown.categories(['news','reviews'])   //不能对多个主题进行查找
    3 []
    4 >>> brown.fileids(['cr05','cr06'])
    5 []

    1.7 基本语料库函数

    示例 描述
    fileids() 语料库的文件
    fileids([categories]) 分类对应的语料库中的文件
    categories() 语料库中的分类
    categoried([fileids]) 文件对应的语料库中的分类
    raw() 语料库的原始内容
    raw(fileids=[f1,f2,f3]) 指定文件的原始内容
    raw(categories=[c1,c2]) 制定分类的原始内容
    words() 整个语料库中的词汇
    words(fileids=[f1,f2,f3]) 指定文件的词汇
    words(categories=[c1,c2]) 指定分类的词汇
    sents() 指定分类的句子
    sents(fileids=[f1,f2,f3]) 指定文件的句子
    sents(categories=[c1,c2]) 指定分类的句子
    abspath(fileid) 制定文件在磁盘的位置
    encoding(fileid) 文件的编码(如果知道的话)
    open(fileid) 打开指定语料库文件的文件流
    root() 到本地安装的语料库根目录的路径
    readme() 语料库的README文件的内容

    1.8 载入自己的语料库

    1 >>> from nltk.corpus import PlaintextCorpusReader
    2 >>> corpus_root='/Users/rcf/workspace/python/python_test/NLP_WITH_PYTHON/chapter_2'
    3 >>> wordlist=PlaintextCorpusReader(corpus_root,'.*')   //corpus_root 资料库路径,'.*'文件类型
    4 >>> wordlist.fileids()
    5 ['1.py', '2.py', '3.py', '4.py']
    6 >>> wordlist.words('3.py')
    7 ['from', 'nltk', '.', 'corpus', 'import', 'brown', ...]
  • 相关阅读:
    ajax 笔记
    EM Algorithm
    Support Vector Machine
    Factor Analysis
    Local weighted regression
    一点突发奇想
    纳什均衡
    自驾崇明东滩湿地
    程序员热力学第二定律
    SQL Server Identity 属性的问题
  • 原文地址:https://www.cnblogs.com/rcfeng/p/3930464.html
Copyright © 2011-2022 走看看