zoukankan      html  css  js  c++  java
  • NLTK 知识整理

    NLTK 知识整理

    nltk.corpus模块自带语料

    NLTK comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: http://nltk.org/nltk_data/

    1. Run the Python interpreter and type the commands:
    >>> import nltk
    >>> nltk.download()
    
    1. Test that the data has been installed as follows. (This assumes you downloaded the Brown Corpus):
    >>> from nltk.corpus import brown
    >>> brown.words()
    ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
    

    API

    • words(): list of str
    • sents(): list of (list of str)
    • paras(): list of (list of (list of str))
    • tagged_words(): list of (str,str) tuple
    • tagged_sents(): list of (list of (str,str))
    • tagged_paras(): list of (list of (list of (str,str)))
    • chunked_sents(): list of (Tree w/ (str,str) leaves)
    • parsed_sents(): list of (Tree with str leaves)
    • parsed_paras(): list of (list of (Tree with str leaves))
    • xml(): A single xml ElementTree
    • raw(): unprocessed corpus contents

    For example, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():

    >>> from nltk.corpus import brown
    >>> print(", ".join(brown.words()))
    The, Fulton, County, Grand, Jury, said, ...
    

    Tokenize 英文分词

    Tokenize some text:

    >>> import nltk
    >>> sentence = """At eight o'clock on Thursday morning
    ... Arthur didn't feel very good."""
    >>> nltk.word_tokenize(sentence)
    ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
    'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
    

    References

    [1] NLTK 3.2.5 documentation http://www.nltk.org/
    [2] nltk.corpus package http://www.nltk.org/api/nltk.corpus.html#module-nltk.corpus

  • 相关阅读:
    模板
    洛谷
    Codeforces
    Codeforces
    Codeforces
    Codeforces
    洛谷
    洛谷
    洛谷
    NOIP 普及组 2016 海港
  • 原文地址:https://www.cnblogs.com/fengyubo/p/8627141.html
Copyright © 2011-2022 走看看