zoukankan      html  css  js  c++  java
  • 初识NLTK

    需要用处理英文文本,于是用到python中nltk这个包

    1 f = open(r"D:PostgraduatePythonPython爬取美国商标局专利s_exp.txt")
    2 text = f.read()
    3 sentences = nltk.sent_tokenize(text)
    4 tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    5 tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]

    依次过程是:

    1、分句;2、分词;3、词性标注

    然后4、命名实体识别

    for sent in tagged_sentences:
        print(nltk.ne_chunk(sent))

    当然,词性标注和命名实体识别这两部可以使用Standford的词性标注和命名实体识别库

    >>> stan_tagger = StanfordPOSTagger(r'D:PostgraduatePythonPython自然语言处理stanford-postagger-full-2018-02-27stanford-postagger-full-2018-02-27modelsenglish-bidirectional-distsim.tagger','D:PostgraduatePythonPython自然语言处理stanford-postagger-full-2018-02-27stanford-postagger-full-2018-02-27stanford-postagger.jar')
    
    Warning (from warnings module):
      File "C:Program FilesPython36libsite-packages
    ltk	agstanford.py", line 149
        super(StanfordPOSTagger, self).__init__(*args, **kwargs)
    DeprecationWarning: 
    The StanfordTokenizer will be deprecated in version 3.2.5.
    Please use nltk.tag.corenlp.CoreNLPPOSTagger or nltk.tag.corenlp.CoreNLPNERTagger instead.
    >>> s = "I was watching TV"
    >>> tokens = nltk.word_tokenize(s)
    >>> stan_tagger.tag(tokens)
    [('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')]

    接着是命名实体识别:

    from nltk.tag.stanford import StanfordNERTagger
    # https://nlp.stanford.edu/software/stanford-ner-2018-02-27.zip
    st = StanfordNERTagger(r'D:PostgraduatePythonPython自然语言处理stanford-ner-2017-06-09stanford-ner-2017-06-09classifiersenglish.all.3class.distsim.crf.ser.gz','D:PostgraduatePythonPython自然语言处理stanford-ner-2017-06-09stanford-ner-2017-06-09stanford-ner.jar')
    st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
    >>[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

    但是效果似乎不好。。

    人生苦短,何不用python
  • 相关阅读:
    HDU2222 自动机(学习中)
    POJ 2289(多重匹配+二分)
    POJ 1486二分图的必要边
    二分图
    2015陕西 并查集
    Hdu2680 最短路
    函数调用约定
    用01随机函数构造[a,b]整数范围随机数
    hello
    Ubuntu 16.04 install R language
  • 原文地址:https://www.cnblogs.com/yqpy/p/9131293.html
Copyright © 2011-2022 走看看