zoukankan      html  css  js  c++  java
  • 初识NLTK

    需要用处理英文文本,于是用到python中nltk这个包

    1 f = open(r"D:PostgraduatePythonPython爬取美国商标局专利s_exp.txt")
    2 text = f.read()
    3 sentences = nltk.sent_tokenize(text)
    4 tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    5 tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]

    依次过程是:

    1、分句;2、分词;3、词性标注

    然后4、命名实体识别

    for sent in tagged_sentences:
        print(nltk.ne_chunk(sent))

    当然,词性标注和命名实体识别这两部可以使用Standford的词性标注和命名实体识别库

    >>> stan_tagger = StanfordPOSTagger(r'D:PostgraduatePythonPython自然语言处理stanford-postagger-full-2018-02-27stanford-postagger-full-2018-02-27modelsenglish-bidirectional-distsim.tagger','D:PostgraduatePythonPython自然语言处理stanford-postagger-full-2018-02-27stanford-postagger-full-2018-02-27stanford-postagger.jar')
    
    Warning (from warnings module):
      File "C:Program FilesPython36libsite-packages
    ltk	agstanford.py", line 149
        super(StanfordPOSTagger, self).__init__(*args, **kwargs)
    DeprecationWarning: 
    The StanfordTokenizer will be deprecated in version 3.2.5.
    Please use nltk.tag.corenlp.CoreNLPPOSTagger or nltk.tag.corenlp.CoreNLPNERTagger instead.
    >>> s = "I was watching TV"
    >>> tokens = nltk.word_tokenize(s)
    >>> stan_tagger.tag(tokens)
    [('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')]

    接着是命名实体识别:

    from nltk.tag.stanford import StanfordNERTagger
    # https://nlp.stanford.edu/software/stanford-ner-2018-02-27.zip
    st = StanfordNERTagger(r'D:PostgraduatePythonPython自然语言处理stanford-ner-2017-06-09stanford-ner-2017-06-09classifiersenglish.all.3class.distsim.crf.ser.gz','D:PostgraduatePythonPython自然语言处理stanford-ner-2017-06-09stanford-ner-2017-06-09stanford-ner.jar')
    st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
    >>[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

    但是效果似乎不好。。

    人生苦短,何不用python
  • 相关阅读:
    Response.AppendHeader使用大全
    JS获取父框架的内容:获取子框架的内容:js框架应用
    各种好用的代码生成器
    Centos 64位上搭建Android
    WinForm 多线程
    GAC及其作用
    WPF 详解模板
    WPF控件开发基础(1)
    WPF:从WPF Diagram Designer Part 1学习控件模板、移动、改变大小和旋转
    告诫
  • 原文地址:https://www.cnblogs.com/yqpy/p/9131293.html
Copyright © 2011-2022 走看看