zoukankan      html  css  js  c++  java
  • 初识NLTK

    需要用处理英文文本,于是用到python中nltk这个包

    1 f = open(r"D:PostgraduatePythonPython爬取美国商标局专利s_exp.txt")
    2 text = f.read()
    3 sentences = nltk.sent_tokenize(text)
    4 tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    5 tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]

    依次过程是:

    1、分句;2、分词;3、词性标注

    然后4、命名实体识别

    for sent in tagged_sentences:
        print(nltk.ne_chunk(sent))

    当然,词性标注和命名实体识别这两部可以使用Standford的词性标注和命名实体识别库

    >>> stan_tagger = StanfordPOSTagger(r'D:PostgraduatePythonPython自然语言处理stanford-postagger-full-2018-02-27stanford-postagger-full-2018-02-27modelsenglish-bidirectional-distsim.tagger','D:PostgraduatePythonPython自然语言处理stanford-postagger-full-2018-02-27stanford-postagger-full-2018-02-27stanford-postagger.jar')
    
    Warning (from warnings module):
      File "C:Program FilesPython36libsite-packages
    ltk	agstanford.py", line 149
        super(StanfordPOSTagger, self).__init__(*args, **kwargs)
    DeprecationWarning: 
    The StanfordTokenizer will be deprecated in version 3.2.5.
    Please use nltk.tag.corenlp.CoreNLPPOSTagger or nltk.tag.corenlp.CoreNLPNERTagger instead.
    >>> s = "I was watching TV"
    >>> tokens = nltk.word_tokenize(s)
    >>> stan_tagger.tag(tokens)
    [('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')]

    接着是命名实体识别:

    from nltk.tag.stanford import StanfordNERTagger
    # https://nlp.stanford.edu/software/stanford-ner-2018-02-27.zip
    st = StanfordNERTagger(r'D:PostgraduatePythonPython自然语言处理stanford-ner-2017-06-09stanford-ner-2017-06-09classifiersenglish.all.3class.distsim.crf.ser.gz','D:PostgraduatePythonPython自然语言处理stanford-ner-2017-06-09stanford-ner-2017-06-09stanford-ner.jar')
    st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
    >>[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

    但是效果似乎不好。。

    人生苦短,何不用python
  • 相关阅读:
    今天18:40分左右一部价值500多块捷安特自行车被盗!
    利用ASP.net上传文件
    _desktop.ini
    Visual Studio .NET 设置移植工具
    审计厅的项目终于可以告一段落了
    Word2CHM Assistant(Word2CHM助手)V2.1.0 破解版
    最近比较烦!
    delphi 中 Format 用法总汇
    谈谈公司管理及需求方面的问题
    [待续]SQLSERVER无法访问远程服务器问题
  • 原文地址:https://www.cnblogs.com/yqpy/p/9131293.html
Copyright © 2011-2022 走看看