需要用处理英文文本,于是用到python中nltk这个包
1 f = open(r"D:PostgraduatePythonPython爬取美国商标局专利s_exp.txt") 2 text = f.read() 3 sentences = nltk.sent_tokenize(text) 4 tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] 5 tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
依次过程是:
1、分句;2、分词;3、词性标注
然后4、命名实体识别
for sent in tagged_sentences: print(nltk.ne_chunk(sent))
当然,词性标注和命名实体识别这两部可以使用Standford的词性标注和命名实体识别库
>>> stan_tagger = StanfordPOSTagger(r'D:PostgraduatePythonPython自然语言处理stanford-postagger-full-2018-02-27stanford-postagger-full-2018-02-27modelsenglish-bidirectional-distsim.tagger','D:PostgraduatePythonPython自然语言处理stanford-postagger-full-2018-02-27stanford-postagger-full-2018-02-27stanford-postagger.jar') Warning (from warnings module): File "C:Program FilesPython36libsite-packages ltk agstanford.py", line 149 super(StanfordPOSTagger, self).__init__(*args, **kwargs) DeprecationWarning: The StanfordTokenizer will be deprecated in version 3.2.5. Please use [91mnltk.tag.corenlp.CoreNLPPOSTagger[0m or [91mnltk.tag.corenlp.CoreNLPNERTagger[0m instead. >>> s = "I was watching TV" >>> tokens = nltk.word_tokenize(s) >>> stan_tagger.tag(tokens) [('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')]
接着是命名实体识别:
from nltk.tag.stanford import StanfordNERTagger # https://nlp.stanford.edu/software/stanford-ner-2018-02-27.zip st = StanfordNERTagger(r'D:PostgraduatePythonPython自然语言处理stanford-ner-2017-06-09stanford-ner-2017-06-09classifiersenglish.all.3class.distsim.crf.ser.gz','D:PostgraduatePythonPython自然语言处理stanford-ner-2017-06-09stanford-ner-2017-06-09stanford-ner.jar') st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
>>[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]
但是效果似乎不好。。