zoukankan      html  css  js  c++  java
  • spacy

    官方文档: https://spacy.io/api

    Spacy功能简介

    可以用于进行分词,命名实体识别,词性识别等等,但是首先需要下载预训练模型

    pip install --user spacy
    python -m spacy download en_core_web_sm
    pip install neuralcoref
    pip install textacy

    sentencizer

    • 将文章切分成句子,原理是Spacy通过将文章中某些单词的is_sent_start属性设置为True,来实现对文章的句子的切分,这些特殊的单词在规则上对应于句子的开头。
    import spacy
    nlp = spacy.load('en_core_web_sm')# 加载预训练模型
    
    txt = "some text read from one paper ..."
    doc = nlp(txt)
    
    for sent in doc.sents:
        print(sent)
        print('#'*50)

    Tokenization

    将句子切分成单词,英文中一般使用空格分隔

    import spacy
    nlp = spacy.load('en_core_web_sm')
    
    txt = "A magnetic monopole is a hypothetical elementary particle."
    doc = nlp(txt)
    tokens = [token for token in doc]
    print(tokens)

    Part-of-speech tagging

    • 词性标注,标注句子中每个单词的词性,是名词动词还是形容词。
    pos = [token.pos_ for token in doc]
    print(pos)
    >>> ['DET', 'ADJ', 'NOUN', 'VERB', 'DET', 'ADJ', 'ADJ', 'NOUN', 'PUNCT']
    # 对应于中文是 【冠词,形容词,名词,动词,冠词,形容词,形容词,名词,标点】
    # 原始句子是 [A, magnetic, monopole, is, a, hypothetical, elementary, particle, .]
    

    Lemmatization

    • 找到单词的原型,即词性还原,将am, is, are, have been 还原成be,复数还原成单数(cats -> cat),过去时态还原成现在时态 (had -> have)。在代码中使用 token.lemma_ 提取
    lem = [token.lemma_ for token in doc]
    print(lem)
    >>> ['a', 'magnetic', 'monopole', 'be', 'a', 'hypothetical', 'elementary', 'particle', '.']

    Stop words

    • 识别停用词,a,the等等。
    stop_words = [token.is_stop for token in doc]
    print(stop_words)
    >>> [True, False, False, True, True, False, False, False, False]
    # 可以看到,这个磁单极的例子中停用词有 a 和 is。
    

    Dependency Parsing

    依存分析,标记单词是主语,谓语,宾语还是连接词。程序中使用 token.dep_ 提取。

    dep = [token.dep_ for token in doc]
    print(dep)
    >>> ['det', 'amod', 'nsubj', 'ROOT', 'det', 'amod', 'amod', 'attr', 'punct']
    
    • Spacy的依存分析采用了 ClearNLP 的依存分析标签 ClearNLP Dependency Labels。根据这个网站提供的标签字典,翻译成人话:[限定词, 形容词修饰, 名词主语, 根节点, 限定词, 形容词修饰, 形容词修饰, 属性, 标点]

    Noun Chunks

    • 提取名词短语,程序中使用doc.noun_chunks获取。
    noun_chunks = [nc for nc in doc.noun_chunks]
    print(noun_chunks)
    >>> [A magnetic monopole, a hypothetical elementary particle]
    

    Named Entity Recognization

    • 命名实体识别,识别人名,地名,组织机构名,日期,时间,金额,事件,产品等等。程序中使用 doc.ents 获取。
    txt = ''''European authorities fined Google a record $5.1 billion
    on Wednesday for abusing its power in the mobile phone market and
    ordered the company to alter its practices'
    '''
    doc = nlp(txt)
    ners = [(ent.text, ent.label_) for ent in doc.ents]
    print(ners)
    >>> [('European', 'NORP'), ('Google', 'ORG'), ('$5.1 billion', 'MONEY'), ('Wednesday', 'DATE')]
    更详细的命名实体简写列表。
    https://upload-images.jianshu.io/upload_images/11452592-d7776c24334f0a94.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/720/format/webp

    Coreference Resolution

    • 指代消解 ,寻找句子中代词 hesheit 所对应的实体。为了使用这个模块,需要使用神经网络预训练的指代消解系数,如果前面没有安装,可运行命令:pip install neuralcoref
    txt = "My sister has a son and she loves him."
    
    # 将预训练的神经网络指代消解加入到spacy的管道中
    import neuralcoref
    neuralcoref.add_to_pipe(nlp)
    
    doc = nlp(txt)
    doc._.coref_clusters
    >>> [My sister: [My sister, she], a son: [a son, him]]
    

    Display

    可视化。把这条功能单独列出来,是因为它太酷了。举几个简单的例子,第一个例子是对依存分析的可视化,

    txt = '''In particle physics, a magnetic monopole is a 
    hypothetical elementary particle.'''
    displacy.render(nlp(txt), style='dep', jupyter=True,
                    options = {'distance': 90})
    
     
     
    • 第二个例子是对命名实体识别的可视化
    from spacy import displacy
    displacy.render(doc, style='ent', jupyter=True)
    
     
     

    知识提取

    这一部分使用了 textacy, 需要通过pip命令进行安装,textacy.extract 里面的 semistructured_statements() 函数可以提取主语是 Magnetic Monopole,谓语原型是 be 的所有事实。首先将维基百科上的关于磁单极的这篇介绍的文字拷贝到 magneti_monopole.txt 中。

    import textacy.extract
    
    nlp = spacy.load('en_core_web_sm')
    
    with open("magnetic_monopole.txt", "r") as fin:
        txt = fin.read()
    
    doc = nlp(txt)
    statements = textacy.extract.semistructured_statements(doc, "monopole")
    for statement in statements:
        subject, verb, fact = statement
        print(f" - {fact}")
    
    • 如果搜索Magnetic Monopole, 输出只有第三条,如果搜索 monopole, 结果如下:
    - a singular solution of Maxwell's equation (because it requires removing the worldline from spacetime
    - a [[topological defect]] in a compact U(1) gauge theory
    - a new [[elementary particle]], and would violate [[Gauss's law for magnetism
    import spacy
    from spacy import displacy
    nlp = spacy.load('en')
    # nlp = spacy.load("en_core_web_sm")
    filename = "test.txt"
    document = open(filename,encoding="utf-8").read()
    document = nlp(document)
    # display.display()
    #可视化
    displacy.render(document,style="ent",jupyter=True)
    displacy.render(document, style='dep', jupyter=True,
                    options = {'distance': 90})
    
    print([token.orth_ for token in document if not token.is_punct | token.is_space])   #分词
    all_tags = {w.pos: w.pos_ for w in document} #词性标注 可以使用.pos_ 和 .tag_方法访问粗粒度POS标记和细粒度POS标记
    print(all_tags)
    labels = set([w.label_ for w in document.ents])  #实体识别
    print([(i, i.label_, i.label) for i in document.ents])
  • 相关阅读:
    2017"百度之星"程序设计大赛
    2018省赛赛第一次训练题解和ac代码
    2018天梯赛第一次训练题解和ac代码
    rsa Round #71 (Div. 2 only)
    AtCoder Grand Contest 021
    Hello 2018
    Educational Codeforces Round 36 (Rated for Div. 2)
    Codeforces Round #462 (Div. 2)
    Codeforces Round #467 (Div. 2)
    [Offer收割]编程练习赛48
  • 原文地址:https://www.cnblogs.com/pythonclass/p/11310587.html
Copyright © 2011-2022 走看看