zoukankan      html  css  js  c++  java
  • 自然语言18.1_Named Entity Recognition with NLTK

     

    sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频教程)

     

    https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

     

     

     

    QQ:231469242

    欢迎nltk爱好者交流

     

    https://www.pythonprogramming.net/named-entity-recognition-nltk-tutorial/?completed=/chinking-nltk-tutorial/

    Named Entity Recognition with NLTK

    命名实体(Named Entity)类别识别



    This is a temporary script file.
    """
    
    import nltk
    from nltk.corpus import state_union
    from nltk.tokenize import PunktSentenceTokenizer
    
    sentence="Bush is a pig in WhiteHouse in America."
    words=nltk.word_tokenize(sentence)
    tagged=nltk.pos_tag(words)
    nameEnt=nltk.ne_chunk(tagged,binary=False)
        
    nameEnt.draw()
        
        
    

    This is a temporary script file.
    """
    
    import nltk
    from nltk.corpus import state_union
    from nltk.tokenize import PunktSentenceTokenizer
    
    train_text=state_union.raw("2005-GWBush.txt")
    sample_text=state_union.raw("2006-GWBush.txt")
    
    custom_sent_tokenizer=PunktSentenceTokenizer(train_text)
    #分句
    tokenized=custom_sent_tokenizer.tokenize(sample_text)
    
    
    for i in tokenized[0:5]:
        words=nltk.word_tokenize(i)
        tagged=nltk.pos_tag(words)
        nameEnt=nltk.ne_chunk(tagged,binary=False)
        #print(nameEnt)
        nameEnt.draw()
        
    
    nameEnt=nltk.ne_chunk(tagged,binary=True)


    nameEnt=nltk.ne_chunk(tagged,binary=False)

    One of the most major forms of chunking in natural language processing is called "Named Entity Recognition." The idea is to have the machine immediately be able to pull out "entities" like people, places, things, locations, monetary figures, and more.

    This can be a bit of a challenge, but NLTK is this built in for us. There are two major options with NLTK's named entity recognition: either recognize all named entities, or recognize named entities as their respective type, like people, places, locations, etc.

    Here's an example:

    import nltk
    from nltk.corpus import state_union
    from nltk.tokenize import PunktSentenceTokenizer
    
    train_text = state_union.raw("2005-GWBush.txt")
    sample_text = state_union.raw("2006-GWBush.txt")
    
    custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
    
    tokenized = custom_sent_tokenizer.tokenize(sample_text)
    
    def process_content():
        try:
            for i in tokenized[5:]:
                words = nltk.word_tokenize(i)
                tagged = nltk.pos_tag(words)
                namedEnt = nltk.ne_chunk(tagged, binary=True)
                namedEnt.draw()
        except Exception as e:
            print(str(e))
    
    
    process_content()

    Here, with the option of binary = True, this means either something is a named entity, or not. There will be no further detail. The result is:

    If you set binary = False, then the result is:

    Immediately, you can see a few things. When Binary is False, it picked up the same things, but wound up splitting up terms like White House into "White" and "House" as if they were different, whereas we could see in the binary = True option, the named entity recognition was correct to say White House was part of the same named entity.

    Depending on your goals, you may use the binary option how you see fit. Here are the types of Named Entities that you can get if you have binary as false:

    NE Type and Examples
    ORGANIZATION - Georgia-Pacific Corp., WHO
    PERSON - Eddy Bonte, President Obama
    LOCATION - Murray River, Mount Everest
    DATE - June, 2008-06-29
    TIME - two fifty a m, 1:30 p.m.
    MONEY - 175 million Canadian Dollars, GBP 10.40
    PERCENT - twenty pct, 18.75 %
    FACILITY - Washington Monument, Stonehenge
    GPE - South East Asia, Midlothian

    Either way, you will probably find that you need to do a bit more work to get it just right, but this is pretty powerful right out of the box.

    In the next tutorial, we're going to talk about something similar to stemming, called lemmatizing.

  • 相关阅读:
    痞子衡嵌入式:MCUXpresso IDE下SDK工程在Build配置上与IAR,MDK差异
    13万字详细分析JDK中Stream的实现原理
    扫码登录是这样登录的
    [Vue深入组件-边界情况处理] 控制更新
    [Vue深入组件-边界情况处理] 模板定义的替代品
    [Vue深入组件]:递归组件和组件的循环引用
    #antdv 清除指定字段验证 #antdv表单验证指定清除
    [Vue深入组件-边界情况处理] 程序化的事件监听器
    [Vue深入组件-边界情况处理] 访问元素 & 组件
    [Vue深入组件]:Slot插槽
  • 原文地址:https://www.cnblogs.com/webRobot/p/6080155.html
Copyright © 2011-2022 走看看