zoukankan      html  css  js  c++  java
  • 实验二:分词

    jieba

    import jieba
    
    Eng = open("/Users/war/Desktop/NLP/Experiment2/English.txt").read()
    
    Ch = open("/Users/war/Desktop/NLP/Experiment2/Chinese.txt").read()
    
    print(Eng)
    
    Trump was born and raised in the New York City borough of Queens and received an economics degree from the Wharton School. He was appointed president of his family's real estate business in 1971, renamed it The Trump Organization, and expanded it from Queens and Brooklyn into Manhattan. The company built or renovated skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, including licensing his name for real estate and consumer products. He managed the company until his 2017 inauguration. He co-authored several books, including The Art of the Deal. He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015, and he produced and hosted The Apprentice, a reality television show, from 2003 to 2015. Forbes estimates his net worth to be $3.1 billion.
    
    print(Ch)
    
    央视315晚会曝光湖北省知名的神丹牌、莲田牌“土鸡蛋”实为普通鸡蛋冒充,同时在商标上玩猫腻,分别注册“鲜土”、注册“好土”商标,让消费者误以为是“土鸡蛋”。3月15日晚间,新京报记者就此事致电湖北神丹健康食品有限公司方面,其工作人员表示不知情,需要了解清楚情况,截至发稿暂未取得最新回应。新京报记者还查询发现,湖北神丹健康食品有限公司为农业产业化国家重点龙头企业、高新技术企业,此前曾因涉嫌虚假宣传“中国最大的蛋品企业”而被罚6万元。
    

    精确模式

    seg_list = jieba.cut(Eng, cut_all=False)
    print(" ".join(seg_list))
    
    Building prefix dict from the default dictionary ...
    Loading model from cache /tmp/jieba.cache
    Loading model cost 0.704 seconds.
    Prefix dict has been built successfully.
    
    
    Trump   was   born   and   raised   in   the   New   York   City   borough   of   Queens   and   received   an   economics   degree   from   the   Wharton   School .   He   was   appointed   president   of   his   family ' s   real   estate   business   in   1971 ,   renamed   it   The   Trump   Organization ,   and   expanded   it   from   Queens   and   Brooklyn   into   Manhattan .   The   company   built   or   renovated   skyscrapers ,   hotels ,   casinos ,   and   golf   courses .   Trump   later   started   various   side   ventures ,   including   licensing   his   name   for   real   estate   and   consumer   products .   He   managed   the   company   until   his   2017   inauguration .   He   co - authored   several   books ,   including   The   Art   of   the   Deal .   He   owned   the   Miss   Universe   and   Miss   USA   beauty   pageants   from   1996   to   2015 ,   and   he   produced   and   hosted   The   Apprentice ,   a   reality   television   show ,   from   2003   to   2015 .   Forbes   estimates   his   net   worth   to   be   $ 3.1   billion .
    
    seg_list = jieba.cut(Ch, cut_all=False)
    print(" ".join(seg_list))
    
     央视 315 晚会 曝光 湖北省 知名 的 神丹 牌 、 莲田牌 “ 土 鸡蛋 ” 实为 普通 鸡蛋 冒充 , 同时 在 商标 上 玩 猫腻 , 分别 注册 “ 鲜土 ” 、 注册 “ 好土 ” 商标 , 让 消费者 误以为 是 “ 土 鸡蛋 ” 。 3 月 15 日 晚间 , 新 京报 记者 就 此事 致电 湖北 神丹 健康 食品 有限公司 方面 , 其 工作人员 表示 不知情 , 需要 了解 清楚 情况 , 截至 发稿 暂未 取得 最新 回应 。 新 京报 记者 还 查询 发现 , 湖北 神丹 健康 食品 有限公司 为 农业 产业化 国家 重点 龙头企业 、 高新技术 企业 , 此前 曾 因涉嫌 虚假 宣传 “ 中国 最大 的 蛋品 企业 ” 而 被 罚 6 万元 。
    

    全模式

    seg_list = jieba.cut(Eng, cut_all=True)
    print(" ".join(seg_list))
    
    Trump     was     born     and     raised     in     the     New     York     City     borough     of     Queens     and     received     an     economics     degree     from     the     Wharton     School .     He     was     appointed     president     of     his     family ' s     real     estate     business     in     1971 ,    renamed     it     The     Trump     Organization ,    and     expanded     it     from     Queens     and     Brooklyn     into     Manhattan .     The     company     built     or     renovated     skyscrapers ,    hotels ,    casinos ,    and     golf     courses .     Trump     later     started     various     side     ventures ,    including     licensing     his     name     for     real     estate     and     consumer     products .     He     managed     the     company     until     his     2017     inauguration .     He     co - authored     several     books ,    including     The     Art     of     the     Deal .     He     owned     the     Miss     Universe     and     Miss     USA     beauty     pageants     from     1996     to     2015 ,    and     he     produced     and     hosted     The     Apprentice ,    a     reality     television     show ,    from     2003     to     2015 .     Forbes     estimates     his     net     worth     to     be    $ 3 . 1     billion .
    
    seg_list = jieba.cut(Ch, cut_all=True)
    print(" ".join(seg_list))
    
     央视 315 晚会 曝光 湖北 湖北省 知名 的 神丹 牌 、 莲 田 牌 “ 土鸡 鸡蛋 ” 实为 普通 鸡蛋 冒充 , 同时 在 商标 标上 玩 猫腻 , 分别 注册 “ 鲜 土 ”、 注册 “ 好 土 ” 商标 , 让 消费 消费者 误以为 以为 是 “ 土鸡 鸡蛋 ”。 3 月 15 日 晚间 , 新 京报 记者 就此 此事 致电 湖北 神丹 健康 食品 有限 有限公司 公司 方面 , 其 工作 工作人员 作人 人员 表示 不知 不知情 知情 , 需要 了解 清楚 情况 , 截至 发稿 暂 未取 取得 最新 回应 。 新 京报 记者 还 查询 发现 , 湖北 神丹 健康 食品 有限 有限公司 公司 为 农业 农业产业 产业 产业化 国家 重点 龙头 龙头企业 企业 、 高新 高新技术 技术 企业 , 此前 曾 因涉嫌 涉嫌 虚假 宣传 “ 中国 最大 的 蛋品 企业 ” 而 被 罚 6 万元 。
    

    搜索引擎模式

    seg_list = jieba.cut_for_search(Eng)
    print(" ".join(seg_list))
    
    Trump   was   born   and   raised   in   the   New   York   City   borough   of   Queens   and   received   an   economics   degree   from   the   Wharton   School .   He   was   appointed   president   of   his   family ' s   real   estate   business   in   1971 ,   renamed   it   The   Trump   Organization ,   and   expanded   it   from   Queens   and   Brooklyn   into   Manhattan .   The   company   built   or   renovated   skyscrapers ,   hotels ,   casinos ,   and   golf   courses .   Trump   later   started   various   side   ventures ,   including   licensing   his   name   for   real   estate   and   consumer   products .   He   managed   the   company   until   his   2017   inauguration .   He   co - authored   several   books ,   including   The   Art   of   the   Deal .   He   owned   the   Miss   Universe   and   Miss   USA   beauty   pageants   from   1996   to   2015 ,   and   he   produced   and   hosted   The   Apprentice ,   a   reality   television   show ,   from   2003   to   2015 .   Forbes   estimates   his   net   worth   to   be   $ 3.1   billion .
    
    seg_list = jieba.cut_for_search(Ch)
    print(" ".join(seg_list))
    
     央视 315 晚会 曝光 湖北 湖北省 知名 的 神丹 牌 、 莲田牌 “ 土 鸡蛋 ” 实为 普通 鸡蛋 冒充 , 同时 在 商标 上 玩 猫腻 , 分别 注册 “ 鲜土 ” 、 注册 “ 好土 ” 商标 , 让 消费 消费者 以为 误以为 是 “ 土 鸡蛋 ” 。 3 月 15 日 晚间 , 新 京报 记者 就 此事 致电 湖北 神丹 健康 食品 有限 公司 有限公司 方面 , 其 工作 作人 人员 工作人员 表示 不知 知情 不知情 , 需要 了解 清楚 情况 , 截至 发稿 暂未 取得 最新 回应 。 新 京报 记者 还 查询 发现 , 湖北 神丹 健康 食品 有限 公司 有限公司 为 农业 产业 产业化 国家 重点 龙头 企业 龙头企业 、 高新 技术 高新技术 企业 , 此前 曾 涉嫌 因涉嫌 虚假 宣传 “ 中国 最大 的 蛋品 企业 ” 而 被 罚 6 万元 。
    

    自定义词典

    jieba.load_userdict("/Users/war/Desktop/NLP/Experiment2/userdict.txt")
    

    SnowNLP

    from snownlp import SnowNLP
    
    s_ch = SnowNLP(Ch)
    s_eng = SnowNLP(Eng)
    
    print(s_eng.words)
    
    ['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School.', 'He', 'was', 'appointed', 'president', 'of', 'his', "family's", 'real', 'estate', 'business', 'in', '1971,', 'renamed', 'it', 'The', 'Trump', 'Organization,', 'and', 'expanded', 'it', 'from', 'Queens', 'and', 'Brooklyn', 'into', 'Manhattan.', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers,', 'hotels,', 'casinos,', 'and', 'golf', 'courses.', 'Trump', 'later', 'started', 'various', 'side', 'ventures,', 'including', 'licensing', 'his', 'name', 'for', 'real', 'estate', 'and', 'consumer', 'products.', 'He', 'managed', 'the', 'company', 'until', 'his', '2017', 'inauguration.', 'He', 'co-authored', 'several', 'books,', 'including', 'The', 'Art', 'of', 'the', 'Deal.', 'He', 'owned', 'the', 'Miss', 'Universe', 'and', 'Miss', 'USA', 'beauty', 'pageants', 'from', '1996', 'to', '2015,', 'and', 'he', 'produced', 'and', 'hosted', 'The', 'Apprentice,', 'a', 'reality', 'television', 'show,', 'from', '2003', 'to', '2015.', 'Forbes', 'estimates', 'his', 'net', 'worth', 'to', 'be', '$3.1', 'billion.']
    
    print(s_ch.words)
    
    ['ufeff', '央视', '315', '晚会', '曝光', '湖北省', '知名', '的', '神丹', '牌', '、', '莲', '田', '牌', '“', '土', '鸡蛋', '”', '实', '为', '普通', '鸡蛋', '冒充', ',', '同时', '在', '商标', '上', '玩猫', '腻', ',', '分别', '注册', '“', '鲜', '土', '”、', '注册', '“', '好', '土', '”', '商标', ',', '让', '消费者', '误', '以为', '是', '“', '土', '鸡蛋', '”。3', '月', '15', '日', '晚间', ',', '新京', '报', '记者', '就', '此事', '致电', '湖北', '神', '丹', '健康', '食品', '有限公司', '方面', ',', '其', '工作', '人员', '表示', '不', '知情', ',', '需要', '了解', '清楚', '情况', ',', '截至', '发稿', '暂', '未', '取得', '最新', '回应', '。', '新京', '报', '记者', '还', '查询', '发现', ',', '湖北', '神', '丹', '健康', '食品', '有限公司', '为', '农业', '产业化', '国家', '重点', '龙头', '企业', '、', '高新技术', '企业', ',', '此前', '曾', '因', '涉嫌', '虚假', '宣传', '“', '中国', '最', '大', '的', '蛋品', '企业', '”', '而', '被', '罚', '6', '万', '元', '。']
    

    THULAC

    import thulac	
    
    thu = thulac.thulac(seg_only=True)  #默认模式
    s_ch = thu.cut(Ch)  #进行一句话分词
    print(s_ch)
    
    Model loaded succeed
    [['ufeff央', ''], ['视', ''], ['315', ''], ['晚会', ''], ['曝光', ''], ['湖北省', ''], ['知名', ''], ['的', ''], ['神丹牌', ''], ['、', ''], ['莲田牌', ''], ['“', ''], ['土鸡蛋', ''], ['”', ''], ['实', ''], ['为', ''], ['普通', ''], ['鸡蛋', ''], ['冒充', ''], [',', ''], ['同时', ''], ['在', ''], ['商标', ''], ['上', ''], ['玩', ''], ['猫腻', ''], [',', ''], ['分别', ''], ['注册', ''], ['“', ''], ['鲜土', ''], ['”', ''], ['、', ''], ['注册', ''], ['“', ''], ['好', ''], ['土', ''], ['”', ''], ['商标', ''], [',', ''], ['让', ''], ['消费者', ''], ['误', ''], ['以为', ''], ['是', ''], ['“', ''], ['土鸡蛋', ''], ['”', ''], ['。', ''], ['3月', ''], ['15日', ''], ['晚间', ''], [',', ''], ['新', ''], ['京报', ''], ['记者', ''], ['就', ''], ['此事', ''], ['致电', ''], ['湖北', ''], ['神丹', ''], ['健康', ''], ['食品', ''], ['有限公司', ''], ['方面', ''], [',', ''], ['其', ''], ['工作', ''], ['人员', ''], ['表示', ''], ['不', ''], ['知', ''], ['情', ''], [',', ''], ['需要', ''], ['了', ''], ['解', ''], ['清楚', ''], ['情况', ''], [',', ''], ['截至', ''], ['发稿', ''], ['暂', ''], ['未', ''], ['取得', ''], ['最新', ''], ['回应', ''], ['。', ''], ['新', ''], ['京报', ''], ['记者', ''], ['还', ''], ['查询', ''], ['发现', ''], [',', ''], ['湖北', ''], ['神丹', ''], ['健康', ''], ['食品', ''], ['有限公司', ''], ['为', ''], ['农业', ''], ['产业化', ''], ['国', ''], ['家', ''], ['重点', ''], ['龙头', ''], ['企业', ''], ['、', ''], ['高新技术', ''], ['企业', ''], [',', ''], ['此前', ''], ['曾', ''], ['因', ''], ['涉嫌', ''], ['虚假', ''], ['宣传', ''], ['“', ''], ['中国', ''], ['最', ''], ['大', ''], ['的', ''], ['蛋品', ''], ['企业', ''], ['”', ''], ['而', ''], ['被', ''], ['罚', ''], ['6万', ''], ['元', ''], ['。', '']]
    
    s_eng = thu.cut(Eng)  #进行一句话分词
    print(s_eng)
    
    [['Trump', ''], [' ', ''], ['was', ''], [' ', ''], ['born', ''], [' ', ''], ['and', ''], [' ', ''], ['raised', ''], [' ', ''], ['in', ''], [' ', ''], ['the', ''], [' ', ''], ['New', ''], [' ', ''], ['York', ''], [' ', ''], ['City', ''], [' ', ''], ['borough', ''], [' ', ''], ['of', ''], [' ', ''], ['Queens', ''], [' ', ''], ['and', ''], [' ', ''], ['received', ''], [' ', ''], ['an', ''], [' ', ''], ['economics', ''], [' ', ''], ['degree', ''], [' ', ''], ['from', ''], [' ', ''], ['the', ''], [' ', ''], ['Wharton', ''], [' ', ''], ['School', ''], ['.', ''], [' ', ''], ['He', ''], [' ', ''], ['was', ''], [' ', ''], ['appointed', ''], [' ', ''], ['president', ''], [' ', ''], ['of', ''], [' ', ''], ['his', ''], [' ', ''], ['family', ''], ["'", ''], ['s', ''], [' ', ''], ['real', ''], [' ', ''], ['estate', ''], [' ', ''], ['business', ''], [' ', ''], ['in', ''], [' ', ''], ['1971', ''], [',', ''], [' ', ''], ['renamed', ''], [' ', ''], ['it', ''], [' ', ''], ['The', ''], [' ', ''], ['Trump', ''], [' ', ''], ['Organization', ''], [',', ''], [' ', ''], ['and', ''], [' ', ''], ['expanded', ''], [' ', ''], ['it', ''], [' ', ''], ['from', ''], [' ', ''], ['Queens', ''], [' ', ''], ['and', ''], [' ', ''], ['Brooklyn', ''], [' ', ''], ['into', ''], [' ', ''], ['Manhatta', ''], ['n', ''], ['.', ''], [' ', ''], ['The', ''], [' ', ''], ['company', ''], [' ', ''], ['built', ''], [' ', ''], ['o', ''], ['r', ''], [' ', ''], ['renovated', ''], [' ', ''], ['skyscrapers', ''], [',', ''], [' ', ''], ['hotels', ''], [',', ''], [' ', ''], ['casinos', ''], [',', ''], [' ', ''], ['and', ''], [' ', ''], ['golf', ''], [' ', ''], ['courses', ''], ['.', ''], [' ', ''], ['Trump', ''], [' ', ''], ['later', ''], [' ', ''], ['started', ''], [' ', ''], ['various', ''], [' ', ''], ['side', ''], [' ', ''], ['ventures', ''], [',', ''], [' ', ''], ['including', ''], [' ', ''], ['licens', ''], ['ing', ''], [' ', ''], ['his', ''], [' ', ''], ['name', ''], [' ', ''], ['for', ''], [' ', ''], ['real', ''], [' ', ''], ['estate', ''], [' ', ''], ['and', ''], [' ', ''], ['cons', ''], ['umer', ''], [' ', ''], ['products', ''], ['.', ''], [' ', ''], ['He', ''], [' ', ''], ['managed', ''], [' ', ''], ['the', ''], [' ', ''], ['company', ''], [' ', ''], ['until', ''], [' ', ''], ['his', ''], [' ', ''], ['2017', ''], [' ', ''], ['inauguration', ''], ['.', ''], [' ', ''], ['He', ''], [' ', ''], ['c', ''], ['o', ''], ['-', ''], ['authored', ''], [' ', ''], ['several', ''], [' ', ''], ['books', ''], [',', ''], [' ', ''], ['including', ''], [' ', ''], ['The', ''], [' ', ''], ['Art', ''], [' ', ''], ['of', ''], [' ', ''], ['the', ''], [' ', ''], ['Deal', ''], ['.', ''], [' ', ''], ['He', ''], [' ', ''], ['owned', ''], [' ', ''], ['the', ''], [' ', ''], ['Miss', ''], [' ', ''], ['Universe', ''], [' ', ''], ['and', ''], [' ', ''], ['Miss', ''], [' ', ''], ['USA', ''], [' ', ''], ['beauty', ''], [' ', ''], ['pageants', ''], [' ', ''], ['from', ''], [' ', ''], ['1996', ''], [' ', ''], ['t', ''], ['o', ''], [' ', ''], ['2015', ''], [',', ''], [' ', ''], ['and', ''], [' ', ''], ['he', ''], [' ', ''], ['produced', ''], [' ', ''], ['and', ''], [' ', ''], ['hosted', ''], [' ', ''], ['The', ''], [' ', ''], ['Apprentice', ''], [',', ''], [' ', ''], ['a', ''], [' ', ''], ['reality', ''], [' ', ''], ['television', ''], [' ', ''], ['show', ''], [',', ''], [' ', ''], ['from', ''], [' ', ''], ['2003', ''], [' ', ''], ['t', ''], ['o', ''], [' ', ''], ['2015', ''], ['.', ''], [' ', ''], ['Forbes', ''], [' ', ''], ['estimates', ''], [' ', ''], ['his', ''], [' ', ''], ['net', ''], [' ', ''], ['worth', ''], [' ', ''], ['t', ''], ['o', ''], [' ', ''], ['be', ''], [' ', ''], ['$', ''], ['3', ''], ['.', ''], ['1', ''], [' ', ''], ['billion', ''], ['.', '']]
    

    PyNLPIR

    import pynlpir
    
    pynlpir.open()
    
    pynlpir.segment(Ch,pos_tagging = False)
    
    ['央',
     '视',
     '315',
     '晚会',
     '曝光',
     '湖北省',
     '知名',
     '的',
     '神',
     '丹',
     '牌',
     '、',
     '莲',
     '田',
     '牌',
     '“',
     '土',
     '鸡蛋',
     '”',
     '实',
     '为',
     '普通',
     '鸡蛋',
     '冒充',
     ',',
     '同时',
     '在',
     '商标',
     '上',
     '玩',
     '猫腻',
     ',',
     '分别',
     '注册',
     '“',
     '鲜',
     '土',
     '”',
     '、',
     '注册',
     '“',
     '好',
     '土',
     '”',
     '商标',
     ',',
     '让',
     '消费者',
     '误',
     '以为',
     '是',
     '“',
     '土',
     '鸡蛋',
     '”',
     '。',
     '3月',
     '15日',
     '晚间',
     ',',
     '新京报',
     '记者',
     '就',
     '此事',
     '致电',
     '湖北',
     '神',
     '丹',
     '健康',
     '食品',
     '有限公司',
     '方面',
     ',',
     '其',
     '工作',
     '人员',
     '表示',
     '不',
     '知',
     '情',
     ',',
     '需要',
     '了解',
     '清楚',
     '情况',
     ',',
     '截至',
     '发稿',
     '暂',
     '未',
     '取得',
     '最新',
     '回应',
     '。',
     '新京报',
     '记者',
     '还',
     '查询',
     '发现',
     ',',
     '湖北',
     '神',
     '丹',
     '健康',
     '食品',
     '有限公司',
     '为',
     '农业',
     '产业化',
     '国家',
     '重点',
     '龙头',
     '企业',
     '、',
     '高新技术',
     '企业',
     ',',
     '此前',
     '曾',
     '因',
     '涉嫌',
     '虚假',
     '宣传',
     '“',
     '中国',
     '最',
     '大',
     '的',
     '蛋品',
     '企业',
     '”',
     '而',
     '被',
     '罚',
     '6万',
     '元',
     '。']
    
    pynlpir.segment(Eng,pos_tagging = False)
    
    ['Trump',
     'was',
     'born',
     'and',
     'raised',
     'in',
     'the',
     'New',
     'York',
     'City',
     'borough',
     'of',
     'Queens',
     'and',
     'received',
     'an',
     'economics',
     'degree',
     'from',
     'the',
     'Wharton',
     'School',
     '.',
     'He',
     'was',
     'appointed',
     'president',
     'of',
     'his',
     'family',
     "'s",
     'real',
     'estate',
     'business',
     'in',
     '1971',
     ',',
     'renamed',
     'it',
     'The',
     'Trump',
     'Organization',
     ',',
     'and',
     'expanded',
     'it',
     'from',
     'Queens',
     'and',
     'Brooklyn',
     'into',
     'Manhattan',
     '.',
     'The',
     'company',
     'built',
     'or',
     'renovated',
     'skyscrapers',
     ',',
     'hotels',
     ',',
     'casinos',
     ',',
     'and',
     'golf',
     'courses',
     '.',
     'Trump',
     'later',
     'started',
     'various',
     'side',
     'ventures',
     ',',
     'including',
     'licensing',
     'his',
     'name',
     'for',
     'real',
     'estate',
     'and',
     'consumer',
     'products',
     '.',
     'He',
     'managed',
     'the',
     'company',
     'until',
     'his',
     '2017',
     'inauguration',
     '.',
     'He',
     'co',
     '-',
     'authored',
     'several',
     'books',
     ',',
     'including',
     'The',
     'Art',
     'of',
     'the',
     'Deal',
     '.',
     'He',
     'owned',
     'the',
     'Miss',
     'Universe',
     'and',
     'Miss',
     'USA',
     'beauty',
     'pageants',
     'from',
     '1996',
     'to',
     '2015',
     ',',
     'and',
     'he',
     'produced',
     'and',
     'hosted',
     'The',
     'Apprentice',
     ',',
     'a',
     'reality',
     'television',
     'show',
     ',',
     'from',
     '2003',
     'to',
     '2015',
     '.',
     'Forbes',
     'estimates',
     'his',
     'net',
     'worth',
     'to',
     'be',
     '$',
     '3.1',
     'billion',
     '.']
    

    stanfordcorenlp

    from stanfordcorenlp import StanfordCoreNLP
    
    nlp = StanfordCoreNLP("/Users/war/Desktop/NLP/Experiment2/stanford-corenlp-4.2.0")
    
    nlp.word_tokenize(Eng)
    
    ['Trump',
     'was',
     'born',
     'and',
     'raised',
     'in',
     'the',
     'New',
     'York',
     'City',
     'borough',
     'of',
     'Queens',
     'and',
     'received',
     'an',
     'economics',
     'degree',
     'from',
     'the',
     'Wharton',
     'School',
     '.',
     'He',
     'was',
     'appointed',
     'president',
     'of',
     'his',
     'family',
     "'s",
     'real',
     'estate',
     'business',
     'in',
     '1971',
     ',',
     'renamed',
     'it',
     'The',
     'Trump',
     'Organization',
     ',',
     'and',
     'expanded',
     'it',
     'from',
     'Queens',
     'and',
     'Brooklyn',
     'into',
     'Manhattan',
     '.',
     'The',
     'company',
     'built',
     'or',
     'renovated',
     'skyscrapers',
     ',',
     'hotels',
     ',',
     'casinos',
     ',',
     'and',
     'golf',
     'courses',
     '.',
     'Trump',
     'later',
     'started',
     'various',
     'side',
     'ventures',
     ',',
     'including',
     'licensing',
     'his',
     'name',
     'for',
     'real',
     'estate',
     'and',
     'consumer',
     'products',
     '.',
     'He',
     'managed',
     'the',
     'company',
     'until',
     'his',
     '2017',
     'inauguration',
     '.',
     'He',
     'co-authored',
     'several',
     'books',
     ',',
     'including',
     'The',
     'Art',
     'of',
     'the',
     'Deal',
     '.',
     'He',
     'owned',
     'the',
     'Miss',
     'Universe',
     'and',
     'Miss',
     'USA',
     'beauty',
     'pageants',
     'from',
     '1996',
     'to',
     '2015',
     ',',
     'and',
     'he',
     'produced',
     'and',
     'hosted',
     'The',
     'Apprentice',
     ',',
     'a',
     'reality',
     'television',
     'show',
     ',',
     'from',
     '2003',
     'to',
     '2015',
     '.',
     'Forbes',
     'estimates',
     'his',
     'net',
     'worth',
     'to',
     'be',
     '$',
     '3.1',
     'billion',
     '.']
    
    nlp.word_tokenize(Ch)
    
    ['央视315晚会曝光湖北省知名的神丹牌',
     '、',
     '莲田牌',
     '“',
     '土鸡蛋',
     '”',
     '实为普通鸡蛋冒充',
     ',',
     '同时在商标上玩猫腻',
     ',',
     '分别注册',
     '“',
     '鲜土',
     '”',
     '、',
     '注册',
     '“',
     '好土',
     '”',
     '商标',
     ',',
     '让消费者误以为是',
     '“',
     '土鸡蛋',
     '”',
     '。',
     '3月15日晚间',
     ',',
     '新京报记者就此事致电湖北神丹健康食品有限公司方面',
     ',',
     '其工作人员表示不知情',
     ',',
     '需要了解清楚情况',
     ',',
     '截至发稿暂未取得最新回应',
     '。',
     '新京报记者还查询发现',
     ',',
     '湖北神丹健康食品有限公司为农业产业化国家重点龙头企业',
     '、',
     '高新技术企业',
     ',',
     '此前曾因涉嫌虚假宣传',
     '“',
     '中国最大的蛋品企业',
     '”',
     '而被罚6万元',
     '。']
    

    NLTK

    import nltk
    
    tokens_eng = nltk.word_tokenize(Eng)
    
    print(tokens_eng)
    
    ['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School', '.', 'He', 'was', 'appointed', 'president', 'of', 'his', 'family', "'s", 'real', 'estate', 'business', 'in', '1971', ',', 'renamed', 'it', 'The', 'Trump', 'Organization', ',', 'and', 'expanded', 'it', 'from', 'Queens', 'and', 'Brooklyn', 'into', 'Manhattan', '.', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers', ',', 'hotels', ',', 'casinos', ',', 'and', 'golf', 'courses', '.', 'Trump', 'later', 'started', 'various', 'side', 'ventures', ',', 'including', 'licensing', 'his', 'name', 'for', 'real', 'estate', 'and', 'consumer', 'products', '.', 'He', 'managed', 'the', 'company', 'until', 'his', '2017', 'inauguration', '.', 'He', 'co-authored', 'several', 'books', ',', 'including', 'The', 'Art', 'of', 'the', 'Deal', '.', 'He', 'owned', 'the', 'Miss', 'Universe', 'and', 'Miss', 'USA', 'beauty', 'pageants', 'from', '1996', 'to', '2015', ',', 'and', 'he', 'produced', 'and', 'hosted', 'The', 'Apprentice', ',', 'a', 'reality', 'television', 'show', ',', 'from', '2003', 'to', '2015', '.', 'Forbes', 'estimates', 'his', 'net', 'worth', 'to', 'be', '$', '3.1', 'billion', '.']
    
    tokens_ch = nltk.word_tokenize(Ch)
    
    print(tokens_ch)
    
    ['ufeff央视315晚会曝光湖北省知名的神丹牌、莲田牌', '“', '土鸡蛋', '”', '实为普通鸡蛋冒充,同时在商标上玩猫腻,分别注册', '“', '鲜土', '”', '、注册', '“', '好土', '”', '商标,让消费者误以为是', '“', '土鸡蛋', '”', '。3月15日晚间,新京报记者就此事致电湖北神丹健康食品有限公司方面,其工作人员表示不知情,需要了解清楚情况,截至发稿暂未取得最新回应。新京报记者还查询发现,湖北神丹健康食品有限公司为农业产业化国家重点龙头企业、高新技术企业,此前曾因涉嫌虚假宣传', '“', '中国最大的蛋品企业', '”', '而被罚6万元。']
    

    SpaCy

    import spacy
    
    nlp = spacy.load('en_core_web_sm')
    
    print(nlp(Eng))
    
    Trump was born and raised in the New York City borough of Queens and received an economics degree from the Wharton School. He was appointed president of his family's real estate business in 1971, renamed it The Trump Organization, and expanded it from Queens and Brooklyn into Manhattan. The company built or renovated skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, including licensing his name for real estate and consumer products. He managed the company until his 2017 inauguration. He co-authored several books, including The Art of the Deal. He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015, and he produced and hosted The Apprentice, a reality television show, from 2003 to 2015. Forbes estimates his net worth to be $3.1 billion.
    
    print(nlp(Ch))
    
    央视315晚会曝光湖北省知名的神丹牌、莲田牌“土鸡蛋”实为普通鸡蛋冒充,同时在商标上玩猫腻,分别注册“鲜土”、注册“好土”商标,让消费者误以为是“土鸡蛋”。3月15日晚间,新京报记者就此事致电湖北神丹健康食品有限公司方面,其工作人员表示不知情,需要了解清楚情况,截至发稿暂未取得最新回应。新京报记者还查询发现,湖北神丹健康食品有限公司为农业产业化国家重点龙头企业、高新技术企业,此前曾因涉嫌虚假宣传“中国最大的蛋品企业”而被罚6万元。
  • 相关阅读:
    Python零基础入门的基础案例
    今天不抠图,Python实现一键换底片!想换什么换什么(附源码)
    python基础:如何使用 pip 安装第三方库
    Python教程:matplotlib 绘制双Y轴曲线图
    福利来啦,送给大家一个小游戏的源码,不要错过哟
    招聘信息太多,哪家职位才是适合你的?Python采集招聘信息
    我的python菜鸟之路1
    我的SAS菜鸟之路3
    我的SAS菜鸟之路2
    猪鹿蝶1
  • 原文地址:https://www.cnblogs.com/war1111/p/14588008.html
Copyright © 2011-2022 走看看