zoukankan      html  css  js  c++  java
  • 中文词频统计及词云制作

    1、中软国际华南区技术总监曾老师还会来上两次课,同学们希望曾老师讲些什么内容?(认真想一想回答)

    a、关于这门课的相关工作经历

    b、自己对于这门课的看法

    2、中文分词

    a、之前的英文练习将要测试词频的文章放在一个TXT里,

    >>> fo = open('kang.txt','w')
    >>> fo.write('''A Chinese company offering sex dolls for rent has withdrawn its services just days after launching.
    
    Touch had begun offering five different sex doll types for daily or longer-term rent on Thursday in Beijing but quickly drew complaints and criticism.
    
    The company said in a statement on Weibo it "sincerely apologised for the negative impact" of the concept.
    
    But the firm stressed sex was "not vulgar" and said it would keep working towards more people enjoying it.
    
    Touch told the BBC the rental service had operated for two days.
    
    "We prepared ten dolls for the trial operation," a company spokesperson said via email, adding that they received very positive feedback from users.
    
    "But it’s really hard in China," the firm wrote, saying there had been a lot of controversy with the police over the issue.
    
    The company had offered the sex dolls for a daily fee of 298 yuan ($46), according to Chinese media.
    
    The models on offer were marketed as Chinese, Korean and Russian women, with one also modelled on the movie character Wonder Woman, complete with a sword and shield.
    
    In its Weibo statement, the firm said its original intention had been to make expensive silicone dolls more affordable but conceded that the service triggered a heated public debate.
    
    The company also said it would pay out compensation to users worth double the amount they had paid as a deposit for reserving a doll.
    
    The statement added that Touch would in future pay more attention to its "social duty", and would actively promote a "healthier and more harmonious sex lifestyle".
    
    Aside from its short-lived rental offering, the firm sells an array of sex toys, including sex dolls.''')
    1664
    >>> fo.close()
    >>> fr=open('kang.txt','r')
    >>> fr.readline()
    'A Chinese company offering sex dolls for rent has withdrawn its services just days after launching.
    '
    >>> 

    然后引用

    fo=open('kang.txt','r')
    news = fo.read()
    
    news= news.lower()
    for i in ',./-_"":;':
        news=news.replace(i,' ')
    words=news.split(' ')
    exp = {'','the','
    
    the','its','that','it','a','for','and','had','said','to','of','in','on','as','they','also','or','an','
    
    in','
    
    ','
    
    touch'}
    dict={}
    keys=set(words)-exp
    for i in keys:
        dict[i]=words.count(i)
    
    tj=list(dict.items())
    tj.sort(key=lambda x:x[1],reverse=True)
    for i in range(10):
        print(tj[i])
    fo.close()

    结果如下

    >>> 
    =============== RESTART: C:/Users/Administrator/Desktop/词频2.py ===============
    ('sex', 7)
    ('company', 5)
    ('dolls', 5)
    ('would', 4)
    ('firm', 4)
    ('more', 4)
    ('but', 3)
    ('chinese', 3)
    ('statement', 3)
    ('offering', 3)
    >>> 

     b、测试jieba

    >>> import jieba
    >>> word = jieba.cut('太阳出来就去耕作田地,太阳落山就回家去休息。')
    >>> w=list(word)
    >>> w
    ['太阳', '出来', '', '', '耕作', '田地', '', '太阳', '落山', '', '回家', '', '休息', '']
    >>> a = list(jieba.cut('太阳出来就去耕作田地,太阳落山就回家去休息。',cut_all=True))
    >>> a
    ['太阳', '出来', '', '', '耕作', '田地', '', '', '太阳', '落山', '', '回家', '', '休息', '', '']
    >>> s = list(jieba.cut_for_search('太阳出来就去耕作田地,太阳落山就回家去休息。'))
    >>> s
    ['太阳', '出来', '', '', '耕作', '田地', '', '太阳', '落山', '', '回家', '', '休息', '']

     c、这次我选择的是对于小说雪山飞狐的中文分词

    import jieba
    xs=open('xs.txt','w')
    xs.write('''下载的雪山飞狐小说''')
    xs.close()
    fr=open('xs.txt','r',encoding='GBK').read()
    zs=jieba.cut(fr)
    
    dic={}
    for z in zs:
        if len(z)==1:
            continue
        else:
            rez=z
            dic[z] = dic.get(z,0) + 1
    keys=set(z)
    a=sorted(dic.items())
    
    tj=list(dic.items())
    tj.sort(key=lambda x:x[1],reverse=True)
    for i in range(20):
        print(tj[i])

    结果

    >>> 
     RESTART: C:/Users/Administrator/AppData/Local/Programs/Python/Python36/zhongwen222.py 
    Building prefix dict from the default dictionary ...
    Loading model from cache C:UsersADMINI~1AppDataLocalTempjieba.cache
    Loading model cost 0.706 seconds.
    Prefix dict has been built succesfully.
    ('曹云奇', 227)
    ('苗若兰', 210)
    ('一个', 206)
    ('胡一刀', 204)
    ('众人', 203)
    ('说道', 201)
    ('金面佛', 174)
    ('胡斐', 136)
    ('自己', 132)
    ('两人', 132)
    ('心中', 127)
    ('阮士中', 127)
    ('宝树', 126)
    ('爹爹', 124)
    ('苗人凤', 121)
    ('孩子', 114)
    ('一声', 113)
    ('不知', 110)
    ('刘元鹤', 106)
    ('什么', 104)
    >>> 
  • 相关阅读:
    c++ 存储连续性,作用域和链接性注意点
    函数模板的知识点总结
    c++ 左值引用的注意点
    VS2015如何在同一个解决方案下建立多个项目及多个项目之间的引用
    编译opencv4.1.0+tesseract5.0 的Realease x64版本遇见的问题解决
    逻辑化简-卡诺图
    从Word Embedding到Bert模型—自然语言处理中的预训练技术发展史 (转载)
    matlab绘图
    多个EXCEL文件合并成一个
    数学建模及机器学习算法(一):聚类-kmeans(Python及MATLAB实现,包括k值选取与聚类效果评估)
  • 原文地址:https://www.cnblogs.com/kang8823/p/7590905.html
Copyright © 2011-2022 走看看