zoukankan      html  css  js  c++  java
  • 【python3】爬取简书评论生成词云

    一、起因:

          昨天在简书上看到这么一篇文章《中国的父母,大都有毛病》,看完之后个人是比较认同作者的观点。

         不过,翻了下评论,发现评论区争议颇大,基本两极化。好奇,想看看整体的评论是个什么样,就写个爬虫,做了词云。

    二、怎么做:

         ① 观察页面,找到获取评论的请求,查看评论数据样式,写爬虫

         ② 用 jieba 模块,将爬取的评论做分词处理

         ③ 用 wordcloud 模块,生成词云

    三、代码如下:      

    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    import requests,json,time
    import jieba
    import matplotlib.pyplot as plt
    from bs4 import BeautifulSoup
    from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator
    
    # 存储爬取结果
    def write(path,text):
        with open(path,'a', encoding='utf-8') as f:
            f.writelines(text)
            f.write('
    ')
    
    # 爬取评论
    def getcomments(num,path):
        url = 'https://www.jianshu.com/notes/23437010/comments?comment_id=&author_only=false&since_id=0&max_id=1586510606000&order_by=likes_count&page='+str(num)
        response = requests.get(url).text
        response = json.loads(response)
        num = response['total_pages']
        for i in response['comments']:
            comment = BeautifulSoup(i['compiled_content'],'lxml').text
            write(path,comment)
        return num
    
    # jieba 分词
    def read(path):
        text=''
        with open(path, encoding='utf-8') as s:
            for line in s.readlines():
                line.strip()
                text += ' '.join(jieba.cut(line))
        return text
    
    # WordCloud 生成词云
    def wordcloud(imagepath):
        backgroud_Image = plt.imread(imagepath)
        wc = WordCloud(background_color='white',  # 设置背景颜色
                       mask=backgroud_Image,  # 设置背景图片
                       max_words=2000,  # 设置最大现实的字数
                       stopwords=STOPWORDS,  # 设置停用词
                       font_path='C:/Users/Windows/fonts/msyh.ttf',  # 设置字体格式,如不设置显示不了中文
                       max_font_size=120,  # 设置字体最大值
                       random_state=30,  # 设置有多少种随机生成状态,即有多少种配色方案
                       )
        wc.generate(text)
        image_colors = ImageColorGenerator(backgroud_Image)
        wc.recolor(color_func=image_colors)
        plt.imshow(wc)
        plt.axis('off')
        plt.show()
    
    if __name__ == '__main__':
        path = '评论.txt' # 评论path
        imagepath = 'heart.jpg' #词云背景图path
        print('正在爬取评论')
        i,num=1,2
        while i <= num:
            num=getcomments(i,path) # 爬取评论
            time.sleep(2)
            i += 1
        print('正在分词处理')
        text = read(path)  # jieba 分词处理
        print('正在生成词云')
        wordcloud(imagepath) # WordCloud 生成词云
        print('词云生成成功')

    效果:

  • 相关阅读:
    Tencent 闲聊对话机器人接口调用,画像:设计员小白
    logging模块简介python
    jieba分词的几种形式
    h5py这个坑-PyCharm Process finished with exit code -1073741819 (0xC0000005)
    python之six模块的用法six.py2 six.py3
    Swoole从入门到入土(8)——协程初探
    Swoole从入门到入土(7)——TCP服务器[大杂烩]
    Swoole从入门到入土(6)——TCP服务器[粘包]
    Swoole从入门到入土(5)——TCP服务器[异步任务]
    Swoole从入门到入土(4)——TCP服务器[正确重启]
  • 原文地址:https://www.cnblogs.com/TurboWay/p/8435355.html
Copyright © 2011-2022 走看看