zoukankan      html  css  js  c++  java
  • 【python3】爬取简书评论生成词云

    一、起因:

          昨天在简书上看到这么一篇文章《中国的父母,大都有毛病》,看完之后个人是比较认同作者的观点。

         不过,翻了下评论,发现评论区争议颇大,基本两极化。好奇,想看看整体的评论是个什么样,就写个爬虫,做了词云。

    二、怎么做:

         ① 观察页面,找到获取评论的请求,查看评论数据样式,写爬虫

         ② 用 jieba 模块,将爬取的评论做分词处理

         ③ 用 wordcloud 模块,生成词云

    三、代码如下:      

    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    import requests,json,time
    import jieba
    import matplotlib.pyplot as plt
    from bs4 import BeautifulSoup
    from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator
    
    # 存储爬取结果
    def write(path,text):
        with open(path,'a', encoding='utf-8') as f:
            f.writelines(text)
            f.write('
    ')
    
    # 爬取评论
    def getcomments(num,path):
        url = 'https://www.jianshu.com/notes/23437010/comments?comment_id=&author_only=false&since_id=0&max_id=1586510606000&order_by=likes_count&page='+str(num)
        response = requests.get(url).text
        response = json.loads(response)
        num = response['total_pages']
        for i in response['comments']:
            comment = BeautifulSoup(i['compiled_content'],'lxml').text
            write(path,comment)
        return num
    
    # jieba 分词
    def read(path):
        text=''
        with open(path, encoding='utf-8') as s:
            for line in s.readlines():
                line.strip()
                text += ' '.join(jieba.cut(line))
        return text
    
    # WordCloud 生成词云
    def wordcloud(imagepath):
        backgroud_Image = plt.imread(imagepath)
        wc = WordCloud(background_color='white',  # 设置背景颜色
                       mask=backgroud_Image,  # 设置背景图片
                       max_words=2000,  # 设置最大现实的字数
                       stopwords=STOPWORDS,  # 设置停用词
                       font_path='C:/Users/Windows/fonts/msyh.ttf',  # 设置字体格式,如不设置显示不了中文
                       max_font_size=120,  # 设置字体最大值
                       random_state=30,  # 设置有多少种随机生成状态,即有多少种配色方案
                       )
        wc.generate(text)
        image_colors = ImageColorGenerator(backgroud_Image)
        wc.recolor(color_func=image_colors)
        plt.imshow(wc)
        plt.axis('off')
        plt.show()
    
    if __name__ == '__main__':
        path = '评论.txt' # 评论path
        imagepath = 'heart.jpg' #词云背景图path
        print('正在爬取评论')
        i,num=1,2
        while i <= num:
            num=getcomments(i,path) # 爬取评论
            time.sleep(2)
            i += 1
        print('正在分词处理')
        text = read(path)  # jieba 分词处理
        print('正在生成词云')
        wordcloud(imagepath) # WordCloud 生成词云
        print('词云生成成功')

    效果:

  • 相关阅读:
    part11-1 Python图形界面编程(Python GUI库介绍、Tkinter 组件介绍、布局管理器、事件处理)
    part10-3 Python常见模块(正则表达式)
    Cyclic Nacklace HDU
    模拟题 Right turn SCU
    状态DP Doing Homework HDU
    Dp Milking Time POJ
    区间DP Treats for the Cows POJ
    DP Help Jimmy POJ
    Dales and Hills Gym
    Kids and Prizes Gym
  • 原文地址:https://www.cnblogs.com/TurboWay/p/8435355.html
Copyright © 2011-2022 走看看