zoukankan      html  css  js  c++  java
  • scrapy-redis爬取豆瓣电影短评,使用词云wordcloud展示

    1、数据是使用scrapy-redis爬取的,存放在redis里面,爬取的是最近大热电影《海王》
    2、使用了jieba中文分词解析库
    3、使用了停用词stopwords,过滤掉一些无意义的词
    4、使用matplotlib+wordcloud绘图展示

    from redis import Redis
    import json
    import jieba
    from wordcloud import WordCloud
    import matplotlib.pyplot as plt
    
    # 加载停用词
    # stopwords = set(map(lambda x: x.rstrip('
    '), open('chineseStopWords.txt').readlines()))
    stopwords = set()
    with open('chineseStopWords.txt') as f:
        for line in f.readlines():
            stopwords.add(line.rstrip('
    '))
        stopwords.add(' ')
        # print(stopwords)
        # print(len(stopwords))
    
    # 读取影评
    db = Redis(host='localhost')
    items = db.lrange('review:items', 0, -1)
    # print(items)
    # print(len(items))
    
    # 统计每个word出现的次数
    # 过滤掉停用词
    # 记录总数,用于计算词频
    words = {}
    total = 0
    
    for item in items:
        data = json.loads(item)['review']
        # print(data)
        # print('------------')
        for word in jieba.cut(data):
            if word not in stopwords:
                words[word] = words.get(word, 0) + 1
                total += 1
    
    print(sorted(words.items(), key=lambda x: x[1], reverse=True))
    # print(len(words))
    # print(total)
    
    # 词频
    freq = {k: v / total for k, v in words.items()}
    print(sorted(freq.items(), key=lambda x: x[1], reverse=True))
    
    # 词云
    wordcloud = WordCloud(font_path='simhei.ttf',
                          width=500,
                          height=300,
                          scale=10,
                          max_words=200,
                          max_font_size=40).fit_words(frequencies=freq)  # Create a word_cloud from words and frequencies
    
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis('off')
    plt.show()
    

    绘图结果:

    参考:
    https://github.com/amueller/word_cloud
    http://amueller.github.io/word_cloud/

  • 相关阅读:
    Storm分布式实时流计算框架相关技术总结
    上手开源项目的几点建议【转】
    笔试面试的路上——努力ing
    Storm配置项详解【转】
    storm UI
    leetcode-单词探索
    leetcode-全排列详解(回溯算法)
    leetcode-生成括号(回溯算法)
    leetcode-递增的三元子序列
    leetcode-最长无重复字符的子串
  • 原文地址:https://www.cnblogs.com/keithtt/p/10177266.html
Copyright © 2011-2022 走看看