zoukankan      html  css  js  c++  java
  • 一个完整的大作业

    1.选一个自己感兴趣的主题。

    2.网络上爬取相关的数据。

    3.进行文本分析,生成词云。

    4.对文本分析结果解释说明。

    5.写一篇完整的博客,附上源代码、数据爬取及分析结果,形成一个可展示的成果。

    1我选的是新浪新闻网

    2爬取的数据

    3全部代码

    import requests
    from bs4 import BeautifulSoup
    from wordcloud import WordCloud
    import jieba
    import matplotlib.pyplot as plt
    
    url = "http://news.sina.com.cn/"
    res = requests.get(url)
    res.encoding = "utf-8"
    soup = BeautifulSoup(res.text, "html.parser")
    output = open("hmy.txt", "a+", encoding="utf-8")
    for p in soup.find_all("p"):
        output.write(p.get_text() + "
    ")
    output.close()
    txt = open("hmy.txt", "r", encoding="utf-8").read()
    words = jieba.lcut(txt)
    ls = []
    counts = {}
    for word in words:
        ls.append(word)
        if len(word) == 1:
            continue
        else:
            counts[word] = counts.get(word,0)+1
    items = list(counts.items())
    items.sort(key = lambda x:x[1], reverse = True)
    for i in range(10):
        word , count = items[i]
        print ("{:<5}{:>2}".format(word,count))
    wordlist = jieba.cut(txt, cut_all=True)
    wl_split = "/".join(wordlist)
    mywc = WordCloud(font_path='msyh.ttc').generate(wl_split)
    plt.imshow(mywc)
    plt.axis("off")
    plt.show()

    4结果

  • 相关阅读:
    mysql常用函数
    主程Ry访谈录
    mongodb spring anno 查询
    mongodb 查询少少优化
    jquery table thead drop
    ubuntu 配置java,eclipse ,flex,zend,php,TomCat环境
    mongodb shell
    TCP/IP Sockets in Java 源码
    java 断点下载
    直线生成 DDA
  • 原文地址:https://www.cnblogs.com/millmill/p/7689443.html
Copyright © 2011-2022 走看看