zoukankan html css js c++ java

爬虫大作业

1.选一个自己感兴趣的主题或网站。

我选了网易新闻来爬取

2.用python 编写爬虫程序，从网络上爬取相关主题的数据。

import requests
from bs4 import BeautifulSoup
import re


url = "http://news.163.com/"
header={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
res = requests.get(url, headers=header)
html = res.content.decode('gbk')
soup = BeautifulSoup(html, "html.parser")
text = soup.select('a[target="_blank"]')
for i in text:
    f = open('yjd.txt', 'a+', encoding='utf-8')
    f.write(re.sub('s+','',i.get_text()))
    f.close()

3.对爬了的数据进行文本分析，生成词云。

# -*- coding:utf-8 -*-
from PIL import Image,ImageSequence
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud,ImageColorGenerator
import jieba.analyse
lyric= ''
f=open('yjd.txt','r', encoding='utf-8').read()
result = ''
result += ' '.join(jieba.lcut(f))
image= Image.open('1.png')
graph = np.array(image)
wc = WordCloud(font_path='C:WindowsFontsSTZHONGS.TTF',background_color='White',max_words=50,mask=graph)
print(result)
wc.generate_from_text(result)
image_color = ImageColorGenerator(graph)
plt.imshow(wc)
plt.imshow(wc.recolor(color_func=image_color))
plt.axis("off")
plt.show()
wc.to_file('xiaodada.jpg')

4.对文本分析结果进行解释说明。

5.写一篇完整的博客，描述上述实现过程、遇到的问题及解决办法、数据分析思想及结论。

一开始遇到的问题很多，做函数的时候发现自己的基本功非常的不扎实，甚至在导入库方面的知识也很匮乏，好在在同学的帮助下，我还是顺利的完成了任务。感觉做大数据爬取还是很有意思的，不过在爬其他网站的时候经常爬不到东西，应该是被限制了访问，这个问题以后再去深究吧。

6.最后提交爬取的全部数据、爬虫及数据分析源代码。

查看全文

相关阅读:
私有程序集的探测过程
 程序集版本控制
 浅谈对对象clone的理解
 [导入]WCF后传系列（3）：深入WCF寻址Part 3—消息过滤引擎
 [导入]WCF后传系列（5）：深入WCF寻址Part 5—逻辑地址和物理地址
 强名称程序集
 [导入]WCF后传系列（4）：深入WCF寻址Part 4—自定义消息筛选器
 绑定过程小结
 概述
 UpdatePanel 的更新与触发环境

原文地址：https://www.cnblogs.com/sunset-Panda/p/8986853.html