zoukankan html css js c++ java

爬虫爬取疫情数据存到文件

 1 import time
 2 import requests
 3 from bs4 import BeautifulSoup
 4 #加载一个网页
 5 url='https://ncov.dxy.cn/ncovh5/view/pneumonia?scene=2&clicktime=1579582238&enterid=1579582238&from=singlemessage&isappinstalled=0'#丁香园新型肺炎网页
 6 
 7 #本机请求头
 8 headers ={
 9     'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0'
10 }
11 
12 resp=requests.get(url)
13 resp.encoding=resp.apparent_encoding#将网页返回的字符集类型设置为 自动判断的字符集类型
14 #print(resp.text) 打印读取内容
15 #print(resp.encoding)#查看网页返回的字符集类型
16 #print(resp.apparent_encoding)#自动判断字符集类型
17 html=resp.text
18 #解析成为beautiful对象
19 soup=BeautifulSoup(html,'html.parser')
20 print('--------------------------------------------------------')
21 #print(soup)
22 #提取数据
23 result=soup.find('body').find('script',{'id':'getAreaStat'}).text
24 print(type(result))#打印出result的数据类型
25 #print(result)#打印到控制台
26 #将数据写入文件
27 fo=open('result2.txt','w',encoding='utf-8')
28 fo.write(result)#写入到文件
29 fo.close()
30 #分析文本
31 f  = open('result2.txt','r',encoding='utf-8') #由于在当前文件夹下，因此直接写了文件名
32 for lines in f:
33     ls = lines.strip().replace('try { window.getAreaStat = ','').replace('}catch(e){}','')#将文本中无用信息删除
34 f.close()
35 list = eval(ls)#将数据文本中数据转化为list形式
36 print(type(list))#输出变量list类型
37 print(list)#在控制台打印list
38 time=time.strftime('%Y-%m-%d',time.localtime(time.time()))#获取当前日期
39 fo=open('{}.txt'.format(time),'w',encoding='utf-8')#最终结果写入文件，文件名为当前日期
40 fo.write(ls)#写入到文件
41 fo.close()

运行结果：

文件：

明天准备把数据整理到数据库中+作词云分析

查看全文

相关阅读:
Tomcat日志、项目中的log4j日志、e.printStackTrace()——我的日志最后到底跑哪去了？
MySQL中有关TIMESTAMP和DATETIME的总结
 org.apache.ibatis.binding.BindingException: Invalid bound statement (not found)
@RequestBody和@RequestParam区别
 Synchronized的jvm实现
 星空雅梦
 星空雅梦
 星空雅梦
 星空雅梦
 星空雅梦

原文地址：https://www.cnblogs.com/qq1793033075/p/12300826.html