zoukankan html css js c++ java

股票数据定向爬虫

功能描述：

目标：获取上交所和深交所所有股票的名称和交易信息
输出：保存到文件中
技术路线：requests-bs4-re

候选数据网站的选择：

新浪股票：http://finance.sina.com.cn/stock/
百度股票：https://gupiao.baidu.com/stock/

候选数据网站的选择：

选取原则：股票信息静态存在于HTML页面中，非js代码生成，没有Robots协议限制。
选取方法：浏览器F12，源代码查看等。
选取心态：不要纠结与某个网站，多找信息源尝试。

程序的结构设计：

步骤1：从东方财富网获取股票列表（http://quote.eastmoney.com/stocklist.html）
步骤2：根据股票列表逐个到百度股票获取个股信息
步骤3：将结果存储到文件

实例编写：

 1 import requests
 2 import re
 3 from bs4 import BeautifulSoup
 4 import traceback
 5  
 6 def getHTMLText(url):
 7     try:
 8         r = requests.get(url, timeout = 30)
 9         r.raise_for_status()
10         r.encoding = r.apparent_encoding
11         return r.text
12     except:
13         return ""
14  
15 def getStockList(lst, stockURL):
16     html = getHTMLText(stockURL)
17     soup = BeautifulSoup(html, 'html.parser')
18     a = soup.find_all('a')
19     for i in a:
20         try:
21             href = i.attrs['href']
22             lst.append(re.findall(r"[s][hz]d{6}",href)[0])
23         except:
24             continue
25  
26 def getStockInfo(lst, stockURL, fpath):
27     for stock in lst:
28         url = stockURL + stock +".html"
29         html = getHTMLText(url)
30         try:
31             if html == "":
32                 continue
33             infoDict = {}
34             soup = BeautifulSoup(html, 'html.parser')
35             stockInfo = soup.find('div',attrs={'class':'stock-bets'})
36  
37             name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
38             infoDict.update({'股票名称':name.text.split()[0]})
39  
40             keyList = stockInfo.find_all('dt')
41             valueList = stockInfo.find_all('dd')
42             for i in range(len(keyList)):
43                 key = keyList[i].text
44                 val = valueList[i].text
45                 infoDict[key] = val
46  
47             with open(fpath,'a',encoding='utf-8') as f:
48                 f.write(str(infoDict) + '
')
49         except:
50             traceback.print_exc()
51             continue
52  
53 def main():
54     stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
55     stock_info_url = 'https://gupiao.baidu.com/stock/'
56     output_file = 'G://Learning materials//Python//python网页爬虫与信息提取//BaiduStockInfo.txt'
57     slist = []
58     getStockList(slist,stock_list_url)
59     getStockInfo(slist,stock_info_url,output_file)
60 main()

代码优化：

提升用户体验：增加动态进度显示

北音执念i

查看全文

相关阅读:
DB2中创建表
 orcle定时备份
 db2的定时备份
 web.xml 中 resource-ref 的注意事项
 bootstrap
jQuery
web聊天室
 Django web 进阶
 Django自定义分页、bottle、Flask
Queue、进程、线程、协程

原文地址：https://www.cnblogs.com/beiyin/p/9129650.html