zoukankan html css js c++ java

python篇-第一个爬虫程序

突然觉得爬虫的功能非常强大，于是就学了学，试着在牛客网上操作了一番。

功能为爬取牛客网竞赛上的每一场比赛，从而以列表的形式得出每场比赛的过题量，以及获取总过题量。。。效果如下：

经过谷歌浏览器的F12发现，这些信息存储于一个动态的js中（ps：我也不是很懂，所以就先这样描述了），因为过题信息肯定不是静态的，你AC一个题，服务器那边就会将你过题信息局部地刷新，说了这么多废话，其实就是说这些信息不会在网页源代码中，原来直接把源代码当作字符串处理的方法就不行了。。然而，这些动态信息也是有网址的，通过f12可以获取到实时的动态信息，再把含有动态信息的源代码

当作字符串处理就行了。（ps：还需要对应身份的实时cookie）

 1 import requests
 2 import re
 3 import time
 4 import urllib
 5 from bs4 import BeautifulSoup
 6 headers = {
 7     
 8     'Cookie':'_'
 9    
10 }
11 
12 
13 urls = ['https://www.nowcoder.com/acm/contest/problem-list?token=&id={}&_=**********'.format(str(num)) for num in range(1,139)]
14 num = 0
15 tot = 0
16 tottxt='data :
'
17 for url in urls:
18     res = requests.get(url,headers=headers, verify = False)
19     num = num+1
20     try:
21         ss = re.findall('"index":"D","myStatus":"通过"',res.text)
22        
23         if int(len(ss))>0:
24             ss2 = re.search('"problemCount":d+',res.text)
25             tottxt=tottxt+'Contest id:'+str(num)+'
'
26             temp = requests.get('https://www.nowcoder.com/acm/contest/{}#question'.format(str(num)))
27             ss3 = re.search('<title>(.*?)</title>',temp.text)
28             if ss3!=None:
29                 newss3 =re.sub('<title>','',ss3.group())
30                 newss3 =re.sub('_牛客网</title>','',newss3)
31                 tottxt =tottxt+newss3+'
'
32            
33             tottxt=tottxt+'     "Accepted":'+str(len(ss))+'/'
34           
35             tot = tot+len(ss)
36             if ss2!=None:
37                 newss2=re.sub('D','',ss2.group())
38                 tottxt = tottxt +newss2
39             tottxt=tottxt+'
'
40             time.sleep(2)
41            
42     except ConnectionError:
43         print("**ConnectionError**")
44         
45 tottxt=tottxt+'
'+'All Accepted:'+str(tot)
46 f = open('tot2.txt','wb+')
47 tottxt = tottxt.encode('utf-8')
48 f.write(tottxt)
49 f.close()

查看全文

相关阅读:
荧光机理的应用——光学式农药测量技术及系统设计
 滤光片应用——红外吸收粉尘传感器的设计
 磁靶向纳米Fe3O4-TiO2复合物对肝癌细胞的光催化杀伤效应研究
 常用荧光染料的激发波长和发射波长
 光害
 一文解读虚拟化服务器
 一文解读PRA
主数据建设的挑战与发展
 数字孪生技术变革
 intellij idea:配置maven 3.8.2(intellij idea 2021.2)

原文地址：https://www.cnblogs.com/lnu161403214/p/9627030.html