zoukankan html css js c++ java

简单爬虫实例

代码工具：jupyter

抓包工具：fiddle

1：搜狗页面内容爬取

1 import requests
2 
3 url='https://www.sogou.com/'
4 response=requests.get(
5     url=url
6 )
7 text=response.text
8 text

搜狗内容

2:豆瓣电影分类爬取

 1 import requests
 2 url='https://movie.douban.com/j/new_search_subjects'
 3 param={
 4     'sort':'U',
 5     'range': '0,10',
 6     'tags': '',
 7     'start': '0',
 8     'genres': '爱情'     
 9 }
10 headers={
11    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
12 }
13 response=requests.get(
14         url=url,
15         headers=headers,
16         params=param,
17        
18 )
19 text=response.json()
20 text

豆瓣电影

3：搜索磁条爬取并写入文件

 1 import requests
 2 
 3 url='https://www.sogou.com/web'
 4 headers={
 5    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
 6 }
 7 param={
 8     'query':'校花'
 9 }
10 response=requests.get(
11     url=url,
12     headers=headers,
13     params=param
14 )
15 text=response.content
16 with open('xh.html','wb')as f:
17     f.write(text)

爬取并写入文件

4：国家药监总监内容爬取。爬取动态生成的内容

 1 import requests
 2 
 3 
 4 headers={
 5    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
 6 }
 7 url='http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
 8 
 9 data={
10    'on': 'true',
11    'page': '3',
12    'pageSize': '15',
13    'productName':'',
14    'conditionType': '1',
15    'applyname':'',
16    'applysn':''
17 }
18 response=requests.post(url=url,headers=headers,data=data)
19 conn_list=response.json()['list']
20 comm_info_list=[]
21 for i in conn_list:
22     url_c="http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById"
23     data_c={"id":""}
24     if i["XC_DATE"]:
25         data_c["id"]=i["ID"]
26         res=requests.post(url=url_c,data=data_c,headers=headers)
27         comm_info_list.append(res.json())
28 comm_info_list

动态生成的数据如何爬取

查看全文

相关阅读:
1分钟快速生成用于网页内容提取的xslt
Python即时网络爬虫项目: 内容提取器的定义
 Python读取PDF内容
 Golang基础（二）
shell的sed命令
 matplotlib + pandas绘图
 关于字符编码：ascii、unicode与utf-8
shell的sort命令
 shell的uniq命令
 shell的tr命令

原文地址：https://www.cnblogs.com/duanhaoxin/p/10098639.html