zoukankan html css js c++ java

python第三方库Requests的基本使用

Requests 是用python语言编写，基于 urllib，采用 Apache2 Licensed 开源协议的 HTTP 库。它比 urllib 更加方便，可以节约我们大量的工作，完全满足 HTTP 测试需求。Requests 的哲学是以 PEP 20 的习语为中心开发的，所以它比 urllib 更加 Pythoner。

通过pip安装

pip install requests

一、最基本的get请求

1 import requests
2 
3 req=requests.get('https://www.cnblogs.com/')#普通的get请求
4 print(req.text)#解析网页标签，查找头域head中的meta标签<meta charset="utf-8" />
5 print(req.content)#出来的中文有些是乱码，需要解码
6 print(req.content.decode('utf-8'))#用decode解码

1 requests.get(‘https://github.com/timeline.json’) #GET请求
2 requests.post(“http://httpbin.org/post”) #POST请求
3 requests.put(“http://httpbin.org/put”) #PUT请求
4 requests.delete(“http://httpbin.org/delete”) #DELETE请求
5 requests.head(“http://httpbin.org/get”) #HEAD请求
6 requests.options(“http://httpbin.org/get”) #OPTIONS请求

不但GET方法简单，其他方法都是统一的接口样式

二、用post获取需要用户名密码登陆的网页

 1 import requests
 2 
 3 postdata={
 4     'name':'estate',
 5     'pass':'123456'
 6 }#必须是字典类型
 7 req=requests.post('http://www.iqianyue.com/mypost',data=postdata)
 8 print(req.text)#进入登陆后的页面
 9 
10 yonghu=req.content#用户登陆后的结果
11 f=open('1.html','wb')#把结果写入1.html
12 f.write(yonghu)
13 f.close()


http://www.iqianyue.com/mypost

进入这个网站需要登陆,我们要定义一个字典输入用户名和密码

运行没有报错可以把结果写在一个HTML文件中

 1 <html>
 2 <head>
 3 <title>Post Test Page</title>
 4 </head>
 5 
 6 <body>
 7 <form action="" method="post">
 8 name:<input name="name" type="text" /><br>
 9 passwd:<input name="pass" type="text" /><br>
10 <input name="" type="submit" value="submit" />
11 <br />
12 you input name is:estate<br>you input passwd is:123456</body>
13 </html>

三、用headers针对反爬

1 import requests
2 
3 headers={
4 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
5 }#请求头发送请求，多个头域可以直接在字典中添加
6 req=requests.get('http://maoyan.com/board',headers=headers)#传递实参
7 print(req.text)

1 User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36

有些网页进去后会出现403禁止访问，我们需要进入网页查找到hesders中的User-Agent并添加到字典中，读取时要传递实参，运行后并可爬取猫眼电影网页信息

四、用cookies跳过登陆

 1 import requests
 2 
 3 f=open('cookies.txt','r')
 4 #初始化cookies字典变量
 5 cookies={}
 6 #用for循环遍历切割。按照字符;进行切割读取，返回列表数据然后遍历
 7 for line in f.read().split(';'):
 8     #split参数设置为1，将字符串切割成两部分
 9     name,value=line.split('=',1)
10     #为字典cookies添加内容
11     cookies[name] = value
12 url='https://www.cnblogs.com/'
13 res=requests.get(url,cookies=cookies)
14 data=res.content
15 f1=open('bokeyuan.html','wb')
16 f1.write(data)
17 f1.close()
18 f.close()

先输入用户名和密码登陆网页后获取网页的cookies,复制粘贴到新建的文本中，创建一个空的cookies字典，用for循环遍历切割。cookies中的字段按照字符 ; 进行切割读取成两部分。

运行后把结果写到命名为bokeyuan的html文件中，进入html文件直接点击网页图标即可进入登陆后的页面

五、代理IP

1 import requests
2 
3 proxies={
4     'HTTP':'183.129.244.17:10080'
5 }
6 req=requests.get('https://www.taobao.com/',proxies=proxies)
7 print(req.text)

采集时为避免被封IP，经常会使用代理。requests也有相应的proxies属性。我们可以在网页上查找代理IP，在字典中输入代理IP地址和端口，需要多个IP可以直接在字典后面添加。如果代理需要账户和密码，则需这样：

1 proxies = {
2     "http": "http://user:pass@10.10.1.10:3128/",
3 }

六、超时设置

1 import requests
2 
3 req=requests.get('https://www.taobao.com/',timeout=1)
4 print(req.text)

timeout 仅对连接过程有效，与响应体的下载无关

以上为Requests库的基础操作，后续再做补充......

查看全文

相关阅读:
js数组去重五种方法
 wm_concat 多行字符串拼接
 ORACLE WITH AS 简单用法
 layui laytpl 语法
 看懂Oracle执行计划
 GIT RM -R --CACHED 去掉已经托管在GIT上的文件
 sourceTree使用教程--拉取、获取
 SourceTree忽略文件和文件夹
 layui table 详细讲解
 利用POI实现下拉框级联

原文地址：https://www.cnblogs.com/Estate-47/p/9799332.html