zoukankan html css js c++ java

requests模块

安装

pip install requests
pip install -i https://pypi.doubanio.com/simple/ requests

requests.request()

请求接受的参数

requests.request(method, url,** kwargs)类能够构造一个请求，支持不同的请求方式

import requests

response = requests.request(method='get', url='https://www.baidu.com')
print(response.status_code)

request类中几个参数：

method：请求方式。
url：请求URL。
**kwargs：
- params：字典或者字节序列，作为参数增加到url中，使用这个参数可以把一些键值对以k1=v1&k2=v2的模式增加到url中，get请求中用的较多。
- data：字典、字节序列或者文件对象，重点作为向服务器提供或提交资源，作为请求的请求体，与params不同放在url上不同。它也可以接受一个字符串对象。
- json：json格式的数据，可以向服务器提交json类型的数据。
- headers：字典，定义请求的请求头，比如可以headers字典定义user agent。
- cookies：字典或者CookieJar。
- auth：元组，用来支持HTTP认证功能。
- files：字典，用来向服务器传输文件。
- timeout：指定超时时间。
- proxies：字典，设置代理服务器。
- allow_redirects：开关，是否允许对URL进行重定向，默认为True。
- stream：开关，是否对获取内容进行立即下载，默认为False，也就是立即下载。这里需要说明的，stream一般应用于流式请求，比如说下载大文件，不可能一次请求就把整个文件都下载了，不现实，这种情况下，就要设置stream=True，requests无法将连接释放回连接池，除非下载完了所有数据，或者调用了response.close。
- verify：开关，用于SSL证书认证，默认为True。
- cert：用于设置保存本地SSL证书路径

响应对象支持的属性

import requests

response = requests.request(method='get', url='http://www.httpbin.org/get')

当一个请求被发送后，会有一个response响应。requests同样为这个response赋予了相关方法：

response：响应对象。
response.status_code：请求返回状态码。
response.text：字符串形式的响应内容。
response.json()：返回响应的是json类型的数据，如果响应的类型不是json，则抛出ValueError。
response.content：二进制的响应内容。
response.iter_content(chunk_size)：生成器，在stream=True的情况下，当遍历生成器时，以块的形式返回，也就是一块一块的遍历要下载的内容。避免了遇到大文件一次性的将内容读取到内存中的弊端，如果stream=False，全部数据作为一个块返回。chunk_size参数指定块大小。
response.iter_lines()：生成器，当stream=True时，迭代响应数据，每次一行，也就是一行一行的遍历要下载的内容。同样避免了大文件一次性写入到内存中的问题。当然，该方法不安全。至于为啥不安全，咱也不知道，主要是官网上没说！经查，如果多次调用该方法，iter_lines不保证重新进入时的安全性，因此可能会导致部分收到的数据丢失。
response.cookies：响应中的cookie信息。
response.cookies.get_dict()：以字典的形式返回cookies信息。
response.cookies.items()：以列表的形式返回cookies信息。
response.headers：响应头字典。取其中的指定key，response.headers.get('Content-Type', '哎呀，没取到！')
response.reqeust：请求类型。
response.url：请求的URL。
response.reason：响应HTTP状态的文本原因。
response.encoding：响应结果的编码方式。
response.encoding = “gbk”：修该响应编码方式，比如说响应结果的编码是utf-8，通过这么response.encoding = “gbk”指定为gbk。
response.apparent_encoding：根据响应字节流中去chardet库中匹配，返回编码方式，并不保证100%准确。
response.history：以列表的形式返回请求记录。列表内的请求以最老到最新排序。

requests.get()

requests.get(url, params=None, **kwargs)发送GET请求。相关参数：

url，请求的URL。
params参数：可选url中的额外参数，字典或者字节流格式。
**kwargs：参见requests.request中的kwargs。

params参数

get请求难免会带一些额外的参数K1=V1&K2=V2。

我们可以手动的拼接：

import requests
response = requests.get(url='http://www.httpbin.org/get?k1=v1&k2=v2')
print(response.url)  # http://www.httpbin.org/get?k1=v1&k2=v2
print(response.json().get('args'))  # {'k1': 'v1', 'k2': 'v2'}

现在，我们可以使用params参数来解决这个问题。

import requests

xxx = {"user": "xxx", "pwd": "666"}
response = requests.get(url='http://www.httpbin.org/get', params=xxx)
print(response.url)  # http://www.httpbin.org/get?user=%E5%BC%A0%E5%BC%80&pwd=666
print(response.json().get('args'))  # {'pwd': '666', 'user': 'xxx'}

headers

GET请求中如何携带headers。

import requests
from fake_useragent import UserAgent
headers = {"user-agent": UserAgent().random}#(随机一个请求头，fake_useragent模块）
response = requests.get(url='http://www.httpbin.org/get', headers=headers)
print(response.json()['headers']['User-Agent'])  # Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5

cookies

GET请求中如何携带cookies。

import requests
from fake_useragent import UserAgent
cookies = {
    "user": "xxx",
    "pwd": "666"
}
response = requests.get(url='http://www.httpbin.org/cookies', cookies=cookies)
print(response.json())  # {'cookies': {'pwd': '666', 'user': 'xxx'}}

因为url的返回值是json形式cookies也在里面，所以我们要去json中取，而不是从response.cookies取。

再来看响应中的cookies：

import requests

url = 'http://www.baidu.com'
response = requests.get(url=url)
print(response.cookies)  # <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
print(response.cookies.get_dict())  # {'BDORZ': '27315'}
print(response.cookies.items())  # [('BDORZ', '27315')]

文件下载

如果你访问的是一个小文件，或者图片之类的，我们可以直接写入到本地就完了，也就是不用管stream，让它默认为False即可。

import requests
import webbrowser

url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1568638318957&di=1d7f37e7caece1c39af05b624f42f0a7&imgtype=0&src=http%3A%2F%2Fimg3.duitang.com%2Fuploads%2Fitem%2F201501%2F17%2F20150117224236_vYFmL.jpeg'

response = requests.get(url=url)
f = open('a.jpeg', 'wb')
f.write(response.content)
f.close()
webbrowser.open('a.jpeg')

那要是下载大文件，可就不能这么干了：

import requests
import webbrowser


url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1568638318957&di=1d7f37e7caece1c39af05b624f42f0a7&imgtype=0&src=http%3A%2F%2Fimg3.duitang.com%2Fuploads%2Fitem%2F201501%2F17%2F20150117224236_vYFmL.jpeg'

response = requests.get(url=url, stream=True)
with open('a.jpeg', 'wb') as f:
    for chunk in response.iter_content(chunk_size=256):
        f.write(chunk)
webbrowser.open('a.jpeg')

使用response.iter_content(chunk_size=256)一块一块下载，并且可以指定chunk_size大小。

当然，也可以使用response.iter_lines一行一行遍历下载，但是官网说不安全

requests.post()

requests.post(url, data=None, json=None, **kwargs)发送POST请求，相关参数：

url：请求的URL。
data：可选参数，请求中携带表单编码的字典、bytes或者文件对象。
json：请求中携带json类型的数据。
**kwargs：参见requests.request中的kwargs。

在post请求中，data与json既可以是str类型，也可以是dict类型。

区别：

1、不管json是str还是dict，如果不指定headers中的content-type，默认为application/json

2、data为dict时，如果不指定content-type，默认为application/x-www-form-urlencoded，相当于普通form表单提交的形式

3、data为str时，如果不指定content-type，默认为application/json

4、用data参数提交数据时，request.body的内容则为a=1&b=2的这种形式，用json参数提交数据时，request.body的内容则为'{"a": 1, "b": 2}'的这种形式

data参数

import requests

url = 'http://www.httpbin.org/post'
# data为字典
data_dict = {"k1": "v1"}
response = requests.post(url=url, data=data_dict)
print(response.json())

# data为字符串
data_str = "abc"
response = requests.post(url=url, data=data_str)
print(response.json(), type(response.json()['data']))

# data为文件对象

file = open('a.jpg', 'rb')
response = requests.post(url=url, data=file)
print(response.json())

文件上传

基于POST请求的文件上传，使用files参数。

import requests
file = {"file": open('a.jpg', 'rb')}
response = requests.post('http://www.httpbin.org/post', files=file)
print(response.json())

json参数

import requests

url = 'http://www.httpbin.org/post'
response = requests.post(url=url, json={"user": "zhangkai"})
print(response.json())

requests.head()

requests.head(url, **kwargs)发送HEAD请求，相关参数：

url：请求URL。
**kwargs：参见requests.request中的kwargs。

import requests
url = 'http://httpbin.org/get'
response = requests.head(url=url)
print(response.headers)
'''
{
'Access-Control-Allow-Credentials': 'true', 
'Access-Control-Allow-Origin': '*', 
'Content-Encoding': 'gzip', 
'Content-Type': 'application/json', 
'Date': 'Mon, 16 Sep 2019 10:58:07 GMT', 
'Referrer-Policy': 'no-referrer-when-downgrade', 
'Server': 'nginx', 
'X-Content-Type-Options': 'nosniff', 
'X-Frame-Options': 'DENY', 
'X-XSS-Protection': '1; mode=block', 
'Connection': 'keep-alive'
}
'''

使用requests.head(url, **kwargs)的优点就是以较少的流量获得响应头信息，也可以用在分页中。

超时

超时，在规定的时间无响应。

import requests
respone=requests.get('https://www.12306.cn', timeout=0.0001)

爬取天极网的例子：

import requests                             #导入request模块，请求网址用途
from bs4  import BeautifulSoup                 #导入BS4解析库   （从网页抓取数据）
import os

current_path = os.path.abspath(__file__)                  

#os.path.abspath(__file__)返回的是.py文件的绝对路径
#获取文件所在目录的完整路径：os.path.dirname(__file__)	

path = os.path.dirname(current_path)						

#os.path.dirname()  去掉文件名，返回的是目录，其实就是当前文件所在的文件夹  


response = requests.get(url="http://pic.yesky.com/c/6_20491.shtml")

text =reponse.text

soup = BeautifulSoup(text,"html.parser")

div_obj = soup.find(name="div",attrs={"class":"1b_box"})

```
BeautifulSoup解析响应text，find查找参数class为1b_box的div标签，由于只有一个整体的，所以用find，不用find_all()
```


img_list = div_obj.find_all(name="dd")

```
找到所有的class类值为1b_box的div标签下的dd标签  由于有很多，所以用all。会形成一个列表
```

for dd in img_list:

		dd_img = img_list.find("a").get("herf")     #获取dd标签下的a标签中的herf值

		dd_title = img_list.find("a").get("title")   #获取dd标签下的a标签中的titile名字 后边文件夹创建当文件夹名用

		file_path =os.path.join(path ,"xxxx1",dd_title)

print(file_path)        #D:djweekendxxx1裴秀智早年青涩照片 这才是真正的摄影-韩国女明星

		if not os.path.isdir(file_path):


#os.path.isdir()用于判断对象是否为一个目录


		os.mkdir(file_path)    创建文件夹就是目录    （绝对路径）

		a_img = requests.get(dd_img)

		a_text = a_img.text

		a_soup=Beautifulsoup(a_text,"html.parser")

		a_soup_obj=a_soup.find(name="div",attr={"class":"overview"})

		if a_soup_obj:   #后边有两个没有overview属性

				a_list= a_soup_obj.find_all(name="img")

			   for a in a_list:

						a_img = a.get("src")

						a_content = requests.get(a_img.replace('113x113', "740x-"))
					
						file_path_now = os.path.join(file_path,a_img.rsplit("/",1)[-1])
						#rsplit跟split没有什么区别
						with open(file_path_now,"wb") as f:    													
						f.write(a_content.content)

查看全文

相关阅读:
Socket 之同步以及异步通信
 Socket 之 c#实现Socket网络编程
 Socket 之 API函数介绍
 Socket 之原理与编程基础
 C# 之 user32函数库
 WinServer 之访问同网段服务器或同一服务器多虚拟机间的访问
 annex-b格式
 FLV文件格式解析
 PHP5中的stdClass
web服务器【apache/nginx] 关闭目录的浏览权限

原文地址：https://www.cnblogs.com/zzsy/p/12243759.html