zoukankan html css js c++ java

Spider_基础总结1_Request(get/post__url传参_headers_timeout)+Reponse

网络爬虫（一）
一、简介
　　1、robot协议（爬虫协议）：这个协议告诉引擎哪些页面可以抓取，哪些不可以
-User-agent:爬虫引擎
-allow:允许robot访问的URL
-disallow:禁止访问的URL

　　2、爬虫约束：过快/频繁的网络爬虫会对服务器产生巨大的压力，网站可能封锁你的IP，或者采取法律行动，所以需要将请求速度限定在一个合理范围内

　　3、爬虫流程：
　　-获取网页：给网页一个网址发送请求，该网址会返回整个网页的数据；
　　-解析网页（提取数据）：从整个网页中提取想要的数据
　　-存储数据：将数据存储下来，可以存在csv中，或者数据库中

二、新建爬虫
　　1、获取网页：
　　-导入request类，使用requests.get(link,headers=headers)获取网页
　　·requests的header伪装成浏览器访问；
　　·r是requests的Response回复对象，从中获取想要的信息，r.text是获取的网页内容代码

　　2、提取需要的数据：需要用到 bs4库的BeautifulSoup类，后续会将到。

三、静态页面抓取

　　　1、参数介绍：

　　-r.text:服务器响应的内容，会自动根据响应头部的字符编码进行解码
　　-r.encoding:服务器内容使用的文本编码；
　　-r.status_code:用于检测响应的状态码.
                ·返回200，表示请求成功；
                ·返回4xx,表示客户端错误；
                ·返回5xx,表示服务器错误响应
　　-r.content:字节方式的响应体，会自动解码gzip和deflate编码的响应数据；
　　-r.json:是Requests中的内置的JSON解码器

四、代码讲解：

# 1-Request库及Reponse对象：
import requests

r=requests.get("http://www.baidu.com")  # r,服务器响应对象   get方法
print(r.url)              # http://www.baidu.com/
print(r.encoding)         # ISO-8859-1  文本编码
print(r.status_code)      # 200         响应状态码  200--请求成功   4xx--客户端错误  5xx--服务器错误
# print(r.text)           # 服务器响应的代码

# 2-定制Requests

# 1)传递 url参数
# 2)定制请求头
# 3)发送 Post请求
# 4)超时

# 2-1)传递 url参数：
import requests

# 方式1：
url='http://httpbin.org/get?key1=value1'   # 转义字符 r有没有都行
r=requests.get(url)
# 方式2：
parm_dict={'key1':'value1','key2':'value2'}
url='http://httpbin.org/get'              # url以 /get结尾
r=requests.get(url,params=parm_dict)

print(r.status_code)  # 200

# 2-2)定制请求头
# 请求头提供了关于请求，响应，或其他发送实体的信息。
# 1）打开网址：www.santostang.com  
# 2）右键--检查元素--网络--左侧资源里单击要请求的网页www.santostang.com
# 3）点击右侧的‘消息头’，并复制。

# 复制内容如下：
# Host: www.santostang.com
# User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
# Accept: text/css,*/*;q=0.1
# Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2
# Accept-Encoding: gzip, deflate
# Connection: keep-alive
# Cookie: Hm_lvt_752e310cec7906ba7afeb24cd7114c48=1591794256,1591794423; PHPSESSID=1plcgphukjij28c42ns9octmq2; Hm_lpvt_752e310cec7906ba7afeb24cd7114c48=1591794423

# 提取上面内容的重要信息，得到如下的 headers：
import requests
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Host':'www.santostang.com'
}

url='http://www.santostang.com'
r=requests.get(url,headers=headers)
print(r.status_code)  # 200

# 2-3)发送Post请求
# get方法发送请求会将一些信息暴露在url里很不安全，此时可以使用 Post方法，发送编码为表单形式的请求。
# 只需要将数据存储在字典中，并传递给Post方法的data参数就可以：

import requests

headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Host':'www.santostang.com'
}
parm_dict={'key1':'value1','key2':'value2'}
url='http://httpbin.org/post'  # url以 /post结尾
r=requests.post(url,data=parm_dict,headers=headers)
print(r.status_code)           # 200 
print(r.text)

200
{
"args": {},
"data": "",
"files": {},
"form": {
"key1": "value1",
"key2": "value2"
},
"headers": {
"Accept": "/",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "23",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "www.santostang.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
"X-Amzn-Trace-Id": "Root=1-5ee5c1c1-fd164ea0042482a055a977c0"
},
"json": null,
"origin": "116.153.38.222",
"url": "http://www.santostang.com/post"
}

# 2-4)超时
# 有时候爬虫会遇到服务器长时间不返回，这时就会一直等待，造成爬虫程序没有顺利执行，此时可以给get或post方法的 timeout参数设置一个时间限制：
# 一般可以设置为20
# import requests
# url='http://httpbin.org/get'
# r=requests.get(url,timeout=0.00001)  # 为了观察报错效果，故意设置的非常小

# 报错信息为：
# ConnectTimeout: HTTPConnectionPool(host='httpbin.org', port=80): 
# Max retries exceeded with url: /get (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.
# HTTPConnection object at 0x000001DE11B38160>, 'Connection to httpbin.org timed out. (connect timeout=1e-05)'))

# 用 try except处理如下：
import requests
parm_dict={'key1':'value1','key2':'value2'}
url='http://httpbin.org/post'  # url以 /post结尾
try:
    r=requests.post(url,data=parm_dict,timeout=0.00001)
    print(r.status_code)           
    print(r.text)
except:
    print('请求超时，请尝试将timeout设置的大一些试试')

请求超时，请尝试将timeout设置的大一些试试

查看全文

相关阅读:
kafka学习默认端口号9092
kafka搜索介绍
 进程线程区别
 linux下的mysql修改默认编码
 [LeetCode] #19 Remove Nth Node From End of List
[LeetCode] #18 4Sum
[LeetCode] #17 Letter Combinations of a Phone Number
[LeetCode] #16 3Sum Closest
编程之美2015 #1 2月29日
 编程之美2015 #2 回文字符序列

原文地址：https://www.cnblogs.com/Collin-pxy/p/13089447.html