北京python招聘（求职）python招聘（求职）尽在智联招聘

zoukankan html css js c++ java

北京python招聘（求职）python招聘（求职）尽在智联招聘
当用python3做爬虫的时候，一些网站为了防爬虫会设置一些检查机制，这时我们就需要添加请求头，伪装成浏览器正常访问。
header的内容在浏览器的开发者工具中便可看到，将这些信息添加到我们的爬虫代码中即可。
‘Accept-Encoding’：是浏览器发给服务器,声明浏览器支持的编码类型。一般有gzip,deflate,br 等等。
python3中的 requests包中response.text 和 response.content

response.content #字节方式的响应体，会自动为你解码 gzip 和 deflate 压缩类型：bytes
reponse.text #字符串方式的响应体，会自动根据响应头部的字符编码进行解码。类型：str

但是这里是默认是不支持解码br的！！！！

br 指的是 Brotli，是一种全新的数据格式，无损压缩，压缩比极高（比gzip高的）
Brotli具体介绍：https://www.cnblogs.com/Leo_wl/p/9170390.html
Brotli优势：https://www.cnblogs.com/upyun/p/7871959.html

这个不是本文的重点，重点是python3爬虫是如何解决。

第一种：将‘Accept-Encoding’中的：br 去除
这样接受的网页页面就是没有压缩的或者是默认可解析的了。
但是我认为，不好，人家搞出这么牛逼的算法还是要用一下的。

第二种：将使用br压缩的页面解析。
python3 中要导入 brotl 包这个要自己安装（这里就不介绍了，百度一堆）

下面是爬取智联招聘的网站的

from bs4 import BeautifulSoup
import requests
import brotli
from requests.exceptions import RequestException

def get_one_page(city, keyword, page):
'''
获取网页html内容并返回
'''
paras = {
'jl': city, # 搜索城市
'kw': keyword, # 搜索关键词
'isadv': 0, # 是否打开更详细搜索选项
'isfilter': 1, # 是否对结果过滤
'p': page, # 页数
# 're': region # region的缩写，地区，2005代表海淀
}

headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0',
'Host': 'sou.zhaopin.com',
'Referer': 'https://www.zhaopin.com/',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8',
'Accept-Encoding': 'gizp,defale',
'Accept-Language': 'zh-CN,zh;q=0.9'
}
import chardet
url = 'https://sou.zhaopin.com/jobs/searchresult.ashx?jl={}&kw={}&sm=0&p={}'.format(paras['jl'],paras['kw'],paras['p'])
try:
# 获取网页内容，返回html数据
response = requests.get(url, headers=headers)
# 通过状态码判断是否获取成功
if response.status_code == 200:
#response.encoding = 'utf-8'
print(response.headers)
print(response.encoding)
key = 'Content-Encoding'
# print(response.headers[key])
print("-----------")
if(key in response.headers and response.headers['Content-Encoding'] == 'br'):
data = brotli.decompress(response.content)
data1 = data.decode('utf-8')
print(data1)
return data1
print(response.text)
return response.text
return None
except RequestException as e:
return None

def main(city, keyword, pages):
```
for i in range(pages):
    html = get_one_page(city, keyword, i)
```
if name == 'main':
main('北京', 'python', 1)

部分结果：

{'Server': 'openresty', 'Date': 'Sun, 19 Aug 2018 13:15:46 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '361146', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding, Accept-Encoding', 'zp-trace-id': '8437455ebb5342a59f8af78ddaab1985', 'Set-Cookie': 'ZP-ENV-FLAG=gray'}
utf-8
北京python招聘（求职）python招聘（求职）尽在智联招聘这是没有加br在请求的头里的
下面改一下Accept-Encoding添加br

...同上
'Accept-Encoding': 'br,gizp,defale',
...同上
部分结果：

{'Server': 'openresty', 'Date': 'Sun, 19 Aug 2018 13:19:02 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'zp-trace-id': '842e66a58bb2464296121c9de59a9965', 'Content-Encoding': 'br', 'Set-Cookie': 'ZP-ENV-FLAG=gray'}
utf-8
北京python招聘（求职）python招聘（求职）尽在智联招聘当网站使用了br压缩的话，他会告诉我们的，就是‘Content-Encoding’的值。
重点是
```
       key = 'Content-Encoding'
       if(key in response.headers and response.headers['Content-Encoding'] == 'br'):
           data = brotli.decompress(response.content)
           data1 = data.decode('utf-8')
           print(data1)
```
好的这就解决了。

不得不说网上对于brotli的中文介绍并不算太多。

作者：思维不混乱
链接：https://www.jianshu.com/p/70c3994efcd8
來源：简书
简书著作权归作者所有，任何形式的转载都请联系作者获得授权并注明出处。
查看全文

相关阅读:
UVA 1386 Cellular Automaton
ZOJ 3331 Process the Tasks
CodeForces 650B Image Preview
CodeForces 650A Watchmen
CodeForces 651B Beautiful Paintings
CodeForces 651A Joysticks
HUST 1601 Shepherd
HUST 1602 Substring
HUST 1600 Lucky Numbers
POJ 3991 Seinfeld

原文地址：https://www.cnblogs.com/c-x-a/p/10049684.html

北京python招聘（求职）python招聘（求职）尽在智联招聘

{'Server': 'openresty', 'Date': 'Sun, 19 Aug 2018 13:15:46 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '361146', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding, Accept-Encoding', 'zp-trace-id': '8437455ebb5342a59f8af78ddaab1985', 'Set-Cookie': 'ZP-ENV-FLAG=gray'} utf-8

{'Server': 'openresty', 'Date': 'Sun, 19 Aug 2018 13:15:46 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '361146', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding, Accept-Encoding', 'zp-trace-id': '8437455ebb5342a59f8af78ddaab1985', 'Set-Cookie': 'ZP-ENV-FLAG=gray'}
utf-8