zoukankan html css js c++ java

初级爬虫--爬取拉勾网职位信息

主要用到的库：requests

1.原始url地址，https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=。我们查看网页源代码，发现里面并没有我们想要的职位信息，这是因为拉勾网有反爬虫机制，它的职位信息是通过ajax动态加载的。

2.我们按下F12，找到network--在左侧Name中找到：positionAjax.json?needAddtionalResult=false--，在右侧找到response。

我们将显示的json格式的内容放在http://www.bejson.com/jsonviewernew/进行格式化：

发现这正是我们想要的职位信息。

3.简单爬虫的构建

import requests
#实际要爬取的url
url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'

payload = {
    'first': 'true',
    'pn': '1',
    'kd': 'python',
}

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
    'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
    'Accept': 'application/json, text/javascript, */*; q=0.01'
}
#原始的url
urls ='https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='
#建立session
s = requests.Session()
# 获取搜索页的cookies
s.get(urls, headers=header, timeout=3)
# 为此次获取的cookies
cookie = s.cookies
# 获取此次文本
response = s.post(url, data=payload, headers=header, cookies=cookie, timeout=5).text
print(response)

部分输出如下：

查看全文

相关阅读:
CentOS设置密码复杂度及过期时间等
 CentOS开启telnet连接
 Centos6.5升级openssh至7.4版本
 Centos7搭建SVN服务器
 Centos7.2yum安装时候出现db5错误的解决办法
 在Centos中yum安装和卸载软件的使用方法(转)
Centos6.5 恢复误删的系统面板
 ecmall 2.3.0 最新补丁20140618
ecmall在linux下的安装注意事项(转)
PHP Strict standards:Declaration of … should be compatible with that of…(转)

原文地址：https://www.cnblogs.com/xiximayou/p/11703742.html