zoukankan      html  css  js  c++  java
  • 拉勾网爬虫--待改正

    由于在微信公众号CSDN上看到一篇拉勾网招聘信息爬取及分析的文章,觉得非常不错,于是也copy一下,但是却出现了许多文章中没有提到的错误,正是一失足成千古恨啊!

    首先插入代码:

    import requests
    from fake_useragent import UserAgent
    
    url='https://www.lagou.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false'
    headers = {
        'Accept': 'application/json,text/javascript,*/*;q=0.01',
        'Connection': 'keep-alive',
        'Cookie': 'user_trace_token=20190219170421-589c51fd-3425-11e9-94ca-525400f775ce; LGUID=20190219170421-589c556f-3425-11e9-94ca-525400f775ce; JSESSIONID=ABAAABAAAIAACBI83910AF8CFDCD43C502B73B369BC11AE; PRE_UTM=; PRE_HOST=www.so.com; PRE_SITE=http%3A%2F%2Fwww.so.com%2Flink%3Fm%3Dah6rTRiEAqghnfjOchMrldC9g09Z6O4EM8yoD1U73IL58lzzlfsBR1G3ekEi1hDYMb8HzC5keoRl8AIGdSPOI6dMEYY3t8OajAwFORdOWma8%253D; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; sajssdk_2015_cross_new_user=1; ab_test_random_num=0; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2216905871fb4c9-0ba623cd0a19b4-5d4e211f-921600-16905871fba156%22%2C%22%24device_id%22%3A%2216905871fb4c9-0ba623cd0a19b4-5d4e211f-921600-16905871fba156%22%7D; _putrc=C4C6A5FE2C61AA92123F89F2B170EADC; login=true; hasDeliver=0; gate_login_token=dae438a7f0f2180190e414e0d39025bcfe698c75269decc65dbe3ad7b5d645a8; unick=%E5%8F%B6%E7%90%86%E4%BD%A9; _gid=GA1.2.689310331.1550567063; _gat=1; _ga=GA1.2.149442038.1550567063; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1550567064,1550575932; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1550576293; LGSID=20190219193211-ff3ee358-3439-11e9-826e-5254005c3644; LGRID=20190219193811-d650b01a-343a-11e9-826e-5254005c3644; TG-TRACK-CODE=index_search; SEARCH_ID=11dee42b2b7a4d68bd2b4434a28a282c; index_location_city=%E5%B9%BF%E5%B7%9E',
        'Host': 'www.lagou.com',
        'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=sug&fromSearch=true&suginput=p',
        'User-Agent':str(UserAgent().random),
        'X-Requested-With':'XMLHttpRequest'
    }
    proxies = {'https': '49.86.183.149:9999'}
    # rsp=requests.post(url=url,proxies=proxies)
    rsp=requests.request("post",url=url,proxies=proxies,headers=headers,timeout=10)
    print(rsp.content.decode())

    在爬取拉勾网招聘信息时,需要进行许多分析,但是网络上已经有了许多分析,所以也就不再多说,总之登陆后输入python职位,找到与招聘信息相关的一个URL就可以了,然后找到

    ['content']['positionResult']['result']
    然后来说一下存在的错误:
    1.10060或者10061错误:因为刚开始没有使用代理ip,导致被拉勾网认定为爬虫,ip被封了,之后继续访问就被积极!拒绝了
    2.使用代理ip需要注意拉勾网使用的是https协议,所以proxies对应的就要是proxies,而且代理的ip地址也必须是https的,这就要在代理ip的网址仔细看,而且也要看清楚端口号;另外Ip要选择绿色ip,否则也会出错
    3.在改正错误的过程中,我也巩固了知识点,知道了requests怎么使用post请求。有两种方法:requests.request(“post”,url=url,proxies=proxies)或者requests.post(url,proxies=proxies)
    4.另外我也知道了一个第三方库,fake_useragent是一个生成UserAgent 的库,可以指定浏览器的类型,也可以随机生成UserAgent
  • 相关阅读:
    随笔列表--目录还没有时间来得及更改......
    [Redis] 万字长文带你总结Redis,助你面试升级打怪
    [Java] 多线程基础详细总结,附加详细实例
    这次,我是如何监控服务器CPU和内存的
    [Java][Web] Servlet中转发和重定向比较
    [Maven] Project build error: 'packaging' with value 'jar' is invalid. Aggregator projects require 'pom' as packaging.
    [JAVA][Liferay] Configure sharding in multiple sites
    [JAVA][Liferay] Duplicate key value violates unique constraint for resourcepermissionid in Liferay
    [Java][Liferay] 模拟用户
    [Java][Liferay] 如何从Javascript的function中获取language property的值
  • 原文地址:https://www.cnblogs.com/fodalaoyao/p/10409584.html
Copyright © 2011-2022 走看看