zoukankan      html  css  js  c++  java
  • Python拉勾爬虫——以深圳地区数据分析师为例

    拉勾因其结构化的数据比较多因此过去常常被爬,所以在其多次改版之下变得难爬。不过只要清楚它的原理,依然比较好爬。其机制主要就是AJAX异步加载JSON数据,所以至少在搜索页面里翻页url不会变化,而且数据也不会出现在源代码里。

    数据解析

    这是深圳地区的数据分析师页面,用Chrome检查打开。在XHR中可以看到一个以postionAjax.json开头的脚本,打开Preview看一下,可以看到:

    可以发现这些数据与前端的数据一致,此时我们已经找到了数据入口,就可以开始爬了。

    数据爬取

    在Headers里可以查看请求方式:

    Request Header:
    Request URL:https://www.lagou.com/jobs/positionAjax.json?city=深圳&needAddtionalResult=false
    Request Method:POST
    Status Code:200 OK
    Remote Address:106.75.72.62:443

    从Request Header中可以看到,是用POST提交表单方式查询的(所以如果你直接点开Request URL你会发现数据不对,因为没有提交表单数据)。

    那么我们就可以在Python中构造请求头以及提交表单数据来访问:

    import requests
    import time
    from sqlalchemy import create_engine
    import pandas as pd
    from random import choice
    import json
    import numpy
    
    engine=create_engine(#这里填你自己数据库的参数#) # 连接数据库
    dl = pd.read_sql("proxys",engine)
    
    def get_proxy(dl):
        n = choice(range(1, len(dl.index)))
        proxy = {"http":"http://%s:%s" %(dl["ip"][n],dl["port"][n]),
                 "https": "http://%s:%s" % (dl["ip"][n], dl["port"][n])}
        return(proxy)
    
    def get_header():
        headers = {
            "User-Agent": ""Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"",
            "Accept": "application/json, text/javascript, */*; q=0.01",
            "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
            "Referer": "https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E5%B8%88?px=default&city=%E6%B7%B1%E5%9C%B3&district=%E5%8D%97%E5%B1%B1%E5%8C%BA",
            "X-Requested-With": "XMLHttpRequest",
            "Host": "www.lagou.com",
            "Connection":"keep-alive",
            "Cookie":"user_trace_token=20160214102121-0be42521e365477ba08bd330fd2c9c72; LGUID=20160214102122-a3b749ae-d2c1-11e5-8a48-525400f775ce; tencentSig=9579373568; pgv_pvi=3712577536; index_location_city=%E5%85%A8%E5%9B%BD; SEARCH_ID=c684c55390a84fe5bd7b62bf1754b900; JSESSIONID=8C779B1311176D4D6B74AF3CE40CE5F2; TG-TRACK-CODE=index_hotjob; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1485318435,1485338972,1485393674,1485423558; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1485423598; _ga=GA1.2.1996921784.1455416480; LGRID=20170126174002-691cb0a5-e3ab-11e6-bdc0-525400f775ce",
            "Origin": "https://www.lagou.com",
            "Upgrade-Insecure-Requests":"1",
            "X-Anit-Forge-Code": "0",
            "X-Anit-Forge-Token": "None",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "zh-CN,zh;q=0.8"
            }
        return(headers)
    
    def get_form(i):
        data={"first":"false","pn":i,"kd":"数据分析师"}
        return(data)
    
    districts = ["南山区","福田区","宝安区","龙岗区","龙华新区","罗湖区","盐田区","大鹏新区"]
    pagenos = [22,10,1,4,1,2,1,1]
    url_lists = ["https://www.lagou.com/jobs/positionAjax.json?px=default&city=深圳&district=%s&needAddtionalResult=false"%area for area in districts]
    
    s = requests.Session()
    s.keep_alive = False
    s.adapters.DEFAULT_RETRIES = 10
    
    def get_jobinfo(i,j): # i表区号,j表页数
        if i >= 8 or j > pagenos[i]:
            return("索引超标!")
        resp=s.post(url_lists[i], data=get_form(j), headers=get_header())
        resp.encoding="utf-8"
        max_num = len(json.loads(resp.text)["content"]["positionResult"]["result"])
        for k in range(max_num):
            try:
                json_data=json.loads(resp.text)["content"]["positionResult"]["result"][k]
                df = pd.DataFrame(dict(
                    approve=json_data["approve"],
            #        businessZones=json_data["businessZones"],
                    companyId=json_data["companyId"],
            #        companyLabelList=json_data["companyLabelList"],
                    companyShortName=json_data["companyShortName"],
                    companySize=json_data["companySize"],
                    createTime=json_data["createTime"],
                    education=json_data["education"],
                    financeStage=json_data["financeStage"],
                    firstType=json_data["firstType"],
                    industryField=json_data["industryField"],
                    jobNature=json_data["jobNature"],
                    positionAdvantage=json_data["positionAdvantage"],
                    positionId=json_data["positionId"],
                    positionName=json_data["positionName"],
                    salary=json_data["salary"],
                    secondType=json_data["secondType"],
                    workYear=json_data["workYear"],
                    scrapy_time=time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))),index=[0])
                df.to_sql(con = engine, name = "job_info", if_exists = 'append', flavor = "mysql",index=False)
            except:
                print("第%d区,第%d页,第%d个出错了!"%(i,j,k))
    

    以上这个函数就可以通过提交区和页数,返回当前页的职位数。

    其实AJAX返回JSON数据的方式也有好处,数据都是规整的,不必花太多时间精力在数据清洗上。

    不过注意要加延时,拉勾的反爬虫措施还是比较严的,不加延时爬一小会儿就会被封IP。

  • 相关阅读:
    HDU2586 How far away?(tarjan的LCA)
    You Raise Me Up
    POJ2891 Strange Way to Express Integers(中国剩余定理)
    POJ2142 The Balance(扩展欧几里得)
    HDU 1166模仿大牛写的线段树
    NetWord Dinic
    HDU 1754 线段树裸题
    hdu1394 Minimum Inversion Number
    hdu2795 Billboard
    【完全版】线段树
  • 原文地址:https://www.cnblogs.com/lafengdatascientist/p/6516649.html
Copyright © 2011-2022 走看看