zoukankan      html  css  js  c++  java
  • Python爬取前程无忧十万条招聘数据

    前言:本文是介绍利用代理IP池以及多线程完成前程无忧网站的是十万条招聘信息的采集工作,已适当控制采集频率,采集数据仅为了学习使用,采集十万条招聘信息大概需要十个小时。

    起因是在知乎上看到另一个程序猿写的前程无忧的爬虫代码,对于他的一些反反爬虫处理措施抱有一丝怀疑态度,于是在他的代码的基础上进行改造,优化了线程的分配以及页面访问的频率,并加入了代理IP池的处理,优化了爬虫效率。

    原始代码文章链接:https://zhuanlan.zhihu.com/p/146425439

    首先,奉上本文依赖的基础的爬虫代码

    def getdata(bot,top):
        for i in range(bot,top):
            print("正在爬取第" + str(i) + "页的数据")
            url0 = "https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,"
            url_end = ".html?"
            url = url0 + str(i) + url_end
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
            }
            html = requests.get(url, headers=headers)
            html.encoding = "gbk"
            etree = etree.HTML(html.text)
            # ①岗位名称
            JobName = etree.xpath('//div[@class="dw_table"]/div[@class="el"]//p/span/a[@target="_blank"]/@title')
            # ②公司名称
            CompanyName = etree.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t2"]/a[@target="_blank"]/@title')
            # ③工作地点
            Address = etree.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t3"]/text()')
            # ④工资
            sal = etree.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t4"]')
            salary = [i.text for i in sal]
            # ⑤发布时间
            ShowTime = etree.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t5"]/text()')
            # ⑥获取职位详情url
            DetailUrl = etree.xpath('//div[@class="dw_table"]/div[@class="el"]//p/span/a[@target="_blank"]/@href')
            OthersInfo = []
            JobDescribe = []
            CompanyType = []
            CompanySize = []
            Industry = []
            for i in range(len(DetailUrl)):
                htmlInfo = requests.get(DetailUrl[i], headers=headers)
                htmlInfo.encoding = "gbk"
                etreeInfo = etree.HTML(htmlInfo.text)
                # ⑦经验、学历信息等其他信息
                otherinfo = etreeInfo.xpath('//div[@class="tHeader tHjob"]//div[@class="cn"]/p[@class="msg ltype"]/text()')
                # ⑧岗位详情
                JobDescibe = etreeInfo.xpath('//div[@class="tBorderTop_box"]//div[@class="bmsg job_msg inbox"]/p/text()')
                # ⑨公司类型
                CompanyType = etreeInfo.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[1]/@title')
                # ⑩公司规模(人数)
                CompanySize = etreeInfo.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[2]/@title')
                # ⑪所属行业(公司)
                industry = etreeInfo.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[3]/@title')
                #将上述信息存入列表中
                OthersInfo.append(otherinfo)
                JobDescribe.append(JobDescibe)
                CompanyType.append(CompanyType)
                CompanySize.append(CompanySize)
                Industry.append(industry)
                # 休眠
                time.sleep(0.5)
            # 一边爬取一边写入
            data = pd.DataFrame()
            data["岗位名称"] = JobName
            data["工作地点"] = Address
            data["公司名称"] = CompanyName
            data["工资"] = salary
            data["发布日期"] = ShowTime
            data["经验、学历"] = OthersInfo
            data["所属行业"] = Industry
            data["公司类型"] = CompanyType
            data["公司规模"] = CompanySize
            data["岗位描述"] = JobDescribe
            # 有些网页会跳转到公司官网,会返回空值,所以将其忽略
            try:
                data.to_csv("job_info.csv", mode="a+", header=None, index=None, encoding="gbk")
            except:
                print("跳转官网,无数据")
            time.sleep(1)
        print("数据爬取完成!!!!")

    经过实验,发现这段代码存在以下几个问题,1.爬虫的效率低;2.爬虫的过程中报错有点频繁;3.访问网页的延时时间都是固定的,这样很容易被网站识别到

    首先,解决第一个问题,原作者的解决方案是以多线程的方式处理,代码如下

    import requests,time,warnings,threading
    import pandas as pd
    from lxml import etree
    warnings.filterwarnings("ignore")
    
    def getdata(bot,top):
        for i in range(bot,top):
            print("正在爬取第" + str(i) + "页的数据")
            url0 = "https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,"
            url_end = ".html?"
            url = url0 + str(i) + url_end
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
            }
            html = requests.get(url, headers=headers)
            html.encoding = "gbk"
            Html = etree.HTML(html.text)
            # ①岗位名称
            JobName = Html.xpath('//div[@class="dw_table"]/div[@class="el"]//p/span/a[@target="_blank"]/@title')
            # ②公司名称
            CompanyName = Html.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t2"]/a[@target="_blank"]/@title')
            # ③工作地点
            Address = Html.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t3"]/text()')
            # ④工资
            sal = Html.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t4"]')
            salary = [i.text for i in sal]
            # ⑤发布时间
            ShowTime = Html.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t5"]/text()')
            # ⑥获取职位详情url
            DetailUrl = Html.xpath('//div[@class="dw_table"]/div[@class="el"]//p/span/a[@target="_blank"]/@href')
            OthersInfo = []
            JobDescribe = []
            CompanyType = []
            CompanySize = []
            Industry = []
            for i in range(len(DetailUrl)):
                HtmlInfo = requests.get(DetailUrl[i], headers=headers)
                HtmlInfo.encoding = "gbk"
                HtmlInfo = etree.HTML(HtmlInfo.text)
                # ⑦经验、学历信息等其他信息
                otherinfo = HtmlInfo.xpath('//div[@class="tHeader tHjob"]//div[@class="cn"]/p[@class="msg ltype"]/text()')
                # ⑧岗位详情
                JobDescibe = HtmlInfo.xpath('//div[@class="tBorderTop_box"]//div[@class="bmsg job_msg inbox"]/p/text()')
                # ⑨公司类型
                ComType = HtmlInfo.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[1]/@title')
                # ⑩公司规模(人数)
                ComSize = HtmlInfo.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[2]/@title')
                # ⑪所属行业(公司)
                industry = HtmlInfo.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[3]/@title')
                #将上述信息存入列表中
                OthersInfo.append(otherinfo)
                JobDescribe.append(JobDescibe)
                CompanyType.append(ComType)
                CompanySize.append(ComSize)
                Industry.append(industry)
                # 休眠
                time.sleep(0.5)
            # 一边爬取一边写入
            data = pd.DataFrame()
            data["岗位名称"] = JobName
            data["工作地点"] = Address
            data["公司名称"] = CompanyName
            data["工资"] = salary
            data["发布日期"] = ShowTime
            data["经验、学历"] = OthersInfo
            data["所属行业"] = Industry
            data["公司类型"] = CompanyType
            data["公司规模"] = CompanySize
            data["岗位描述"] = JobDescribe
            # 有些网页会跳转到公司官网,会返回空值,所以将其忽略
            try:
                data.to_csv("job_info.csv", mode="a+", header=None, index=None, encoding="gbk")
            except:
                print("跳转官网,无数据")
            time.sleep(1)
        print("数据爬取完成!!!!")
    
    threads = []
    t1 = threading.Thread(target=getdata,args=(1,125))
    threads.append(t1)
    t2 = threading.Thread(target=getdata,args=(125,250))
    threads.append(t2)
    t3 = threading.Thread(target=getdata,args=(250,375))
    threads.append(t3)
    t4 = threading.Thread(target=getdata,args=(375,500))
    threads.append(t4)
    t5 = threading.Thread(target=getdata,args=(500,625))
    threads.append(t5)
    t6 = threading.Thread(target=getdata,args=(625,750))
    threads.append(t6)
    t7 = threading.Thread(target=getdata,args=(750,875))
    threads.append(t7)
    t8 = threading.Thread(target=getdata,args=(875,1000))
    threads.append(t8)
    t9 = threading.Thread(target=getdata,args=(1000,1125))
    threads.append(t9)
    t10 = threading.Thread(target=getdata,args=(1125,1250))
    threads.append(t10)
    t11 = threading.Thread(target=getdata,args=(1250,1375))
    threads.append(t11)
    t12 = threading.Thread(target=getdata,args=(1375,1500))
    threads.append(t12)
    
    if __name__ == "__main__":
        for t in threads:
            t.setDaemon(True)
            t.start()

    确实增加了爬虫的速度,但这样做会有一个问题,就是爬虫的质量变差了,准确的说就是出错的几率提高了,被反爬虫策略识别到的次数增加了

    首先从代码生成的角度,我优化了一下多线程的生成方法,允许用户自定义线程数作为参数传递,通过总的页数进行均分,如下所示

    # 分配线程任务
    def start_spider(num):
        start = 1
        end = 0
        count = 2000
        size = count//(num-1)
        print(size)
        while num > 1:
            end = start+size
            t = threading.Thread(target=getdata,args=(start,end))
            start = end+1
            t.start()
            num = num-1
        # 分配剩下的任务给新的线程
        if(end < count):
            start = end+1
            end = count
            t = threading.Thread(target=getdata,args=(start,end))
            t.start()

    代码优化了之后,我们调整下爬虫时页面访问的延迟,改为一个随机数

                Industry.append(industry)
                # 休眠
                time.sleep(random.uniform(0.1,1))
            # 一边爬取一边写入
            data = pd.DataFrame()
            data["岗位名称"] = JobName
            data["工作地点"] = Address
            data["公司名称"] = CompanyName
            data["工资"] = salary
            data["发布日期"] = ShowTime
            data["经验、学历"] = OthersInfo
            data["所属行业"] = Industry
            data["公司类型"] = CompanyType
            data["公司规模"] = CompanySize
            data["岗位描述"] = JobDescribe
            # 有些网页会跳转到公司官网,会返回空值,所以将其忽略
            try:
                data.to_csv("job_info.csv", mode="a+", header=None, index=None, encoding="gbk")
            except:
                print("跳转官网,无数据")
            time.sleep(random.uniform(0.2,0.5))

    最后利用代理IP池的方式来提高爬虫的质量

    这里我分享一个很好用的代理IP池项目:https://github.com/jhao104/proxy_pool

    这个项目在我等会分享的gitee开源项目中也拷贝了一份:https://gitee.com/chengrongkai/OpenSpiders

    配置IP代理池的方法就参考这个项目的readme就行了

    下面我奉上我对这个项目的代码改造

    # 利用代理IP请求
    def getHtml(url):
        # ....
        retry_count = 5
        proxy = get_proxy().get("proxy")
        while retry_count > 0:
            try:
                headers = {
                            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
                        }
                print("代理信息:{}".format(proxy))
                html = requests.get(url,headers=headers, proxies={"http": "http://{}".format(proxy)})
                # 使用代理访问
                return html
            except Exception:
                retry_count -= 1
        # 出错5次, 删除代理池中代理
        delete_proxy(proxy)
        return None
    
    def getdata(bot,top):
        for i in range(bot,top):
            print("正在爬取第" + str(i) + "页的数据")
            url0 = "https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,"
            url_end = ".html?"
            url = url0 + str(i) + url_end
            html = getHtml(url)
            if(html == None):
                continue
            html.encoding = "gbk"
            Html = etree.HTML(html.text)
            # ①岗位名称
            JobName = Html.xpath('//div[@class="dw_table"]/div[@class="el"]//p/span/a[@target="_blank"]/@title')
            # ②公司名称
            CompanyName = Html.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t2"]/a[@target="_blank"]/@title')
            # ③工作地点
            Address = Html.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t3"]/text()')
            # ④工资
            sal = Html.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t4"]')
            salary = [i.text for i in sal]
            # ⑤发布时间
            ShowTime = Html.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t5"]/text()')
            # ⑥获取职位详情url
            DetailUrl = Html.xpath('//div[@class="dw_table"]/div[@class="el"]//p/span/a[@target="_blank"]/@href')
            OthersInfo = []
            JobDescribe = []
            CompanyType = []
            CompanySize = []
            Industry = []
            for i in range(len(DetailUrl)):
                HtmlInfo = getHtml(DetailUrl[i])
                HtmlInfo.encoding = "gbk"
                HtmlInfo = etree.HTML(HtmlInfo.text)
                if(HtmlInfo == None):
                    continue
                # ⑦经验、学历信息等其他信息
                otherinfo = HtmlInfo.xpath('//div[@class="tHeader tHjob"]//div[@class="cn"]/p[@class="msg ltype"]/text()')
                # ⑧岗位详情
                JobDescibe = HtmlInfo.xpath('//div[@class="tBorderTop_box"]//div[@class="bmsg job_msg inbox"]/p/text()')
                # ⑨公司类型
                ComType = HtmlInfo.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[1]/@title')
                # ⑩公司规模(人数)
                ComSize = HtmlInfo.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[2]/@title')
                # ⑪所属行业(公司)
                industry = HtmlInfo.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[3]/@title')
                #将上述信息存入列表中
                OthersInfo.append(otherinfo)
                JobDescribe.append(JobDescibe)
                CompanyType.append(ComType)
                CompanySize.append(ComSize)
                Industry.append(industry)
                # 休眠
                time.sleep(random.uniform(0.1,1))
            # 一边爬取一边写入
            data = pd.DataFrame()
            data["岗位名称"] = JobName
            data["工作地点"] = Address
            data["公司名称"] = CompanyName
            data["工资"] = salary
            data["发布日期"] = ShowTime
            data["经验、学历"] = OthersInfo
            data["所属行业"] = Industry
            data["公司类型"] = CompanyType
            data["公司规模"] = CompanySize
            data["岗位描述"] = JobDescribe
            # 有些网页会跳转到公司官网,会返回空值,所以将其忽略
            try:
                data.to_csv("job_info.csv", mode="a+", header=None, index=None, encoding="gbk")
            except:
                print("跳转官网,无数据")
            time.sleep(random.uniform(0.2,0.5))
            print("数据爬取完成!!!!")

    我自己的机器测试了下,8个线程爬取了一个半小时,采集了一万五的数据,这里我有意的降慢了速度,大家可以根据实际情况进行调整,比如代理IP的重试可以去掉,如果出现无法采集就直接删除代理IP池中的该IP即可,另外线程数也可以按照电脑配置适当增加,在不计质量的情况下,应该可以达到一个小时一万五左右的采集量,单机的情况下,如果有更好的解决方案,欢迎留言,下篇文章将讲述如何对获取到的数据进行清洗以及数据分析。

    采集到的数据如下

    本文所有代码均开源在https://gitee.com/chengrongkai/OpenSpiders

    欢迎star,你的鼓励是我最大的动力

    本文首发于https://www.bizhibihui.com/blog/article/45

  • 相关阅读:
    云计算之路-阿里云上:基于Xen的IO模型进一步分析“黑色0.1秒”问题团队
    上周热点回顾(5.5-5.11)团队
    云计算之路-阿里云上:原来“黑色0.1秒”发生在socket读取数据时团队
    云计算之路-阿里云上:读取缓存时的“黑色0.1秒”团队
    云计算之路-阿里云上:“黑色30秒”走了,“黑色1秒”来了,真相也许大白了团队
    云计算之路-阿里云上:神奇的“黑色30秒”再次出现,究竟是谁的错?团队
    上周热点回顾(4.28-5.4)团队
    云计算之路-阿里云上:从ASP.NET线程角度对“黑色30秒”问题的全新分析团队
    上周热点回顾(4.21-4.27)团队
    云计算之路-阿里云上:借助IIS Log Parser Studio分析“黑色30秒”问题团队
  • 原文地址:https://www.cnblogs.com/chengrongkai/p/13183629.html
Copyright © 2011-2022 走看看