zoukankan      html  css  js  c++  java
  • 通过CrawlSpider对招聘网站进行整站爬取(拉勾网实战)

    爬虫首先要明确自己要爬取的网站以及内容

    进入拉勾网的网站
    然后看看想要爬取什么内容
    职位,薪资,城市,经验要求
    学历要求,全职或者兼职
    职位诱惑,职位描述
    提取公司的名称 以及 在拉勾网的url等等

    然后在navicat中设计表 

    我是在数据库article_spider中设计的表lagou_job

    url varchar 300
    url_object_id varchar 50(作为主键)
    title varchar 100 
    salary varchar 20(薪资) (不确定有没有工资,所以可以是空值)
    job_city varchar 10 (可以为空)
    work_years varchar 100 (可以为空)
    degree_need varchar 30 (可以为空)
    job_type varchar 20 (可以为空)
    publish_time varchar 20 可以想想怎么把varchar转换为datetime类型
    tags varchar varchar 100 (可以为空)
    job_advantage varchar 1000 (可以为空)
    job_desc longtext 
    job_addr varchar 50 (可以为空) 具体工作地址
    company_url varchar 300 (可以为空)
    company_name varchar 100 (可以为空)
    crawl_time datetime 
    crawl_update_time datetime (可以为空)

    然后进入anaconda的命令行(这里采用别的也是一样)

    scrapy genspider --list
    可以看到有 crawl 模板
    然后使用
    scrapy genspider -t crawl lagou www.lagou.com

    我们可以在目录下看到这个代码

    import scrapy
    class LagouSpider(CrawlSpider):
        name = 'lagou'
        allowed_domains = ['www.lagou.com']
        start_urls = ['http://www.lagou.com/']
    
    rules = (
        Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
        Rule(LinkExtractor(allow=("gongsi/\d+.html",)), follow=True),
        Rule(LinkExtractor(allow="jobs/\d+.html"), callback='parse_job', follow=True),
    )
    

    其中rules 中的规则是读取拉勾网的二级目录下的筛选模式,我们只设置最后一种可以写入我们的数据库

    接下来我们就可以分析网站中我们想要提取的对象了

    我们可以先使用scrapy shell 进行分析 

    顺便提一句,现在拉勾网需要验证user-agent和cookies才能进行查看(而不是说要登录才能查看)

    在anaconda的命令行中使用命令

    scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36" https://www.lagou.com/jobs/4804175.html

    进行分析

    这里有一点,就是在shell中这样写

    response.css(".position-label li::text").extract()

    而在crawlspider中只需要写入(".position-label li::text"),后面不需要加入extract()

    这样子就可以爬取到我们想要的数据...

    把数据输入到mysql数据库中

    在items中进行定义

    def remove_splash(value):
        #去掉工作城市的斜线
        return value.replace("/","")
    def handle_jobaddr(value):
        addr_list = value.split("\n")
        addr_list = [item.strip() for item in addr_list
                     if item.strip()!="查看地图"]
        return "".join(addr_list)
    #self["front_image_path"], front_image_path,
    class LagouJobItemLoader(ItemLoader):
        #自定义itemloader
        default_output_processor = TakeFirst()
    
    class LagouJobItem(scrapy.Item):
        #拉勾网职位信息
        title = scrapy.Field()
        url =  scrapy.Field()
        url_object_id =  scrapy.Field()
        salary =  scrapy.Field()
        job_city =  scrapy.Field(
            input_processor = MapCompose(remove_splash)
        )
        work_years =  scrapy.Field(
            input_processor = MapCompose(remove_splash)
        )
        degree_need =  scrapy.Field(
            input_processor = MapCompose(remove_splash)
        )
        job_type =  scrapy.Field()
        publish_time =  scrapy.Field()
        job_advantage = scrapy.Field()
        job_desc =  scrapy.Field()
        job_addr =  scrapy.Field(
            input_processor=MapCompose(remove_tags,handle_jobaddr)
        )
        company_name =  scrapy.Field()
        company_url =  scrapy.Field()
        tags =  scrapy.Field(
            input_processor = Join(",")
        )
        crawl_time =  scrapy.Field()
    
        def get_insert_sql(self):
            insert_sql = """
                insert into lagou_job(title,url,url_object_id,
                salary,job_city,work_years,
                degree_need,job_type,publish_time,
                job_advantage,job_desc,job_addr,
                company_name,company_url,tags,
                crawl_time) VALUES( %s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
                ON DUPLICATE KEY UPDATE salary=VALUES(salary),job_desc=VALUES(job_desc)        
            """
            params = (
                self["title"],self["url"],self["url_object_id"],self["salary"],
                self["job_city"],self["work_years"],self["degree_need"],
                self["job_type"],self["publish_time"],self["job_advantage"],
                self["job_desc"],self["job_addr"],self["company_name"],
                self["company_url"],self["tags"],self["crawl_time"].strftime(SQL_DATETIME_FORMAT)
    
            )
            return insert_sql,params

    lagou.py的文件如下

    import scrapy
    from datetime import datetime
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from ArticleSpider.items import LagouJobItem,LagouJobItemLoader
    from ArticleSpider.utils.common import get_md5
    
    
    class LagouSpider(CrawlSpider):
        name = 'lagou'
        allowed_domains = ['www.lagou.com']
        start_urls = ['http://www.lagou.com/']
    
    
        rules = (
            Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
            Rule(LinkExtractor(allow=("gongsi/\d+.html",)), follow=True),
            Rule(LinkExtractor(allow="jobs/\d+.html"), callback='parse_job', follow=True),#回调函数callback
        )
    
    
        def parse_job(self, response):
            #解析拉勾网的职位
            item_loader = LagouJobItemLoader(item=LagouJobItem(), response=response)  #连接数据库的重要语句
            item_loader.add_css("title", ".job-name::attr(title)")
            item_loader.add_value("url", response.url)
            item_loader.add_value("url_object_id", get_md5(response.url))
            item_loader.add_css("salary", ".job_request .salary::text")
            item_loader.add_xpath("job_city", "//*[@class='job_request']/p/span[2]/text()")
            item_loader.add_xpath("work_years", "//*[@class='job_request']/p/span[3]/text()")
            item_loader.add_xpath("degree_need", "//*[@class='job_request']/p/span[4]/text()")
            item_loader.add_xpath("job_type", "//*[@class='job_request']/p/span[5]/text()")
            item_loader.add_css("tags", '.position-label li::text')
            item_loader.add_css("publish_time", ".publish_time::text")
            item_loader.add_css("job_advantage", ".job-advantage p::text")
            item_loader.add_css("job_desc", ".job_bt div")
            item_loader.add_css("job_addr", ".work_addr")
            item_loader.add_css("company_name", "#job_company dt a img::attr(alt)")
            item_loader.add_css("company_url", "#job_company dt a::attr(href)")
            item_loader.add_value("crawl_time", datetime.now())
    
            job_item = item_loader.load_item()
    
            return job_item

    这里需要注意的是:以前拉勾网是不需要验证user-agent的,现在需要了,所以要加入

       custom_settings = {
            "COOKIES_ENABLED": False,
            "DOWNLOAD_DELAY": 1,
            'DEFAULT_REQUEST_HEADERS': {
                'Accept': 'application/json, text/javascript, */*; q=0.01',
                'Accept-Encoding': 'gzip, deflate, br',
                'Accept-Language': 'zh-CN,zh;q=0.8',
                'Connection': 'keep-alive',
                'Cookie': 'user_trace_token=20171015132411-12af3b52-3a51-466f-bfae-a98fc96b4f90; LGUID=20171015132412-13eaf40f-b169-11e7-960b-525400f775ce; SEARCH_ID=070e82cdbbc04cc8b97710c2c0159ce1; ab_test_random_num=0; X_HTTP_TOKEN=d1cf855aacf760c3965ee017e0d3eb96; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=0; PRE_UTM=; PRE_HOST=www.baidu.com; PRE_SITE=https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DsXIrWUxpNGLE2g_bKzlUCXPTRJMHxfCs6L20RqgCpUq%26wd%3D%26eqid%3Dee53adaf00026e940000000559e354cc; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; index_location_city=%E5%85%A8%E5%9B%BD; TG-TRACK-CODE=index_hotjob; login=false; unick=""; _putrc=""; JSESSIONID=ABAAABAAAFCAAEG50060B788C4EED616EB9D1BF30380575; _gat=1; _ga=GA1.2.471681568.1508045060; LGSID=20171015203008-94e1afa5-b1a4-11e7-9788-525400f775ce; LGRID=20171015204552-c792b887-b1a6-11e7-9788-525400f775ce',
                'Host': 'www.lagou.com',
                'Origin': 'https://www.lagou.com',
                'Referer': 'https://www.lagou.com/',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
            }
        }

    然后就可以看到数据写入到我们的表中了

    这里可能会出现tags的值错误,因为tags中有可能是为空的

    类似于这个网站 

    https://www.lagou.com/jobs/67631.html

    这里就没有tags    

    我们可以定义一个函数处理这个错误。(这里就不说怎么做了)

  • 相关阅读:
    sql查询语句
    java网络编程实现两端聊天
    Thread和Runnable的子类调用
    接口和抽象类
    ObjectOutputStream和ObjectInputStream的简单使用
    HashMap遍历和使用
    InputStreamReader读取文件出现乱码
    Neural Network
    Logistic Regression 逻辑回归
    Linear Regression 线性回归
  • 原文地址:https://www.cnblogs.com/linyujin/p/9803112.html
Copyright © 2011-2022 走看看