zoukankan      html  css  js  c++  java
  • <scrapy爬虫>爬取腾讯社招信息

    1.创建scrapy项目

    dos窗口输入:

    scrapy startproject tencent
    
    cd tencent
    

    2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义)

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class TencentItem(scrapy.Item):
        # define the fields for your item here like:
        #职位名
        positionname = scrapy.Field()
        #链接
        positionlink = scrapy.Field()
        #类别
        positionType = scrapy.Field()
        #招聘人数
        positionNum = scrapy.Field()
        #工作地点
        positioncation = scrapy.Field()
        #职位名称
        positionTime = scrapy.Field()
    

    3.创建爬虫文件

    dos窗口输入:

    scrapy genspider myspider tencent.com
    

    4.编写myspider.py文件(接收响应,处理数据)

    # -*- coding: utf-8 -*-
    import scrapy
    from tencent.items import TencentItem
    
    class MyspiderSpider(scrapy.Spider):
        name = 'myspider'
        allowed_domains = ['tencent.com']
        url = 'https://hr.tencent.com/position.php?&start='
        offset = 0
        start_urls = [url+str(offset)]
    
    
        def parse(self, response):
            for each in response.xpath('//tr[@class="even"]|//tr[class="odd"]'):
                #初始化模型对象
                item = TencentItem()
                # 职位名
                item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0]
                # 链接
                item['positionlink'] = 'http://hr.tencent.com/' + each.xpath("./td[1]/a/@href").extract()[0]
                # 类别
                item['positionType'] = each.xpath("./td[2]/text()").extract()[0]
                # 招聘人数
                item['positionNum'] = each.xpath("./td[3]/text()").extract()[0]
                # 工作地点
                item['positioncation'] = each.xpath("./td[4]/text()").extract()[0]
                # 职位名称
                item['positionTime'] = each.xpath("./td[5]/text()").extract()[0]
                yield item
            if self.offset < 2820:
                self.offset += 10
            else:
                raise ("程序结束")
            yield scrapy.Request(self.url+str(self.offset),callback=self.parse)
    

    5.编写pipelines.py(存储数据)

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import json
    
    class TencentPipeline(object):
        def __init__(self):
            self.filename = open('tencent.json','wb')
    
        def process_item(self, item, spider):
            text =json.dumps(dict(item),ensure_ascii=False) + ',
    '
            self.filename.write(text.encode('utf-8'))
            return item
    
        def close_spider(self):
            self.filename.close()
    

    6.编写settings.py(设置headers,pipelines等)

    robox协议

    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False  

    headers

    DEFAULT_REQUEST_HEADERS = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      # 'Accept-Language': 'en',
    }
    

    pipelines

    ITEM_PIPELINES = {
        'tencent.pipelines.TencentPipeline': 300,
    }
    

    7.运行爬虫

    dos窗口输入:

    scrapy crawl myspider 

    运行结果:

    查看debug:

    2019-02-18 16:02:22 [scrapy.core.scraper] ERROR: Spider error processing <GET https://hr.tencent.com/position.php?&start=520> (referer: https://hr.tencent.com/position.php?&start=510)
    Traceback (most recent call last):
      File "E:softwareANACONDAlibsite-packagesscrapyutilsdefer.py", line 102, in iter_errback
        yield next(it)
      File "E:softwareANACONDAlibsite-packagesscrapyspidermiddlewaresoffsite.py", line 30, in process_spider_output
        for x in result:
      File "E:softwareANACONDAlibsite-packagesscrapyspidermiddlewares
    eferer.py", line 339, in <genexpr>
        return (_set_referer(r) for r in result or ())
      File "E:softwareANACONDAlibsite-packagesscrapyspidermiddlewaresurllength.py", line 37, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "E:softwareANACONDAlibsite-packagesscrapyspidermiddlewaresdepth.py", line 58, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "C:Users123	encent	encentspidersmyspider.py", line 22, in parse
        item['positionType'] = each.xpath("./td[2]/text()").extract()[0]  

    去网页查看:

    这个职位少一个属性- -!!!(城市套路多啊!)

    那就改一下myspider.py里面的一行:

    item['positionType'] = each.xpath("./td[2]/text()").extract()[0] 

    加个判断,改为:

    if len(each.xpath("./td[2]/text()").extract()) > 0:
      item['positionType'] = each.xpath("./td[2]/text()").extract()[0]
    else:
      item['positionType'] = "None"
    

     运行结果:

     看网站上最后一页:

    爬取成功!

  • 相关阅读:
    hnust Snowman
    hnust 可口可乐大促销
    hnust 聚宝盆
    hnust 搬书
    hnust 神奇的序列
    hnust 懒人多动脑
    hnust CZJ-Superman
    集合Set--BST实现
    快速排序
    位运算符
  • 原文地址:https://www.cnblogs.com/shuimohei/p/10396406.html
Copyright © 2011-2022 走看看