zoukankan      html  css  js  c++  java
  • CrawlSpiders

    1.用 scrapy 新建一个 tencent 项目

    2.在 items.py 中确定要爬去的内容

     1 # -*- coding: utf-8 -*-
     2 
     3 # Define here the models for your scraped items
     4 #
     5 # See documentation in:
     6 # http://doc.scrapy.org/en/latest/topics/items.html
     7 
     8 import scrapy
     9 
    10 
    11 class TencentItem(scrapy.Item):
    12     # define the fields for your item here like:
    13     # 职位
    14     position_name = scrapy.Field()
    15     # 详情链接
    16     positin_link = scrapy.Field()
    17     # 职业类别 
    18     position_type = scrapy.Field()
    19     # 招聘人数
    20     people_number = scrapy.Field()
    21     # 工作地点
    22     work_location = scrapy.Field()
    23     # 发布时间
    24     publish_time = scrapy.Field()

    3.快速创建 CrawlSpider模板

    scrapy genspider -t crawl tencent_spider tencent.com

    注意  此时中的名称不能与项目名相同

    4.打开tencent_spider.py 编写代码

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 # 导入链接规则匹配类,用来提取符合规则的链接
     4 from scrapy.linkextractors import LinkExtractor
     5 # 导入CrawlSpider类和Rule
     6 from scrapy.spiders import CrawlSpider, Rule
     7 # 从tentcent项目下的itmes.py中导入TencentItem类
     8 from tencent.items import TencentItem
     9 
    10 
    11 class TencentSpiderSpider(CrawlSpider):
    12     name = 'tencent_spider'
    13     allowed_domains = ['hr.tencent.com']
    14     start_urls = ['http://hr.tencent.com/position.php?&start=0#a']
    15     pagelink = LinkExtractor(allow=("start=d+")) # 正则匹配
    16 
    17     rules = (
    18         # 获取这个列表的链接,依次发送请求,并继续跟进,调用指定的回调函数
    19         Rule(pagelink, callback='parse_item', follow=True),
    20     )
    21 
    22     def parse_item(self, response):
    23         for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):
    24             item = TencentItem()
    25             # 职位名称
    26             item['position_name'] = each.xpath("./td[1]/a/text()").extract()[0]
    27             # 详情连接
    28             item['position_link'] = each.xpath("./td[1]/a/@href").extract()[0]
    29             # 职位类别
    30             #item['position_type'] = each.xpath("./td[2]/text()").extract()[0]
    31             # 招聘人数
    32             item['people_number'] = each.xpath("./td[3]/text()").extract()[0]
    33             # 工作地点
    34             # item['work_location'] = each.xpath("./td[4]/text()").extract()[0]
    35             # 发布时间
    36             item['publish_time'] = each.xpath("./td[5]/text()").extract()[0]
    37 
    38             yield item

    5.在 piplines.py 中写入文件

     1 # -*- coding: utf-8 -*-
     2 
     3 # Define your item pipelines here
     4 #
     5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
     6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
     7 
     8 import json
     9 
    10 class TencentPipeline(object):
    11     def open_spider(self, spider):
    12         self.filename = open("tencent.json", "w")
    13 
    14     def process_item(self, item, spider):
    15         text = json.dumps(dict(item), ensure_ascii = False) + "
    "
    16         self.filename.write(text.encode("utf-8")
    17         return item
    18 
    19     def close_spider(self, spider):
    20         self.filename.close()

    7.在命令输入以下命令运行

    scrapy crawl tencen_spider.py

     

    出现以下问题在tencent_spider.py 文件中只有把position_type 和 work_location 注销掉才能运行...

  • 相关阅读:
    文件输出debug
    sweetalert
    js认清this的第一步
    Creating default object from empty value in PHP?
    matplotlib画图
    python解析库
    zabbix监控ssl证书过期时间
    aws 预留实例到期监控
    aws ec2挂载 s3
    aliyun挂载oss
  • 原文地址:https://www.cnblogs.com/cuzz/p/7629087.html
Copyright © 2011-2022 走看看