zoukankan      html  css  js  c++  java
  • Scrapy:腾讯招聘整站数据爬取

    项目地址:https://hr.tencent.com/

    步骤一、分析网站结构和待爬取内容

    以下省略一万字

    步骤二、上代码(不能略了)

     1、配置items.py

     1 import scrapy
     2 
     3 
     4 class HrTencentItem(scrapy.Item):
     5     # define the fields for your item here like:
     6     # name = scrapy.Field()
     7     # pass
     8     position_name = scrapy.Field()#职位名称
     9     position_type = scrapy.Field()#职位类别
    10     detail_url = scrapy.Field()
    11     people_count = scrapy.Field()
    12     work_city = scrapy.Field()
    13     release_date = scrapy.Field()#发布时间
    14     job_description = scrapy.Field()#工作描述
    15     job_require = scrapy.Field()#工作要求

    2、配置settings.py

    配置mongo

    NEWSPIDER_MODULE = 'hr_tencent.spiders'
    MONGO_URL ='localhost'
    MONGO_DB ='hrtencent'

    切记注册ITEM_PIPELINES 

     ITEM_PIPELINES = { # 'hr_tencent.pipelines.HrTencentPipeline': 300, 'hr_tencent.pipelines.MongoPipeline': 400, } 

    3.到spider文件夹里面执行指令 scrapy genspider tencent 

    4、打开自动生成的tencent.py文件,进行编辑

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from hr_tencent.items import HrTencentItem
     4 
     5 
     6 class TencentSpider(scrapy.Spider):
     7     name = 'tencent'
     8     allowed_domains = ['hr.tencent.com']
     9     start_urls = ['https://hr.tencent.com/position.php']
    10     front_url = "https://hr.tencent.com/"
    11     def parse(self, response):
    12 
    13         tencenthr = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')
    14         for job in tencenthr:
    15             item = HrTencentItem()
    16             item["position_name"] = job.xpath('.//a/text()').extract_first()
    17             item["detail_url"] = self.front_url + job.xpath('.//a/@href').extract_first()
    18             item["position_type"] = job.xpath('.//td[2]/text()').extract_first()
    19             item["people_count"] = job.xpath('.//td[3]/text()').extract_first()
    20             item["work_city"] = job.xpath('.//td[4]/text()').extract_first()
    21             item["release_date"] = job.xpath('.//td[5]/text()').extract_first()
    22             yield scrapy.Request(url=item["detail_url"], callback=self.detail_parse, meta={"item": item})
    23         next_url = self.front_url + response.xpath('//div[@class="pagenav"]/a[@id="next"]/@href').extract_first()
    24         yield scrapy.Request(url=next_url, callback=self.parse)
    25 
    26 
    27 
    28     def detail_parse(self, response):
    29         item = response.meta["item"]
    30         node_list = response.xpath('//ul[@class="squareli"]')
    31         item["job_description"] = ''.join(node_list[0].xpath("./li/text()").extract())
    32         item["job_require"] = ''.join(node_list[1].xpath("./li/text()").extract())
    33         yield item
    View Code

    5、配置pipelines.py文件

     1 import pymongo
     2 
     3 
     4 class MongoPipeline(object):
     5     def __init__(self,mongo_url,mongo_db):
     6         self.mongo_url = mongo_url
     7         self.mongo_db = mongo_db
     8     @classmethod
     9     def from_crawler(cls,crawler):
    10         return cls(
    11             mongo_url = crawler.settings.get('MONGO_URL'),
    12             mongo_db=crawler.settings.get('MONGO_DB')
    13 
    14         )
    15     def open_spider(self,spider):
    16         self.client = pymongo.MongoClient(self.mongo_url)
    17         self.db = self.client[self.mongo_db]
    18 
    19     def process_item(self,item,spider):
    20         name = item.__class__.__name__
    21         self.db[name].insert(dict(item))
    22         return item
    23 
    24     def close_spider(self,spider):
    25         self.client.close()
    View Code

    6、新建一个run.py文件,为了不每次运行都敲指令,直接运行run.py即可

    1 # -*- coding:utf-8 -*-
    2 from scrapy import cmdline
    3 
    4 cmdline.execute("scrapy crawl tencent".split())
    View Code

    7、运行本地或服务器上的mongo数据库(远程mongo数据库地址需要自己配)

    8、执行run文件数据到手

  • 相关阅读:
    java 反射 invoke()的异常问题记录
    windows安装nginx可视化工具nginxWebUI
    Springboot+Mybatis+Clickhouse+jsp 搭建单体应用项目(三)(添加增删改查)
    Springboot+Mybatis+Clickhouse+jsp 搭建单体应用项目(二)(添加日志打印和源码地址)
    Springboot+Mybatis+Clickhouse+jsp 搭建单体应用项目(一)
    mac + docker+单击clickhouse+Dbeaver安装全套
    线程中使用for循环的add或remove方法的两种方案
    map数据按照list排序
    oracle dbca 【bug】:JAVA_JIT_ENABLED=false
    Ubuntu(Debian):apt-get:处理repository数字签名无效、过期、没有签名:即 如何强制 apt-get update?
  • 原文地址:https://www.cnblogs.com/jackzz/p/10750463.html
Copyright © 2011-2022 走看看