zoukankan      html  css  js  c++  java
  • 爬取智联招聘

    创建项目

    wljdeMacBook-Pro:PycharmProjects wlj$ scrapy startproject zhilianzhaopin
    New Scrapy project
    'zhilianzhaopin', using template directory '/usr/local/lib/python3.6/site-packages/scrapy/templates/project', created in: /Users/wlj/work/PycharmProjects/zhilianzhaopin
    You can start your first spider with: cd zhilianzhaopin scrapy genspider example example.com

    wljdeMacBook-Pro:PycharmProjects wlj$ scrapy genspider --list Available templates: basic crawl csvfeed xmlfeed wljdeMacBook-Pro:PycharmProjects wlj$ scrapy genspider -t crawl zhaopin sou.zhaopin.com Created spider 'zhaopin' using template 'crawl' wljdeMacBook-Pro:PycharmProjects wlj$

     items.py

     1 import scrapy
     2 
     3 class ZhilianzhaopinItem(scrapy.Item):
     4     # define the fields for your item here like:
     5     # name = scrapy.Field()
     6     # 职位
     7     position = scrapy.Field()
     8     # 公司
     9     company = scrapy.Field()
    10     # 职位描述
    11     position_1 = scrapy.Field()
    12     # 公司介绍
    13     company_1 = scrapy.Field()

    zhaopin.py

     1 port scrapy
     2 from scrapy.linkextractors import LinkExtractor
     3 from scrapy.spiders import CrawlSpider, Rule
     4 from zhilianzhaopin.items import ZhilianzhaopinItem
     5 
     6 class ZhaopinSpider(CrawlSpider):
     7     name = 'zhaopin'
     8     allowed_domains = ['sou.zhaopin.com']
     9     start_urls = ['https://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC&kw=python&p=1']
    10 
    11     rules = (
    12         Rule(LinkExtractor(allow=r'python&p=d+'),follow = True),
    13         Rule(LinkExtractor(allow=r'd+.htm'), callback = 'parse_item'),
    14     )
    15     def parse_item(self, response):
    16         i = ZhilianzhaopinItem()
    17         #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
    18         #i['name'] = response.xpath('//div[@id="name"]').extract()
    19         #i['description'] = response.xpath('//div[@id="description"]').extract()
    20 
    21         # 职位
    22         i['position'] = response.xpath('//div[@class="fixed-inner-box"]//h1/text()').extract()[0]
    23         # 公司
    24         i['company'] = response.xpath('//div[@class="fixed-inner-box"]//a/text()').extract()[0].strip()
    25         # 职位描述
    26         i['position_1'] = response.xpath('//div[@class="tab-inner-cont"]/p/text()').extract()
    27         # 公司介绍
    28         i['company_1'] = response.xpath('//div[@class="tab-inner-cont"]/div/div/text()').extract()[0].strip()
    29 
    30         yield i

    pipelines.py

     1 import json
     2 
     3 class ZhilianzhaopinPipeline(object):
     4     def __init__(self):
     5         self.filename = open("zhilian.json", "w")
     6 
     7     def process_item(self, item, spider):
     8         text = json.dumps(dict(item), ensure_ascii = False) + ",
    "
     9         self.filename.write(text)
    10 
    11         return item
    12     def close_spider(self, spider):
    13         self.filename.close()
    settings.py
     1 BOT_NAME = 'zhilianzhaopin'
     2 
     3 SPIDER_MODULES = ['zhilianzhaopin.spiders']
     4 NEWSPIDER_MODULE = 'zhilianzhaopin.spiders'
     5 
     6 
     7 
     8 # Obey robots.txt rules
     9 ROBOTSTXT_OBEY = False
    10 
    11 ITEM_PIPELINES = {
    12     'zhilianzhaopin.pipelines.ZhilianzhaopinPipeline': 300,
    13 }
    14 
    15 LOG_FILE = "dg.log"
    16 LOG_LEVEL = "DEBUG"

     执行

    scrapy crawl zhaopin

  • 相关阅读:
    【笔记】常见的架构风格
    【笔记】机器学习的数学基础
    【入门】机器学习基础理论
    【产品】海康威视工业相机
    开源搜索引擎与框架
    xmodmap系列工具,用于键盘设置
    【转载】Linux中功能强大的截图工具: Flameshot
    Ubuntu系统环境及配置
    【入门】CloudCompare使用教程
    Linux安装NVIDIA显卡驱动
  • 原文地址:https://www.cnblogs.com/wanglinjie/p/9227124.html
Copyright © 2011-2022 走看看