zoukankan      html  css  js  c++  java
  • Scrapy框架: 通用爬虫之CrawlSpider

    步骤01: 创建爬虫项目

    scrapy startproject quotes
    

    步骤02: 创建爬虫模版

    scrapy genspider -t quotes quotes.toscrape.com
    

    步骤03: 配置爬虫文件quotes.py

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    
    class Quotes(CrawlSpider):
    	# 爬虫名称
        name = "get_quotes"
        allow_domain = ['quotes.toscrape.com']
        start_urls = ['http://quotes.toscrape.com/']
    
    # 设定规则
        rules = (
            # 对于quotes内容页URL,调用parse_quotes处理,
          		# 并以此规则跟进获取的链接
            Rule(LinkExtractor(allow=r'/page/d+'), callback='parse_quotes', follow=True),
          		# 对于author内容页URL,调用parse_author处理,提取数据
            Rule(LinkExtractor(allow=r'/author/w+'), callback='parse_author')
        )
    
    # 提取内容页数据方法
        def parse_quotes(self, response):
            for quote in response.css(".quote"):
                yield {'content': quote.css('.text::text').extract_first(),
                       'author': quote.css('.author::text').extract_first(),
                       'tags': quote.css('.tag::text').extract()
                       }
    	# 获取作者数据方法
    
        def parse_author(self, response):
            name = response.css('.author-title::text').extract_first()
            author_born_date = response.css('.author-born-date::text').extract_first()
            author_bron_location = response.css('.author-born-location::text').extract_first()
            author_description = response.css('.author-description::text').extract_first()
    
            return ({'name': name,
                     'author_bron_date': author_born_date,
                     'author_bron_location': author_bron_location,
                     'author_description': author_description
                     })
    

    步骤04: 运行爬虫

    scrapy crawl quotes
    
  • 相关阅读:
    年末deadline汇总
    [线性代数]2016.12.19周一作业
    [线性代数]2016.12.15周四作业
    Android SDK的安装与环境变量的配置
    安装JDK环境变量的配置
    Python中单引号,双引号,三引号
    带有支付功能的产品如何进行测试
    Android稳定性测试工具Monkey的使用
    mysql数据库单表增删改查命令
    svn
  • 原文地址:https://www.cnblogs.com/hankleo/p/11872497.html
Copyright © 2011-2022 走看看