zoukankan      html  css  js  c++  java
  • Scrapy框架: 通用爬虫之CrawlSpider

    步骤01: 创建爬虫项目

    scrapy startproject quotes
    

    步骤02: 创建爬虫模版

    scrapy genspider -t quotes quotes.toscrape.com
    

    步骤03: 配置爬虫文件quotes.py

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    
    class Quotes(CrawlSpider):
    	# 爬虫名称
        name = "get_quotes"
        allow_domain = ['quotes.toscrape.com']
        start_urls = ['http://quotes.toscrape.com/']
    
    # 设定规则
        rules = (
            # 对于quotes内容页URL,调用parse_quotes处理,
          		# 并以此规则跟进获取的链接
            Rule(LinkExtractor(allow=r'/page/d+'), callback='parse_quotes', follow=True),
          		# 对于author内容页URL,调用parse_author处理,提取数据
            Rule(LinkExtractor(allow=r'/author/w+'), callback='parse_author')
        )
    
    # 提取内容页数据方法
        def parse_quotes(self, response):
            for quote in response.css(".quote"):
                yield {'content': quote.css('.text::text').extract_first(),
                       'author': quote.css('.author::text').extract_first(),
                       'tags': quote.css('.tag::text').extract()
                       }
    	# 获取作者数据方法
    
        def parse_author(self, response):
            name = response.css('.author-title::text').extract_first()
            author_born_date = response.css('.author-born-date::text').extract_first()
            author_bron_location = response.css('.author-born-location::text').extract_first()
            author_description = response.css('.author-description::text').extract_first()
    
            return ({'name': name,
                     'author_bron_date': author_born_date,
                     'author_bron_location': author_bron_location,
                     'author_description': author_description
                     })
    

    步骤04: 运行爬虫

    scrapy crawl quotes
    
  • 相关阅读:
    javaScript 匿名函数 理解
    javaScript this理解
    javaScript原型链理解
    Django学习笔记
    python mysql应用
    华为OBS上传,与modelart添加标签--python
    pyhton 定时任务
    制作滑动验证码(未完待续)
    测试扫描支付功能
    js 易错点(未完待续)
  • 原文地址:https://www.cnblogs.com/hankleo/p/11872497.html
Copyright © 2011-2022 走看看