zoukankan      html  css  js  c++  java
  • Scrapy Crawl 运行出错 AttributeError: 'xxxSpider' object has no attribute '_rules' 的问题解决

    按照官方的文档写的demo,只是多了个init函数,最终执行时提示没有_rules这个属性的错误日志如下:

     ......
      File "C:ProgramDataAnaconda3libsite-packagesscrapyspiderscrawl.py", line 82, in _parse_response
        for request_or_item in self._requests_to_follow(response):
      File "C:ProgramDataAnaconda3libsite-packagesscrapyspiderscrawl.py", line 60, in _requests_to_follow
        for n, rule in enumerate(self._rules):
    AttributeError: 'TestSpider' object has no attribute '_rules'
    

    出问题的spider代码如下:

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from newtest.items import NewtestItem
    
    
    class TestSpider(CrawlSpider):
        
        def __init__(self,*args, **kwargs):
            self.headers = {
                'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
                'Accept-Encoding':'gzip, deflate',
                'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
            }
    
        name = 'test'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com']
    
        rules = (
            # Extract links matching 'category.php' (but not matching 'subsection.php')
            # and follow links from them (since no callback means follow=True by default).
            Rule(LinkExtractor(allow=('category.php', ), deny=('subsection.php', ))),
    
            # Extract links matching 'item.php' and parse them with the spider's method parse_item
            Rule(LinkExtractor(allow=('item.php', )), callback='parse_item'),
        )
    
        def parse_item(self, response):
            self.logger.info('Hi, this is an item page! %s', response.url)
            item = scrapy.Item()
            item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (d+)')
            item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
            item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
            return item
    

    后来仔细看了下,跟官方不一样的就是自己重写了init初始化方法,而根据这个提示的日志,应该是覆盖了CrawlSpider的init方法但是没有调用父类的init导致_rules这个属性没有声明导致的。我们来看下CrawlSpider的源码:
    在这里插入图片描述
    在这里插入图片描述
    所以如果我们的Spider是从CrawlSpider继承过来的,并且自己需要实现__init__ 方法的话,记住要调用父类的__init__方法保障能正常初始化crawlspider的属性。
    修改后的代码如下:

    第11行的super(TestSpider, self).__init__(*args, **kwargs) 是关键:

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from newtest.items import NewtestItem
    
    
    class TestSpider(CrawlSpider):
        
        def __init__(self, *args, **kwargs):
            super(TestSpider, self).__init__(*args, **kwargs)  # 这里是关键
            self.headers = {
                'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
                'Accept-Encoding':'gzip, deflate',
                'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
            }
    
        name = 'test'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com']
    
        rules = (
            # Extract links matching 'category.php' (but not matching 'subsection.php')
            # and follow links from them (since no callback means follow=True by default).
            Rule(LinkExtractor(allow=('category.php', ), deny=('subsection.php', ))),
    
            # Extract links matching 'item.php' and parse them with the spider's method parse_item
            Rule(LinkExtractor(allow=('item.php', )), callback='parse_item'),
        )
    
        def parse_item(self, response):
            self.logger.info('Hi, this is an item page! %s', response.url)
            item = scrapy.Item()
            item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (d+)')
            item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
            item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
            return item
    
  • 相关阅读:
    [leetcode-693-Binary Number with Alternating Bits]
    [leetcode-695-Max Area of Island]
    [leetcode-690-Employee Importance]
    Windows Phone开发(17):URI映射
    Windows Phone开发(18):变形金刚第九季
    Windows Phone开发(19):三维透视效果
    Windows Phone开发(20):当MediaElement和VideoBrush合作的时候
    Windows Phone开发(21):做一个简单的绘图板
    Windows Phone开发(22):启动器与选择器之BingMapsDirectionsTask
    Windows Phone开发(1):概论
  • 原文地址:https://www.cnblogs.com/xiaocy66/p/10589277.html
Copyright © 2011-2022 走看看