zoukankan      html  css  js  c++  java
  • scrapy.Spider的属性和方法

    scrapy.Spider的属性和方法
    属性:
    name:spider的名称,要求唯一
    allowed_domains:允许的域名,限制爬虫的范围
    start_urls:初始urls
    custom_settings:个性化设置,会覆盖全局的设置
    crawler:抓取器,spider将绑定到它上面
    custom_settings:配置实例,包含工程中所有的配置变量
    logger:日志实例,打印调试信息
    
    方法:
    from_crawler(crawler, *args, **kwargs):类方法,用于创建spider
    start_requests():生成初始的requests
    make_requests_from_url(url):遍历urls,生成一个个request
    parse(response):用来解析网页内容
    log(message[,level.component]):用来记录日志,这里请使用logger属性记录日志,self.logger.info('visited success')
    closed(reason):当spider关闭时调用的方法
    
    子类:
    主要CrawlSpider
    1:最常用的spider,用于抓取普通的网页
    2:增加了两个成员
    1)rules:定义了一些抓取规则--链接怎么跟踪,使用哪一个parse函数解析此链接
    2)parse_start_url(response):解析初始url的相应
    实例:
    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    class MySpider(CrawlSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com']
    
        rules = (
            # Extract links matching 'category.php' (but not matching 'subsection.php')
            # and follow links from them (since no callback means follow=True by default).
            Rule(LinkExtractor(allow=('category.php', ), deny=('subsection.php', ))),
    
            # Extract links matching 'item.php' and parse them with the spider's method parse_item
            Rule(LinkExtractor(allow=('item.php', )), callback='parse_item'),
        )
    
        def parse_item(self, response):
            self.logger.info('Hi, this is an item page! %s', response.url)
            item = scrapy.Item()
            item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (d+)')
            item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
            item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
            return item
  • 相关阅读:
    ssh登陆报错“WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!”的解决方法
    python错误:SyntaxError: invalid character in identifier
    Python3中出现UnicodeEncodeError: 'ascii' codec can't encode characters in ordinal not in range(128)的解决方法
    Jmeter在Mac下安装教程
    TensorFlow | win10下使用docker安装tensorflow
    Docker | 删除 image 失败的一种情况
    基础技能 | Git
    Leetcode-探索 | 两数之和
    Leetcode-探索 | 移动零
    基础复习-算法设计基础 | 复杂度计算
  • 原文地址:https://www.cnblogs.com/themost/p/7105645.html
Copyright © 2011-2022 走看看