zoukankan      html  css  js  c++  java
  • scrapy.Spider的属性和方法

    scrapy.Spider的属性和方法
    属性:
    name:spider的名称,要求唯一
    allowed_domains:允许的域名,限制爬虫的范围
    start_urls:初始urls
    custom_settings:个性化设置,会覆盖全局的设置
    crawler:抓取器,spider将绑定到它上面
    custom_settings:配置实例,包含工程中所有的配置变量
    logger:日志实例,打印调试信息
    
    方法:
    from_crawler(crawler, *args, **kwargs):类方法,用于创建spider
    start_requests():生成初始的requests
    make_requests_from_url(url):遍历urls,生成一个个request
    parse(response):用来解析网页内容
    log(message[,level.component]):用来记录日志,这里请使用logger属性记录日志,self.logger.info('visited success')
    closed(reason):当spider关闭时调用的方法
    
    子类:
    主要CrawlSpider
    1:最常用的spider,用于抓取普通的网页
    2:增加了两个成员
    1)rules:定义了一些抓取规则--链接怎么跟踪,使用哪一个parse函数解析此链接
    2)parse_start_url(response):解析初始url的相应
    实例:
    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    class MySpider(CrawlSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com']
    
        rules = (
            # Extract links matching 'category.php' (but not matching 'subsection.php')
            # and follow links from them (since no callback means follow=True by default).
            Rule(LinkExtractor(allow=('category.php', ), deny=('subsection.php', ))),
    
            # Extract links matching 'item.php' and parse them with the spider's method parse_item
            Rule(LinkExtractor(allow=('item.php', )), callback='parse_item'),
        )
    
        def parse_item(self, response):
            self.logger.info('Hi, this is an item page! %s', response.url)
            item = scrapy.Item()
            item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (d+)')
            item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
            item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
            return item
  • 相关阅读:
    sql事务
    连续按两次提示退出功能
    页面跳转及传值
    TextView详解
    textAppearance的属性设置
    POJ-1459 Power Network
    POJ-2112 Optimal Milking
    POJ-1149 PIGS
    AOJ-722 发红包
    HDU-3605 Escape
  • 原文地址:https://www.cnblogs.com/themost/p/7105645.html
Copyright © 2011-2022 走看看