zoukankan      html  css  js  c++  java
  • scrapy之spiders

    官方文档:https://docs.scrapy.org/en/latest/topics/spiders.html#

    一句话总结:spider是定义爬取的动作(是否跟进新的链接)及分析网页结构(提取数据,返回item)的地方。

    一 scrapy.Spider

      1 name

      2 allowed_domins  <----------------------->  offsitemiddleware

      3 start_urls  <-----------------------> start_requests()

      4 custom_settings  <------------------------->Built-in settings reference

      It must be defined as a class attribute since the settings are updated before instantiation.

    class BaiduSpider(scrapy.Spider):
        name = 'baidu'
        allowed_domains = ['https://www.baidu.com']
        start_urls = ['http://https://www.baidu.com/']
        custom_settings = {
                'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
        }
        def parse(self, response):
            pass

      5 crawler <----------> from_crawler()

      6 settings

      7 logger

      8 from_crawler(crawler,*args,**kwargs)

      This is the class method used by Scrapy to create your spiders.

      9 start_request()

      It is called by Scrapy when the spider is opened for scraping. 

      核心代码:

    for url in self.start_urls:
                    yield Request(url, dont_filter=True)

       关于Request的说明。以下是Requet的源码。

    class Request(object_ref):
    
        def __init__(self, url, callback=None, method='GET', headers=None, body=None,
                     cookies=None, meta=None, encoding='utf-8', priority=0,
                     dont_filter=False, errback=None, flags=None):

      源码中可以看到,Request默认是get请求,如果是发post请求,需要在重写此方法。这里涉及到了 Request类

    class MySpider(scrapy.Spider):
        name = 'myspider'
    
        def start_requests(self):
            return [scrapy.FormRequest("http://www.example.com/login",
                                       formdata={'user': 'john', 'pass': 'secret'},
                                       callback=self.logged_in)]
    
        def logged_in(self, response):
            # here you would extract links to follow and return Requests for
            # each of them, with another callback
            pass

      10 parse(response)

      This method, as well as any other Request callback, must return an iterable of Requestand/or dicts or Item objects.

       11 log(message[ , level,component])

      12 closed(reason)

      

    二 Spider arguments

      -a

    三 Generic Spiders

      1 CrawlSpider

        推荐

        加了 rules,简化了相关操作。

      2 XMLFeedSpider

      3 CSVFeedSpider

      4 SitemapSpider

  • 相关阅读:
    [SSRS] Use Enum values in filter expressions Dynamics 365 Finance and Operation
    Power shell deploy all SSRS report d365 FO
    display method in Dynamics 365 FO
    How To Debug Dynamics 365 Finance and Operation
    Computed columns and virtual fields in data entities Dynamics 365
    Azure DevOps for Power Platform Build Pipeline
    Create readonly entities that expose financial dimensions Dynamics 365
    Dataentity call stack dynamics 365
    Dynamics 365 FO extension
    Use singletenant servertoserver authentication PowerApps
  • 原文地址:https://www.cnblogs.com/654321cc/p/8875285.html
Copyright © 2011-2022 走看看