zoukankan      html  css  js  c++  java
  • scrapy之spiders

    官方文档:https://docs.scrapy.org/en/latest/topics/spiders.html#

    一句话总结:spider是定义爬取的动作(是否跟进新的链接)及分析网页结构(提取数据,返回item)的地方。

    一 scrapy.Spider

      1 name

      2 allowed_domins  <----------------------->  offsitemiddleware

      3 start_urls  <-----------------------> start_requests()

      4 custom_settings  <------------------------->Built-in settings reference

      It must be defined as a class attribute since the settings are updated before instantiation.

    class BaiduSpider(scrapy.Spider):
        name = 'baidu'
        allowed_domains = ['https://www.baidu.com']
        start_urls = ['http://https://www.baidu.com/']
        custom_settings = {
                'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
        }
        def parse(self, response):
            pass

      5 crawler <----------> from_crawler()

      6 settings

      7 logger

      8 from_crawler(crawler,*args,**kwargs)

      This is the class method used by Scrapy to create your spiders.

      9 start_request()

      It is called by Scrapy when the spider is opened for scraping. 

      核心代码:

    for url in self.start_urls:
                    yield Request(url, dont_filter=True)

       关于Request的说明。以下是Requet的源码。

    class Request(object_ref):
    
        def __init__(self, url, callback=None, method='GET', headers=None, body=None,
                     cookies=None, meta=None, encoding='utf-8', priority=0,
                     dont_filter=False, errback=None, flags=None):

      源码中可以看到,Request默认是get请求,如果是发post请求,需要在重写此方法。这里涉及到了 Request类

    class MySpider(scrapy.Spider):
        name = 'myspider'
    
        def start_requests(self):
            return [scrapy.FormRequest("http://www.example.com/login",
                                       formdata={'user': 'john', 'pass': 'secret'},
                                       callback=self.logged_in)]
    
        def logged_in(self, response):
            # here you would extract links to follow and return Requests for
            # each of them, with another callback
            pass

      10 parse(response)

      This method, as well as any other Request callback, must return an iterable of Requestand/or dicts or Item objects.

       11 log(message[ , level,component])

      12 closed(reason)

      

    二 Spider arguments

      -a

    三 Generic Spiders

      1 CrawlSpider

        推荐

        加了 rules,简化了相关操作。

      2 XMLFeedSpider

      3 CSVFeedSpider

      4 SitemapSpider

  • 相关阅读:
    测试工具文件4. 数据分析——定义analyseXML
    关于sprintf的"_CRT_SECURE_NO_WWARNINGS"问题的解决
    测试工具文件3. 输出文件——定义TestLog
    测试工具文件2. 支持代码——定义TestUtility
    测试工具文件1. 平台问题——定义Platform.h
    python之正则表达式
    python之字典总结
    python之global关键字的用法
    python + selenium 常用方法验证页面上的UI元素
    python + selenium 元素定位方法 (索引)By属性
  • 原文地址:https://www.cnblogs.com/654321cc/p/8875285.html
Copyright © 2011-2022 走看看