zoukankan      html  css  js  c++  java
  • scrapy 框架

    来源:http://doc.scrapy.org/en/latest/topics/architecture.html 

    Overview

    The following diagram shows an overview of Scrapy architecture with its components and an outline of the data flow that  takes place inside the system.

    Components

    Scrapy Engine

    The engine is responsible for controlling the data flow between all components of the system, and triggering events when certain actions occur.

    Scheduler

    The scheduler receives requests from the engine and enqueues them for feeding them later when the engine requests them.

    Downloader

    The downloader is responsible for fetching web pages and feeding them to the engine which, in turn, feeds them to the spiders.

    Spiders

    Spiders are custom classes written by Scrapy users to parse responses and extract items from them or additional URLs to follow. Each spider is able to handle a specific domain(or group of domains).

    Item Pipeline

    The Item Pipeline is responsible for processing the items once they have been extracted by the spiders.

    Downloader middlewares

    Downloader midllewares are specific hooks that sit between the Engine and the Downloader and  process requests when they pass from the Engine to the Downloader, and responses that pass from Downloader to the Engine. They provide a convenient mechanism for extending Scrapy functionality by plugging custom code.

    Spider middlewares

    Spidder  middlewares are specific hooks that sit between the Engine and Spider and are able process spider input(response) and output(items and requests). They provide a convenient mechanism for extending Scrapy functionality by plugging custom code.

    Data flow

    The data flow in Scrapy is controlled by the execution engine, and goes like this:

    srapy-framework

    1. The Engine opens a domain, locates the Spider that handle that domain, and asks the spider for the first URLs to crawl.
    2. The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as Requests.
    3. The Engine asks the Scheduler for the next URLs to crawl.
    4. The Scheduler returns the next URLs to crawl to Engine and Engine sends them to Downloader, passing through the Downloader Midderware.
    5. Once the page finishes downloading the Downloader generates a Response and sends it to the Engine, passing through the Downloader Middleware.
    6. The Engine receives the Response from Downloader Middleware and sends it to the Spider for processing, passing through the Spider Middleware.
    7. The Spider processes the Response and returns scrapyed items and new Requests to the Engine.
    8. The Engine sends scraped items to the Item Pipeline and Requests to the Scheduler.
    9. The process repeats (from step 2) until there are no more requests from the Scheduler, and Engine close the domain.

    Event-driven networking

    Scrapy is written with Twisted, a popular event-driven networking framework for python.

  • 相关阅读:
    线性表---顺序表&链表
    C++——虚函数表——typedef指向函数的指针
    C++——继承与多态
    C++——动态内存分配new--delete
    C++——模板---函数模板---类模板
    C++——指针---指向数组的指针---指向字符串的指针--指向函数的指针--指针的指针--指针的引用
    C++——this指针
    C++——运算符的重载---以成员函数方式重载---以友元函数方式重载
    C++——友元函数--友元类——friend关键字
    Ubuntu环境下实现WireShark抓取HTTPS
  • 原文地址:https://www.cnblogs.com/hotbaby/p/4871764.html
Copyright © 2011-2022 走看看