zoukankan      html  css  js  c++  java
  • scrapy框架基础篇

    一、scrapy工作流程

    (出自该博客:https://www.cnblogs.com/hewanli/p/12090042.html

       1)当爬虫(Spider)要爬取某URL地址的页面时,使用该URL初始化Request对象提交给引擎(Scrapy Engine),并设置回调函数。 

        Spider中初始的Request是通过调用start_requests() 来获取的。start_requests() 读取start_urls 中的URL,并以parse为回调函数生成Request 。

       2)Request对象进入调度器(Scheduler)按某种算法进行排队,之后的每个时刻调度器将其出列,送往下载器。

       3)下载器(Downloader)根据Request对象中的URL地址发送一次HTTP请求到网络服务器把资源下载下来,并封装成应答包(Response)。

       4)应答包Response对象最终会被递送给爬虫(Spider)的页面解析函数进行处理。

       5)若是解析出实体(Item),则交给实体管道(Item Pipeline)进行进一步的处理。由Spider返回的Item将被存到数据库(由某些Item Pipeline处理)或使用Feed exports存入到文件中。

       6)若是解析出的是链接(URL),则把URL交给调度器(Scheduler)等待抓取。

      以上就是Scrapy框架的运行流程,也就是它的工作原理。Request和Response对象是血液,Item是代谢产物。

    (1)安装:

    pip install scrapy

    (2)创建项目

    2.1 进入需要创建爬虫项目的文件下,cd 命令

    2.2 创建项目:scrapy  startproject  scrapystudy(项目名)

    (3) 创建爬虫:scrapy  genspider xxx(爬虫名)xxx.com (爬取域,只加域名即可,不要带上前面的http://或https://

    (4)运行爬虫:scrapy crawl XXX

    (5)项目目录

    介绍:items:用来保存数据的一个数据结构
       middlewares:在爬取过程中定义的一些中间件,可以处理request和response
       pipelines:项目管道,输出一些items,负责处理spider从网页中抽取的数据,主要是负责清洗,验证和向数据库中存储数据
       settings:定义一些配置信息

    二、简单应用

    爬取例子:http://quotes.toscrape.com/

    2.1、items.py文件定义项目字段(需要爬取的字段),如下:

     1 # Define here the models for your scraped items
     2 # See documentation in:
     3 # https://docs.scrapy.org/en/latest/topics/items.html
     4 
     5 import scrapy
     6 
     7 class QuotesItem(scrapy.Item):
     8     # define the fields for your item here like:
     9     # 定义爬取内容的存储字段
    10     text = scrapy.Field()
    11     author = scrapy.Field()
    12     tags = scrapy.Field()

    2.2、quotes.py爬虫文件,处理页面信息,并提取结果,如下:

     1 import scrapy
     2 from scrapystudy.items import QuotesItem
     3 
     4 class QuotesSpider(scrapy.Spider):
     5     name = 'quotes'
     6     allowed_domains = ['quotes.toscrape.com']
     7     start_urls = ['http://quotes.toscrape.com/']
     8 
     9     def parse(self, response):
    10         quotes = response.css(".quote")
    11         for quote in quotes:
    12             items = QuotesItem()
    13             text = quote.css(".text::text").extract_first() #extract_first()提取第一个结果,extract()提取全部结果
    14             author = quote.css(".author::text").extract_first()
    15             tags = quote.css(".tags .tag::text").extract() #多个tag处理方式
    16             # 赋值
    17             items["text"] = text
    18             items["author"] = author
    19             items["tags"] = tags
    20             yield items
    21         # 处理翻页
    22         next_page = response.css(".pager .next a::attr(href)").extract_first()
    23         url = response.urljoin(next_page) #获取一个完整的url
    24         yield scrapy.Request(url=url,callback=self.parse) #callback回调函数,自己调用自己
    25         # 保存数据,命令行输入:scrapy crawl quotes -o quotes.json

     保存数据:

    1 # 命令行输入以下代码,可以保存不同格式的文件
    2 scrapy crawl quotes -o quotes.json
    3 scrapy crawl quotes -o quotes.jl
    4 scrapy crawl quotes -o quotes.csv
    5 scrapy crawl quotes -o quotes.xml
    6 
    7 # 保存到json文件,处理中文,在使用scrapy crawl -o file.json时,对于中文,保存的是unicode编码字符,因为需要转换为utf-8中文编码
    8 
    9 转换命令:scrapy crawl crawlboke -o boke.json -s FEED_EXPORT_ENCODING=UTF-8

    2.3、pipelines.py文件处理items内容长度,以及存储到MongoDB的操作,如下:

     1 # Define your item pipelines here
     2 #
     3 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
     4 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
     5 
     6 
     7 # useful for handling different item types with a single interface
     8 from itemadapter import ItemAdapter
     9 from scrapy.exceptions import DropItem
    10 import pymongo
    11 
    12 class TextPipeline:
    13 
    14     def __init__(self):
    15         self.limit = 50
    16 
    17     def process_item(self, item, spider):
    18         "处理items长度"
    19         if item['text']:
    20             if len(item['text'])>self.limit:
    21                 item['text'] = item['text'][0:self.limit].rstrip()+'...'
    22             return item
    23         else:
    24             return DropItem('Missing Text')
    25 
    26 class MongoPipeline(object):
    27     "将数据存储在MongoDB中"
    28 
    29     def __init__(self,mongo_url,mongo_db):
    30         self.mongo_url = mongo_url
    31         self.mongo_db = mongo_db
    32 
    33     @classmethod
    34     def from_crawler(cls,crawler):
    35         return cls(
    36             mongo_url = crawler.settings.get('MONGO_URL'),
    37             mongo_db = crawler.settings.get('MONGO_DB')
    38         )
    39 
    40     def open_spider(self,spider):
    41         self.client = pymongo.MongoClient(self.mongo_url)
    42         self.db = self.client[self.mongo_db]
    43 
    44     def process_item(self,item,spider):
    45         name = item.__class__.__name__
    46         self.db[name].insert(dict(item))
    47         return item
    48 
    49     def close_spider(self,spider):
    50         self.client.close()

    2.4、setting.py文件配置生效信息

     1 # Scrapy settings for scrapystudy project
     2 #
     3 # For simplicity, this file contains only settings considered important or
     4 # commonly used. You can find more settings consulting the documentation:
     5 #
     6 #     https://docs.scrapy.org/en/latest/topics/settings.html
     7 #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
     8 #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
     9 
    10 BOT_NAME = 'scrapystudy'
    11 
    12 SPIDER_MODULES = ['scrapystudy.spiders']
    13 NEWSPIDER_MODULE = 'scrapystudy.spiders'
    14 
    15 MONGO_URL = "localhost"
    16 MONGO_DB = "mydb"
    17 
    18 
    19 # Crawl responsibly by identifying yourself (and your website) on the user-agent
    20 #USER_AGENT = 'scrapystudy (+http://www.yourdomain.com)'
    21 
    22 # Obey robots.txt rules
    23 ROBOTSTXT_OBEY = False
    24 
    25 # Configure maximum concurrent requests performed by Scrapy (default: 16)
    26 #CONCURRENT_REQUESTS = 32
    27 
    28 # Configure a delay for requests for the same website (default: 0)
    29 # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    30 # See also autothrottle settings and docs
    31 #DOWNLOAD_DELAY = 3
    32 # The download delay setting will honor only one of:
    33 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    34 #CONCURRENT_REQUESTS_PER_IP = 16
    35 
    36 # Disable cookies (enabled by default)
    37 #COOKIES_ENABLED = False
    38 
    39 # Disable Telnet Console (enabled by default)
    40 #TELNETCONSOLE_ENABLED = False
    41 
    42 # Override the default request headers:
    43 #DEFAULT_REQUEST_HEADERS = {
    44 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    45 #   'Accept-Language': 'en',
    46 #}
    47 
    48 # Enable or disable spider middlewares
    49 # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    50 #SPIDER_MIDDLEWARES = {
    51 #    'scrapystudy.middlewares.ScrapystudySpiderMiddleware': 543,
    52 #}
    53 
    54 # Enable or disable downloader middlewares
    55 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    56 #DOWNLOADER_MIDDLEWARES = {
    57 #    'scrapystudy.middlewares.ScrapystudyDownloaderMiddleware': 543,
    58 #}
    59 
    60 # Enable or disable extensions
    61 # See https://docs.scrapy.org/en/latest/topics/extensions.html
    62 #EXTENSIONS = {
    63 #    'scrapy.extensions.telnet.TelnetConsole': None,
    64 #}
    65 
    66 # Configure item pipelines
    67 # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    68 ITEM_PIPELINES = {
    69    'scrapystudy.pipelines.TextPipeline': 300,
    70    'scrapystudy.pipelines.MongoPipeline': 400,
    71 }
    72 
    73 # Enable and configure the AutoThrottle extension (disabled by default)
    74 # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    75 #AUTOTHROTTLE_ENABLED = True
    76 # The initial download delay
    77 #AUTOTHROTTLE_START_DELAY = 5
    78 # The maximum download delay to be set in case of high latencies
    79 #AUTOTHROTTLE_MAX_DELAY = 60
    80 # The average number of requests Scrapy should be sending in parallel to
    81 # each remote server
    82 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    83 # Enable showing throttling stats for every response received:
    84 #AUTOTHROTTLE_DEBUG = False
    85 
    86 # Enable and configure HTTP caching (disabled by default)
    87 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    88 #HTTPCACHE_ENABLED = True
    89 #HTTPCACHE_EXPIRATION_SECS = 0
    90 #HTTPCACHE_DIR = 'httpcache'
    91 #HTTPCACHE_IGNORE_HTTP_CODES = []
    92 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  • 相关阅读:
    海量数据处理:十道面试题与十个海量数据处理方法总结
    C++中的static及内存分配
    面试时如何向面试官提问
    面试题3:斐波那契数列与爬楼梯
    面试题1:二进制中1的个数
    面试题:单链表的几种处理
    js网页下载csv格式的表格
    解决背景图文字盖住html里面的dom元素
    使用element-ui的常见问题
    Promise-async-await处理函数
  • 原文地址:https://www.cnblogs.com/yzmPython/p/14230799.html
Copyright © 2011-2022 走看看