zoukankan      html  css  js  c++  java
  • Scrapy框架基本使用

    pycharm+Scrapy

    距离上次使用Scrapy已经是大半年前的事情了,赶紧把西瓜皮捡回来。。

    简单粗暴上爬取目标:

    初始URL:http://quotes.toscrape.com/

    目标:将每一页中每一栏的语录、作者、标签解析出来,保存到json文件或者MongoDB数据库中

    打开命令行,敲

    scrapy startproject quotetutorial      #在当前目录下生成了一个叫quotetutorial的项目

    然后敲cd quotetutorail,然后敲

    scrapy genspider quotes quotes.toscrape.com      #创建一个目标站点的爬虫

    此时项目结构如下:

    做一下解释:

    iems:定义存储数据的Item类

    settings:变量的配置信息

    pipeline:负责处理被Spider提取出来的Item,典型应用有:清理HTML数据;验证爬取数据的合法性,检查Item是否包含某些字段;查重并丢弃;将爬取结果保存到文件或者数据库中

    middlewares:中间件

    spiders > quotes:爬虫模块

    接着我们修改quotes.py代码:

    # -*- coding: utf-8 -*-
    import scrapy
    from quotetutorial.items import QuotetutorialItem
    from urllib.parse import urljoin
    class QuotesSpider(scrapy.Spider):
        name = 'quotes'
        allowed_domains = ['quotes.toscrape.com']
        start_urls = ['http://quotes.toscrape.com/']
    
        def parse(self, response):
            quotes = response.css('.quote')
            for quote in quotes:
                item = QuotetutorialItem()
                text = quote.css('.text::text').extract_first()
                author = quote.css('.author::text').extract_first()
                tags = quote.css('.tags .tag::text').extract()
                item['text'] = text
                item['author'] = author
                item['tags'] = tags
                yield item
    
            next = response.css('.pager .next a::attr(href)').extract_first()#提取翻页的url
            url = response.urljoin(next) #作url拼接
            if url:
                yield scrapy.Request(url=url,callback=self.parse)#回调parse函数

    然后是pipelines.py文件

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    from scrapy.exceptions import DropItem
    from pymongo import MongoClient
    
    class TextPipeline(object):#对item数据处理,限制字段大小
        def __init__(self):
            self.limit = 50
    
        def process_item(self, item, spider):
            if item['text']:
                if len(item['text']) > self.limit:
                    item['text'] = item['text'][0:self.limit].rstrip() + '...'
                return item
            else:
                return DropItem('Missing Text')
    
    class MongoPipeline(object):#保存到MongoDB数据库
    
        def __init__(self,mongo_uri,mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db
    
        @classmethod
        def from_crawler(cls,crawler):
            return cls(
                mongo_uri = crawler.settings.get('MONGO_URI'),
                mongo_db = crawler.settings.get('MONGO_DB')
            )
    
        def open_spider(self,spider):
            self.client = MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
    
        def process_item(self,item,spider):
            name = item.__class__.__name__
            self.db[name].insert(dict(item))
            return item
    
        def close_spider(self,spider):
            self.client.close()

    然后是items.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class QuotetutorialItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        text = scrapy.Field()
        author = scrapy.Field()
        tags = scrapy.Field()

    然后修改settings.py

    SPIDER_MODULES = ['quotetutorial.spiders']
    NEWSPIDER_MODULE = 'quotetutorial.spiders'
    
    MONGO_URI = 'localhost'
    MONGO_DB = 'quotestutorial'
    
    # Configure item pipelines
    # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'quotetutorial.pipelines.TextPipeline': 300,      #数字越小表示优先级越高,先处理 
        'quotetutorial.pipelines.MongoPipeline': 400,
    }

    这里需要注意的地方是:

    Scrapy有自己的一套数据提取机制,成为Selector,通过Xpath或者CSS来解析HTML,用法和普通的选择器一样

    把CSS换成XPATH如下:

        def parse(self, response):
            quotes = response.xpath(".//*[@class='quote']")
            for quote in quotes:
                item = QuotetutorialItem()
                # text = quote.css('.text::text').extract_first()
                # author = quote.css('.author::text').extract_first()
                # tags = quote.css('.tags .tag::text').extract()
                text = quote.xpath(".//span[@class='text']/text()").extract()[0]
                author = quote.xpath(".//span/small[@class='author']/text()").extract()[0]
                tags = quote.xpath(".//div[@class='tags']/a/text()").extract()
                item['text'] = text
                item['author'] = author
                item['tags'] = tags
    
                # item['tags'] = tags
                yield item
    人生苦短,何不用python
  • 相关阅读:
    【小白学PyTorch】1 搭建一个超简单的网络
    【小白学PyTorch】2 浅谈训练集和测试集
    【小白学AI】GBDT梯度提升详解
    【小白学AI】XGBoost推导详解与牛顿法
    【小白写论文】技术性论文结构剖析
    小白学PyTorch 动态图与静态图的浅显理解
    【小白学推荐1】 协同过滤 零基础到入门
    【小白学AI】随机森林 全解 (从bagging到variance)
    OpenCV开发笔记(七十二):红胖子8分钟带你使用opencv+dnn+tensorFlow识别物体
    【python刷题】二叉搜索树-相关题目
  • 原文地址:https://www.cnblogs.com/yqpy/p/8694866.html
Copyright © 2011-2022 走看看