zoukankan      html  css  js  c++  java
  • 爬虫--使用scrapy爬取糗事百科并在txt文件中持久化存储

    工程目录结构

     spiders下的first源码

      

    # -*- coding: utf-8 -*-
    import scrapy
    from  firstBlood.items  import FirstbloodItem
    class FirstSpider(scrapy.Spider):
        #爬虫文件的名称
        #当有多个爬虫文件时,可以通过名称定位到指定的爬虫文件
        name = 'first'
        #allowed_domains 允许的域名 跟start_url互悖
        #allowed_domains = ['www.xxx.com']
        #start_url 请求的url列表,会被自动的请求发送
        start_urls = ['https://www.qiushibaike.com/text/']
        def parse(self, response):
            '''
            解析请求的响应
            可以使用正则,XPATH  ,因为scrapy 集成了XPATH,建议使用XAPTH
            解析得到一个selector
            :param response:
            :return:
            '''
            all_data = []
            div_list=response.xpath('//div[@id="content-left"]/div')
            for div in div_list:
                #author=div.xpath('./div[1]/a[2]/h2/text()')#author 拿到的不是之前理解的源码数据而
                # 是selector对象,我们只需将selector类型对象下的data对象拿到即可
                #author=author[0].extract()
                #如果存在匿名用户时,将会报错(匿名用户的数据结构与登录的用户名的数据结构不一样)
                ''' 改进版'''
    
                author = div.xpath('./div[1]/a[2]/h2/text()| ./div[1]/span[2]/h2/text()')[0].extract()
                content=div.xpath('.//div[@class="content"]/span//text()').extract()
                content=''.join(content)
                #print(author+':'+content.strip(' 
     	 '))
    
    
            #基于终端的存储
            #     dic={
            #         'author':author,
            #         'content':content
            #     }
            #     all_data.append(dic)
            # return all_data
            #持久化存储的两种方式
                #1 基于终端指令:parse方法有一个返回值
                  #scrapy crawl first -o qiubai.csv --nolog
                  #终端指令只能存储json,csv,xml等格式文件
                #2基于管道
                item = FirstbloodItem()#循环里面,每次实例化一个item对象
                item['author']=author
                item['content']=content
                yield item #将item提交给管道

    Items文件

      

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class FirstbloodItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        #item类型对象 万能对象,可以接受任意类型属性,字符串,json等
        author = scrapy.Field()
        content = scrapy.Field()

    pipeline文件

      

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    #只要涉及持久化存储的相关操作代码都需要写在该文件种
    class FirstbloodPipeline(object):
        fp=None
        def open_spider(self,spider):
            print('开始爬虫')
            self.fp=open('./qiushibaike.txt','w',encoding='utf-8')
        def process_item(self, item, spider):
            '''
            处理Item
            :param item:
            :param spider:
            :return:
            '''
            self.fp.write(item['author']+':'+item['content'])
            print(item['author'],item['content'])
            return item
        def close_spider(self,spider):
            print('爬虫结束')
            self.fp.close()

    Setting文件

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for firstBlood project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'firstBlood'
    
    SPIDER_MODULES = ['firstBlood.spiders']
    NEWSPIDER_MODULE = 'firstBlood.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
    
    # Obey robots.txt rules
    #默认为True ,改为False  不遵从ROBOTS协议  反爬
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'firstBlood.middlewares.FirstbloodSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'firstBlood.middlewares.FirstbloodDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'firstBlood.pipelines.FirstbloodPipeline': 300,#300 为优先级
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  • 相关阅读:
    svg圆弧进度条demo
    canvas圆弧、圆环进度条实现
    angularjs与pagination插件实现分页功能
    CSS布局:居中的多种实现方式
    新闻滚动demo
    移动端rem设置字体
    angularjs自定义指令通用属性详解
    浅谈angularjs绑定富文本出现的小问题
    jquery.validate使用攻略(表单校验)
    Typescript 享元模式(Flyweight)
  • 原文地址:https://www.cnblogs.com/SunIan/p/10328276.html
Copyright © 2011-2022 走看看