zoukankan      html  css  js  c++  java
  • scrapy实例:爬取天气、气温等

    1.创建项目

    scrapy startproject weather # weather是项目名称

    scrapy crawl spidername开始运行,程序自动使用start_urls构造Request并发送请求,然后调用parse函数对其进行解析,

    在这个解析过程中使用rules中的规则从html(或xml)文本中提取匹配的链接,通过这个链接再次生成Request,如此不断循环,直到返回的文本中再也没有匹配的链接,或调度器中的Request对象用尽,程序才停止。

    2.确定爬取目标:

    scrapy构建的爬虫的爬取过程:

    scrapy crawl spidername开始运行,程序自动使用start_urls构造Request并发送请求,然后调用parse函数对其进行解析,在这个解析过程中使用rules中的规则从html(或xml)文本中提取匹配的链接,

    通过这个链接再次生成Request,如此不断循环,直到返回的文本中再也没有匹配的链接,或调度器中的Request对象用尽,程序才停止。

    allowed_domains:顾名思义,允许的域名,爬虫只会爬取该域名下的url

    rule:定义爬取规则,爬虫只会爬取符合规则的url

      rule有allow属性,使用正则表达式书写匹配规则.正则表达式不熟悉的话可以写好后在网上在线校验,尝试几次后,简单的正则还是比较容易的,我们要用的也不复杂.

      rule有callback属性可以指定回调函数,爬虫在发现符合规则的url后就会调用该函数,注意要和默认的回调函数parse作区分.(爬取的数据在命令行里都可以看到)

      rule有follow属性.为True时会爬取网页里所有符合规则的url,反之不会.  我这里设置为了False,因为True的话要爬很久.大约两千多条天气信息

    import scrapy
    from weather.items import WeatherItem
    from scrapy.spiders import Rule, CrawlSpider
    from scrapy.linkextractors import LinkExtractor
    
    class Spider(CrawlSpider):
        name = 'weatherSpider'
        #allowed_domains = "www.weather.com.cn"
        start_urls = [
            #"http://www.weather.com.cn/weather1d/101020100.shtml#search"
            "http://www.weather.com.cn/forecast/"
        ]
        rules = (
            #Rule(LinkExtractor(allow=('http://www.weather.com.cn/weather1d/101d{6}.shtml#around2')), follow=False, callback='parse_item'),
            Rule(LinkExtractor(allow=('http://www.weather.com.cn/weather1d/101d{6}.shtml$')), follow=True,callback='parse_item'),
        )
        
        
        #多页面爬取时需要自定义方法名称,不能用parse
        def parse_item(self, response):
            item = WeatherItem()
            #city = response.xpath("//div[@class='crumbs fl']/a[2]/text()").extract_first()
            item['city'] = response.xpath("//div[@class='crumbs fl']/a[2]/text()").extract_first()  # 获取省或者直辖市名称
            #if city == '>':
            #item['city'] = response.xpath("//div[@class='crumbs fl']/a[last()-1]/text()").extract_first()#获取非直辖省
            #item['city'] = response.xpath("//div[@class ='crumbs fl']/a[2]/text()").extract_first()#获取直辖市
    
            #item['city_addition'] = response.xpath("//div[@class ='crumbs fl']/a[last()]/text()").extract_first()#获取直辖市
            #city_addition = response.xpath("//div[@class ='crumbs fl']/a[last()]/text()").extract_first() #获取>字符
            #print("aaaaa"+city)
            #print("nnnnn"+city_addition)
            #if city_addition != city:
                #item['city_addition'] = response.xpath("//div[@class='crumbs fl']/a[2]/text()").extract_first()
            item['city_addition'] = response.xpath("//div[@class ='crumbs fl']/a[last()]/text()").extract_first()  # 获取城市名或者直辖市名称
            #else:
                #item['city_addition'] = ''
    
            #item['city_addition2'] = response.xpath("//div[@class='crumbs fl']/span[3]/text()").extract_first()
    
    
            weatherData = response.xpath("//div[@class='today clearfix']/input[1]/@value").extract_first() #获取当前的气温
            item['data'] = weatherData[0:6] #获取日期
            print("data:"+item['data'])
            item['weather'] = response.xpath("//p[@class='wea']/text()").extract_first() #获取天气
            item['temperatureMax'] = response.xpath("//ul[@class='clearfix']/li[1]/p[@class='tem']/span[1]/text()").extract_first() #最高温度
            item['temperatureMin'] = response.xpath("//ul[@class='clearfix']/li[2]/p[@class='tem']/span[1]/text()").extract_first() #最低温度
            yield item


    spider.py顾名思义就是爬虫文件

    在填写spider.py之前,我们先看看如何获取需要的信息

    刚才的命令行应该没有关吧,关了也没关系

    win+R在打开cmd,键入:scrapy shell http://www.weather.com.cn/weather1d/101020100.shtml#search #网址是你要爬取的url

    这是scrapy的shell命令,可以在不启动爬虫的情况下,对网站的响应response进行处理调试等,主要是调试xpath获取元素的

    3.填写Items.py

    Items.py只用于存放你要获取的字段:

    给自己要获取的信息取个名字:

    # -*- coding: utf-8 -*-
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    import scrapy
    
    class WeatherItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        city = scrapy.Field()
        city_addition = scrapy.Field()
        city_addition2 = scrapy.Field()
        weather = scrapy.Field()
        data = scrapy.Field()
        temperatureMax = scrapy.Field()
        temperatureMin = scrapy.Field()
        pass

    这里写了管道文件,还要在settings.py设置文件里启用这个pipeline:

    6.填写settings.py

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for weather project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'weather'
    
    SPIDER_MODULES = ['weather.spiders']
    NEWSPIDER_MODULE = 'weather.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'weather (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    DOWNLOAD_DELAY = 1
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'weather.middlewares.WeatherSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'weather.middlewares.WeatherDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
        'weather.pipelines.TxtPipeline': 600,
        #'weather.pipelines.JsonPipeline': 6,
        #'weather.pipelines.ExcelPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

    5.填写pipeline.py

    但要保存爬取的数据的话,还需写下pipeline.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import os
    import codecs
    import json
    import csv
    from scrapy.exporters import JsonItemExporter
    from openpyxl import Workbook
    
    base_dir = os.getcwd()
    filename = base_dir + '\' + 'weather.txt'
    with open(filename,'w+') as f:#打开文件
        f.truncate()#清空文件内容
    
    
    class JsonPipeline(object):
        # 使用FeedJsonItenExporter保存数据
        def __init__(self):
            self.file = open('weather1.json','wb')
            self.exporter = JsonItemExporter(self.file,ensure_ascii =False)
            self.exporter.start_exporting()
    
        def process_item(self,item,spider):
            print('Write')
            self.exporter.export_item(item)
            return item
    
        def close_spider(self,spider):
            print('Close')
            self.exporter.finish_exporting()
            self.file.close()
    
            
    class TxtPipeline(object):
        def process_item(self, item, spider):
            #获取当前工作目录
            #base_dir = os.getcwd()
            #filename = base_dir + 'weather.txt'
            #print('创建Txt')
            print("city:"+item['city'])
            print("city_addition:"+item['city_addition'])
    
            #从内存以追加方式打开文件,并写入对应的数据
            with open(filename, 'a') as f: #追加
                if item['city'] != item['city_addition']:
                    f.write('城市:' + item['city'] + '>')
                    f.write(item['city_addition'] + '
    ')
                else:
                    f.write('城市:' + item['city'] + '
    ')
                    #f.write(item['city_addition'] + '
    ')
                f.write('日期:' + item['data'] + '
    ')
                f.write('天气:' + item['weather'] + '
    ')
                f.write('温度:' + item['temperatureMin'] + '~' + item['temperatureMax'] + '℃
    ')
        
    class ExcelPipeline(object):
        #创建EXCEL,填写表头
        def __init__(self):
            self.wb = Workbook()
            self.ws = self.wb.active
            #设置表头
            self.ws.append(['', '', '县(乡)', '日期', '天气', '最高温', '最低温'])
        
        def process_item(self, item, spider):
            line = [item['city'], item['city_addition'], item['city_addition2'], item['data'], item['weather'], item['temperatureMax'], item['temperatureMin']]
            self.ws.append(line) #将数据以行的形式添加仅xlsx中
            self.wb.save('weather.xlsx')
            return item
        '''def process_item(self, item, spider):
            base_dir = os.getcwd()
            filename = base_dir + 'weather.csv'
            print('创建EXCEL')
            with open(filename,'w') as f:
                fieldnames = ['省','市', '县(乡)', '天气', '日期', '最高温','最低温'] # 定义字段的名称
                writer = csv.DictWriter(f,fieldnames=fieldnames) # 初始化一个字典对象
                write.writeheader() # 调用writeheader()方法写入头信息
                # 传入相应的字典数据
                write.writerow(dict(item))
        '''

    爬虫效果:

    确定爬取目标:

    这里选择中国天气网做爬取素材,爬取网页之前一定要先分析网页,要获取那些信息,怎么获取更加方便,网页源代码这里只展示部分:

    <div class="ctop clearfix">
                <div class="crumbs fl">
                    <a href="http://js.weather.com.cn" target="_blank">江苏</a>
                    <span>></span>
                    <a href="http://www.weather.com.cn/weather/101190801.shtml" target="_blank">徐州</a><span>></span>  <span>鼓楼</span>
                </div>
                <div class="time fr"></div>
            </div>

    如果是非直辖市:获取省名称

     //div[@class='crumbs fl']/a[last()-1]/text()

    取xpath最后一个book元素

    book[last()]

    取xpath最后第二个book元素

    book[last()-1]

  • 相关阅读:
    UVA 1599 Ideal Path(双向bfs+字典序+非简单图的最短路+队列判重)
    UVA 1572 Self-Assembly(拓扑排序)
    最大流当前弧优化Dinic分层模板
    POJ 3683.Priest John's Busiest Day 2-SAT
    n的m划分 整数拆分问题
    表达式计算
    大白书中无向图的点双联通分量(BCC)模板的分析与理解
    Codeforces 766D. Mahmoud and a Dictionary 并查集 二元敌对关系 点拆分
    树状数组入门
    Tire树入门专题
  • 原文地址:https://www.cnblogs.com/qmfsun/p/11512606.html
Copyright © 2011-2022 走看看