zoukankan      html  css  js  c++  java
  • Python爬虫scrapy-redis分布式实例(一)

    目标任务:将之前新浪网的Scrapy爬虫项目,修改为基于RedisSpider类的scrapy-redis分布式爬虫项目,将数据存入redis数据库。

    一、item文件,和之前项目一样不需要改变

    复制代码
    # -*- coding: utf-8 -*-
    

    import scrapy
    import sys
    reload(sys)
    sys.setdefaultencoding(
    "utf-8")

    class SinanewsItem(scrapy.Item):
    # 大类的标题和url
    parentTitle = scrapy.Field()
    parentUrls
    = scrapy.Field()

    </span><span style="color: #008000">#</span><span style="color: #008000"> 小类的标题和子url</span>
    subTitle =<span style="color: #000000"> scrapy.Field()
    subUrls </span>=<span style="color: #000000"> scrapy.Field()
    
    </span><span style="color: #008000">#</span><span style="color: #008000"> 小类目录存储路径</span>
    subFilename =<span style="color: #000000"> scrapy.Field()
    
    </span><span style="color: #008000">#</span><span style="color: #008000"> 小类下的子链接</span>
    sonUrls =<span style="color: #000000"> scrapy.Field()
    
    </span><span style="color: #008000">#</span><span style="color: #008000"> 文章标题和内容</span>
    head =<span style="color: #000000"> scrapy.Field()
    content </span>= scrapy.Field()</pre>
    
    复制代码

    二、spiders爬虫文件,使用RedisSpider类替换之前的Spider类,其余地方做些许改动即可,具体代码如下:

    复制代码
    # -*- coding: utf-8 -*-
    

    import scrapy
    import os
    from sinaNews.items import SinanewsItem
    from scrapy_redis.spiders import RedisSpider
    import sys
    reload(sys)
    sys.setdefaultencoding(
    "utf-8")

    class SinaSpider(RedisSpider):
    name
    = "sina"
    # 启动爬虫的命令

    redis_key
    = "sinaspider:strat_urls"
      # 动态定义爬虫爬取域范围
    def init(self, args, **kwargs):
    domain
    = kwargs.pop('domain', '')
    self.allowed_domains
    = filter(None, domain.split(','))
    super(SinaSpider, self).
    init(
    args, **kwargs)

    </span><span style="color: #0000ff">def</span><span style="color: #000000"> parse(self, response):
        items</span>=<span style="color: #000000"> []
        </span><span style="color: #008000">#</span><span style="color: #008000"> 所有大类的url 和 标题</span>
        parentUrls = response.xpath(<span style="color: #800000">'</span><span style="color: #800000">//div[@id="tab01"]/div/h3/a/@href</span><span style="color: #800000">'</span><span style="color: #000000">).extract()
        parentTitle </span>= response.xpath(<span style="color: #800000">'</span><span style="color: #800000">//div[@id="tab01"]/div/h3/a/text()</span><span style="color: #800000">'</span><span style="color: #000000">).extract()
    
        </span><span style="color: #008000">#</span><span style="color: #008000"> 所有小类的ur 和 标题</span>
        subUrls  = response.xpath(<span style="color: #800000">'</span><span style="color: #800000">//div[@id="tab01"]/div/ul/li/a/@href</span><span style="color: #800000">'</span><span style="color: #000000">).extract()
        subTitle </span>= response.xpath(<span style="color: #800000">'</span><span style="color: #800000">//div[@id="tab01"]/div/ul/li/a/text()</span><span style="color: #800000">'</span><span style="color: #000000">).extract()
    
        </span><span style="color: #008000">#</span><span style="color: #008000">爬取所有大类</span>
        <span style="color: #0000ff">for</span> i <span style="color: #0000ff">in</span><span style="color: #000000"> range(0, len(parentTitle)):
            
            </span><span style="color: #008000">#</span><span style="color: #008000"> 爬取所有小类</span>
            <span style="color: #0000ff">for</span> j <span style="color: #0000ff">in</span><span style="color: #000000"> range(0, len(subUrls)):
                item </span>=<span style="color: #000000"> SinanewsItem()
    
                </span><span style="color: #008000">#</span><span style="color: #008000"> 保存大类的title和urls</span>
                item[<span style="color: #800000">'</span><span style="color: #800000">parentTitle</span><span style="color: #800000">'</span>] =<span style="color: #000000"> parentTitle[i]
                item[</span><span style="color: #800000">'</span><span style="color: #800000">parentUrls</span><span style="color: #800000">'</span>] =<span style="color: #000000"> parentUrls[i]
    
                </span><span style="color: #008000">#</span><span style="color: #008000"> 检查小类的url是否以同类别大类url开头,如果是返回True (sports.sina.com.cn 和 sports.sina.com.cn/nba)</span>
                if_belong = subUrls[j].startswith(item[<span style="color: #800000">'</span><span style="color: #800000">parentUrls</span><span style="color: #800000">'</span><span style="color: #000000">])
    
                </span><span style="color: #008000">#</span><span style="color: #008000"> 如果属于本大类,将存储目录放在本大类目录下</span>
                <span style="color: #0000ff">if</span><span style="color: #000000">(if_belong):
                    
                    </span><span style="color: #008000">#</span><span style="color: #008000"> 存储 小类url、title和filename字段数据</span>
                    item[<span style="color: #800000">'</span><span style="color: #800000">subUrls</span><span style="color: #800000">'</span>] =<span style="color: #000000"> subUrls[j]
                    item[</span><span style="color: #800000">'</span><span style="color: #800000">subTitle</span><span style="color: #800000">'</span>] =<span style="color: #000000">subTitle[j]
                    items.append(item)
    
        </span><span style="color: #008000">#</span><span style="color: #008000">发送每个小类url的Request请求,得到Response连同包含meta数据 一同交给回调函数 second_parse 方法处理</span>
        <span style="color: #0000ff">for</span> item <span style="color: #0000ff">in</span><span style="color: #000000"> items:
            </span><span style="color: #0000ff">yield</span> scrapy.Request( url = item[<span style="color: #800000">'</span><span style="color: #800000">subUrls</span><span style="color: #800000">'</span>], meta={<span style="color: #800000">'</span><span style="color: #800000">meta_1</span><span style="color: #800000">'</span>: item}, callback=<span style="color: #000000">self.second_parse)
    
    </span><span style="color: #008000">#</span><span style="color: #008000">对于返回的小类的url,再进行递归请求</span>
    <span style="color: #0000ff">def</span><span style="color: #000000"> second_parse(self, response):
        </span><span style="color: #008000">#</span><span style="color: #008000"> 提取每次Response的meta数据</span>
        meta_1= response.meta[<span style="color: #800000">'</span><span style="color: #800000">meta_1</span><span style="color: #800000">'</span><span style="color: #000000">]
    
        </span><span style="color: #008000">#</span><span style="color: #008000"> 取出小类里所有子链接</span>
        sonUrls = response.xpath(<span style="color: #800000">'</span><span style="color: #800000">//a/@href</span><span style="color: #800000">'</span><span style="color: #000000">).extract()
    
        items</span>=<span style="color: #000000"> []
        </span><span style="color: #0000ff">for</span> i <span style="color: #0000ff">in</span><span style="color: #000000"> range(0, len(sonUrls)):
            </span><span style="color: #008000">#</span><span style="color: #008000"> 检查每个链接是否以大类url开头、以.shtml结尾,如果是返回True</span>
            if_belong = sonUrls[i].endswith(<span style="color: #800000">'</span><span style="color: #800000">.shtml</span><span style="color: #800000">'</span>) <span style="color: #0000ff">and</span> sonUrls[i].startswith(meta_1[<span style="color: #800000">'</span><span style="color: #800000">parentUrls</span><span style="color: #800000">'</span><span style="color: #000000">])
    
            </span><span style="color: #008000">#</span><span style="color: #008000"> 如果属于本大类,获取字段值放在同一个item下便于传输</span>
            <span style="color: #0000ff">if</span><span style="color: #000000">(if_belong):
                item </span>=<span style="color: #000000"> SinanewsItem()
                item[</span><span style="color: #800000">'</span><span style="color: #800000">parentTitle</span><span style="color: #800000">'</span>] =meta_1[<span style="color: #800000">'</span><span style="color: #800000">parentTitle</span><span style="color: #800000">'</span><span style="color: #000000">]
                item[</span><span style="color: #800000">'</span><span style="color: #800000">parentUrls</span><span style="color: #800000">'</span>] =meta_1[<span style="color: #800000">'</span><span style="color: #800000">parentUrls</span><span style="color: #800000">'</span><span style="color: #000000">]
                item[</span><span style="color: #800000">'</span><span style="color: #800000">subUrls</span><span style="color: #800000">'</span>] = meta_1[<span style="color: #800000">'</span><span style="color: #800000">subUrls</span><span style="color: #800000">'</span><span style="color: #000000">]
                item[</span><span style="color: #800000">'</span><span style="color: #800000">subTitle</span><span style="color: #800000">'</span>] = meta_1[<span style="color: #800000">'</span><span style="color: #800000">subTitle</span><span style="color: #800000">'</span><span style="color: #000000">]
                item[</span><span style="color: #800000">'</span><span style="color: #800000">sonUrls</span><span style="color: #800000">'</span>] =<span style="color: #000000"> sonUrls[i]
                items.append(item)
    
        </span><span style="color: #008000">#</span><span style="color: #008000">发送每个小类下子链接url的Request请求,得到Response后连同包含meta数据 一同交给回调函数 detail_parse 方法处理</span>
        <span style="color: #0000ff">for</span> item <span style="color: #0000ff">in</span><span style="color: #000000"> items:
                </span><span style="color: #0000ff">yield</span> scrapy.Request(url=item[<span style="color: #800000">'</span><span style="color: #800000">sonUrls</span><span style="color: #800000">'</span>], meta={<span style="color: #800000">'</span><span style="color: #800000">meta_2</span><span style="color: #800000">'</span>:item}, callback =<span style="color: #000000"> self.detail_parse)
    
    </span><span style="color: #008000">#</span><span style="color: #008000"> 数据解析方法,获取文章标题和内容</span>
    <span style="color: #0000ff">def</span><span style="color: #000000"> detail_parse(self, response):
        item </span>= response.meta[<span style="color: #800000">'</span><span style="color: #800000">meta_2</span><span style="color: #800000">'</span><span style="color: #000000">]
        content </span>= <span style="color: #800000">""</span><span style="color: #000000">
        head </span>= response.xpath(<span style="color: #800000">'</span><span style="color: #800000">//h1[@id="main_title"]/text()</span><span style="color: #800000">'</span><span style="color: #000000">)
        content_list </span>= response.xpath(<span style="color: #800000">'</span><span style="color: #800000">//div[@id="artibody"]/p/text()</span><span style="color: #800000">'</span><span style="color: #000000">).extract()
    
        </span><span style="color: #008000">#</span><span style="color: #008000"> 将p标签里的文本内容合并到一起</span>
        <span style="color: #0000ff">for</span> content_one <span style="color: #0000ff">in</span><span style="color: #000000"> content_list:
            content </span>+=<span style="color: #000000"> content_one
    
        item[</span><span style="color: #800000">'</span><span style="color: #800000">head</span><span style="color: #800000">'</span>]= head[0] <span style="color: #0000ff">if</span> len(head) &gt; 0 <span style="color: #0000ff">else</span> <span style="color: #800000">"</span><span style="color: #800000">NULL</span><span style="color: #800000">"</span><span style="color: #000000">
        item[</span><span style="color: #800000">'</span><span style="color: #800000">content</span><span style="color: #800000">'</span>]=<span style="color: #000000"> content
    
        </span><span style="color: #0000ff">yield</span> item</pre>
    
    复制代码

    三、settings文件设置

    复制代码
    SPIDER_MODULES = ['sinaNews.spiders']
    NEWSPIDER_MODULE = 'sinaNews.spiders'
    

    # 使用scrapy-redis里的去重组件,不使用scrapy默认的去重方式
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis里的调度器组件,不使用默认的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 允许暂停,redis请求记录不丢失
    SCHEDULER_PERSIST = True
    # 默认的scrapy-redis请求队列形式(按优先级)
    SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
    # 队列形式,请求先进先出

    SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"

    栈形式,请求先进后出

    SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"

    # 只是将数据放到redis数据库,不需要写pipelines文件
    ITEM_PIPELINES = {
    # 'Sina.pipelines.SinaPipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400,
    }

    # LOG_LEVEL = 'DEBUG'

    # Introduce an artifical delay to make use of parallelism. to speed up the

    crawl.

    DOWNLOAD_DELAY = 1
    # 指定数据库的主机IP
    REDIS_HOST = "192.168.13.26"
    # 指定数据库的端口号
    REDIS_PORT = 6379

    复制代码

    执行命令:

    本次直接使用本地的redis数据库,将settings文件中的REDIS_HOST和REDIS_PORT注释掉。

    启动爬虫程序

    scrapy runspider sina.py
    

    执行程序后终端窗口显示如下:

    表示程序处于等待状态,此时在redis数据库端执行如下命令:

    redis-cli> lpush sinaspider:start_urls http://news.sina.com.cn/guide/

    http://news.sina.com.cn/guide/为起始url,此时程序开始执行。

  • 相关阅读:
    用goto做异常处理
    零长度数组的妙用
    DTMF三种模式(SIPINFO,RFC2833,INBAND)
    Myeclipse下的struts2.3.8 配置 保证绝对好用
    Linux内核--内核数据类型
    Linux内核:kthread_create(线程)、SLEEP_MILLI_SEC
    3.4.4 数据预留和对齐(skb_reserve, skb_push, skb_put, skb_pull)
    Linux 2.6内核中新的锁机制--RCU
    Linux中SysRq的使用(魔术键)
    CentOS Linux服务器安全设置
  • 原文地址:https://www.cnblogs.com/wq-mr-almost/p/10208546.html
Copyright © 2011-2022 走看看