zoukankan      html  css  js  c++  java
  • Redis实现分布式爬虫

    redis分布式爬虫 

    概念:多台机器上可以执行同一个爬虫程序,实现网站数据的爬取
    原生的scrapy是不可以实现分布式爬虫, 原因如下:

    • 调度器无法共享
    • 管道无法共享

    scrapy-redis组件:专门为scrapy开发的一套组件。 该组件可以让scrapy实现分布式 pip install scrapy-redis

    分布式爬取的流程:

    1 redis配置文件的配置

    •  将 bind 127.0.0.1 进行注释
    •  将 protected-mode no 关闭保护模式

    2 redis服务器的开启:基于配置文件的开启

    3 创建scrapy工程后, 创建基于crawlSpider的爬虫文件

    4 导入RedisCrawSpider类 from scrapy_redis.spiders import RedisCrawlSpider

    5 将start_url修改成redis_key = 'xxx'

    6 解析代码编写

    7 将项目的管道和调度器配置成基于scrapy-redis组件中

    ITEM_PIPELINES = {
        'scrapy_redis.pipelines.RedisPipeline': 400
    }
    # 使用scrapy-redis组件的去重队列
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 是否允许暂停
    SCHEDULER_PERSIST = True

     8 配置Redis服务器地址和端口

    # 如果redis服务器不在本机,则需如下配置
    REDIS_HOST = '192.168.0.108'
    REDIS_PORT = 6379
    REDIS_PARAMS = {"password":123456}

    9 执行爬虫文件

    scrapy runspider qiubai

    10 向调度器队列中扔入一个起始url(在redis客户端中操作):lpush redis_key属性值 起始url

    lpush qiubaispider https://www.qiushibaike.com/pic/

    实现代码

    class QiubaiSpider(RedisCrawlSpider):
        name = 'qiubai'
        # allowed_domains = ['www.qiushibaike.com/pic']
        # start_urls = ['http://www.qiushibaike.com/pic/']
        redis_key = 'qiubaispider'  # 表示跟start_urls含义一样
        link = LinkExtractor(allow=r'/pic/page/d+')
        rules = (
            Rule(link, callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            print('开始爬虫')
            div_list = response.xpath('//*[@id="content-left"]/div')
            for div in div_list:
                print(div)
                img_url = "http://" + div.xpath('.//div[@class="thumb"]/a/img/@src').extract_first()
                item = RedisproItem()
                item['img_url'] = img_url
                yield item

    基于RedisSpider的分布式爬虫

    案例需求:爬取的是基于文字的新闻数据(国内, 国际,军师, 航空)

    • 1 在爬虫文件中导入webdriver类
    • 2 在爬虫文件的爬虫类的构造方法中进行了浏览器实例化操作
    • 3 在爬虫类的closed方法中进行浏览器的关闭操作
    • 4 在下载中间件的process_response方法中编写执行浏览器自动化操作

    wangyi.py:

    # -*- coding: utf-8 -*-
    import scrapy
    from selenium import webdriver
    from wanyiPro.items import WanyiproItem
    from scrapy_redis.spiders import RedisSpider
    
    
    class WangyiSpider(RedisSpider):
        name = 'wangyi'
        # allowed_domains = ['news.163.com']
        # start_urls = ['https://news.163.com/']
        redis_key = "wangyi"
    
        def __init__(self):
            # 实例化一个浏览器对象
            self.bro = webdriver.Chrome(executable_path='G:myprogram路飞学城第七模块wanyiProchromedriver.exe')
    
        # 必须在整个爬虫结束后关闭浏览器
        def closed(self, spider):
            print('爬虫结束')
            self.bro.quit()
    
        def parse(self, response):
            lis = response.xpath('//div[@class="ns_area list"]/ul/li')
            indexs = [3, 4, 6, 7]
            li_list = []  # 存储的就是国内 国际 军事 航空四个板块对应的li标签对象
            for index in indexs:
                li_list.append(lis[index])
            # 获取四个板块中的链接和文字标题
    
            for li in li_list:
                url = li.xpath('./a/@href').extract_first()
                title = li.xpath('./a/text()').extract_first()
                # print(url+":"+title)
                # 对每一个板块对应的url发起请求,获取页面数据(标题, 缩略图, 关键字, 发布时间,  url)
                yield scrapy.Request(url=url, callback=self.parseSecond, meta={'title': title})
    
        def parseSecond(self, response):
            div_list = response.xpath('//div[@class="data_row news_article clearfix "]')
            for div in div_list:
                head = div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
                url = div.xpath('.//div[@class="news_title"]/h3/a/@href').extract_first()
                img_url = div.xpath('./a/img/@src').extract_first()
                tag_list = div.xpath('.//div[@class="news_tag"]//text()').extract()
                tags = []
                for t in tag_list:
                    t = t.strip('
     	')
                    tags.append(t)
                tag = "".join(tags)
                # 获取meta传递的数据值title
                title = response.meta['title']
                print(head + ":" + url + ":" + img_url)
                # 实例化item对象, 将解析到的数据值存储在item中
                item = WanyiproItem()
                item['head'] = head
                item['url'] = url
                item['imgUrl'] = img_url
                item['tag'] = tag
                item['title'] = title
                # 对url发起请求 解析新闻详细内容
                yield scrapy.Request(url=url, callback=self.getContent, meta={'item': item})
    
        def getContent(self, response):
            # 获取传递过来的item
            item = response.meta['item']
            # 解析当前页面中存储的新闻数据
            content_list = response.xpath('//div[@class="post_text"]/p/text()').extract()
            content = "".join(content_list)
            item['content'] = content
            yield item

    middlewares.py:

    from scrapy import signals
    from scrapy.http import HtmlResponse
    class WanyiproDownloaderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
        def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
    
            # Must either:
            # - return None: continue processing this request
            # - or return a Response object
            # - or return a Request object
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            return None
    
        def process_response(self, request, response, spider):
            # 拦截到响应对象(下载器传递给Spider的响应对象)
            # request: 响应对象对应的请求对象
            # response: 拦截到的响应对象
            # spider: 爬虫文件对应的爬虫类的实例
            print(request.url + "这是下载中间件")
            # 响应对象中存储页面数据的篡改
            if request.url in ['http://news.163.com/domestic/', 'http://news.163.com/world/', 'http://war.163.com/',
                               'http://news.163.com/air/']:
                spider.bro.get(url=request.url)
                js = 'window.scrollTo(0,document.body.scrollHeight)'
                spider.bro.execute_script(js)
                time.sleep(2)  # 一定要给与浏览器一定的缓冲加载数据的时间
                # 页面数据包含了动态加载出来的新闻数据对应的页面数据
                page_text = spider.bro.page_source
                return HtmlResponse(url=spider.bro.current_url, body=page_text, encoding='utf-8', request=request)
            else:
                return response

    UA池和地址池:

    from scrapy import signals
    from scrapy.http import HtmlResponse
    from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
    import random
    
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
        "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    
    # UA池代码的编写(单独给UA池封装一个下载中间件的一个类)
    # 导包UserAgentMiddleware类
    class RandomUserAgent(UserAgentMiddleware):
        def process_request(self, request, spider):
            # 从列表中随机抽选一个ua值
            ua = random.choice(user_agent_list)
            # ua值进行当前拦截到请求的ua的写入操作
            request.headers.setdefault('User-Agent', ua)
    
    
    # 可被选用的代理IP
    PROXY_http = [
        '153.180.102.104:80',
        '195.208.131.189:56055',
    ]
    PROXY_https = [
        '120.83.49.90:9000',
        '95.189.112.214:35508',
    ]
    
    # 批量对拦截到的请求进行IP更换
    class Proxy(object):
        def process_request(self, request, spider):
            # 对拦截到请求的url进行判断(协议头到底是http还是https)
            # request.url返回值:http://www.xxx.com
            h = request.url.split(':')[0]  # 请求的协议头
            if h == 'https':
                ip = random.choice(PROXY_https)
                request.meta['proxy'] = 'https://' + ip
            else:
                ip = random.choice(PROXY_http)
                request.meta['proxy'] = 'http://' + ip

    基于RedisSpider实现分布式爬虫步骤

    1 导包:from scrapy_redis.spiders import RedisSpider
    2 将爬虫类的父类修改成RedisSpider
    3 将起始URL列表注释, 添加一个redis_key(调度器队列的名称)的属性
    4 进行redis数据库配置文件的配置:

    • 将 bind 127.0.0.1 进行注释
    • 将 protected-mode no 关闭保护模式

    5 settings中配置redis

    REDIS_HOST = '192.168.0.108'
    REDIS_PORT = 6379
    REDIS_PARAMS = {"password": 123456}
    
    # 使用scrapy-redis组件的去重队列
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 是否允许暂停
    SCHEDULER_PERSIST = True
    
    ITEM_PIPELINES = {
        'scrapy_redis.pipelines.RedisPipeline': 400
    }

    6  执行爬虫文件

    scrapy runspider wangyi.py

    7 向调度器的管道中扔一个起始url

    lpush wangyi https://news.163.com/
  • 相关阅读:
    数据库基础——EXISTS和IN
    C#基础——加密
    C#基础——派生和继承
    SQL Server——报表服务
    SQL Server——SQL Server Profiler
    UML基础——UML简介和历史
    C#基础——密码加密
    C#(ASP.NET)错误: 无法获取属性“0”的值: 对象为 null 或未定义 关键字 'user' 附近有语法错误。
    SQL Server——存储过程
    链表的声明及操作
  • 原文地址:https://www.cnblogs.com/harryblog/p/11376864.html
Copyright © 2011-2022 走看看