zoukankan      html  css  js  c++  java
  • 爬虫(四)之分布式爬虫

     01-基于Redis的分布式爬虫(基于RedisCrawlSpider类)

    分布式爬虫:
        1.概念:多台机器上可以执行统一爬虫程序,实现网站数据的分布式爬取。
        2.原生的scrapy 是不可以实现分布式爬虫的。
            2.1 调度器无法共享
            2.2 管道无法共享
        3. scrapy-redis组件:专门为scrapy开发的组件。实现分布式爬取
            3.1 下载:pip install scrapy-redis
        4.分布式爬取的流程:
            a. 安装Redis数据库
            b. redis配置文件的配置
            c. redis服务的开启,基于配置配置文件
            d. 创建scrapy工程后,创建基于crawlSpider的爬虫文件
            e. 导入 from scrapy_redis.spiders import RedisCrawlSpider
            f. 然后爬虫文件基于 RedisCrawlSpider 这个类的源文件
                class RedisqiubaiSpider(RedisCrawlSpider):
            g. 修改
                # start_urls = ['https://www.qiushibaike.com/pic/']
    
                # 调度器队列的名称 该行代码 跟start_urls含义一样
                redis_key = 'qiubaiSpider'
            h. 将项目的管道和调度器 配置成 基于 scrapy-redis的组件
            i. 执行爬虫文件
                scrapy runspider reidsQiubai.py
            j. 在redis的客户端,将起始url 放到调度器的队列中(qiubaiSpider 是调度器队列名称)
                lpush qiubaiSpider https://www.qiushibaike.com/pic/
            k. 查询爬取结果
                lrange redisQiubai:items 0 -1

    修改配置文件

    # settings.py

    ROBOTSTXT_OBEY = False
    USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
    # 使用scrapy-redis组件中封装好的管道,将每台机器爬取到的数据存储通过该管道存储到redis数据库中,从而实现了多台机器的管道共享。
    
    ITEM_PIPELINES = {
        # 'redisPro.pipelines.RedisproPipeline': 300,
        'scrapy_redis.pipelines.RedisPipeline': 400,
    
    }
    
    # 使用scrapy-redis组件中封装好的调度器,将所有的url存储到该指定的调度器中,从而实现了多台机器的调度器共享。
    
    # 使用scrapy-redis组件的去重队列
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 是否允许暂停
    SCHEDULER_PERSIST = True
    
    # 如果redis 服务器不在自己本机,则需要如下配置:
    # REDIS_HOST = 'redis服务的ip地址'
    # REDIS_PORT = 6379
    # 创建新项目
    scrapy startproject redisPro cd redisPro
    / scrapy genspider -t crawl redisQiubai www.qiushibaike.com/pic/
    # redisQiubai.py
    
    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    from redisPro.items import RedisproItem
    
    from scrapy_redis.spiders import RedisCrawlSpider
    
    
    class RedisqiubaiSpider(RedisCrawlSpider):
        name = 'redisQiubai'
        # allowed_domains = ['www.qiushibaike.com/pic/']
        # start_urls = ['https://www.qiushibaike.com/pic/']
    
        # 调度器队列的名称 该行代码 跟start_urls含义一样
        redis_key = 'qiubaiSpider'
    
        link = LinkExtractor(allow=r'/pic/page/d+')
        rules = (
            Rule(link, callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            div_list = response.xpath('//*[@id="content-left"]/div')
            for div in div_list:
                img_url = "https:" + div.xpath('.//div[@class="thumb"]/a/img/@src').extract_first()
                item = RedisproItem()
                item['img_url'] = img_url
    
                yield item
    # 切换到爬虫文件的目录
    cd redisPro/redisPro/spiders/
    
    # 执行爬虫文件
    scrapy runspider redisQiubai.py

    将起始url 放到 调度器队列中

    查看爬取结果

    02-UA池

    - 作用:尽可能多的将scrapy工程中的请求伪装成不同类型的浏览器身份。
    
    - 操作流程:
    
        1.在下载中间件中拦截请求
        2.将拦截到的请求的请求头信息中的UA进行篡改伪装
        3.在配置文件中开启下载中间件
    # pipelines.py
    
    # 导包
    from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
    import random
    # UA池代码的编写(单独给UA池封装一个下载中间件的一个类)
    class RandomUserAgent(UserAgentMiddleware):
    
        def process_request(self, request, spider):
            # 从列表中随机抽选出一个ua值
            ua = random.choice(user_agent_list)
            # ua值进行当前拦截到请求的ua的写入操作
            request.headers.setdefault('User-Agent',ua)
    
    
    user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
            "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]

    03-代理池

    - 作用:尽可能多的将scrapy工程中的请求的IP设置成不同的。
    - 免费代理ip:http://www.goubanjia.com/
    - 操作流程:
    
        1.在下载中间件中拦截请求
        2.将拦截到的请求的IP修改成某一代理IP
        3.在配置文件中开启下载中间件
    # pipelines.py
    
    # 批量对拦截到的请求进行ip更换
    # 单独封装下载中间件类
    class Proxy(object):
        def process_request(self, request, spider):
            # 对拦截到请求的url进行判断(协议头到底是http还是https)
            # request.url返回值:http://www.xxx.com
            h = request.url.split(':')[0]  # 请求的协议头
            if h == 'https':
                ip = random.choice(PROXY_https)
                request.meta['proxy'] = 'https://'+ip
            else:
                ip = random.choice(PROXY_http)
                request.meta['proxy'] = 'http://' + ip
    
    # 可被选用的代理IP
    PROXY_http = [
        '153.180.102.104:80',
        '195.208.131.189:56055',
    ]
    PROXY_https = [
        '120.83.49.90:9000',
        '95.189.112.214:35508',
    ]

    04-selenium在scrapy中的应用

    selenium如何被应用到scrapy:
        a)在爬虫文件中导入webdriver类
        b)在爬虫文件的爬虫类的构造方法中进行了浏览器实例化的操作
        c)在爬虫类的closed方法中进行浏览器关闭的操作
        d)在下载中间件的process_response方法中编写执行浏览器自动化的操作    

    ·需求:爬取的是基于文字的新闻数据(国内,国际,军事,航空)

    # -*- coding: utf-8 -*-
    
    # Define here the models for your spider middleware
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    from scrapy.http import HtmlResponse
    import time
    
    '''
    UA池
    '''
    # 导包
    from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
    import random
    # UA池代码的编写(单独给UA池封装一个下载中间件的一个类)
    
    
    class RandomUserAgent(UserAgentMiddleware):
    
        def process_request(self, request, spider):
            # 从列表中随机抽选出一个ua值
            ua = random.choice(user_agent_list)
            # ua值进行当前拦截到请求的ua的写入操作
            request.headers.setdefault('User-Agent',ua)
    
    
    user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
            "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    
    
    class WangyiproDownloaderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
        def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
    
            # Must either:
            # - return None: continue processing this request
            # - or return a Response object
            # - or return a Request object
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            return None
    
        # 可以拦截到响应对象(下载器传递给spider的响应对象)
        # request:响应对象对应的请求对象
        # response:拦截到的响应对象
        # spider:爬虫文件中对应的爬虫类的实例
        def process_response(self, request, response, spider):
            # Called with the response returned from the downloader.
    
            # Must either;
            # - return a Response object
            # - return a Request object
            # - or raise IgnoreRequest
    
            # 响应对象中存储页面数据的篡改
            # print(request.url)
            if request.url in ['http://news.163.com/domestic/', 'http://news.163.com/air/', 'http://war.163.com/', 'http://news.163.com/world/']:
                spider.bro.get(url=request.url)
    
                js = 'windows.scrollTo(0, document.body.scrollHeight)'
                spider.bro.execute_script(js)
                # 一定要给浏览器移动的缓冲加载数据的时间
                time.sleep(2)
                # page_text 包含了动态加载出来的页面数据
                page_text = spider.bro.page_source
                # current_url属性 表示 刚才浏览器发起请求所对应的url
                # body: 表示 响应对象所携带的数据值
                return HtmlResponse(url=spider.bro.current_url, body=page_text, encoding='utf-8', request=request)
            else:
                return response
    
    
    '''
    代理池
    '''
    # 批量对拦截到的请求进行ip更换
    # 单独封装下载中间件类
    
    
    class Proxy(object):
        def process_request(self, request, spider):
            # 对拦截到请求的url进行判断(协议头到底是http还是https)
            # request.url返回值:http://www.xxx.com
            h = request.url.split(':')[0]  # 请求的协议头
            if h == 'https':
                ip = random.choice(PROXY_https)
                request.meta['proxy'] = 'https://'+ip
            else:
                ip = random.choice(PROXY_http)
                request.meta['proxy'] = 'http://' + ip
    
    # 可被选用的代理IP
    PROXY_http = [
        '153.180.102.104:80',
        '195.208.131.189:56055',
    ]
    PROXY_https = [
        '120.83.49.90:9000',
        '95.189.112.214:35508',
    ]
    
    middlewares.py
    middlewares.py
    # -*- coding: utf-8 -*-
    import scrapy
    from selenium import webdriver
    from wangyiPro.items import WangyiproItem
    
    
    class WangyiSpider(scrapy.Spider):
        name = 'wangyi'
        # allowed_domains = ['https://news.163.com']
        start_urls = ['https://news.163.com/']
    
        def __init__(self):
            # 实例化一个浏览器对象
            self.bro = webdriver.Chrome(executable_path='./chromedriver 2')
    
        # 必须在整个爬虫结束后,关闭浏览器
        def closed(self, spider):
            print("爬虫结束!")
            self.bro.quit()
    
        def parse(self, response):
            lis = response.xpath('//div[@class="ns_area list"]/ul/li')
            indexs = [3, 4, 6, 7]
            # 存储的是 国内,国际,军事,航空 四个板块对应的li标签对象
            li_list = []
            for index in indexs:
                li_list.append(lis[index])
            # 获取四个板块的链接和文字标题
            for li in li_list:
                url = li.xpath('./a/@href').extract_first()
                title = li.xpath('./a/text()').extract_first()
    
                # headers = {
                #     'Connection': 'close',
                # }
    
                # print(title, url)
                # 对每一个板块对应的url发起请求,获取页面数据(标题,缩略图,关键字,发布时间,标题的url)
                yield scrapy.Request(url=url, callback=self.parseSecond, meta={'title': title, 'url': url})
    
        def parseSecond(self, response):
            div_list = response.xpath('//div[@class="data_row news_article clearfix"]')
            # print(len(div_list))
            for div in div_list:
                # 标题
                head = div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
                # 缩略图链接
                img_url = div.xpath('./a/img/@src').extract_first()
    
                tag = div.xpath('.//div[@class="news_tag"]//text()').extract()
                tags = []
                for i in tag:
                    i = i.strip('
     	')
                    tags.append(i)
                tag = "".join(tags)
    
                # 标题链接
                title_url = div.xpath('.//div[@class="news_title"]/h3/a/@href').extract_first()
    wangyi.py
    # -*- coding: utf-8 -*-
    
    # Scrapy settings for wangyiPro project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'wangyiPro'
    
    SPIDER_MODULES = ['wangyiPro.spiders']
    NEWSPIDER_MODULE = 'wangyiPro.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    # USER_AGENT = 'wangyiPro (+http://www.yourdomain.com)'
    USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    # CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'wangyiPro.middlewares.WangyiproSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    DOWNLOADER_MIDDLEWARES = {
       'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
       'wangyiPro.middlewares.RandomUserAgent': 542,
       'wangyiPro.middlewares.Proxy': 541,
    }
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'wangyiPro.pipelines.WangyiproPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    settings.py 

    05-基于RedisSpider类实现的分布式爬虫

    a)代码修改(爬虫类):
        i.导包:
        
    from scrapy_redis.spiders import RedisSpider ii.将爬虫类的父类修改成RedisSpider iii.将起始url列表注释,添加一个redis_key(调度器队列的名称)的属性   redis_key = ‘wangyi’ b)redis数据库配置文件的配置redisxxx.conf: i.#bind 127.0.0.1 ii.protected-mode no c)对项目中settings进行配置: REDIS_HOST = 'redis服务的ip地址' REDIS_PORT = 6379 REDIS_ENCODING = ‘utf-8’ REDIS_PARAMS = {‘password’:’123456’} # 使用scrapy-redis组件的去重队列 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis组件自己的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 是否允许暂停 SCHEDULER_PERSIST = True # 使用scrapy-redis组件中封装好的管道 ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400, } d)开启redis数据库的服务:redis-server 配置文件 e)执行爬虫文件:在spiders文件夹下
      scrapy runspider wangyi.py f)向调度器的队列中扔一个起始url:   i.开启redis客户端: redis-cli   ii.向调度器队列中扔一个起始url   lpush wangyi https:
    //news.163.com
    # -*- coding: utf-8 -*-
    import scrapy
    from selenium import webdriver
    from wangyiPro.items import WangyiproItem
    from scrapy_redis.spiders import RedisSpider
    
    
    class WangyiSpider(RedisSpider):
        name = 'wangyi'
        # allowed_domains = ['https://news.163.com']
        # start_urls = ['https://news.163.com/']
    
        redis_key = 'wangyi'
    
        def __init__(self):
            # 实例化一个浏览器对象
            self.bro = webdriver.Chrome(executable_path='/Users/huangyanpeng/Desktop/scrapy框架/firstScrapy/wangyiPro/chromedriver 2')
    
        # 必须在整个爬虫结束后,关闭浏览器
        def closed(self, spider):
            print("爬虫结束!")
            self.bro.quit()
    
        def parse(self, response):
            lis = response.xpath('//div[@class="ns_area list"]/ul/li')
            indexs = [3, 4, 6, 7]
            # 存储的是 国内,国际,军事,航空 四个板块对应的li标签对象
            li_list = []
            for index in indexs:
                li_list.append(lis[index])
            # 获取四个板块的链接和文字标题
            for li in li_list:
                url = li.xpath('./a/@href').extract_first()
                title = li.xpath('./a/text()').extract_first()
    
                # print(title, url)
                # 对每一个板块对应的url发起请求,获取页面数据(标题,缩略图,关键字,发布时间,标题的url)
                yield scrapy.Request(url=url, callback=self.parseSecond, meta={'title': title, 'url': url})
    
        def parseSecond(self, response):
            div_list = response.xpath('//div[@class="data_row news_article clearfix "]')
            # print(len(div_list))
            for div in div_list:
                head = div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
                url = div.xpath('.//div[@class="news_title"]/h3/a/@href').extract_first()
                imgUrl = div.xpath('./a/img/@src').extract_first()
                tag = div.xpath('.//div[@class="news_tag"]//text()').extract()
                tags = []
                for t in tag:
                    t = t.strip(' 
     	')
                    tags.append(t)
                tag = "".join(tags)
    
                # 获取meta传递过来的数据值title
                title = response.meta['title']
    
                # 实例化item对象,将解析到的数据值存储到item对象中
                item = WangyiproItem()
                item['head'] = head
                item['url'] = url
                item['imgUrl'] = imgUrl
                item['tag'] = tag
                item['title'] = title
    
                # 对url发起请求,获取对应页面中存储的新闻内容数据
                yield scrapy.Request(url=url, callback=self.getContent, meta={'item': item})
                print(head+":"+url+":"+imgUrl+":"+tag)
    
        def getContent(self, response):
            # 获取传递过来的item
            item = response.meta['item']
    
            # 解析当前页面中存储的新闻数据
            content_list = response.xpath('//div[@class="post_text"]/p/text()').extract()
            content = "".join(content_list)
            item['content'] = content
    
            yield item
    wangyi.py
    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class WangyiproItem(scrapy.Item):
        # define the fields for your item here like:
        head = scrapy.Field()
        url = scrapy.Field()
        imgUrl = scrapy.Field()
        tag = scrapy.Field()
        title = scrapy.Field()
        content = scrapy.Field()
    items.py
    # -*- coding: utf-8 -*-
    
    # Define here the models for your spider middleware
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    from scrapy.http import HtmlResponse
    import time
    
    '''
    UA池
    '''
    # 导包
    from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
    import random
    # UA池代码的编写(单独给UA池封装一个下载中间件的一个类)
    
    
    class RandomUserAgent(UserAgentMiddleware):
    
        def process_request(self, request, spider):
            # 从列表中随机抽选出一个ua值
            ua = random.choice(user_agent_list)
            # ua值进行当前拦截到请求的ua的写入操作
            request.headers.setdefault('User-Agent',ua)
    
    
    user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
            "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    
    
    class WangyiproDownloaderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
        def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
    
            # Must either:
            # - return None: continue processing this request
            # - or return a Response object
            # - or return a Request object
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            return None
    
        # 可以拦截到响应对象(下载器传递给spider的响应对象)
        # request:响应对象对应的请求对象
        # response:拦截到的响应对象
        # spider:爬虫文件中对应的爬虫类的实例
        def process_response(self, request, response, spider):
            # Called with the response returned from the downloader.
    
            # Must either;
            # - return a Response object
            # - return a Request object
            # - or raise IgnoreRequest
    
            # 响应对象中存储页面数据的篡改
            # print(request.url)
            if request.url in ['http://news.163.com/domestic/', 'http://news.163.com/air/', 'http://war.163.com/', 'http://news.163.com/world/']:
                spider.bro.get(url=request.url)
    
                js = 'windows.scrollTo(0, document.body.scrollHeight)'
                spider.bro.execute_script(js)
                # 一定要给浏览器移动的缓冲加载数据的时间
                time.sleep(2)
                # page_text 包含了动态加载出来的页面数据
                page_text = spider.bro.page_source
                # current_url属性 表示 刚才浏览器发起请求所对应的url
                # body: 表示 响应对象所携带的数据值
                return HtmlResponse(url=spider.bro.current_url, body=page_text, encoding='utf-8', request=request)
            else:
                return response
    
    
    '''
    代理池
    '''
    # 批量对拦截到的请求进行ip更换
    # 单独封装下载中间件类
    
    
    class Proxy(object):
        def process_request(self, request, spider):
            # 对拦截到请求的url进行判断(协议头到底是http还是https)
            # request.url返回值:http://www.xxx.com
            h = request.url.split(':')[0]  # 请求的协议头
            if h == 'https':
                ip = random.choice(PROXY_https)
                request.meta['proxy'] = 'https://'+ip
            else:
                ip = random.choice(PROXY_http)
                request.meta['proxy'] = 'http://' + ip
    
    # 可被选用的代理IP
    PROXY_http = [
        '153.180.102.104:80',
        '195.208.131.189:56055',
    ]
    PROXY_https = [
        '120.83.49.90:9000',
        '95.189.112.214:35508',
    ]
    middlewares.py
    # -*- coding: utf-8 -*-
    
    # Scrapy settings for wangyiPro project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'wangyiPro'
    
    SPIDER_MODULES = ['wangyiPro.spiders']
    NEWSPIDER_MODULE = 'wangyiPro.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    # USER_AGENT = 'wangyiPro (+http://www.yourdomain.com)'
    USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    # CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'wangyiPro.middlewares.WangyiproSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    DOWNLOADER_MIDDLEWARES = {
       'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
       'wangyiPro.middlewares.RandomUserAgent': 542,
       'wangyiPro.middlewares.Proxy': 541,
    }
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       # 'wangyiPro.pipelines.WangyiproPipeline': 300,
       'scrapy_redis.pipelines.RedisPipeline': 400,
    
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    # REDIS_HOST = '127.0.0.1'
    REDIS_PORT = 6379
    REDIS_ENCODING = 'utf-8'
    # REDIS_PARAMS = {'password': '123456'}
    
    
    # 使用scrapy-redis组件的去重队列
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 是否允许暂停
    SCHEDULER_PERSIST = True
    settings.py
  • 相关阅读:
    Docker
    内存与缓存
    硬盘和内存的关系
    磁盘和硬盘、软盘
    mysql和redis的区别
    BZOJ 3110 k大数查询 (树套树)
    Codeforces 1303E Erase Subsequences (子序列匹配)
    Codeforces 1304E 1-Trees and Queries (树上距离+思维)(翻译向)
    BZOJ 2161 布娃娃(权值线段树)
    2019icpc徐州现场赛 H Yuuki and a problem (树状数组套主席树)
  • 原文地址:https://www.cnblogs.com/pgxpython/p/10045930.html
Copyright © 2011-2022 走看看