zoukankan      html  css  js  c++  java
  • day26-爬虫进阶

    5.代码书写请求-全栈数据爬取
    例子4:爬取所有页面choutiAll--手动请求发送形式start_urls = ['https://dig.chouti.com/r/pic/hot/1']
    解析抽屉图片下所有的超链!
    #设计了一个所有页码通用的url(pageNum表示的就是不同页码)
    url = 'https://dig.chouti.com/r/pic/hot/%d'
    重点是parse方法的调用yield scrapy.Request(url=url,callback=self.parse)
    # -*- coding: utf-8 -*-
    import scrapy
    from choutiAllPro.items import ChoutiallproItem
    
    class ChoutiSpider(scrapy.Spider):
        name = 'chouti'
        #allowed_domains = ['www.ddd.com']
        start_urls = ['https://dig.chouti.com/r/pic/hot/1']
    
        #设计了一个所有页码通用的url(pageNum表示的就是不同页码)
        url = 'https://dig.chouti.com/r/pic/hot/%d'
        pageNum = 1
        
        def parse(self, response):
            div_list = response.xpath('//div[@class="content-list"]/div')
            for div in div_list:
                title = div.xpath('./div[3]/div[1]/a/text()').extract_first()
                item = ChoutiallproItem()
                item['title']=title
                
                yield item
            
            #进行其他页码对应url的请求操作
            if self.pageNum <= 120: #假设只有120个页码
                self.pageNum += 1
                url = format(self.url%self.pageNum)
                #print(url)
                #进行手动请求的发送
                yield scrapy.Request(url=url,callback=self.parse) #yield共发送页码的次数,无yield只发一次!parse被递归的调用
                
                
    chouti.py

    //text获取多个文本内容 /text获取单个文本内容
    scarpy框架会自动处理get请求的cookie

    例子5:百度翻译--发post请求--处理cookie--postPro
    修改父类方法:
    def start_requests(self):
    for url in self.start_urls:
    #该方法可以发起一个post请求
    yield scrapy.FormRequest(url=url,callback=self.parse,formdata={'kw':'dog'})
    # -*- coding: utf-8 -*-
    import scrapy
    
    #需求:对start_urls列表中的url发起post请求
    class PostSpider(scrapy.Spider):
        name = 'post'
        #allowed_domains = ['www.xxx.com']
        start_urls = ['https://fanyi.baidu.com/sug']
        
        #Spider父类中的一个方法:可以将 start_urls列表中的url一次进行请求发送
        def start_requests(self):
            for url in self.start_urls:
                # yield scrapy.Request(url=url, callback=self.parse) #默认发get请求
                #该方法可以发起一个post请求
                yield scrapy.FormRequest(url=url,callback=self.parse,formdata={'kw':'dog'}) #formdata处理携带的参数
    
        def parse(self, response):
            print(response.text) #结果为json串
    post.py
          
    例子6:登录操作(登录豆瓣电影),发post请求---loginPro
    登录即可获取cookie
    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class LoginSpider(scrapy.Spider):
        name = 'login'
        #allowed_domains = ['www.xxx.com']
        start_urls = ['https://accounts.douban.com/login']
        
        def start_requests(self):
            data = {
                'source':    'movie',
                'redir':    'https://movie.douban.com/',
                'form_email':    '15027900535',
                'form_password':    'bobo@15027900535',
                'login':    '登录',
            }
            for url in self.start_urls:
                yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data)
        
        def getPageText(self,response):
            page_text = response.text
            with open('./douban.html','w',encoding='utf-8') as fp:
                fp.write(page_text)
                print('over')
        
        def parse(self, response):
            #对当前用户的个人主页页面进行获取(有用户信息说明携带cookie,否则是登录界面)
            url = 'https://www.douban.com/people/185687620/'
            yield scrapy.Request(url=url,callback=self.getPageText)
    login.py


    6.scrapy核心组件--5大核心组件
    总结流程描述:
    引擎调用爬虫文件中的start_requests方法,将列表中url封装成请求对象(start_urls、yield中的),会有一系列的请求对象,引擎将请求对象给调度器,调度器会进行去重,
    请求对象放在调度器的队列中,调度器将请求对象调度给下载器,下载器拿着请求对象到互联网中下载,页面数据下载完后给下载器,下载器给爬虫文件,
    爬虫文件进行解析(调用parse方法),将解析后的数据封装到item对象中,提交给管道,管道进行持久化存储。
    注意:调度器中队列,调度器对请求对象有去重功能。
    1.引擎:所有方法的调用
    2.调度器:接收引擎发送的请求,压入到队列中,去除重复网址
    3.下载器:下载页面内容,将下载好的页面内容返回给蜘蛛(scrapy,就是爬虫文件)
    4.爬虫文件(spiders):干活的,将获取的页面数据进行解析操作
    5.管道:进行持久化存储
    互联网

    下载中间件(介于调度器、引擎、爬虫文件和下载器的中间):可进行代理ip的更换
    例子7:代理中间件的应用----dailiPro
    daili.py的书写;middlewares.py中DailiproDownloaderMiddleware下process_request方法
    def process_request(self, request, spider):
    #request参数表示的就是拦截到的请求对象
    request.meta['proxy'] = "https://151.106.15.3:1080"
    return None
    在settings中DOWNLOADER_MIDDLEWARES开启 55-57行
    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class DailiSpider(scrapy.Spider):
        name = 'daili'
        #allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.baidu.com/s?wd=ip']
    
        def parse(self, response):
           page_text = response.text
           with open('daili.html','w',encoding='utf-8') as fp:
               fp.write(page_text)
    daili.py
    # -*- coding: utf-8 -*-
    from scrapy import signals
    
    
    class DailiproDownloaderMiddleware(object):
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_request(self, request, spider):
            # request参数表示的就是拦截到的请求对象
            request.meta['proxy'] = "https://151.106.15.3:1080"
            # request.meta={"https":"151.106.15.3:1080"} #不推荐
            return None
    
        def process_response(self, request, response, spider):
            return response
    
        def process_exception(self, request, exception, spider):
            pass
    
        def spider_opened(self, spider):
            spider.logger.info('Spider opened: %s' % spider.name)
    middlewares.py
     1 # -*- coding: utf-8 -*-
     2 
     3 # Scrapy settings for dailiPro project
     4 #
     5 # For simplicity, this file contains only settings considered important or
     6 # commonly used. You can find more settings consulting the documentation:
     7 #
     8 #     https://doc.scrapy.org/en/latest/topics/settings.html
     9 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    10 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    11 
    12 BOT_NAME = 'dailiPro'
    13 
    14 SPIDER_MODULES = ['dailiPro.spiders']
    15 NEWSPIDER_MODULE = 'dailiPro.spiders'
    16 
    17 
    18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
    19 #USER_AGENT = 'dailiPro (+http://www.yourdomain.com)'
    20 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
    21 # Obey robots.txt rules
    22 ROBOTSTXT_OBEY = False
    23 
    24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
    25 #CONCURRENT_REQUESTS = 32
    26 
    27 # Configure a delay for requests for the same website (default: 0)
    28 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    29 # See also autothrottle settings and docs
    30 #DOWNLOAD_DELAY = 3
    31 # The download delay setting will honor only one of:
    32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    33 #CONCURRENT_REQUESTS_PER_IP = 16
    34 
    35 # Disable cookies (enabled by default)
    36 #COOKIES_ENABLED = False
    37 
    38 # Disable Telnet Console (enabled by default)
    39 #TELNETCONSOLE_ENABLED = False
    40 
    41 # Override the default request headers:
    42 #DEFAULT_REQUEST_HEADERS = {
    43 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    44 #   'Accept-Language': 'en',
    45 #}
    46 
    47 # Enable or disable spider middlewares
    48 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    49 #SPIDER_MIDDLEWARES = {
    50 #    'dailiPro.middlewares.DailiproSpiderMiddleware': 543,
    51 #}
    52 
    53 # Enable or disable downloader middlewares
    54 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    55 DOWNLOADER_MIDDLEWARES = {
    56     'dailiPro.middlewares.DailiproDownloaderMiddleware': 543,
    57 }
    58 
    59 # Enable or disable extensions
    60 # See https://doc.scrapy.org/en/latest/topics/extensions.html
    61 #EXTENSIONS = {
    62 #    'scrapy.extensions.telnet.TelnetConsole': None,
    63 #}
    64 
    65 # Configure item pipelines
    66 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    67 #ITEM_PIPELINES = {
    68 #    'dailiPro.pipelines.DailiproPipeline': 300,
    69 #}
    70 
    71 # Enable and configure the AutoThrottle extension (disabled by default)
    72 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    73 #AUTOTHROTTLE_ENABLED = True
    74 # The initial download delay
    75 #AUTOTHROTTLE_START_DELAY = 5
    76 # The maximum download delay to be set in case of high latencies
    77 #AUTOTHROTTLE_MAX_DELAY = 60
    78 # The average number of requests Scrapy should be sending in parallel to
    79 # each remote server
    80 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    81 # Enable showing throttling stats for every response received:
    82 #AUTOTHROTTLE_DEBUG = False
    83 
    84 # Enable and configure HTTP caching (disabled by default)
    85 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    86 #HTTPCACHE_ENABLED = True
    87 #HTTPCACHE_EXPIRATION_SECS = 0
    88 #HTTPCACHE_DIR = 'httpcache'
    89 #HTTPCACHE_IGNORE_HTTP_CODES = []
    90 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    91 
    92 #DEBUG  INFO  ERROR  WARNING
    93 #LOG_LEVEL = 'ERROR'
    94 
    95 LOG_FILE = 'log.txt'
    settings.py


    7.日志信息的设置
    日志登记 #DEBUG INFO ERROR WARNING
    在settings中写 #LOG_LEVEL = 'ERROR' 只输出error类型的日志
    LOG_FILE = 'log.txt'日志输出到文件,上看6.上面settings.py中配置


    8.请求传参 :爬取的数据不在同一个页面中
    正则未生效!???
    例子8:请求传参---爬取电影详情的数据---moviePro
    将不同页面的值放到同一个item里(名称和作者)
    手动发请求--yield
    请求传参:通过Request方法的meta参数将某一个具体的数据值传递给request方法中指定的callback方法,callback中方法通过response去取,
    item = response.meta['item'] 一个取name,二级子页面中取author
    yield scrapy.Request(url=url,callback=self.getSencodPageText,meta={'item':item}

    def getSencodPageText(self,response):
    #2.接收Request方法传递过来的item对象
    item = response.meta['item']
    # -*- coding: utf-8 -*-
    import scrapy
    from moviePro.items import MovieproItem
    
    class MovieSpider(scrapy.Spider):
        name = 'movie'
        #allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.dy2018.com/html/gndy/dyzz/']
        #该方法可以将电影详情页中的数据进行解析
        def getSencodPageText(self,response):
            #2.接收Request方法传递过来的item对象
            item = response.meta['item']
            actor = response.xpath('//*[@id="Zoom"]/p[16]/text()').extract_first()
            item['actor'] = actor
            
            yield item
            
        def parse(self, response):
            print(response.text)
            table_list = response.xpath('//div[@class="co_content8"]/ul/table')
            for table in table_list:
                url = "https://www.dy2018.com"+table.xpath('./tbody/tr[2]/td[2]/b/a/@href').extract_first() #需要加https前缀
                name = table.xpath('./tbody/tr[2]/td[2]/b/a/text()').extract_first()
                print(url)
                item = MovieproItem() #实例化item类型对象
                item['name']=name
                
                #1.让Request方法将item对象传递给getSencodPageText方法,加入meta
                yield scrapy.Request(url=url,callback=self.getSencodPageText,meta={'item':item}) #手动发请求
                
    movie.py


    9.SrawlSpider的使用--链接提取器&规则解析器
    SrawlSpider可以进行全栈数据的爬取! --重点!
    例子9:SrawlSpider的使用--爬取糗百图片全栈数据--crawlPro
    注意:项目创建 scrapy genspider -t crawl qiubai www.xxx.com
    取第一页的标签?--注意allow取得是符合正则的链接 link1 = LinkExtractor(allow=r'/pic/$')
    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class QiubaiSpider(CrawlSpider):
        name = 'qiubai'
        #allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.qiushibaike.com/pic/']
        #连接提取器(提取页码连接):从起始url表示的页面源码中进行指定连接的提取
        #allow参数:正则表达式。可以将起始url页面源码数据中符合该正则的连接进行全部的提取
        link = LinkExtractor(allow=r'/pic/page/d+?s=d+')
        #href="/pic/page/5?s=5144132"
        
        link1 = LinkExtractor(allow=r'/pic/$') #正则表达式提取到的是所有连接的内容
        #href="/pic/"
        rules = (
            #规则解析器:将连接提取器提取到的连接对应的页面数据进行指定(callback)负责解析
            #follow = True:将连接提取器继续作用到连接提取器提取出的连接所对应的页面中(会继续作用于link中);为False时,只会作用到start_urls,出现几个结果。
            Rule(link, callback='parse_item', follow=True),
            Rule(link1, callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            print(response)
            
        
    qiubai.py


    10.分布式爬取--多台机器同时爬取同一页面数据--重点!
    在pycharm中下载redis

    例子10:分布式爬取--爬取抽屉42区--redisPro
    #爬取抽屉42区所有图片所对应的url连接
    提交到redis中的管道
    settings.py中ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
    }
     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from scrapy.linkextractors import LinkExtractor
     4 from scrapy.spiders import CrawlSpider, Rule
     5 from scrapy_redis.spiders import RedisCrawlSpider
     6 from redisPro.items import RedisproItem
     7 #0.将RedisCrawlSpider类进行导入
     8 #1.将爬虫类的父类修改成RedisCrawlSpider
     9 #2.将start_urls修改成redis_key属性
    10 #3.编写具体的解析代码
    11 # 4.将item提交到scrapy-redis组件中被封装好的管道里(settings.py中ITEM_PIPELINES = {
    12 #     'scrapy_redis.pipelines.RedisPipeline': 400
    13 # })
    14 #5.将爬虫文件中产生的url对应的请求对象全部都提交到scrapy-redis封装好的调度器中(settings.py中配置95-10015 #6.在配置文件中指明将爬取到的数据值存储到哪一个redis数据库中(settings.py中105-10816 #7.对redis数据库的配置文件(redis.windows.conf)进行修改:protected-mode no   #bind 127.0.0.1
    17 #8.执行爬虫文件:scrapy runspider xxx.py
    18 #9.向调度器中扔一个起始的url
    19 class ChoutiSpider(RedisCrawlSpider):
    20     name = 'chouti'
    21     #allowed_domains = ['www.xxx.com']
    22     #start_urls = ['http://www.xxx.com/']
    23     #调度器队列的名称:将起始的url扔到该名称表示的调度器队列中
    24     redis_key = "chouti"
    25     
    26     rules = (
    27         Rule(LinkExtractor(allow=r'/r/news/hot/d+'), callback='parse_item', follow=True),
    28     )
    29 
    30     def parse_item(self, response):
    31         
    32         imgUrl_list =  response.xpath('//div[@class="news-pic"]/img/@src').extract()
    33         for url in imgUrl_list:
    34             item = RedisproItem()
    35             item['url'] = url
    36             
    37             yield item
    chouti.py
      1 # -*- coding: utf-8 -*-
      2 
      3 # Scrapy settings for redisPro project
      4 #
      5 # For simplicity, this file contains only settings considered important or
      6 # commonly used. You can find more settings consulting the documentation:
      7 #
      8 #     https://doc.scrapy.org/en/latest/topics/settings.html
      9 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
     10 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
     11 
     12 BOT_NAME = 'redisPro'
     13 
     14 SPIDER_MODULES = ['redisPro.spiders']
     15 NEWSPIDER_MODULE = 'redisPro.spiders'
     16 
     17 
     18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
     19 #USER_AGENT = 'redisPro (+http://www.yourdomain.com)'
     20 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
     21 # Obey robots.txt rules
     22 ROBOTSTXT_OBEY = False
     23 
     24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
     25 #CONCURRENT_REQUESTS = 32
     26 
     27 # Configure a delay for requests for the same website (default: 0)
     28 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
     29 # See also autothrottle settings and docs
     30 #DOWNLOAD_DELAY = 3
     31 # The download delay setting will honor only one of:
     32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
     33 #CONCURRENT_REQUESTS_PER_IP = 16
     34 
     35 # Disable cookies (enabled by default)
     36 #COOKIES_ENABLED = False
     37 
     38 # Disable Telnet Console (enabled by default)
     39 #TELNETCONSOLE_ENABLED = False
     40 
     41 # Override the default request headers:
     42 #DEFAULT_REQUEST_HEADERS = {
     43 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
     44 #   'Accept-Language': 'en',
     45 #}
     46 
     47 # Enable or disable spider middlewares
     48 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
     49 #SPIDER_MIDDLEWARES = {
     50 #    'redisPro.middlewares.RedisproSpiderMiddleware': 543,
     51 #}
     52 
     53 # Enable or disable downloader middlewares
     54 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
     55 #DOWNLOADER_MIDDLEWARES = {
     56 #    'redisPro.middlewares.RedisproDownloaderMiddleware': 543,
     57 #}
     58 
     59 # Enable or disable extensions
     60 # See https://doc.scrapy.org/en/latest/topics/extensions.html
     61 #EXTENSIONS = {
     62 #    'scrapy.extensions.telnet.TelnetConsole': None,
     63 #}
     64 
     65 # Configure item pipelines
     66 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
     67 ITEM_PIPELINES = {
     68     'scrapy_redis.pipelines.RedisPipeline': 400
     69 
     70 #    'redisPro.pipelines.RedisproPipeline': 300,
     71 
     72 }
     73 
     74 # Enable and configure the AutoThrottle extension (disabled by default)
     75 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
     76 #AUTOTHROTTLE_ENABLED = True
     77 # The initial download delay
     78 #AUTOTHROTTLE_START_DELAY = 5
     79 # The maximum download delay to be set in case of high latencies
     80 #AUTOTHROTTLE_MAX_DELAY = 60
     81 # The average number of requests Scrapy should be sending in parallel to
     82 # each remote server
     83 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
     84 # Enable showing throttling stats for every response received:
     85 #AUTOTHROTTLE_DEBUG = False
     86 
     87 # Enable and configure HTTP caching (disabled by default)
     88 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
     89 #HTTPCACHE_ENABLED = True
     90 #HTTPCACHE_EXPIRATION_SECS = 0
     91 #HTTPCACHE_DIR = 'httpcache'
     92 #HTTPCACHE_IGNORE_HTTP_CODES = []
     93 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
     94 
     95 # 使用scrapy-redis组件的去重队列
     96 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
     97 # 使用scrapy-redis组件自己的调度器
     98 SCHEDULER = "scrapy_redis.scheduler.Scheduler"
     99 # 是否允许暂停
    100 SCHEDULER_PERSIST = True
    101 
    102 
    103 
    104 
    105 REDIS_HOST = '192.168.12.65'
    106 REDIS_PORT = 6379
    107 #REDIS_ENCODING = ‘utf-8108 #REDIS_PARAMS = {‘password’:’123456’}
    settings.py

    redis配置文件中注释56行 75保存模式改为no
    运行:
    1.启动redis服务器:进入到redis目录,在cmd中输入redis-server ./redis.windows.conf
    2.启动redis 数据库客户端:redis-cli

    3.执行配置文件:cmd进入到F:Python自动化21期3.Django&项目day26 爬虫1104课上代码及笔记scrapy项目 edisPro edisProspiders下的目录,
    scrapy runspider chouti.py 会停在监听的位置

    4.在redis中:redis-cli
    lpush chouti https://dig.chouti.com/r/news/hot/1 执行之后项目cmd中会进行数据爬取操作

    5.在redis中查看爬取的数据
    keys * -------存在chouti:items
    lrange chouti:items 0 -1

    删除数据:redis cli
    flushall即可

      
    小结18:40-50 总结的答案:
    1.2种爬虫模块,requests、urllib
    2.robots协议作用:防君子不妨小人,常用的一种反扒手段
    3.使用云打码或者人工识别--注:验证码也是门户网站的一种反扒手段
    4.3种解析方式:xpath、BeautifulSoup、正则
    5.selenium--执行js代码/PhantomJs、谷歌无头浏览器
    6.重要!数据加密(下载密文),动态数据爬取(梨视频)
    token--登录时rkey对应的值
    7.5个,爬虫文件、引擎、调度器、下载器、管道
    8.sqiders/CrawlSpider/RedisCrawlSpider
    9.总结的10步---可以自己尝试--分布式样本保存
    10.未讲到




    想要的内容括起来
  • 相关阅读:
    VNC跨平台远程桌面的安装与使用
    Apache 的编译安装
    Xming配置
    工作杂记
    自动创建系统用户脚本
    关于linux网络基础记录
    Linux的setup命令启动服务名称和功能
    涉密计算机检查工具
    Nginx压力测试工具之WebBench
    关于系统性能检测的一些使用
  • 原文地址:https://www.cnblogs.com/lijie123/p/10029627.html
Copyright © 2011-2022 走看看