zoukankan      html  css  js  c++  java
  • scrapy爬取知乎某个问题下的所有图片

    前言:

      1、仅仅是想下载图片,别人上传的图片也是没有版权的,下载来可以自己欣赏做手机背景但不商用

      2、由于爬虫周期的问题,这个代码写于2019.02.13

    1.关于知乎爬虫

      网上能访问到的理论上都能爬取下来,只是网站反爬虫手段和爬取复杂的问题。知乎的内容大概是问题+回答(我才开始用,暂时的概念)。大概流程是:;<1>登录-->进入首页-->点击首页列表中的某篇问题-->查看问题和回答-->查看评论或者<2>百度到某篇问题-->查看问题和回答,在网页版中第二种方式并不需要登录,也即你爬取目的和方法有两种:

      1.1.从知乎首页开始爬取所有问题(或者某类型问题),并爬取对应的回答(评论)

        需要模拟登录的过程,再从首页访问问题,从问题地址获取回答、评论。这个在知乎模拟登录的过程(https://blog.csdn.net/sinat_34200786/article/details/78449499)有相关介绍,不过这些产品(包括反爬措施)都是在不断变化的,具体还是得自己分析。

      1.2.爬取某个问题下的回答

        现在来说不需要登录就可以直接获取,我找了上面的方法,发现我自己爬取图片的目的并不需要登录,之前只是一个小问题弄错了。

    2.scrapy项目

      这个项目也是想复习复习scrapy

      根据分析浏览器网络访问过程可以知道,我所希望爬取的东西是通过以下网址获取的json

        https://www.zhihu.com/api/v4/questions/309298287/answers?    include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&offset=3&limit=1&sort_by=default&platform=desktop

       里面有很多参数,主要参数是offset和limit。它请求头有些复杂,但是后来发现只要把基本的“User-Agent”设置好应该就差不多了,毕竟并不需要登录。使用了scrapy自带的ImagesPipeline,除了图片我对其他信息也不感兴趣。那么代码如下:

      1.1. item.py中比较简单,只是储存图片地址,类型应该是['xxx.jpg','yyy.jpg']的list

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class Img584294770Item(scrapy.Item):
        # define the fields for your item here like:
        imgs = scrapy.Field()
        
    item.py

       1.2. Img58429477.py是spider文件,定义了爬取的过程

    # -*- coding: utf-8 -*-
    #Author:lwx
    #21090111我想要爬取知乎id584294770问题下回答的图片
    from scrapy import Spider
    import scrapy
    import json
    import re
    from zhihu.items import *
    import requests
    
    class Img584294770(Spider):
        name = 'Img584294770'
        start_urls=[
            'https://www.zhihu.com/',
            'https://www.zhihu.com/api/v4/questions/309298287/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,is_labeled;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&offset={offset}&limit={limit}&sort_by=default&platform=desktop'    
        ]
        #print(start_urls[0].format(limit=1,offset=1));
        #设置header
        headers = {
            'Accept':'*/*',
            #'Accept-Encoding':'gzip, deflate, br',#这个不要设置,因为设置后会返回乱码
            'Accept-Language':'zh-CN,zh;q=0.9',
            'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36',
            'x-requested-with':'fetch',
        }
        #设置访问
        def start_requests(self):
            url = self.start_urls[1].format(offset=1,limit=1)
            #url = "http://36.248.23.136:8888/data_zfcg/"
            res = requests.get(url,headers=self.headers)
            data = json.loads(res.text)
            if('data' in data.keys()):#有获取到数据
                #先获取回答总数
                totalPage = data['paging']['totals']
                for Page in range(0,totalPage):
                    #以三个为一组进行访问
                    url = self.start_urls[1].format(offset=Page*3,limit=3)
                    yield scrapy.Request(url=url, callback=self.parse_imgs, headers = self.headers)
            
            
        #获取回答中的图片地址
        def parse_imgs(self, response):
            res = json.loads(response.body)
            
            if('data' in res.keys()):#有获取到数据
                data = res['data']
                for d in data:
                    item = Img584294770Item()
                    content = d['content']
                    t = self.get_imgs(content)
                    item['imgs'] = t
                    yield item
                        
        #获取字符串中的图片地址src='xxx.jpg'或src="xxx.png"                        
        def get_imgs(self,content):
            imgs_url_list = re.findall(r'ssrc="(.*?)"', content)
            imgs_list = []
            for i in range(len(imgs_url_list)):
                if(imgs_url_list[i].split('.')[-1]=='jpg' or imgs_url_list[i].split('.')[-1]=='png'):
                    imgs_list.append(imgs_url_list[i])
            return imgs_list
    Img584294770
      1.3. settings.py设置一些必要的参数
    # -*- coding: utf-8 -*-
    
    # Scrapy settings for zhihu project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'zhihu'
    
    SPIDER_MODULES = ['zhihu.spiders']
    NEWSPIDER_MODULE = 'zhihu.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'zhihu (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False#置为false表示不遵守robot.txt,去爬取网站不允许的内容
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    DOWNLOAD_DELAY = 3#设置一下延时
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'zhihu.middlewares.ZhihuSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'zhihu.middlewares.ZhihuDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import os
    IMAGES_EXPIRES = 90 #图片过期时间,在这个时间内爬取过的都不再爬取
    
    IMAGES_URLS_FIELD ="imgs"#图片地址在item中的名字
    project_dir=os.path.abspath(os.path.dirname(__file__))
    IMAGES_STORE=os.path.join(project_dir,'images')#图片储存的文件夹
    
    ITEM_PIPELINES = {
        'scrapy.contrib.pipeline.images.ImagesPipeline':200,
        #'zhihu.pipelines.ZhihuPipeline': 300,
        'zhihu.pipelines.Img584294770Pipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    settings.py

     3.总结

      3.1.'utf-8'报错,你用的所有文件,item,spider或者pineline,请设置成utf-8格式,我看到这个问题就恶心但是总是记不得

      3.2."ROBOTSTXT_OBEY = False",在settings.py中这个值默认是True,即遵守robot.txt的规定。如果没有设置,你会发现你的爬虫明明进去转了一圈,但是很绅士地什么都没动人家,扔给你一个200但是就是不给你想要的数据。这个值得True模式是一些搜索引擎常用的,而我们做这些爬虫,就是不受网页的所有者欢迎的,超越了robot.txt规定的范围。

      3.3.我这个小项目拖了三天,不是因为有点不记得scrapy开发过程,主要是因为借着分析网页的目的围观某乎大佬装逼(并不)编码问题。即使后面发现就是headers中不要设置'Accept-Encoding':'gzip, deflate, br'的问题,但是编码问题还是狠狠蹂躏了我又一回,我还要因为又又又忘记编码问题回去啃一遍书。

    6000多张图片了……自己右键下载多麻烦,不过自己喜欢的也不过是百来张……还是麻烦,代码又不能根据你喜好帮你挑好……

    参考:

    https://blog.csdn.net/xwbk12/article/details/79009995

    https://blog.csdn.net/sinat_34200786/article/details/78449499

    当你深入了解,你就会发现世界如此广袤,而你对世界的了解则是如此浅薄,请永远保持谦卑的态度。
  • 相关阅读:
    git init 与 git init --bare 区别
    python_集合_笔记
    git笔记
    screen命令
    python的and和or优先级
    计算机语言的发展史
    python3颜色输出
    mysql_windows解压包安装
    那些经常不开心的上班族
    mysql主从搭建
  • 原文地址:https://www.cnblogs.com/liwxmyself/p/10369543.html
Copyright © 2011-2022 走看看