zoukankan      html  css  js  c++  java
  • Python使用Scrapy爬虫框架全站爬取图片并保存本地(@妹子图@)

    大家可以在Github上clone全部源码。

    Github:https://github.com/williamzxl/Scrapy_CrawlMeiziTu

    Scrapy官方文档:http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

    基本上按照文档的流程走一遍就基本会用了。

    Step1:

    在开始爬取之前,必须创建一个新的Scrapy项目。 进入打算存储代码的目录中,运行下列命令:

    scrapy startproject CrawlMeiziTu

    该命令将会创建包含下列内容的 tutorial 目录:

    CrawlMeiziTu/
        scrapy.cfg
        CrawlMeiziTu/
            __init__.py
            items.py
            pipelines.py
            settings.py
         middlewares.py spiders/ __init__.py ...
    cd CrawlMeiziTu
    scrapy genspider Meizitu http://www.meizitu.com/a/list_1_1.html

    该命令将会创建包含下列内容的 tutorial 目录:

    CrawlMeiziTu/
        scrapy.cfg
        CrawlMeiziTu/
         __init__.py items.py pipelines.py settings.py
         middlewares.py spiders/
           Meizitu.py __init__.py ...
    我们主要编辑的就如下图箭头所示:


    main.py是后来加上的,加了两条命令,
    1 from scrapy import cmdline
    2 
    3 cmdline.execute("scrapy crawl Meizitu".split())
    主要为了方便运行。

    Step2:编辑Settings,如下图所示
     1 BOT_NAME = 'CrawlMeiziTu'
     2 
     3 SPIDER_MODULES = ['CrawlMeiziTu.spiders']
     4 NEWSPIDER_MODULE = 'CrawlMeiziTu.spiders'
     5 ITEM_PIPELINES = {
     6    'CrawlMeiziTu.pipelines.CrawlmeizituPipeline': 300,
     7 }
     8 IMAGES_STORE = 'D://pic2'
     9 DOWNLOAD_DELAY = 0.3
    10 
    11 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    12 ROBOTSTXT_OBEY = True

    主要设置USER_AGENT,下载路径,下载延迟时间




    Step3:编辑Items.
    Items主要用来存取通过Spider程序抓取的信息。由于我们爬取妹子图,所以要抓取每张图片的名字,图片的连接,标签等等
    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    class CrawlmeizituItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        #title为文件夹名字
        title = scrapy.Field()
        url = scrapy.Field()
        tags = scrapy.Field()
        #图片的连接
        src = scrapy.Field()
        #alt为图片名字
        alt = scrapy.Field()



    Step4:编辑Pipelines
    Pipelines主要对items里面获取的信息进行处理。比如说根据title创建文件夹或者图片的名字,根据图片链接下载图片。
    # -*- coding: utf-8 -*-
    import os
    import requests
    from CrawlMeiziTu.settings import IMAGES_STORE
    
    class CrawlmeizituPipeline(object):
    
        def process_item(self, item, spider):
            fold_name = "".join(item['title'])
            header = {
                'USER-Agent': 'User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
                'Cookie': 'b963ef2d97e050aaf90fd5fab8e78633',
                #需要查看图片的cookie信息,否则下载的图片无法查看
            }
            images = []
            # 所有图片放在一个文件夹下
            dir_path = '{}'.format(IMAGES_STORE)
            if not os.path.exists(dir_path) and len(item['src']) != 0:
                os.mkdir(dir_path)
            if len(item['src']) == 0:
                with open('..//check.txt', 'a+') as fp:
                    fp.write("".join(item['title']) + ":" + "".join(item['url']))
                    fp.write("
    ")
    
            for jpg_url, name, num in zip(item['src'], item['alt'],range(0,100)):
                file_name = name + str(num)
                file_path = '{}//{}'.format(dir_path, file_name)
                images.append(file_path)
                if os.path.exists(file_path) or os.path.exists(file_name):
                    continue
    
                with open('{}//{}.jpg'.format(dir_path, file_name), 'wb') as f:
                    req = requests.get(jpg_url, headers=header)
                    f.write(req.content)
    
            return item



    Step5:编辑Meizitu的主程序。
    最重要的主程序:
    # -*- coding: utf-8 -*-
    import scrapy
    from CrawlMeiziTu.items import CrawlmeizituItem
    #from CrawlMeiziTu.items import CrawlmeizituItemPage
    import time
    class MeizituSpider(scrapy.Spider):
        name = "Meizitu"
        #allowed_domains = ["meizitu.com/"]
    
        start_urls = []
        last_url = []
        with open('..//url.txt', 'r') as fp:
            crawl_urls = fp.readlines()
            for start_url in crawl_urls:
                last_url.append(start_url.strip('
    '))
        start_urls.append("".join(last_url[-1]))
    
    
        def parse(self, response):
            selector = scrapy.Selector(response)
            #item = CrawlmeizituItemPage()
    
            next_pages = selector.xpath('//*[@id="wp_page_numbers"]/ul/li/a/@href').extract()
            next_pages_text = selector.xpath('//*[@id="wp_page_numbers"]/ul/li/a/text()').extract()
            all_urls = []
            if '下一页' in next_pages_text:
                next_url = "http://www.meizitu.com/a/{}".format(next_pages[-2])
                with open('..//url.txt', 'a+') as fp:
                    fp.write('
    ')
                    fp.write(next_url)
                    fp.write("
    ")
                request = scrapy.http.Request(next_url, callback=self.parse)
                time.sleep(2)
                yield request
    
            all_info = selector.xpath('//h3[@class="tit"]/a')
            #读取每个图片夹的连接
            for info in all_info:
                links = info.xpath('//h3[@class="tit"]/a/@href').extract()
            for link in links:
                request = scrapy.http.Request(link, callback=self.parse_item)
                time.sleep(1)
                yield request
    
            # next_link = selector.xpath('//*[@id="wp_page_numbers"]/ul/li/a/@href').extract()
            # next_link_text = selector.xpath('//*[@id="wp_page_numbers"]/ul/li/a/text()').extract()
            # if '下一页' in next_link_text:
            #     nextPage = "http://www.meizitu.com/a/{}".format(next_link[-2])
            #     item['page_url'] = nextPage
            #     yield item
    
                #抓取每个文件夹的信息
        def parse_item(self, response):
             item = CrawlmeizituItem()
             selector = scrapy.Selector(response)
    
             image_title = selector.xpath('//h2/a/text()').extract()
             image_url = selector.xpath('//h2/a/@href').extract()
             image_tags = selector.xpath('//div[@class="metaRight"]/p/text()').extract()
             if selector.xpath('//*[@id="picture"]/p/img/@src').extract():
                image_src = selector.xpath('//*[@id="picture"]/p/img/@src').extract()
             else:
                image_src = selector.xpath('//*[@id="maincontent"]/div/p/img/@src').extract()
             if selector.xpath('//*[@id="picture"]/p/img/@alt').extract():
                 pic_name = selector.xpath('//*[@id="picture"]/p/img/@alt').extract()
             else:
                pic_name = selector.xpath('//*[@id="maincontent"]/div/p/img/@alt').extract()
             #//*[@id="maincontent"]/div/p/img/@alt
             item['title'] = image_title
             item['url'] = image_url
             item['tags'] = image_tags
             item['src'] = image_src
             item['alt'] = pic_name
             print(item)
             time.sleep(1)
             yield item


  • 相关阅读:
    C# 灵活切换开发/测试/生成环境web.config
    c# sqlserver 删除大批量数据超时
    vue 上传进度显示
    sqlserver the name is not a valid identifier error in function
    WEBAPI 设置上传文件大小
    vue 获取视频时长
    webapi 导入excel处理数据
    vscode 通过ftp发布vue到azure服务器
    C# 汉字转拼音
    静态代码扫描工具
  • 原文地址:https://www.cnblogs.com/william126/p/6923017.html
Copyright © 2011-2022 走看看