zoukankan      html  css  js  c++  java
  • 小白爬虫第一弹之抓取妹子图【更新版】

    最近对爬虫感兴趣,参考了  http://cuiqingcai.com/3179.html   这篇文章,琢磨的小试身手,但是按照原文发现图片下载下来全是打不开的图片,这不是我想要的啊,尝试直接把图片的链接在浏览器中打开,发现已经看过的没有问题,新打开的就不行了,一直怀疑是在cookie方面做了文章,在request里也确实发现了有两个cookie的身影,但就是怎么找也找不到这两个cookie怎么来的,一度想放弃,爬个别的站算了,对妹子们的不舍还是又坚持研究了半天,在浏览器里的抓包工具里一直查看各种参数,做各种尝试,发现了一点门道,原来请求图片的时候需要增加一个参数,具体看代码吧,运行后中途连接被切断了,遂又增加了ua随机替换的方法,不废话了,代码如下:

    import requests
    from bs4 import BeautifulSoup
    import os
    import urllib
    import random
     
     
    class mzitu():
     
        def all_url(self, url):
            html = self.request(url)
            all_a = BeautifulSoup(html.text, 'lxml').find('div', class_='all').find_all('a')
            for a in all_a:
                title = a.get_text()
                if title == '早期图片':
                    continue
                print(u'开始保存:', title)
                path = str(title).replace("?", '_')
                if not self.mkdir(path): ##跳过已存在的文件夹
                    print(u'已经跳过:', title)
                    continue
                href = a['href']
                self.html(href)
        def html(self, href):
            html = self.request(href)
            max_span = BeautifulSoup(html.text, 'lxml').find('div', class_='pagenavi').find_all('span')[-2].get_text()
            for page in range(1, int(max_span) + 1):
                page_url = href + '/' + str(page)
                self.img(page_url)
     
        def img(self, page_url):
            img_html = self.request(page_url)
            img_url = BeautifulSoup(img_html.text, 'lxml').find('div', class_='main-image').find('img')['src']
            self.save(img_url,page_url)
     
        def save(self, img_url, page_url):
            name = img_url[-9:-4]
            try:
                img = self.requestpic(img_url,page_url)
                f = open(name + '.jpg', 'ab')
                f.write(img.content)
                f.close()
            except FileNotFoundError: ##捕获异常,继续往下走
                print(u'图片不存在已跳过:', img_url)
                return False
     
        def mkdir(self, path): ##这个函数创建文件夹
            path = path.strip()
            isExists = os.path.exists(os.path.join("D:mzitu", path))
            if not isExists:
                print(u'建了一个名字叫做', path, u'的文件夹!')
                os.makedirs(os.path.join("D:mzitu", path))
                os.chdir(os.path.join("D:mzitu", path)) ##切换到目录
                return True
            else:
                print(u'名字叫做', path, u'的文件夹已经存在了!')
                return False
     
        def requestpic(self, url, Referer): ##这个函数获取网页的response 然后返回
            user_agent_list = [ 
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" 
                "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", 
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", 
                "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", 
                "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", 
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", 
                "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", 
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", 
                "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", 
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", 
                "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", 
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", 
                "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", 
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", 
                "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", 
                "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", 
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", 
                "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
            ]
            ua = random.choice(user_agent_list)
            headers = {'User-Agent': ua,"Referer":Referer} ##较之前版本获取图片关键参数在这里
            content = requests.get(url, headers=headers)
            return content
     
        def request(self, url): ##这个函数获取网页的response 然后返回
            headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
            content = requests.get(url, headers=headers)
            return content
     
    Mzitu = mzitu() ##实例化
    Mzitu.all_url('http://www.mzitu.com/all') ##给函数all_url传入参数  你可以当作启动爬虫(就是入口)
    print(u'恭喜您下载完成啦!')

    感谢前人们的无私分享, 感谢 http://cuiqingcai.com/ , 感谢mzitu的站长提供给我们这么多的优质妹子。

     
  • 相关阅读:
    poj 3666 Making the Grade
    poj 3186 Treats for the Cows (区间dp)
    hdu 1074 Doing Homework(状压)
    CodeForces 489C Given Length and Sum of Digits...
    CodeForces 163A Substring and Subsequence
    CodeForces 366C Dima and Salad
    CodeForces 180C Letter
    CodeForces
    hdu 2859 Phalanx
    socket接收大数据流
  • 原文地址:https://www.cnblogs.com/wangbg/p/7282543.html
Copyright © 2011-2022 走看看