zoukankan      html  css  js  c++  java
  • python3煎蛋网的爬虫

    做的第一个爬虫就遇上了硬茬儿,可能是我http头没改好还是我点击次数过高导致无法循环爬取煎蛋网的妹子图。。。。

    不过也算是邪恶了一把。。。技术本无罪~~~

    爬了几页的照片下来还是很赏心悦目的~

    import urllib.request
    
    import re
    
    import time
    
    import requests
    
    k=1
    def read_url(url,k):
        user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
        headers = {'User-Agent': user_agent,
                   'Referer': 'http://jandan.net/ooxx/page-2406'
                #    'Accept': 'image / webp, image / *, * / *;q = 0.8',
                # 'Referer': 'http: // cdn.jandan.net / wp - content / themes / egg / style.css?v = 20170319',
                # 'Accept - Encoding': 'gzip, deflate, sdch',
                # 'Accept - Language': 'zh - CN, zh; q = 0.8',
                # 'Connection':'close',
                # 'host':'cdn.jandan.net',
                # 'Accept - Encoding': 'dentity'
        }
        #cok = {"Cookie":"_ga=GA1.2.1842145399.1491574879; Hm_lvt_fd93b7fb546adcfbcf80c4fc2b54da2c=1491574879; Hm_lpvt_fd93b7fb546adcfbcf80c4fc2b54da2c=1491575669"}
        r=urllib.request.Request(url,headers=headers)
        req = urllib.request.urlopen(r)
        # print(req.read().decode('utf-8'))
        # r= requests.get(url,headers=headers,cookies=cok)
        image_d(req.read().decode('utf-8'),k)
    
    def image_d(data,k):
    
        print('正在爬取第%d页图片' %k)
        # datalist = []
        dirct = 'C:\UserseexfDesktopjiandan'
        pattern = re.compile('<img src="(.*?)" /></p>')
        res = re.findall(pattern,data)
    
        for i in res:
            j = 'http:'+i
            data1 = urllib.request.urlopen(j).read()
            k = re.split('/', j)[-1]
            # print(i)
            path = dirct + '/' + k
            f = open(path, 'wb')
            f.write(data1)
    
            f.close()
        print('爬取完成')
    
    
            # respon = urllib.request.Request(i)
            # data1= urllib.request.urlopen(respon).read().decode('utf-8')
            #
            # datalist.append(data1)
            # print(datalist)
    
    if __name__=='__main__':
        url = 'http://jandan.net/ooxx/page-2406'#+str(i)
        read_url(url,k)
        k+=1
        # time.sleep(3)

    基本的结构框架也就是:请求网页源代码-->通过正则表达式匹配相应的图片地址返回一个列表-->将列表中所有地址中的内容全部写入一个文件夹。。

    代码很乱,第一个爬虫权当留个纪念~~~

    附上福利:

  • 相关阅读:
    Android studio开发找不到HttpClient问题
    Android studio开发找不到HttpClient问题
    互联网应用之传递HTTP参数
    互联网应用之传递HTTP参数
    计算机组成原理
    计算机组成原理
    【NYOJ】[40]公约数和公倍数
    【NYOJ】[40]公约数和公倍数
    【NYOJ】[39]水仙花数
    【NYOJ】[39]水仙花数
  • 原文地址:https://www.cnblogs.com/jokerspace/p/6685114.html
Copyright © 2011-2022 走看看