zoukankan      html  css  js  c++  java
  • python3煎蛋网的爬虫

    做的第一个爬虫就遇上了硬茬儿,可能是我http头没改好还是我点击次数过高导致无法循环爬取煎蛋网的妹子图。。。。

    不过也算是邪恶了一把。。。技术本无罪~~~

    爬了几页的照片下来还是很赏心悦目的~

    import urllib.request
    
    import re
    
    import time
    
    import requests
    
    k=1
    def read_url(url,k):
        user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
        headers = {'User-Agent': user_agent,
                   'Referer': 'http://jandan.net/ooxx/page-2406'
                #    'Accept': 'image / webp, image / *, * / *;q = 0.8',
                # 'Referer': 'http: // cdn.jandan.net / wp - content / themes / egg / style.css?v = 20170319',
                # 'Accept - Encoding': 'gzip, deflate, sdch',
                # 'Accept - Language': 'zh - CN, zh; q = 0.8',
                # 'Connection':'close',
                # 'host':'cdn.jandan.net',
                # 'Accept - Encoding': 'dentity'
        }
        #cok = {"Cookie":"_ga=GA1.2.1842145399.1491574879; Hm_lvt_fd93b7fb546adcfbcf80c4fc2b54da2c=1491574879; Hm_lpvt_fd93b7fb546adcfbcf80c4fc2b54da2c=1491575669"}
        r=urllib.request.Request(url,headers=headers)
        req = urllib.request.urlopen(r)
        # print(req.read().decode('utf-8'))
        # r= requests.get(url,headers=headers,cookies=cok)
        image_d(req.read().decode('utf-8'),k)
    
    def image_d(data,k):
    
        print('正在爬取第%d页图片' %k)
        # datalist = []
        dirct = 'C:\UserseexfDesktopjiandan'
        pattern = re.compile('<img src="(.*?)" /></p>')
        res = re.findall(pattern,data)
    
        for i in res:
            j = 'http:'+i
            data1 = urllib.request.urlopen(j).read()
            k = re.split('/', j)[-1]
            # print(i)
            path = dirct + '/' + k
            f = open(path, 'wb')
            f.write(data1)
    
            f.close()
        print('爬取完成')
    
    
            # respon = urllib.request.Request(i)
            # data1= urllib.request.urlopen(respon).read().decode('utf-8')
            #
            # datalist.append(data1)
            # print(datalist)
    
    if __name__=='__main__':
        url = 'http://jandan.net/ooxx/page-2406'#+str(i)
        read_url(url,k)
        k+=1
        # time.sleep(3)

    基本的结构框架也就是:请求网页源代码-->通过正则表达式匹配相应的图片地址返回一个列表-->将列表中所有地址中的内容全部写入一个文件夹。。

    代码很乱,第一个爬虫权当留个纪念~~~

    附上福利:

  • 相关阅读:
    LeetCode OJ 112. Path Sum
    LeetCode OJ 226. Invert Binary Tree
    LeetCode OJ 100. Same Tree
    LeetCode OJ 104. Maximum Depth of Binary Tree
    LeetCode OJ 111. Minimum Depth of Binary Tree
    LeetCode OJ 110. Balanced Binary Tree
    apache-jmeter-3.1的简单压力测试使用方法(下载和安装)
    JMeter入门教程
    CentOS6(CentOS7)设置静态IP 并且 能够上网
    分享好文:分享我在阿里8年,是如何一步一步走向架构师的
  • 原文地址:https://www.cnblogs.com/jokerspace/p/6685114.html
Copyright © 2011-2022 走看看