zoukankan      html  css  js  c++  java
  • Gihub项目分享 —— Python爬虫获取高清桌面壁纸

    https://github.com/zhongjiajie/crawler-photo

    啥都不说,先上地址

    本人第一个爬虫程序,不敢和大神相比,只是做学习分享用

    # -*- coding:utf-8 -*-
    '''
    多线程,防止网络阻塞(有超时),伪装IE , 无休眠 ,记录时间 , 爬数据库
    '''
    import requests
    import re
    import time
    from multiprocessing.dummy import Pool as ThreadPool
    
    #从数据库中获取所有图片的URL并生成列表
    def GetPictureUrl():
        #访问网站的数据库,并获取相应的html
        url1 = 'http://www.socwall.com/images/wallpapers/'
        html1 = requests.get(url1).text
    
        #正则表达求出相应的图片链接
        content1 = re.search('<a href="/views/images/">(.*?)<a href="staging/">',html1,re.S).group(1)
        content2 = re.findall('<a href="(.*?)">',content1,re.S)
    
        #生成网页列表
        UrlList = []
        for each in content2:
            url2 = 'http://www.socwall.com/images/wallpapers/' + each
            UrlList.append(url2)
        return UrlList
    
    #模拟IE下载图片
    def DownloadPicture(url):
        try:
            head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'}
            html = requests.get(url,headers = head,timeout = 40)      #模拟IE下载,40s超时退出
            with open(r'E://picture2//%s'%url[41:], 'wb+') as f:
                f.write(html.content)
        except:
            with open(r'failure_url.txt','a') as f:
                f.write(url + '
    ')
            print 'download ' + url + ' failure!'
    
    if __name__ == '__main__':
        time1 = time.time()             #记录开始的时间time1
        UrlList = GetPictureUrl()       #获取图片的URL
    
        pool = ThreadPool(4)            #开四线程
        results = pool.map(DownloadPicture, UrlList)   #多线程下载
        pool.close()
        pool.join()
        time2 = time.time()             #记录结束时间time2
        print u'合计耗时: ' + str(time2 - time1)        #计算耗时量

    特点为:多线程,防止网络阻塞(有超时),伪装IE , 无休眠 ,记录时间 , 爬数据库

    PS:因为本人发现网站www.socwall.com中的图片放在开放的数据库中,所以二话不说直接拿来用了,就是直接访问网址http://www.socwall.com/images/wallpapers/爬取,还是一个简单的爬虫。

  • 相关阅读:
    申论1
    why factory pattern and when to use factory pattern
    jvm的字符串池
    is assembler instruction and machine instuction atomic
    jvm本身的多线程机制
    final
    java类的加载
    path和classpath的用途
    jar -cmf file1 file2 file3命令
    MANIFEST.MF中的MF是什么意思
  • 原文地址:https://www.cnblogs.com/zhongjiajie/p/5215048.html
Copyright © 2011-2022 走看看