zoukankan      html  css  js  c++  java
  • Python 多进程 一分钟下载二百张图片 是什么样子的体验

    需要爬取国内某个网站,但是这个网站封ip,没办法,只能用代理了,然后构建自己的代理池,代理池维护了20条进程,

    所用的网络是20M带宽,实际的网速能达到2.5M,考虑到其他原因,网速未必能达到那么多。爬虫对网速的要求挺高的。

    首先把 URL 图片的链接  抓取下来 保存到数据库中去,然后使用多进程进行图片的抓取。

    经过测试   开40个进程,一分钟能采集200张图片,但是开60个进程,图片下降到了一分钟120张。

    注意: 抓取图片的时候,或者抓取视频的时候,一定要加上请求头,实现图片的压缩传输。

    下面直接粘贴出来代码:

    # coding:utf-8
    from common.contest import *
    
    def save_img(source_url, dir_path, file_name,maxQuests= 11):
        maxQuests =maxQuests
    
        headers = {
    
                    "Host":"img5.artron.net",
                    "Connection":"keep-alive",
                    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36",
                    "Accept":"image/webp,image/apng,image/*,*/*;q=0.8",
                    "Referer":"http://auction.artron.net/paimai-art5113610001/",
                    "Accept-Encoding":"gzip, deflate",
                    "Accept-Language":"zh-CN,zh;q=0.8",
    
                    }
        proxies = r.get(str(random.randint(1,10)))
        proxies = {"http": "http://" + str(proxies)}
        print "使用的代理是:",proxies
        try:
            response = requests.get(url=source_url, headers=headers,verify=False,proxies=proxies,timeout=15)
            if response.status_code == 200:
                if not os.path.exists(dir_path):
                    os.makedirs(dir_path)
                total_path = dir_path + '/' + file_name
    
                with open(total_path, 'wb') as f:
                    for chunk in response.iter_content(1024):
                        f.write(chunk)
                print "图片保存到本地"
                return "1"
            else:
                print "图片没有保存到本地"
                return "0"
        except Exception as e:
            print e
            if maxQuests > 0 and response.status_code != 200:
                save_img(source_url, dir_path, file_name, maxQuests-1)
    
    
    
    
    def getUpdataImage(item):
    
        item_imgurl = item['item_imgurl']
        url = item_imgurl
        item_href = item_imgurl
        print "正在采集的 url 是",url
    
        filenamelist = url.split('/')
    
        filename1 = filenamelist[len(filenamelist) - 4]
        filename2 = filenamelist[len(filenamelist) - 3]
        filename3 = filenamelist[len(filenamelist) - 2]
        filename4 = filenamelist[len(filenamelist) - 1]
    
        filename = filename1 + "_" + filename2 + "_" + filename3 + "_" + filename4
    
        filenamestr = filename.replace('.jpg', '')
        filenamestr = filenamestr.replace('.JPG', '')
        filenamestr = filenamestr.replace('.JPEG', '')
        filenamestr = filenamestr.replace('.jpeg', '')
        filenamestr = filenamestr.replace('.png', '')
        filenamestr = filenamestr.replace('.bmp', '')
        filenamestr = filenamestr.replace('.tif', '')
        filenamestr = filenamestr.replace('.gif', '')
    
        localpath = 'G:/helloworld/' + filenamestr
    
        save_localpath = localpath + "/" + filename
        print "图片保存路径是:",save_localpath
    
    
        try:
            result = save_img(url, localpath, filename,item_href)
    
            if result == "1":
                print "图片采集成功"
            else:
                print "图片采集失败"
    
    
        except IOError:
            pass
    
    
    
    if __name__ == "__main__":
    
        time1 = time.time()
        sql = """SELECT item_id, item_imgurl FROM 2017_xia_erci_pic  """
        resultList = select_data(sql)
        print len(resultList)
        pool = multiprocessing.Pool(60)
        for item in resultList:
            pool.apply_async(getUpdataImage, (item,))
        pool.close()
        pool.join()
  • 相关阅读:
    LeetCode 226. Invert Binary Tree
    LeetCode 221. Maximal Square
    LeetCode 217. Contains Duplicate
    LeetCode 206. Reverse Linked List
    LeetCode 213. House Robber II
    LeetCode 198. House Robber
    LeetCode 188. Best Time to Buy and Sell Stock IV (stock problem)
    LeetCode 171. Excel Sheet Column Number
    LeetCode 169. Majority Element
    运维工程师常见面试题
  • 原文地址:https://www.cnblogs.com/xuchunlin/p/7615590.html
Copyright © 2011-2022 走看看