需要爬取国内某个网站,但是这个网站封ip,没办法,只能用代理了,然后构建自己的代理池,代理池维护了20条进程,
所用的网络是20M带宽,实际的网速能达到2.5M,考虑到其他原因,网速未必能达到那么多。爬虫对网速的要求挺高的。
首先把 URL 图片的链接 抓取下来 保存到数据库中去,然后使用多进程进行图片的抓取。
经过测试 开40个进程,一分钟能采集200张图片,但是开60个进程,图片下降到了一分钟120张。
注意: 抓取图片的时候,或者抓取视频的时候,一定要加上请求头,实现图片的压缩传输。
下面直接粘贴出来代码:
# coding:utf-8 from common.contest import * def save_img(source_url, dir_path, file_name,maxQuests= 11): maxQuests =maxQuests headers = { "Host":"img5.artron.net", "Connection":"keep-alive", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36", "Accept":"image/webp,image/apng,image/*,*/*;q=0.8", "Referer":"http://auction.artron.net/paimai-art5113610001/", "Accept-Encoding":"gzip, deflate", "Accept-Language":"zh-CN,zh;q=0.8", } proxies = r.get(str(random.randint(1,10))) proxies = {"http": "http://" + str(proxies)} print "使用的代理是:",proxies try: response = requests.get(url=source_url, headers=headers,verify=False,proxies=proxies,timeout=15) if response.status_code == 200: if not os.path.exists(dir_path): os.makedirs(dir_path) total_path = dir_path + '/' + file_name with open(total_path, 'wb') as f: for chunk in response.iter_content(1024): f.write(chunk) print "图片保存到本地" return "1" else: print "图片没有保存到本地" return "0" except Exception as e: print e if maxQuests > 0 and response.status_code != 200: save_img(source_url, dir_path, file_name, maxQuests-1) def getUpdataImage(item): item_imgurl = item['item_imgurl'] url = item_imgurl item_href = item_imgurl print "正在采集的 url 是",url filenamelist = url.split('/') filename1 = filenamelist[len(filenamelist) - 4] filename2 = filenamelist[len(filenamelist) - 3] filename3 = filenamelist[len(filenamelist) - 2] filename4 = filenamelist[len(filenamelist) - 1] filename = filename1 + "_" + filename2 + "_" + filename3 + "_" + filename4 filenamestr = filename.replace('.jpg', '') filenamestr = filenamestr.replace('.JPG', '') filenamestr = filenamestr.replace('.JPEG', '') filenamestr = filenamestr.replace('.jpeg', '') filenamestr = filenamestr.replace('.png', '') filenamestr = filenamestr.replace('.bmp', '') filenamestr = filenamestr.replace('.tif', '') filenamestr = filenamestr.replace('.gif', '') localpath = 'G:/helloworld/' + filenamestr save_localpath = localpath + "/" + filename print "图片保存路径是:",save_localpath try: result = save_img(url, localpath, filename,item_href) if result == "1": print "图片采集成功" else: print "图片采集失败" except IOError: pass if __name__ == "__main__": time1 = time.time() sql = """SELECT item_id, item_imgurl FROM 2017_xia_erci_pic """ resultList = select_data(sql) print len(resultList) pool = multiprocessing.Pool(60) for item in resultList: pool.apply_async(getUpdataImage, (item,)) pool.close() pool.join()