今天我表弟说帮忙爬一下中国药学科学数据,导出json格式给他。一共18万条数据。
看了一下网站http://pharm.ncmi.cn/dataContent/admin/index.jsp?submenu=183
竟然get请求。不爬你爬谁。。。
#/usr/bin/env python #Guoyabin #-*- coding:utf-8 -*- import re,requests,threading,time def inserttxt(file,text): f=open(file,'a+') f.write(text) f.close() def down(begin,end): url='http://pharm.ncmi.cn/dataContent/dataSearch.do' for i in range(begin,end): file=str(end)+'.txt' params={'method':'viewpage','id':i,'did':26} try: html=requests.get(url,params=params,timeout=60) r=html.text.replace(" ","") html.close() r=r.replace(" ","") r=r.replace(" ","") r=r.replace(">","") req='width="89%">(.*?) </td>' yaovalue=re.findall(req,r) yaokey=['{ name:"','", english:"','", number:"','", shanpinmingchen:"','", danwei:"','", date:"','", class:"','", guige:"','", jixing:"','", leibie:"','", pizhun:"'] yao=zip(yaokey,yaovalue) for i in yao: for x in i: inserttxt(file,x) inserttxt(file,'" }, ') #休息3秒在爬,原来没有休息。导致大量TCP连接。且对方直接封我ip。 #18万条数据/10线程*3秒等待/60秒/60分=15个小时拿完对方数据。不如改一下程序,多台独立IP电脑运行了。 time.sleep(3) except: print('url访问失败') continue if __name__=='__main__': t1=threading.Thread(target=down,args=(2228,20000,)) t1.start() t2=threading.Thread(target=down,args=(20000,40000,)) t2.start() t3=threading.Thread(target=down,args=(40000,60000,)) t3.start() t4=threading.Thread(target=down,args=(60000,80000,)) t4.start() t5=threading.Thread(target=down,args=(80000,100000,)) t5.start() t6=threading.Thread(target=down,args=(100000,120000,)) t6.start() t7=threading.Thread(target=down,args=(120000,140000,)) t7.start() t8=threading.Thread(target=down,args=(140000,160000,)) t8.start() t9=threading.Thread(target=down,args=(16000,180000,)) t9.start() t10=threading.Thread(target=down,args=(18000,183662,)) t10.start() t10.join() input('已经下载完,按回车退出')
开始运行了几次没问题,已经爬下一半了,过了一会直接被封了。可能爬的太快了。容我做个悲伤的表情。
无耻的求一下赞助