zoukankan html css js c++ java

简单爬虫

import urllib.request
import re
import urllib.error

keyname = "短裙"
key = urllib.request.quote(keyname) #进行编码

#伪装浏览器 （因为淘宝能够识别是否为爬虫程序）
headers = ("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
#将opener添加为全局
urllib.request.install_opener(opener)

#要爬取多少页那么进行多少次循环
for i in range(3,5):
    url = "https://s.taobao.com/search?q="+key+"&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20171209&ie=utf8&bcoffset=4&ntoffset=4&p4ppushleft=1%2C48&s="+str(i * 44)
    #先对所在的页面的主页进行爬取读取内容，也就是读取源码
    data = urllib.request.urlopen(url).read().decode("utf-8","ignore")
    #构造正则表达式
    pattern = 'pic_url":"//(.*?)"'
    #在当前页根据正则进行查找，查找到的所有连接存储为一个list
    imagelist = re.compile(pattern).findall(data)
    #遍历列表进行每个图片的存储到本地文件夹
    for j in range(0,len(imagelist)):
        thisimg = imagelist[j]
        thisimageurl = "http://"+thisimg
        file = "/home/tarena/aid1808/pbase/pachong/tupian/"+"b"+str(i)+str(j)+".jpg"
        urllib.request.urlretrieve(thisimageurl,file)
        print('第%d%d张'%(i,j))

查看全文

相关阅读:
select、poll和epoll
Linux 常用命令之文件和目录
 SmartPlant Review 帮助文档机翻做培训手册
 SmartPlant Foundation 基础教程 3.4 菜单栏
 SmartPlant Foundation 基础教程 3.3 标题栏
 SmartPlant Foundation 基础教程 3.2 界面布局
 SmartPlant Foundation 基础教程 3.1 DTC登陆界面
 SmartPlant Foundation 基础教程 1.4 SPF架构
 SmartPlant Foundation 基础教程 1.3 SPF其他功能
 SmartPlant Foundation 基础教程 1.2 SPF集成设计功能

原文地址：https://www.cnblogs.com/sky-ai/p/9744599.html

最新文章
阻止事件冒泡
 cookies存多个键值对
 拖拽
 阻止默认行为
 element
数据同步
 加载进度条
 分页插件
 bootstrap分页
 本地存储localstorage