zoukankan      html  css  js  c++  java
  • Python爬取 斗图表情,让你成为斗图大佬

    话不多说,上结果(只爬了10页内容)

     上代码:(可直接运行)   用到Xpath

    #encoding:utf-8
    # __author__ = 'donghao'
    # __time__ = 2018/12/24 15:20
    import requests
    import urllib.request
    import urllib3
    import os
    import re
    import time
    from lxml import etree
    
    
    def parse_page(url):
        headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'
        }
        resp = requests.get(url=url,headers=headers)
        text = resp.text
        html = etree.HTML(text)
        imgs = html.xpath("//div[@class='page-content text-center']//img[@class!='gif']")
        for img in imgs:
            #获取图片url
            img_url = img.get('data-original')
            #获取图片Url的后缀名
            end = os.path.splitext(img_url)[1]
            #替换掉url中特殊字符
            end = re.sub(r'[,。??,/\·]','',end)
            # 获取图片描述,并加上后缀
            name = img.get('alt')+end
            #文件名为
            #下载到本地文件夹
            urllib.request.urlretrieve(img_url,'images/'+name)
    
    def main():
        #爬取10页
        for x in range(1,10):
            url = 'http://www.doutula.com/photo/list/?page=%d'%x
            parse_page(url)
    
    
    if __name__ == '__main__':
        start = time.time()
        main()
        end = time.time()
        print('耗时:%0.002fs' % (end - start))
    
  • 相关阅读:
    SQL序列键
    SQL日期跟时间值序列
    springboot日志配置
    weblogic10补丁升级与卸载
    idea使用svn报错
    mybatis插入数据并返回主键(oracle)
    UTF-8格式txt文件读取字节前三位问题
    https连接器
    git将本地项目上传码云
    aop的使用
  • 原文地址:https://www.cnblogs.com/donghaoblogs/p/10389699.html
Copyright © 2011-2022 走看看