zoukankan      html  css  js  c++  java
  • python 爬虫(五)

    下载媒体文件

    I 使用urllib.request.urlretrieve方法可以下载文件存为指定文件

    from urllib.request import urlretrieve
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    with urlopen("http://www.pythonscraping.com") as html:
        bsObj = BeautifulSoup(html,'html.parser')
    imageLocation = bsObj.find('a',{'id':"logo"}).find("img")["src"]
    urlretrieve(imageLocation,"logo.jpg")
    View Code
    import os
    from urllib.request import urlretrieve
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    downloadDirectory = "downloaded"
    baseUrl = "http://pythonscraping.com"
    
    def getAbsoluteURL(baseUrl, source):
        if source.startswith("http://www."):
            url = "http://" + source[11:]
        elif source.startswith("http://"):
            url = source
        elif source.startswith("www."):
            url = source[4:]
            url = "http://" + source
        else:
            url = baseUrl + "/" + source
        if baseUrl not in url:
            return None
        return url
    
    def getDownLoadPath(baseUrl, absoluteUrl, downloadDirecory):
        path = absoluteUrl.replace("www.","")
        path = path.replace(baseUrl,"")
        path = downloadDirectory + path
        path = path.split("?")[0]
        directory = os.path.dirname(path)
        if not os.path.exists(directory):
            os.makedirs(directory)
        return path
    
    html = urlopen("http://www.pythonscraping.com")
    bsObj = BeautifulSoup(html,"html.parser")
    
    downloadList = bsObj.find_all(src=True)
        
    for download in downloadList:
        fileUrl = getAbsoluteURL(baseUrl, download["src"])
        if fileUrl is not None:
            print(fileUrl)
            dir = getDownLoadPath(baseUrl,fileUrl,downloadDirectory)
            print("save: " + dir)
            urlretrieve(fileUrl,dir)
    

      

  • 相关阅读:
    前端资源
    WCF常见异常-The maximum string content length quota (8192) has been exceeded while reading XML data
    Asp.Net MVC路由调试工具-RouteDebugger
    Java中String 的equals 和==详解
    记一次高级java工程师职位的面试
    java中Class对象详解和类名.class, class.forName(), getClass()区别
    2014读书计划
    Javascript quiz
    CSS3 Flexbox布局那些事
    前端开发中的图片优化
  • 原文地址:https://www.cnblogs.com/someoneHan/p/6237572.html
Copyright © 2011-2022 走看看