zoukankan      html  css  js  c++  java
  • python_crawler,批量下载文件

    这个第一个python3网络爬虫,参考书籍是《python网络数据采集》。该爬虫的主要功能是爬取某个网站,并将.rar,.doc,.docx,.zip文件批量下载。

    后期将要改进的是,用后缀名来识别并下载文件,但面对大数据量的网站,需要用到BloomFilter,再者还需要了解网站的反爬虫机制。

    # -*- coding: utf-8 -*-

    import os
    from urllib.request import urlretrieve
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re
    from urllib.parse import quote
    import string

    downloadDirectory = "downloaded"
    baseUrl = "http://computer.hdu.edu.cn"
    def is_chinese(uchar):
    if uchar >= u'u2E80' and uchar <= u'uFE4F':
    return True
    else:
    return False

    def getAbsoluteURL(baseUrl, source):
    if source.startswith("http://www."):
    url = "http://"+source[11:]
    elif source.startswith("http://"):
    url = source
    elif source.startswith("www."):
    url = source[4:]
    url = "http://"+source
    else:
    url = baseUrl+source
    if baseUrl not in url:
    return None
    return url

    def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
    path = absoluteUrl.replace("www.", "")
    path = path.replace(baseUrl, "")
    path = downloadDirectory+path
    directory = os.path.dirname(path)

    if not os.path.exists(directory):
    os.makedirs(directory)

    print(path)
    return path


    pages = set()
    def getLinks(pageUrl):
    global pages
    html = urlopen("http://computer.hdu.edu.cn"+pageUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    try:
    print(bsObj.h1.get_text())
    print(bsObj.h2.get_text())
    print(bsObj.h3.get_text())
    # my_docs = bsObj.findAll("a", {"href":re.compile("/uploads/attachments/.*.doc")})
    my_files = bsObj.findAll("a", {"href":re.compile("/uploads/attachments/")})

    for my_file in my_files:
    if is_chinese(my_file["href"]):
    my_file["href"]=quote(my_file["href"])
    print("τݾ"+my_file["href"])
    url = getAbsoluteURL(baseUrl, my_file["href"])
    # url="http://computer.hdu.edu.cn"+ my_file["href"]
    print(url)
    if url is not None:
    # print(url)
    # url=url.encode("utf-8")
    # url=quote(url,safe=string.printable)
    # url=quote(url)

    # print(url)
    urlretrieve(url, getDownloadPath(baseUrl, url, downloadDirectory))

    # print(bsObj.find(id ="mw-content-text").findAll("p")[0])
    # print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
    except AttributeError:
    print("This page is missing something! No worries though!")

    for link in bsObj.findAll("a", href=re.compile("^(/index.php/)")):
    if 'href' in link.attrs:
    if link.attrs['href'] not in pages:
    #We have encountered a new page
    newPage = link.attrs['href']
    print("---------------- "+newPage)
    pages.add(newPage)
    getLinks(newPage)
    getLinks("")

    一生有所追!
  • 相关阅读:
    购买成熟软件产品后的二次开发的问题
    outlook2010如何导入csv的通讯录?
    导入Excel数据时对数据校验提示方法
    系统开发中存储过程使用的优势和劣势
    FCKeditor.Net_2.5的使用
    [正则表达式]如何高亮显示搜索关键字
    国外网站模板网址集锦
    _NET 下 FCKeditor_2_5_1上传图片的配置
    用属性模拟多继承机制
    FCKeditor 2.6在ASP.NET中的配置方法
  • 原文地址:https://www.cnblogs.com/BlueBlue-Sky/p/6719230.html
Copyright © 2011-2022 走看看