zoukankan      html  css  js  c++  java
  • python 爬虫 重复下载 二次请求

    在写爬虫的时候,难免会遇到报错,比如 4XX ,5XX,有些可能是网络的原因,或者一些其他的原因,这个时候我们希望程序去做第二次下载,

    有一种很low的解决方案,比如是用  try  except

      

    try:
        -------
    except:
        try:
            --------
        except:
            try:
                ------
            except:
                try:
                    ------
                except:
                    try:
                        ------
                    except:
                        try:
                            ------
                        except:
                            ------

    有没有看起来更舒服的写法呢?

    我们可以用递归实现这个过程

    代码如下

    request_urls = [
    "https://www.baidu.com/",
    "https://www.baidu.com/",
    "https://www.baidu.com/",
    "https://www.ba111111idu.com/",
    "https://www.baidu.com/",
    "https://www.baidu.com/",
    ]
    
    def down_load(url,request_max=3):
        print "正在请求的URL是:",url
        result_html = ""
        result_status_code = ""
        try:
            result = session.get(url=url)
            result_html = result.content
            result_status_code = result.status_code
            print result_status_code
        except Exception as e:
            print e
            if request_max >0:
                if result_status_code != 200:
                    return down_load(url,request_max-1)
        return result_html
    
    for url in request_urls:
        down_load(url=url,request_max=13)

     输出结果:

    C:Python27python.exe C:/Users/xuchunlin/PycharmProjects/A9_25/auction/test.py
    正在请求的URL是: https://www.baidu.com/
    200
    正在请求的URL是: https://www.baidu.com/
    200
    正在请求的URL是: https://www.baidu.com/
    200
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6208>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6438>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA65F8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6828>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6A90>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA62E8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6D30>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6DD8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B682B0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B68080>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B685C0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B687F0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B68A20>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.ba111111idu.com/
    HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B68C50>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
    正在请求的URL是: https://www.baidu.com/
    200
    正在请求的URL是: https://www.baidu.com/
    200
    
    Process finished with exit code 0
  • 相关阅读:
    VS2019调试 asp.net core 2.2 出现《ANCM In-Process Handler Load Failure 发布后启动错误处理》处理
    网页上显示数学公式的三种方案
    FileStream实现多线程断点续传(已封装)
    绝对定位不脱离文档流的方法
    百度地图InfoWindow弹窗圆角
    并发:线程池异步执行与创建单独的线程执行
    互斥锁和自旋锁的区别
    事务的特性和隔离级别
    线程不安全与线程安全示例
    多线程过去与现在
  • 原文地址:https://www.cnblogs.com/xuchunlin/p/8565952.html
Copyright © 2011-2022 走看看