zoukankan      html  css  js  c++  java
  • 1.4.3 ID遍历爬虫(每天一更)

    # -*- coding: utf-8 -*-
    '''
    Created on 2019年5月7日
    
    @author: 薛卫卫
    '''
    import itertools
    import urllib.request
    import re
    
    def download(url, user_agent="wswp",num_retries=2):
        print("Downloading: " , url)
        headers = { 'User-agent': user_agent}
        request = urllib.request.Request(url, headers=headers)
        try:
            html = urllib.request.urlopen(request).read()
        except urllib.request.URLError as e:
            print('Download error:' , e.reason)
            html = None
            if num_retries > 0 :
                if hasattr(e, 'code') and 500 <= e.code < 600:
                    return download(url, user_agent, num_retries-1)
        return html
    
    for page in itertools.count(1):
        url = 'http://example.webscraping.com/view/-%d' % page
        html = download(url)
        if html is None:
            break
        else:
            # success - can scrape the result
            pass
        
    #     
    # # maximum number of consecutive download errors allowed
    # max_error = 5
    # # current number of consecutive download errors
    # nun_errors = 0
    # for page in itertools.count(1):
    #     url = 'http://example.webcraping.com/view/-%d' % page
    #     html = download(url)
    #     if html is None:
    #         # received an error trying to download this webpage
    #         num_errors +=1
    #         if num_errors == max_errors:
    #             # reached maxinum number of 
    #             # consecutive errors so exit
    #             break
    #         else:
    #             # success - can scrape the result
    #             # ...
    #             num_errors = 0
    

      

  • 相关阅读:
    2016-10-17: source insight插件
    Reactor模式通俗解释
    2016-09-19: linux后台运行
    2016-08-16: 检测函数是否存在的C++模板
    2016-08-16: copy-and-swap
    2016-08-15:从YUV420P中提取指定大小区域
    2016-08-15: C++ traits
    2016-08-05:samba服务器配置
    LINQ 根据指定属性名称对序列进行排序
    Resharper
  • 原文地址:https://www.cnblogs.com/xww115/p/10835223.html
Copyright © 2011-2022 走看看