zoukankan      html  css  js  c++  java
  • Meta Blogging

    Meta Blogging

    由来

    偶然想到说如果哪一天cnblogs挂了,那之前记录的随笔怎么办?可不可以把它们给download下来本地保存一份。正好看到有个库叫requests, 干嘛不试试看呢。

    开工

    有了requestsbeautifulsoup,代码其实很简单。唯一需要注意的是,不能太频繁地用requests.get来抓取网页,不然会报错。一般的网站都会有类似的自我保护机制吧,防止被爬虫给爬死了。

    import requests
    from BeautifulSoup import BeautifulSoup
    import re
    import os
    import time
    
    
    URL='http://www.cnblogs.com/fangwenyu/p/'
    URL_PATTERN = 'http://www.cnblogs.com/fangwenyu/p|archive'
    pattern = re.compile(URL_PATTERN)
    DIRECTORY = os.path.dirname(__file__)
    ESCAPE_CHARS = '/:*?"<>|' # Those characters are not allowed to be used in file name in Windows.
    tbl = {ord(char): u'' for char in ESCAPE_CHARS}
    
    # get the total page number
    page_count = 0
    resp = requests.get(URL)
    if resp.status_code == requests.codes.ok:
        soup = BeautifulSoup(resp.content)
        attr = {'class':'Pager'}
        result = soup.find('div', attr)
        page_count = int(result.getText()[1:2])
    
    with open(os.path.join(DIRECTORY, 'blog_archive.txt'), 'w') as blog_archive:
        for page in range(1,page_count+1):
            param = {'page':page}
            resp = requests.get(URL, params=param)
            soup = BeautifulSoup(resp.content, convertEntities=BeautifulSoup.HTML_ENTITIES)
            
            blog_list = [(a.getText(), a.get('href')) for a in soup.findAll('a', id=True, href=pattern)]
            for title, link in blog_list:
                norm_title = title.translate(tbl)
                item = '%s |[%s]| %s ' % (title, norm_title, link)
                blog_archive.write(item.encode('utf-8'))
                blog_archive.write('
    ')
                
                with open(os.path.join(DIRECTORY, norm_title + '.html'), 'w') as f:
                    f.write(requests.get(link).content)
            
            # sleep for some time as access the cnblogs too freqently will cause the server not respond.
            # Something like this -- 
            # ...
            # requests.exceptions.ConnectionError: ('Connection aborted.', error(10060, 'A connection attempt failed 
            # because the connected party did not properly respond after a period of time, or established connection failed 
            # because connected host has failed to respond'))
            time.sleep(5)
    

  • 相关阅读:
    Zjnu Stadium(hdu3047带权并查集)
    cocos2d-x结合cocosbuilder,不同屏幕适配小结
    分布式爬虫系统设计、实现与实战:爬取京东、苏宁易购全网手机商品数据+MySQL、HBase存储
    Generating RSA keys in PKCS#1 format in Java--转
    nodejs安装node-rsa遇到的问题及解决
    spring-redis-data的一个坑
    node-rsa加密,java解密调试
    MySQL 四种事务隔离级别详解及对比--转
    从实际案例聊聊Java应用的GC优化--转
    动态可缓存的内容管理系统(CMS)
  • 原文地址:https://www.cnblogs.com/fangwenyu/p/4191709.html
Copyright © 2011-2022 走看看