zoukankan      html  css  js  c++  java
  • 爬取博客园的所有随笔的url以及计数,还有对应标题

    1.爬取博客园的所有随笔的url以及计数,还有对应标题

    import re
    import requests
    from lxml.html import etree
    import json
    
    #对于链接和标题的一个整合
    def func_1_deco(func_1):
        def wrapper(*args,**kwargs):
            dic = dict()
            lis = func_1(*args,**kwargs)
            count = lis[0]
            url_lis = lis[1]
            dic['count'] = count
            name_xpath = '//*[@id="cb_post_title_url"]/text()'
            for url in url_lis:
                response = requests.get(url)
                response = response.text
                response_html = etree.HTML(response)
                name = response_html.xpath(name_xpath)[0]
                print(name)
                dic[name] = url
            return dic
        return wrapper
    
    @func_1_deco
    def func(url):
        lis = []
        count = 1
        while True:
            count_1 = len(lis)
            response = requests.get(f'{url}default.html?page={count}')
            response = response.text
            data_1 = re.findall(' href="(.*?)"', response, re.S)
            for a in data_1:  # type:str
                if a.startswith('http'):
                    if a.endswith('html'):
                        if 'archive' not in a:
                            lis.append(a)
            count +=1
            lis = set(lis)
            lis = list(lis)
            count_2 = len(lis)
    
            if count_1 == count_2:
                return count_2,lis  #博客的数据量,博客里面随笔的url
    
            
    dic = func('你的博客的首页地址') #注意结尾要有/,字典格式是有一栏'count'计数,其他均为标题+对应的url
    
  • 相关阅读:
    LintCode-Search for a Range
    LintCode-Serialization and Deserialization Of Binary Tree
    LeetCode-Reverse Words in a String
    LeetCode-Reorder List
    LeetCode-Word Break
    LeetCode-Word Ladder
    LeetCode-Valid Palindrome
    cf div2 237 D
    POJ 1759
    cf div2 238 D
  • 原文地址:https://www.cnblogs.com/pythonywy/p/11046302.html
Copyright © 2011-2022 走看看