zoukankan      html  css  js  c++  java
  • 博客园 文章爬取(乱写的,有的爬不下来)

    微博爬取(乱写的)

    import re
    import requests
    web=[
        {"name":'张三',"博客地址":"http://www.cnblogs.com/bladecheng/"},
        {"name":"甲","博客地址":"http://www.cnblogs.com/pythonywy/"},
        {"name":"乙","博客地址":"http://www.cnblogs.com/pythonywy/"},
        {"name":"丙","博客地址":"http://www.cnblogs.com/zrx19960128/"},
        {"name":"丁","博客地址":"http://www.cnblogs.com/itboy-newking/"},
        {"name":"帅哥","博客地址":"http://www.cnblogs.com/chuwanliu/"},
        {"name":"浪哥","博客地址":"http://www.cnblogs.com/einsam/"},
        {"name":"强哥","博客地址":"http://www.cnblogs.com/wsxiaoyao"},
        {"name":"云哥","博客地址":"http://www.cnblogs.com/yellowcloud/"}
    ]
    for n in range(len(web)):
        print("%s的博客文章地址如下:" %(web[n]["name"]))
        html = requests.get(web[n]["博客地址"])
        strr = html.text                                              #网页文本  
        pat1 = r'postTitle2" href="(.*?)</a>'             #正则匹配
        title = re.findall(pat1, strr)                             #匹配后的结果
        long = len(title)
        for i in range(0, long):
            tx = r'">'
            res = re.sub(tx, '  文章标题:', title[i])
            print(res)
    print("爬取完毕!")
    
  • 相关阅读:
    node 搭建代理服务器
    jquery常见的方法
    静态布局字体标签
    ajax简单了解
    GET方式缓存清除
    Ajax使用概述
    SESSION技术
    COOKIE技术
    PHP操作数据库(二)-增删改查操作
    PHP操作数据库(一)-步骤介绍
  • 原文地址:https://www.cnblogs.com/bladecheng/p/10883555.html
Copyright © 2011-2022 走看看