zoukankan      html  css  js  c++  java
  • Python网络爬虫

    下面我们创建一个真正的爬虫例子

    爬取我的博客园个人主页首页的推荐文章列表和地址

    scrape_home_articles.py

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re
    
    html = urlopen("http://www.cnblogs.com/davidgu")
    bsObj = BeautifulSoup(html, "html.parser")
    for link in bsObj.find("div", {"id":"main_container"}).findAll("a", href=re.compile("^http://www.cnblogs.com/davidgu/p")):
        if 'href' in link.attrs and not('class' in link.attrs):
            print(link.string)
            print(link.attrs['href'])
            print("--------------------------------------------------------------")

    运行结果:
    [置顶]解决adb server端口被占用的问题
    http://www.cnblogs.com/davidgu/p/4515236.html
    --------------------------------------------------------------
    [置顶]解决Eclipse下不自动拷贝apk到模拟器问题( The connection to adb is down, and a sever
    http://www.cnblogs.com/davidgu/p/4390661.html
    --------------------------------------------------------------
    常用的正则表达式一览
    http://www.cnblogs.com/davidgu/p/4831357.html
    --------------------------------------------------------------
    C++ 11 - STL - 函数对象(Function Object) (上)
    http://www.cnblogs.com/davidgu/p/4829097.html
    --------------------------------------------------------------

    ...

  • 相关阅读:
    CodeForces 375D. Tree and Queries【树上启发式合并】
    JavaWeb(一)-Servlet知识
    XML解析
    XML约束
    XML
    什么是JWT
    Springboot @ConditionalOnProperty注解
    带你了解HTTP协议(二)
    带你了解HTTP协议(一)
    JAVA十大经典排序算法最强总结(含JAVA代码实现)
  • 原文地址:https://www.cnblogs.com/twodog/p/12135312.html
Copyright © 2011-2022 走看看