zoukankan      html  css  js  c++  java
  • Python网络爬虫

    下面我们创建一个真正的爬虫例子

    爬取我的博客园个人主页首页的推荐文章列表和地址

    scrape_home_articles.py

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re
    
    html = urlopen("http://www.cnblogs.com/davidgu")
    bsObj = BeautifulSoup(html, "html.parser")
    for link in bsObj.find("div", {"id":"main_container"}).findAll("a", href=re.compile("^http://www.cnblogs.com/davidgu/p")):
        if 'href' in link.attrs and not('class' in link.attrs):
            print(link.string)
            print(link.attrs['href'])
            print("--------------------------------------------------------------")

    运行结果:
    [置顶]解决adb server端口被占用的问题
    http://www.cnblogs.com/davidgu/p/4515236.html
    --------------------------------------------------------------
    [置顶]解决Eclipse下不自动拷贝apk到模拟器问题( The connection to adb is down, and a sever
    http://www.cnblogs.com/davidgu/p/4390661.html
    --------------------------------------------------------------
    常用的正则表达式一览
    http://www.cnblogs.com/davidgu/p/4831357.html
    --------------------------------------------------------------
    C++ 11 - STL - 函数对象(Function Object) (上)
    http://www.cnblogs.com/davidgu/p/4829097.html
    --------------------------------------------------------------

    ...

  • 相关阅读:
    Dependency property changed example
    业务数据分析
    WPF : 以鼠标指针为中心缩放
    WPF待学习问题列表(未完)
    GirdView前台数据类型转换
    牛人的博客
    使用Xpath对XML进行模糊查询
    XPath语法
    【HDU】3415 Max Sum of MaxKsubsequence
    【HDU】3474 Necklace
  • 原文地址:https://www.cnblogs.com/twodog/p/12135312.html
Copyright © 2011-2022 走看看