zoukankan      html  css  js  c++  java
  • python网页爬虫

    1. 静态页面爬取

    这类最简单啦,右键->查看页面源码时,想下载的信息都能够显示在这里,这时只需要直接down页面源码,代码如下:

    # Simple open web
    import urllib2
    print urllib2.urlopen('http://stockrt.github.com').read()
    # With password?
    import urllib
    opener = urllib.FancyURLopener()
    print opener.open('http://user:password@stockrt.github.com').read()

    2. 滑动鼠标动态加载内容

    有些页面在打开时不会完全显示,而是通过滑动鼠标动态加载。对于这类页面的爬虫,需要找到触发动态加载的url,通常方法为:右键->审查元素->Network

    寻找滑动鼠标时触发的事件,分析每次滑动鼠标时url中变化的参数,在代码中拼接出对应的url即可。

    3. 使用 mechanize 模拟浏览器访问网页 

    有时会发现上述方法不灵,即down的东西与页面内容不一致,会发现内容少了很多,这时就需要浏览器伪装,模拟浏览器动作,在命令行或者python脚本中实例化一个浏览器。代码网页连接

    模拟浏览器:

    import mechanize
    import cookielib
    # Browser
    br = mechanize.Browser()
    # Cookie Jar
    cj = cookielib.LWPCookieJar()
    br.set_cookiejar(cj)
    # Browser options
    br.set_handle_equiv(True)
    br.set_handle_gzip(True)
    br.set_handle_redirect(True)
    br.set_handle_referer(True)
    br.set_handle_robots(False)
    # Follows refresh 0 but not hangs on refresh > 0
    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
    # Want debugging messages?
    #br.set_debug_http(True)
    #br.set_debug_redirects(True)
    #br.set_debug_responses(True)
    # User-Agent (this is cheating, ok?)
    br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

    现在你得到了一个浏览器的示例,br对象。使用这个对象,便可以打开一个页面,使用类似如下的代码: 

    # Open some site, let's pick a random one, the first that pops in mind:
    r = br.open('http://google.com')
    html = r.read()
    # Show the source
    print html
    # or
    print br.response().read()
    # Show the html title
    print br.title()
    # Show the response headers
    print r.info()
    # or
    print br.response().info()
    # Show the available forms
    for f in br.forms():
        print f
    # Select the first (index zero) form
    br.select_form(nr=0)
    # Let's search
    br.form['q']='weekend codes'
    br.submit()
    print br.response().read()
    # Looking at some results in link format
    for l in br.links(url_regex='stockrt'):
        print l

    如果你访问的网站需要验证(http basic auth),那么: 

    # If the protected site didn't receive the authentication data you would
    # end up with a 410 error in your face
    br.add_password('http://safe-site.domain', 'username', 'password')
    br.open('http://safe-site.domain')

    由于之前使用了Cookie Jar,你不需要管理网站的登录session。也就是不需要管理需要POST一个用户名和密码的情况。 
    通常这种情况,网站会请求你的浏览器去存储一个session cookie除非你重复登陆, 
    而导致你的cookie中含有这个字段。所有这些事情,存储和重发这个session cookie已经被Cookie Jar搞定了,爽吧。 
    同时,你可以管理你的浏览器历史: 

    # Testing presence of link (if the link is not found you would have to
    # handle a LinkNotFoundError exception)
    br.find_link(text='Weekend codes')
    # Actually clicking the link
    req = br.click_link(text='Weekend codes')
    br.open(req)
    print br.response().read()
    print br.geturl()
    # Back
    br.back()
    print br.response().read()
    print br.geturl()

    下载一个文件: 

    # Download
    f = br.retrieve('http://www.google.com.br/intl/pt-BR_br/images/logo.gif')[0]
    print f
    fh = open(f)

    为http设置代理

    # Proxy and user/password
    br.set_proxies({"http": "joe:password@myproxy.example.com:3128"})
    # Proxy
    br.set_proxies({"http": "myproxy.example.com:3128"})
    # Proxy password
    br.add_proxy_password("joe", "password")
  • 相关阅读:
    解决:Could not resolve archetype org.apache.maven.archetypes
    Spring MVC配置MyBatis输出SQL
    Spring集成MyBatis 通用Mapper以及 pagehelper分页插件
    关于SpringMVC或Struts2接受参数接收不到的原因
    配置quartz启动时就执行一次
    ajaxFileUpload进行文件上传时,总是进入error
    spring mvc注入配置文件里的属性
    java中将一个文件夹下所有的文件压缩成一个文件
    flume failed to start agent because dependencies were not found in classpath
    ubuntu不能安装pip unable to install pip in unbuntu
  • 原文地址:https://www.cnblogs.com/tec-vegetables/p/5411811.html
Copyright © 2011-2022 走看看