zoukankan html css js c++ java

phantomjs+selenium实现爬取动态网址

之前使用 selenium + firefox驱动浏览器来实现爬取动态网址，但是firefox经常更新，更新后时常会导致webdriver启动不来，所以改用phantomjs+selenium来改善一下。
使用phantomjs和使用浏览器区别并不大。

一，首先还是需要下载Phantomjs

Phantomjs对各个主流的平台都支持，下载页面。选择好存放的目录，例如D:phantomjs。
phantomjs的可执行文件就在bin目录下，可以将D:phantomjsin目录加入环境变量中。如果不加入环境变量，那么selenium在驱动phantomjs时就需要指定路径。

二，在Selenium中驱动Phantomjs

from selenium import webdriver
from selenium.common.exceptions import TimeoutException

##可以对phantomjs配置
#cap = webdriver.DesiredCapabilities.PHANTOMJS    #获取webdriver对Phantomjs的默认配置
#cap["phantomjs.page.settings.resourceTimeout"] = 5000    #资源加载超时时长
#cap["phantomjs.page.settings.loadImages"] = False    #是否加载图片
#driver = webdriver.PhantomJS(desired_capabilities=cap)

#未将phantomjs加入环境变量,需要指定phantomjs的路径
#driver = webdriver.PhantomJS(executable_path="D:phantomjsinphantomjs.exe")
driver = webdriver.PhantomJS()
driver.set_page_load_timeout(5)    #设置页面超时时长
#driver.set_script_timeout(5)    #设置页面JS超时时长，这两者超时后会报TimeoutException错

##当超时后停止页面的加载
##有些页面在加载出你想要的数据后，还是会一直加载一些其他资源
tru:
    driver.get("www.tvmao.com")
exception TimeoutException:
    driver.execute_script("window.stop()")

##获取网页源代码后，就可以将其保存起来进而进行数据解析了
page_source = driver.page_source()

############
#
#数据解析部分
#
############

phantomjs可配置的选项，可以看官方文档说明

查看全文

相关阅读:
对象o o[name]和o['name']的差别
 数组转换为字符串
 函数和方法区别
 创建对象和构造函数的区别
 jQuery光源移动效果
 继承原型链
 javascript跨域
 prototype、constructor、__proto__
寄生组合式继承
 组合继承

原文地址：https://www.cnblogs.com/bencakes/p/5971859.html