zoukankan html css js c++ java

selenium&phantomJs相关

问题：处理页面动态加载数据的爬取

selenium
phantomJs

selenium: 三方库，可以实现让浏览器完成自动化操作

- 环境搭建
    1 安装： pip install selenium
    2 获取浏览器的驱动程序
        下载地址：http://chromedriver.storage.googleapis.com/index.html
        浏览器版本和驱动版本的对应关系表网址https://blog.csdn.net/huilan_same/article/details/51896672

使用下面的方法，查找指定的元素进行操作即可

find_element_by_id     根据id找节点
find_element_by_name     根据name找节点
find_element_by_xpath     根据xpath找节点
find_element_by_tag_name     根据标签找节点
find_element_by_class_name     根据class名字找节点

# 编码流程：
from selenium import webdriver
from time import sleep

#创建一个浏览器对象     executable_path是驱动的路径
bro= webdriver.Chrome(executable_path='./chromedriver')
#get方法可以指定一个url，让浏览器进行请求
bro.get('http://www.baidu.com')
sleep(1)
#让百度进行指定词条的搜索
text = bro.find_element_by_id('kw') #定位到text文本框
text.send_keys('人民币')   #send_keys表示向文本框中录入指定内容
sleep(1)
button = bro.find_element_by_id('su')
button.click()   #click表示的是点击操作
sleep(3)
bro.quit()  #关闭浏览器

phantomJs : 无界面浏览器，其自动化流程与上述谷歌浏览器自动化流程一致

C:ProgramDataAnaconda3libsite-packagesseleniumwebdriverphantomjswebdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '

报错解决方式原因

最新版本的selenium不支持plantomjs了，所以如果想继续用的话只能对selenium做降级处理：

在命令行工具cmd里分别执行以下两行：

pip3 uninstall selenium pip install selenium==2.48.0

from selenium import webdriver

bro = webdriver.PhantomJS(executable_path=r'C:UsersAdministratorDesktopphantomjs-2.1.1-windowsinphantomjs.exe')
#打开浏览器
bro.get('http://www.baidu.com')
# 截屏
bro.save_screenshot('./1.png')
text = bro.find_element_by_id('kw') #定位到text文本框
text.send_keys('人民币')   #send_keys表示向文本框中录入指定内容
bro.save_screenshot('./2.png')

bro.quit()

#SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated UXXXXXXXX escape
#解决办法是路径前面加上r   :  executable_path=r'C:UsersAdministratorDesktopphantomjs-2.1.1-windowsinphantomjs.exe'

使用selenium+phantomJs处理页面动态加载数据的爬取

- 需求 ： 获取豆瓣电影中动态加载出更多电影详情数据

from selenium import webdriver
from time import sleep

bro = webdriver.PhantomJS(executable_path=r'C:UsersAdministratorDesktopphantomjs-2.1.1-windowsinphantomjs.exe')
url = 'https://https://movie.douban.com/typerank?type_name=%E6%AD%8C%E8%88%9E&type=7&interval_id=100:90&action='
bro.get(url)
sleep(1)
bro.save_screenshot('./1.png')
#编写js代码：让页面中的滚轮向下滑动（底部）
js='window.scrollTo(0,document.body.scrollHeight)'
#如何让浏览器对象执行JS代码
bro.execute_script(js)
sleep(1)
bro.save_screenshot('./2.png')
#获取加载数据后的页面：page_source获取浏览器当前的页面数据
page_text = bro.page_source

查看全文

相关阅读:
JavaWeb项目自动部署，持续集成
 hbase系列
 传输视频的带宽如何计算？传输4K视频需要多少带宽？
TSINGSEE青犀视频通过Webrtc编译android版本找不到gzip模块如何处理？
TSINGSEE青犀视频webrtc相关内容编译如何在c++端编码出H264？
视频组网/网络穿透工具EasyNTS如何永久删除其中某个设备？
音视频流媒体平台的开发，开源EasyDarwin为什么如此受欢迎？
基于音视频的云会议为什么会迎来发展的大爆发？
EasyRTC的Web开发过程中如何创建新的空分支？
C# 会话,进程,线程,线程安全

原文地址：https://www.cnblogs.com/lys666/p/10478520.html