爬虫3：html页面+webdriver模块+demo - 走看看

zoukankan html css js c++ java

爬虫3：html页面+webdriver模块+demo
　　保密性好的网站，不能使用request请求页面信息，这样可以使用webdriver模块先开启一个浏览器，然后爬去信息，甚至还可以click等操作对页面操作，再爬取。

　　demo 一般流程：

　　1）包含selenium 模块
from selenium import webdriver from selenium.webdriver.common.keys import Keys
　　2）设置采用火狐浏览器（chrome也可以）
driver = webdriver.Firefox()
　　3）get方式打开（为了保密，url省略）
driver.get("http://www.---------------")
　　4）css方式筛选
elements = driver.find_elements_by_css_selector("span.c9.ng-binding")
　　5）由于webdriver模块的筛选功能不是很好用，这里推荐转成html形式，然后使用beautifulsoap筛选
html = driver.page_source
　　6）BeautifulSoup筛选信息-find_all 和 css 选择器方式更好用
from bs4 import BeautifulSoup import re soup = BeautifulSoup(html) # soup.find_all('div',text=re.compile(u"信息"))[0] for i in soup.select('a[href*="human"]'): print i
查看全文

相关阅读:
【python】变量定义及全局局部变量
 【python】重要的内置函数
 【python】迭代器iterator
Java序列化与反序列化
 java中的IO操作总结
 Java中List Set Map 是否有序等总结
 java.lang.Class.getDeclaredMethod()方法详解
 一个servlet处理多个请求（使用Method的反射机制）
java类的访问权限
 java中的基本数据类型存放位置

原文地址：https://www.cnblogs.com/rongyux/p/5513780.html

Copyright © 2011-2022 走看看