一、selenium概念
selenium 是一个基于浏览器自动化的模块
selenium爬虫之间的关联:
1.便捷的获取动态加载的数据
2.实现模拟登录
基本使用
pip install selenium
获取浏览器的驱动程序
google驱动地址下载链接:http://chromedriver.storage.googleapis.com/index.html
selenium基本使用
from selenium import webdriver from time import sleep #实例化一个浏览器对象 bro = webdriver.Chrome(executable_path='./chromedriver.exe') #发送请求 bro.get('https://www.jd.com/') sleep(2) #定位标签 search_tag = bro.find_elements_by_xpath('//*[@id="key"]')[0] search_tag.send_keys('mac pro') #定位搜索按钮 btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button') btn.click() #js注入 bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') sleep(3) bro.quit()
基于selenium爬取动态加载的数据
from selenium import webdriver from time import sleep from lxml import etree #实例化一个浏览器对象 bro = webdriver.Chrome(executable_path='./chromedriver.exe') bro.get('http://125.35.6.84:81/xk/') sleep(1) #当前浏览器显示对应的所有的页面数据 page_text = bro.page_source all_page_text = [page_text] for i in range(1,4): next_page_tag = bro.find_element_by_xpath('//*[@id="pageIto_next"]') next_page_tag.click() sleep(1) all_page_text.append(bro.page_source) for page_text in all_page_text: tree = etree.HTML(page_text) li_list = tree.xpath('//*[@id="gzlist"]/li') for li in li_list: title = li.xpath('./dl/a/text()') print(title) sleep(3) bro.quit()