zoukankan      html  css  js  c++  java
  • Selenium 获取动态js的网页

    Selenium基于webkit实现爬虫功能

    http://www.cnblogs.com/luxiaojun/p/6144748.html

    https://www.cnblogs.com/chenqingyang/p/3772673.html

    现在headless chrome替代 PhantomJS 

    https://zhuanlan.zhihu.com/p/27100187

    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    import time
    import io
    
    dcap = dict(DesiredCapabilities.PHANTOMJS)  #设置userAgent
    dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0 ")
     
    obj = webdriver.PhantomJS(executable_path='C:Program Files (x86)Microsoft Visual StudioSharedPython36_64Scriptsphantomjs.exe',desired_capabilities=dcap) #加载网址
    obj.get('http://chart.icaile.com/sd11x5.php')#打开网址
    
    
    #time.sleep(10)
    pageSource = obj.page_source
    print(pageSource)
    
    obj.quit() 
    

      

    获取的网页内容后,可以使用beautifulsoup来分析

    https://cuiqingcai.com/1319.html

    直接获取表格的文本

    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    import time
    import io
    
    dcap = dict(DesiredCapabilities.PHANTOMJS)  #设置userAgent
    #dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0 ")
     
    obj = webdriver.PhantomJS(executable_path='C:Program Files (x86)Microsoft Visual StudioSharedPython36_64Scriptsphantomjs.exe',desired_capabilities=dcap) #加载网址
    obj.get('http://chart.icaile.com/sd11x5.php')#打开网址
    
    
    text = obj.find_element_by_id("fixedtable").text
    
    print(text)
    
    obj.quit()  
    

      

    import time
    import io
    import re
    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    
    
    
    dcap = dict(DesiredCapabilities.PHANTOMJS)  #设置userAgent
    dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0 ")
     
    obj = webdriver.PhantomJS(executable_path='C:Program Files (x86)Microsoft Visual StudioSharedPython36_64Scriptsphantomjs.exe',desired_capabilities=dcap) #加载网址
    obj.get('http://chart.icaile.com/sd11x5.php')#打开网址
    
    
    text = obj.find_element_by_id("fixedtable").text
    #time.sleep(10)
    #pageSource = obj.page_source
    #print(pageSource)
    
    #print(text)
    
    
    page = obj.page_source
        
    url_context = re.findall('href="(.*?)"',page,re.S)
    url_list = []
    for url in url_context:    
        if 'http'in url:
            print(url)
    
    obj.quit()  
    

      

  • 相关阅读:
    爬取B站up主相册原图
    爬MEIZITU网站上的图片
    mpvue
    修改Tomcat控制台标题
    iserver频繁崩溃、内存溢出事故解决小记
    Java反射机制详解 及 Method.invoke解释
    window下maven的环境搭建
    window下mongodb的安装和环境搭建
    centos7 安装 redis4.0.8
    centos7 安装mysql5.7.20(yum方式)
  • 原文地址:https://www.cnblogs.com/coolyylu/p/8277439.html
Copyright © 2011-2022 走看看