zoukankan      html  css  js  c++  java
  • Selenium 获取动态js的网页

    Selenium基于webkit实现爬虫功能

    http://www.cnblogs.com/luxiaojun/p/6144748.html

    https://www.cnblogs.com/chenqingyang/p/3772673.html

    现在headless chrome替代 PhantomJS 

    https://zhuanlan.zhihu.com/p/27100187

    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    import time
    import io
    
    dcap = dict(DesiredCapabilities.PHANTOMJS)  #设置userAgent
    dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0 ")
     
    obj = webdriver.PhantomJS(executable_path='C:Program Files (x86)Microsoft Visual StudioSharedPython36_64Scriptsphantomjs.exe',desired_capabilities=dcap) #加载网址
    obj.get('http://chart.icaile.com/sd11x5.php')#打开网址
    
    
    #time.sleep(10)
    pageSource = obj.page_source
    print(pageSource)
    
    obj.quit() 
    

      

    获取的网页内容后,可以使用beautifulsoup来分析

    https://cuiqingcai.com/1319.html

    直接获取表格的文本

    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    import time
    import io
    
    dcap = dict(DesiredCapabilities.PHANTOMJS)  #设置userAgent
    #dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0 ")
     
    obj = webdriver.PhantomJS(executable_path='C:Program Files (x86)Microsoft Visual StudioSharedPython36_64Scriptsphantomjs.exe',desired_capabilities=dcap) #加载网址
    obj.get('http://chart.icaile.com/sd11x5.php')#打开网址
    
    
    text = obj.find_element_by_id("fixedtable").text
    
    print(text)
    
    obj.quit()  
    

      

    import time
    import io
    import re
    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    
    
    
    dcap = dict(DesiredCapabilities.PHANTOMJS)  #设置userAgent
    dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0 ")
     
    obj = webdriver.PhantomJS(executable_path='C:Program Files (x86)Microsoft Visual StudioSharedPython36_64Scriptsphantomjs.exe',desired_capabilities=dcap) #加载网址
    obj.get('http://chart.icaile.com/sd11x5.php')#打开网址
    
    
    text = obj.find_element_by_id("fixedtable").text
    #time.sleep(10)
    #pageSource = obj.page_source
    #print(pageSource)
    
    #print(text)
    
    
    page = obj.page_source
        
    url_context = re.findall('href="(.*?)"',page,re.S)
    url_list = []
    for url in url_context:    
        if 'http'in url:
            print(url)
    
    obj.quit()  
    

      

  • 相关阅读:
    Spring总结四:IOC和DI 注解方式
    Spring总结二:IOC(控制反转)xml方式
    Spring总结一:Srping快速入门
    TCP UDP HTTP 的关系和区别
    sql 一些偶尔会用到的写法和函数 不定时更新
    AngularJS分层开发
    AngularJS入门
    url和uri的区别
    MyBatis总结八:缓存介绍(一级缓存,二级缓存)
    Javascript 中 == 与=== 对比
  • 原文地址:https://www.cnblogs.com/coolyylu/p/8277439.html
Copyright © 2011-2022 走看看