zoukankan      html  css  js  c++  java
  • 使用Selenium慢慢向下滚动页面

    我正试图从航班搜索页面抓取一些数据.

    此页面以这种方式工作:

    你填写一个表格,然后你点击按钮搜索 – 这没关系.当您单击该按钮时,您将被重定向到包含结果的页面,这就是问题所在.这个页面连续添加结果,例如一分钟,这不是什么大问题 – 问题是得到所有这些结果.当您使用真正的浏览器时,您必须向下滚动页面并显示这些结果.所以我试图使用Selenium向下滚动.它可能在页面底部向下滚动可能非常快,或者是跳转而不是滚动页面不会加载任何新结果.

    当你慢慢向下滚动时,它会重新加载结果,但是如果你这么做就会停止加载.

    我不确定我的代码是否有助于理解,所以我附上它.

    SEARCH_STRING = """URL"""
    
    class spider():
    
        def __init__(self):
            self.driver = webdriver.Firefox()
    
        @staticmethod
        def prepare_get(dep_airport,arr_airport,dep_date,arr_date):
            string = SEARCH_STRING%(dep_airport,arr_airport,arr_airport,dep_airport,dep_date,arr_date)
            return string
    
    
        def find_flights_html(self,dep_airport, arr_airport, dep_date, arr_date):
            if isinstance(dep_airport, list):
                airports_string = str(r'%20').join(dep_airport)
                dep_airport = airports_string
    
            wait = WebDriverWait(self.driver, 60) # wait for results
            self.driver.get(spider.prepare_get(dep_airport, arr_airport, dep_date, arr_date))
            wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]')))
            wait.until(EC.invisibility_of_element_located((By.XPATH, u'//div[. = "Poprosíme o trpezlivosť, hľadáme pre Vás ešte viac letov"]/preceding-sibling::img')))
            self.driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
    
            self.driver.find_element_by_xpath('//body').send_keys(Keys.CONTROL+Keys.END)
            return self.driver.page_source
    
        @staticmethod 
        def get_info_from_borderbox(div):
            arrival = div.find('div',class_='departure').text
            price = div.find('div',class_='pricebox').find('div',class_=re.compile('price'))
            departure = div.find_all('div',class_='departure')[1].contents
            date_departure = departure[1].text 
            airport_departure = departure[5].text
            arrival = div.find_all('div', class_= 'arrival')[0].contents
            date_arrival = arrival[1].text
            airport_arrival = arrival[3].text[1:]
            print 'DEPARTURE: ' 
            print date_departure,airport_departure
            print 'ARRIVAL: '
            print date_arrival,airport_arrival
    
        @staticmethod
        def get_flights_from_result_page(html):
    
            def match_tag(tag, classes):
                return (tag.name == 'div'
                        and 'class' in tag.attrs
                        and all([c in tag['class'] for c in classes]))
    
            soup = mLib.getSoup_html(html)
            divs = soup.find_all(lambda t: match_tag(t, ['borderbox', 'flightbox', 'p2']))
    
            for div in divs:
                spider.get_info_from_borderbox(div)
    
            print len(divs)
    
    
    spider_inst = spider() 
    
    print spider.get_flights_from_result_page(spider_inst.find_flights_html(['BTS','BRU','PAR'], 'MAD', '2015-07-15', '2015-08-15'))

    因此,我认为主要问题是滚动太快而无法触发新的结果加载.

    你知道如何使它工作吗?

    最佳答案
    这是一个不同的方法,对我有用,包括滚动到最后一个搜索结果的视图,并等待再次滚动之前加载其他元素:
    # -*- coding: utf-8 -*-
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.common.exceptions import StaleElementReferenceException
    from selenium.webdriver.support import expected_conditions as EC
    
    
    class wait_for_more_than_n_elements(object):
        def __init__(self, locator, count):
            self.locator = locator
            self.count = count
    
        def __call__(self, driver):
            try:
                count = len(EC._find_elements(driver, self.locator))
                return count >= self.count
            except StaleElementReferenceException:
                return False
    
    
    driver = webdriver.Firefox()
    
    dep_airport = ['BTS', 'BRU', 'PAR']
    arr_airport = 'MAD'
    dep_date = '2015-07-15'
    arr_date = '2015-08-15'
    
    airports_string = str(r'%20').join(dep_airport)
    dep_airport = airports_string
    
    url = "https://www.pelikan.sk/sk/flights/list?dfc=C%s&dtc=C%s&rfc=C%s&rtc=C%s&dd=%s&rd=%s&px=1000&ns=0&prc=&rng=1&rbd=0&ct=0" % (dep_airport, arr_airport, arr_airport, dep_airport, dep_date, arr_date)
    driver.maximize_window()
    driver.get(url)
    
    wait = WebDriverWait(driver, 60)
    wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]')))
    wait.until(EC.invisibility_of_element_located((By.XPATH,
                                                   u'//div[. = "Poprosíme o trpezlivosť, hľadáme pre Vás ešte viac letov"]/preceding-sibling::img')))
    
    while True:  # TODO: make the endless loop end
        results = driver.find_elements_by_css_selector("div.flightbox")
        print "Results count: %d" % len(results)
    
        # scroll to the last element
        driver.execute_script("arguments[0].scrollIntoView();", results[-1])
    
        # wait for more results to load
        wait.until(wait_for_more_than_n_elements((By.CSS_SELECTOR, 'div.flightbox'), len(results)))

    笔记:

    >你需要弄清楚何时停止循环 – 例如,在特定的len(结果)值
    > wait_for_more_than_n_elements是一个custom Expected Condition,它有助于确定何时加载下一部分,我们可以再次滚动

    转自: https://www.cnblogs.com/yipianshuying/p/10040461.html

  • 相关阅读:
    动态规划——Best Time to Buy and Sell Stock IV
    动态规划——Split Array Largest Sum
    动态规划——Burst Ballons
    动态规划——Best Time to Buy and Sell Stock III
    动态规划——Edit Distance
    动态规划——Longest Valid Parentheses
    动态规划——Valid Permutations for DI Sequence
    构建之法阅读笔记05
    构建之法阅读笔记04
    构建之法阅读笔记03
  • 原文地址:https://www.cnblogs.com/perfectdata/p/10586183.html
Copyright © 2011-2022 走看看