zoukankan      html  css  js  c++  java
  • 使用Selenium慢慢向下滚动页面

    我正试图从航班搜索页面抓取一些数据.

    此页面以这种方式工作:

    你填写一个表格,然后你点击按钮搜索 – 这没关系.当您单击该按钮时,您将被重定向到包含结果的页面,这就是问题所在.这个页面连续添加结果,例如一分钟,这不是什么大问题 – 问题是得到所有这些结果.当您使用真正的浏览器时,您必须向下滚动页面并显示这些结果.所以我试图使用Selenium向下滚动.它可能在页面底部向下滚动可能非常快,或者是跳转而不是滚动页面不会加载任何新结果.

    当你慢慢向下滚动时,它会重新加载结果,但是如果你这么做就会停止加载.

    我不确定我的代码是否有助于理解,所以我附上它.

    SEARCH_STRING = """URL"""
    
    class spider():
    
        def __init__(self):
            self.driver = webdriver.Firefox()
    
        @staticmethod
        def prepare_get(dep_airport,arr_airport,dep_date,arr_date):
            string = SEARCH_STRING%(dep_airport,arr_airport,arr_airport,dep_airport,dep_date,arr_date)
            return string
    
    
        def find_flights_html(self,dep_airport, arr_airport, dep_date, arr_date):
            if isinstance(dep_airport, list):
                airports_string = str(r'%20').join(dep_airport)
                dep_airport = airports_string
    
            wait = WebDriverWait(self.driver, 60) # wait for results
            self.driver.get(spider.prepare_get(dep_airport, arr_airport, dep_date, arr_date))
            wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]')))
            wait.until(EC.invisibility_of_element_located((By.XPATH, u'//div[. = "Poprosíme o trpezlivosť, hľadáme pre Vás ešte viac letov"]/preceding-sibling::img')))
            self.driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
    
            self.driver.find_element_by_xpath('//body').send_keys(Keys.CONTROL+Keys.END)
            return self.driver.page_source
    
        @staticmethod 
        def get_info_from_borderbox(div):
            arrival = div.find('div',class_='departure').text
            price = div.find('div',class_='pricebox').find('div',class_=re.compile('price'))
            departure = div.find_all('div',class_='departure')[1].contents
            date_departure = departure[1].text 
            airport_departure = departure[5].text
            arrival = div.find_all('div', class_= 'arrival')[0].contents
            date_arrival = arrival[1].text
            airport_arrival = arrival[3].text[1:]
            print 'DEPARTURE: ' 
            print date_departure,airport_departure
            print 'ARRIVAL: '
            print date_arrival,airport_arrival
    
        @staticmethod
        def get_flights_from_result_page(html):
    
            def match_tag(tag, classes):
                return (tag.name == 'div'
                        and 'class' in tag.attrs
                        and all([c in tag['class'] for c in classes]))
    
            soup = mLib.getSoup_html(html)
            divs = soup.find_all(lambda t: match_tag(t, ['borderbox', 'flightbox', 'p2']))
    
            for div in divs:
                spider.get_info_from_borderbox(div)
    
            print len(divs)
    
    
    spider_inst = spider() 
    
    print spider.get_flights_from_result_page(spider_inst.find_flights_html(['BTS','BRU','PAR'], 'MAD', '2015-07-15', '2015-08-15'))

    因此,我认为主要问题是滚动太快而无法触发新的结果加载.

    你知道如何使它工作吗?

    最佳答案
    这是一个不同的方法,对我有用,包括滚动到最后一个搜索结果的视图,并等待再次滚动之前加载其他元素:
    # -*- coding: utf-8 -*-
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.common.exceptions import StaleElementReferenceException
    from selenium.webdriver.support import expected_conditions as EC
    
    
    class wait_for_more_than_n_elements(object):
        def __init__(self, locator, count):
            self.locator = locator
            self.count = count
    
        def __call__(self, driver):
            try:
                count = len(EC._find_elements(driver, self.locator))
                return count >= self.count
            except StaleElementReferenceException:
                return False
    
    
    driver = webdriver.Firefox()
    
    dep_airport = ['BTS', 'BRU', 'PAR']
    arr_airport = 'MAD'
    dep_date = '2015-07-15'
    arr_date = '2015-08-15'
    
    airports_string = str(r'%20').join(dep_airport)
    dep_airport = airports_string
    
    url = "https://www.pelikan.sk/sk/flights/list?dfc=C%s&dtc=C%s&rfc=C%s&rtc=C%s&dd=%s&rd=%s&px=1000&ns=0&prc=&rng=1&rbd=0&ct=0" % (dep_airport, arr_airport, arr_airport, dep_airport, dep_date, arr_date)
    driver.maximize_window()
    driver.get(url)
    
    wait = WebDriverWait(driver, 60)
    wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]')))
    wait.until(EC.invisibility_of_element_located((By.XPATH,
                                                   u'//div[. = "Poprosíme o trpezlivosť, hľadáme pre Vás ešte viac letov"]/preceding-sibling::img')))
    
    while True:  # TODO: make the endless loop end
        results = driver.find_elements_by_css_selector("div.flightbox")
        print "Results count: %d" % len(results)
    
        # scroll to the last element
        driver.execute_script("arguments[0].scrollIntoView();", results[-1])
    
        # wait for more results to load
        wait.until(wait_for_more_than_n_elements((By.CSS_SELECTOR, 'div.flightbox'), len(results)))

    笔记:

    >你需要弄清楚何时停止循环 – 例如,在特定的len(结果)值
    > wait_for_more_than_n_elements是一个custom Expected Condition,它有助于确定何时加载下一部分,我们可以再次滚动

    转自: https://www.cnblogs.com/yipianshuying/p/10040461.html

  • 相关阅读:
    asp.net发送邮件
    jquery+TreeView 级联 复选框 checkbox 级联
    100层楼,两个会坏的杯子,测从哪层开始坏【算法思想】
    flex中dragdrop不响应的原因
    flex 中urlrequest缓存问题
    程序员技术练级攻略
    consistent hashing
    入门教材
    烙饼啊烙饼{转自ITEO
    杂乱的工作记录
  • 原文地址:https://www.cnblogs.com/perfectdata/p/10586183.html
Copyright © 2011-2022 走看看