zoukankan      html  css  js  c++  java
  • phantomjs集成到scrapy中,并禁用图片,切换UA

    phantomjs是一个没有界面的浏览器,支持各种web标准,提供DOM 处理, CSS 选择器, JSON, Canvas, SVG,对于爬取一些经过js渲染的页面非常有用。但是phantomjs默认的user-agent一般都被那些防采集的网站屏蔽了,鬼都知道用这个浏览器,都是来爬取网页的,不是正常的用户。

    phantomjs相当于一个真实的浏览器,一个浏览器该加载的该渲染的它都加载都渲染,只是没有界面而已。所以爬取网页的速度很慢。如果可以不加载图片,网页加载速度就会快不少.

    下面是PhantomJS禁用加载图片,并随机切换UAscrapy middleware的完整代码

       

    from selenium import webdriver

    from selenium.webdriver.common.by import By

    from selenium.webdriver.support import expected_conditions as EC

    from selenium.webdriver.support.wait import WebDriverWait

    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

    from fake_useragent import UserAgent

    from scrapy.http import HtmlResponse

       

       

    class SeleniumSpiderMiddleware(object):

    def process_request(self, request, spider):

    # 随机请求头

    ua = UserAgent()

    ua_use = ua.random

    # 设置请求头

    dcap = dict(DesiredCapabilities.PHANTOMJS)

    # --load-images=false 图片不加载

    # --disk - cache = true 启用缓存

    # --max-disk-cache-size=1024 设置最大缓存数量

    SERVICE_ARGS = [' --disk-cache=true','--max-disk-cache-size=1024', '--load-images=false']

    dcap["phantomjs.page.settings.userAgent"] = ua_use

    # 请求头生效, 图片不加载生效

    driver = webdriver.PhantomJS(desired_capabilities=dcap, service_args=SERVICE_ARGS)

    # 请求的url是这个

    url = request.url

    driver.get(url)

    # 设置等待所有的td标签加载完成

    locator = (By.CSS_SELECTOR, 'tbody > tr > td')

    WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located(locator))

    # 截图看是否有图片加载

    driver.save_screenshot('aqi.png')

    body = driver.page_source

    driver.close()

    # body必须为bytes类型

    response = HtmlResponse(url=url, request=request, encoding='utf8', body=body.encode())

    # 直接return response 直接将response返回到engine不会重新下载

    return response

  • 相关阅读:
    Codeforces Round #649 (Div. 2) D. Ehab's Last Corollary
    Educational Codeforces Round 89 (Rated for Div. 2) E. Two Arrays
    Educational Codeforces Round 89 (Rated for Div. 2) D. Two Divisors
    Codeforces Round #647 (Div. 2) E. Johnny and Grandmaster
    Codeforces Round #647 (Div. 2) F. Johnny and Megan's Necklace
    Codeforces Round #648 (Div. 2) G. Secure Password
    Codeforces Round #646 (Div. 2) F. Rotating Substrings
    C++STL常见用法
    各类学习慕课(不定期更新
    高阶等差数列
  • 原文地址:https://www.cnblogs.com/liuqianli/p/8390444.html
Copyright © 2011-2022 走看看