zoukankan      html  css  js  c++  java
  • 【原创】大叔经验分享(39)spark cache unpersist级联操作

    问题:spark中如果有两个DataFrame(或者DataSet),DataFrameA依赖DataFrameB,并且两个DataFrame都进行了cache,将DataFrameB unpersist之后,DataFrameA的cache也会失效,官方解释如下:

    When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.

    However, in other cases, like when user simply want to drop a cache to free up memory, we do not need to invalidate dependent caches since no underlying data has been changed. For this reason, we would like to introduce a new cache invalidation mode: the non-cascading cache invalidation.

    之前默认的模式为regular mode,这种模式下为了保证被cache数据是最新的(没有过期),会对cache的unpersist进行级联操作,即清空所有依赖(包括间接依赖)该cache的其他cache;
    从spark2.4开始引入了一个新的模式:non-cascading mode,这个模式下不会对cache的unpersist进行级联操作;

    DataFrame/DataSet的cache操作默认用的level是MEMORY_AND_DISK,除非手工指定MEMORY,并且确认内存足够,否则unpersist之前的cache看起来没有必要;

    参考:
    https://issues.apache.org/jira/browse/SPARK-21478
    https://issues.apache.org/jira/browse/SPARK-24596
    https://issues.apache.org/jira/browse/SPARK-21579

  • 相关阅读:
    vb.net EXCEL进程问题
    VB.NET 中使用正则表达式
    改变鼠标状态
    举证信息表,语言不是VB.NET 是VBA的
    初学查询时的一些东西
    删除关联错误
    对于 using ESRI.ArcGIS.Carto; 的引用。是否缺少 using 指令或程序集引用
    AxMapControl 引用问题
    堆排序:大顶堆,小顶堆
    abs()函数,fabs函数(),max()函数的区别
  • 原文地址:https://www.cnblogs.com/barneywill/p/10524805.html
Copyright © 2011-2022 走看看