zoukankan      html  css  js  c++  java
  • Spark2.x(五十六):Queue's AM resource limit exceeded.

    背景:

    按照业务需求将数据拆分为60份,启动60个application分别运行对每一份数据,application的提交脚本如下:

    #/bin/sh
    #LANG=zh_CN.utf8
    #export LANG
    export SPARK_KAFKA_VERSION=0.10
    export LANG=zh_CN.UTF-8
    jarspath=''
    for file in `ls /home/dx/pro2.0/app01/sparkjars/*.jar`
    do
      jarspath=${file},$jarspath
    done
    jarspath=${jarspath%?}
    echo $jarspath
    
    ./bin/spark-submit.sh 
    --jars $jarspath 
    --properties-file ../conf/spark-properties.conf 
    --verbose 
    --master yarn 
    --deploy-mode cluster 
    --name Streaming-$2-$3-$4-$5-$1-Agg-Parser 
    --driver-memory 9g 
    --driver-cores 1 
    --num-executors 1 
    --executor-cores 12 
    --executor-memory 22g 
    --driver-java-options "-XX:+TraceClassPaths" 
    --class com.dx.app01.streaming.Main 
    /home/dx/pro2.0/app01/lib/app01-streaming-driver.jar $1 $2 $3 $4 $5

    运行集群包含的运行节点43个节点,每个节点配置信息如下:24VCores 64G

    yarn配置情况:

    yarn.scheduler.minimum-allocation-mb  单个容器可申请的最小内存 1G
    yarn.scheduler.maximum-allocation-mb 单个容器可申请的最大内存 51G
    yarn.nodemanager.resource.cpu-vcores  NodeManager总的可用虚拟CPU个数 21vcores
    yarn.nodemanager.resource.memory-mb  每个节点可用的最大内存,RM中的两个值不应该超过此值 51G

    问题:

     执行上边脚本启动了60个任务,但是经过测试发现最多只能提交24个任务,然后剩余还有一个部分任务都是处于 Accepted 状态,按照目前情况至少要执行43个任务。

    通过yarn node -list命令查看当前节点上运行containers情况如下:

    Node-Id Node-State Node-Http-Address Number-of-Running-Containers
    node-53:45454 RUNNING node-53:8042 1
    node-62:45454 RUNNING node-62:8042 4
    node-44:45454 RUNNING node-44:8042 3
    node-37:45454 RUNNING node-37:8042 0
    node-35:45454 RUNNING node-35:8042 1
    node-07:45454 RUNNING node-07:8042 0
    node-30:45454 RUNNING node-30:8042 0
    node-56:45454 RUNNING node-56:8042 2
    node-47:45454 RUNNING node-47:8042 0
    node-42:45454 RUNNING node-42:8042 2
    node-03:45454 RUNNING node-03:8042 6
    node-51:45454 RUNNING node-51:8042 2
    node-33:45454 RUNNING node-33:8042 1
    node-04:45454 RUNNING node-04:8042 1
    node-48:45454 RUNNING node-48:8042 6
    node-39:45454 RUNNING node-39:8042 0
    node-60:45454 RUNNING node-60:8042 1
    node-54:45454 RUNNING node-54:8042 0
    node-45:45454 RUNNING node-45:8042 0
    node-63:45454 RUNNING node-63:8042 1
    node-09:45454 RUNNING node-09:8042 1
    node-01:45454 RUNNING node-01:8042 1
    node-36:45454 RUNNING node-36:8042 3
    node-06:45454 RUNNING node-06:8042 0
    node-61:45454 RUNNING node-61:8042 1
    node-31:45454 RUNNING node-31:8042 0
    node-40:45454 RUNNING node-40:8042 0
    node-57:45454 RUNNING node-57:8042 1
    node-59:45454 RUNNING node-59:8042 1
    node-43:45454 RUNNING node-43:8042 1
    node-52:45454 RUNNING node-52:8042 1
    node-34:45454 RUNNING node-34:8042 1
    node-38:45454 RUNNING node-38:8042 0
    node-50:45454 RUNNING node-50:8042 4
    node-46:45454 RUNNING node-46:8042 1
    node-08:45454 RUNNING node-08:8042 1
    node-55:45454 RUNNING node-55:8042 1
    node-32:45454 RUNNING node-32:8042 0
    node-41:45454 RUNNING node-41:8042 2
    node-05:45454 RUNNING node-05:8042 1
    node-02:45454 RUNNING node-02:8042 1
    node-58:45454 RUNNING node-58:8042 0
    node-49:45454 RUNNING node-49:8042 0

     很明显,目前集群还有一部分节点未被使用,说明资源时充足的。

    那么,至少应该能提交43个任务才对,但是目前只提交了24个任务,而且在Yarn上还提示错误信息:

    [Tue Jul 30 16:33:29 +0000 2019] Application is added to the scheduler and is not yet activated. 
    Queue's AM resource limit exceeded. Details : AM Partition = <DEFAULT_PARTITION>; 
    AM Resource Request = <memory:9216MB(9G), vCores:1>; 
    Queue Resource Limit for AM = <memory:454656MB(444G), vCores:1>; 
    User AM Resource Limit of the queue = <memory:229376MB(224G), vCores:1>; 
    Queue AM Resource Usage = <memory:221184MB(216G), vCores:24>;

    解决方案:

    其中错误日志:“Queue AM Resource Usage = <memory:221184MB(216G), vCores:24>;”中正是指目前已经运行了24个app(yarn-cluster模式下,每个app包含一个driver,driver也就是等同于AM):每个app的driver包含1个vcores,一共占用24vcores;每个app的driver内存为9G,9G*24=216G。
    其中错误日志:“User AM Resource Limit of the queue = <memory:229376MB(224G), vCores:1>; ”中集群中用于运行应用程序ApplicationMaster的资源最大允许224G,这个值由参数”yarn.scheduler.capacity.maximum-am-resource-percent“决定。

    yarn.scheduler.capacity.maximum-am-resource-percent

    / yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent

    集群中用于运行应用程序ApplicationMaster的资源比例上限,该参数通常用于限制处于活动状态的应用程序数目。该参数类型为浮点型,默认是0.1,表示10%。

    所有队列的ApplicationMaster资源比例上限可通过参数yarn.scheduler.capacity. maximum-am-resource-percent设置(可看做默认值),

    而单个队列可通过参数yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent设置适合自己的值。

    1)yarn.scheduler.capacity.maximum-am-resource-percent(调大)

    <property>
        <!-- Maximum resources to allocate to application masters
        If this is too high application masters can crowd out actual work -->
        <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
        <value>0.5</value>
    </property>

    2)降低 driver 内存。

    关于Yarn Capacity更多,更官方问题请参考官网文档:《Hadoop: Capacity Scheduler

  • 相关阅读:
    PyDev for eclipse 插件下载地址
    Impala SQL 语言元素(翻译)[转载]
    oracle9i-11.2安装包及补丁包下载链接
    oracle link的创建过程
    Oracle 查询历史数据(转帖)
    Alpha、Beta、RC、GA版本的区别
    oracle wm_concat(column)函数的使用
    Quest.Central.for.DB2.v5.0.2.4下载地址
    Hadoop 管理工具HUE配置-初始配置
    maven编译下载源码
  • 原文地址:https://www.cnblogs.com/yy3b2007com/p/11273169.html
Copyright © 2011-2022 走看看