YARN的重启动问题：RM Restart/RM HA/Timeline Server/NM Restart

zoukankan html css js c++ java

YARN的重启动问题：RM Restart/RM HA/Timeline Server/NM Restart
ResourceManger Restart

ResourceManager负责资源管理和应用的调度，是YARN的核心组件，有可能存在单点失败的问题。ResourceManager Restart是使RM在重启动时能够使Yarn集群正常工作的feature，并且使RM的出现的失败不被用户知道。

ResourceManager Restart feature is divided into two phases:
- ResourceManager Restart Phase 1 (Non-work-preserving RM restart，since hadoop2.4.0): Enhance RM to persist application/attempt state and other credentials information in a pluggable state-store. RM will reload this information from state-store upon restart and re-kick the previously running applications. Users are not required to re-submit the applications.
- ResourceManager Restart Phase 2 (Work-preserving RM restart, since hadoop2.6.0): Focus on re-constructing the running state of ResourceManager by combining the container statuses from NodeManagers and container requests from ApplicationMasters upon restart. The key difference from phase 1 is that previously running applications will not be killed after RM restarts, and so applications won’t lose its work because of RM outage.
ResourceManager High Availability

Hadoop2.4.0之前，ResourceManager存在单点失败的问题。Yarn的HA（高可用）使用Actice/Standby结构。在任意一个时刻，只有一个Active RM，一个到多个Standby RM。其实就是将ResourceManager进行了备份，使得系统中存在Active RM和Standby RM。

Manual transitions and failover

输入yarn rmadmin

Automatic failover

当RM 失效或者不再响应时，基于Zookeeper的ActiveStandbyElector（已经内嵌到了RM中，不用启动单独的ZKFC daemon）选举出新的Active RM。

Client, ApplicationMaster and NodeManager on RM failover

如果有多个RM，那么所有节点上的yarn-site.xml文件都需要列出所有的RM。Clients、AMs、NMs以Round-Robin的方式连接RMs，直到遇到一个Active RM为止。如果Active RM失效，那么重新以Round-Robin的方式找到新的Active RM。

The YARN Timeline Server

YARN通过Timeline Server解决apps当前信息和历史信息的存储和检索。TimelineServer的两个职责：

Persisting Application Specific Information

信息的搜集和检索与特定的app或者框架有关。例如MapReduce框架的信息可以包括number of map tasks, reduce tasks, counters…etc。用户可以将app专门的信息通过Application Master包含的TimelineClient

或者App的container进行发布。

Persisting Generic Information about Completed Applications

Generic information为app level的信息，例如queue-name，user info等。通用数据被Yarn的RM发布到timeline store中，用于web-UI的已经完成的apps的信息展示。

NodeManager Restart

NodeManager Restart机制能够使NodeManager所在节点的active Containers不丢失。NM在处理container 管理请求时，将必要的state存储到local state-store。当NMs restart时，首先为不同的子系统加载state，然后让子系统使用加载的state进行恢复。

enabling NM Restart：

（1）       将/conf/yarn-site.xml中的yarn.nodemanager.recovery.enabled设置为true。默认为false

（2）       Configure a path to the local file-system directory where the NodeManager can save its run state.

（3）       Configure a valid RPC address for the NodeManager.

（4）       Auxiliary services.

Link：

http://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html

http://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html

http://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/TimelineServer.html

http://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html
查看全文

相关阅读:
list浅析
 C#尝试读取或写入受保护的内存。这通常指示其他内存已损坏（catch不起作用）
浅析C#线程同步事件-WaitHandle
C#操作xml方法1
C#简单的操作csv文件
 C#的int类型?,??,~的意思，string类型空值赋值
 将多个exc表格汇总于一个表格中
 C#禁止双击标题栏等操作
 c#泛型
 c#session

原文地址：https://www.cnblogs.com/sodawoods-blogs/p/8715231.html

YARN的重启动问题：RM Restart/RM HA/Timeline Server/NM Restart

ResourceManger Restart

ResourceManager High Availability

Manual transitions and failover

Automatic failover

Client, ApplicationMaster and NodeManager on RM failover

The YARN Timeline Server

Persisting Application Specific Information

Persisting Generic Information about Completed Applications

NodeManager Restart