refer to
http://www.cnblogs.com/Richardzhu/p/3435989.html
http://blog.csdn.net/wuzhilon88/article/details/49506873
方法一、使用namespaceID
1、在namenode节点上,将dfs.name.dir指定的目录中(这里是name目录)的内容情况,以此来模拟故障发生。
1 [hadoop@node1 name]$ ls 2 current image in_use.lock 3 [hadoop@node1 name]$ rm -rf *
2、将集群关闭后,再重启我们看到namenode守护进程消失。
1 [hadoop@node1 name]$ stop-all.sh 2 stopping jobtracker 3 192.168.1.152: stopping tasktracker 4 192.168.1.153: stopping tasktracker 5 stopping namenode 6 192.168.1.152: stopping datanode 7 192.168.1.153: stopping datanode 8 192.168.1.152: stopping secondarynamenode 9 [hadoop@node1 name]$ start-all.sh 10 starting namenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-namenode-node1.out 11 192.168.1.152: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node2.out 12 192.168.1.153: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node3.out 13 192.168.1.152: starting secondarynamenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-node2.out 14 starting jobtracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-jobtracker-node1.out 15 192.168.1.152: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node2.out 16 192.168.1.153: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node3.out 17 [hadoop@node1 name]$ jps 18 31942 Jps 19 31872 JobTracker
而且namenode的日志中有报错:
1 2013-11-14 06:19:59,172 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: 2 /************************************************************ 3 STARTUP_MSG: Starting NameNode 4 STARTUP_MSG: host = node1/192.168.1.151 5 STARTUP_MSG: args = [] 6 STARTUP_MSG: version = 0.20.2 7 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 8 ************************************************************/ 9 2013-11-14 06:19:59,395 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 10 2013-11-14 06:19:59,400 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: node1.com/192.168.1.151:9000 11 2013-11-14 06:19:59,403 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 12 2013-11-14 06:19:59,407 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 13 2013-11-14 06:19:59,557 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop 14 2013-11-14 06:19:59,558 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 15 2013-11-14 06:19:59,558 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 16 2013-11-14 06:19:59,568 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 17 2013-11-14 06:19:59,569 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 18 2013-11-14 06:19:59,654 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. 19 java.io.IOException: NameNode is not formatted. 20 at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:317) 21 at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) 22 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) 23 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292) 24 at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) 25 at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279) 26 at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) 27 at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) 28 2013-11-14 06:19:59,658 INFO org.apache.hadoop.ipc.Server: Stopping server on 9000 29 2013-11-14 06:19:59,663 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: NameNode is not formatted. 30 at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:317) 31 at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) 32 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) 33 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292) 34 at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) 35 at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279) 36 at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) 37 at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) 38 39 2013-11-14 06:19:59,664 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 40 /************************************************************ 41 SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.1.151 42 ************************************************************/
3、查看HDFS的文件失败:
1 [hadoop@node1 name]$ hadoop dfs -ls /user/hive/warehouse 2 13/11/14 06:21:06 INFO ipc.Client: Retrying connect to server: node1/192.168.1.151:9000. Already tried 0 time(s). 3 13/11/14 06:21:07 INFO ipc.Client: Retrying connect to server: node1/192.168.1.151:9000. Already tried 1 time(s). 4 13/11/14 06:21:08 INFO ipc.Client: Retrying connect to server: node1/192.168.1.151:9000. Already tried 2 time(s). 5 13/11/14 06:21:09 INFO ipc.Client: Retrying connect to server: node1/192.168.1.151:9000. Already tried 3 time(s).
4、关闭集群,格式化namenode:
1 [hadoop@node1 name]$ stop-all.sh 2 stopping jobtracker 3 192.168.1.152: stopping tasktracker 4 192.168.1.153: stopping tasktracker 5 no namenode to stop 6 192.168.1.152: stopping datanode 7 192.168.1.153: stopping datanode 8 192.168.1.152: stopping secondarynamenode 9 [hadoop@node1 name]$ hadoop namenode -format 10 13/11/14 06:21:37 INFO namenode.NameNode: STARTUP_MSG: 11 /************************************************************ 12 STARTUP_MSG: Starting NameNode 13 STARTUP_MSG: host = node1/192.168.1.151 14 STARTUP_MSG: args = [-format] 15 STARTUP_MSG: version = 0.20.2 16 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 17 ************************************************************/ 18 Re-format filesystem in /app/user/hdfs/name ? (Y or N) Y 19 13/11/14 06:21:39 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop 20 13/11/14 06:21:39 INFO namenode.FSNamesystem: supergroup=supergroup 21 13/11/14 06:21:39 INFO namenode.FSNamesystem: isPermissionEnabled=true 22 13/11/14 06:21:39 INFO common.Storage: Image file of size 96 saved in 0 seconds. 23 13/11/14 06:21:39 INFO common.Storage: Storage directory /app/user/hdfs/name has been successfully formatted. 24 13/11/14 06:21:39 INFO namenode.NameNode: SHUTDOWN_MSG: 25 /************************************************************ 26 SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.1.151 27 ************************************************************/
5、从任意datanode中获取namenode格式化之前namespaceID并修改namenode的namespaceID跟datanode一致:
1 [hadoop@node2 current]$ cat VERSION 2 #Thu Nov 14 02:27:10 CST 2013 3 namespaceID=2062292356 4 storageID=DS-107813142-192.168.1.152-50010-1379339943465 5 cTime=0 6 storageType=DATA_NODE 7 layoutVersion=-18 8 [hadoop@node2 current]$ pwd 9 /app/user/hdfs/data/current 10 ----修改namenode的namespaceID----
[hadoop@node1 current]$ cat VERSION 11 #Thu Nov 14 06:29:31 CST 2013 12 namespaceID=2062292356 13 cTime=0 14 storageType=NAME_NODE 15 layoutVersion=-18
6、删除新的namenode的fsimage文件:
1 [hadoop@node1 current]$ ll 2 total 16 3 -rw-rw-r-- 1 hadoop hadoop 4 Nov 14 06:21 edits 4 -rw-rw-r-- 1 hadoop hadoop 96 Nov 14 06:21 fsimage 5 -rw-rw-r-- 1 hadoop hadoop 8 Nov 14 06:21 fstime 6 -rw-rw-r-- 1 hadoop hadoop 101 Nov 14 06:22 VERSION 7 [hadoop@node1 current]$ rm fsimage
7、从Secondarynamenode拷贝fsimage到Namenode的current目录下:
[hadoop@node2 current]$ ll total 16 -rw-rw-r-- 1 hadoop hadoop 4 Nov 14 05:38 edits -rw-rw-r-- 1 hadoop hadoop 2410 Nov 14 05:38 fsimage -rw-rw-r-- 1 hadoop hadoop 8 Nov 14 05:38 fstime -rw-rw-r-- 1 hadoop hadoop 101 Nov 14 05:38 VERSION [hadoop@node2 current]$ scp fsimage node1:/app/user/hdfs/name/current The authenticity of host 'node1 (192.168.1.151)' can't be established. RSA key fingerprint is ca:9a:7e:19:ee:a1:35:44:7e:9d:d4:09:5c:fc:c5:0a. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'node1,192.168.1.151' (RSA) to the list of known hosts. fsimage 100% 2410 2.4KB/s 00:00
8、重启集群:
[hadoop@node1 current]$ start-all.sh starting namenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-namenode-node1.out 192.168.1.152: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node2.out 192.168.1.153: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node3.out 192.168.1.152: starting secondarynamenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-node2.out starting jobtracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-jobtracker-node1.out 192.168.1.152: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node2.out 192.168.1.153: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node3.out [hadoop@node1 current]$ jps 32486 Jps 32419 JobTracker 32271 NameNode
9、验证数据的完整性:
1 [hadoop@node1 current]$ hadoop dfs -ls /user/hive/warehouse 2 Found 8 items 3 drwxr-xr-x - hadoop supergroup 0 2013-10-17 16:18 /user/hive/warehouse/echo 4 drwxr-xr-x - hadoop supergroup 0 2013-10-28 13:48 /user/hive/warehouse/jack 5 drwxr-xr-x - hadoop supergroup 0 2013-09-18 15:54 /user/hive/warehouse/table4 6 drwxr-xr-x - hadoop supergroup 0 2013-09-18 15:53 /user/hive/warehouse/table5 7 drwxr-xr-x - hadoop supergroup 0 2013-09-18 15:48 /user/hive/warehouse/test 8 drwxr-xr-x - hadoop supergroup 0 2013-10-25 14:50 /user/hive/warehouse/test1 9 drwxr-xr-x - hadoop supergroup 0 2013-10-25 14:52 /user/hive/warehouse/test2 10 drwxr-xr-x - hadoop supergroup 0 2013-10-25 14:30 /user/hive/warehouse/test3 11 12 [hadoop@node3 conf]$ hive 13 14 Logging initialized using configuration in jar:file:/app/hive/lib/hive-common-0.11.0.jar!/hive-log4j.properties 15 Hive history file=/tmp/hadoop/hive_job_log_hadoop_7451@node3_201311111325_424288589.txt 16 hive> show tables; 17 OK 18 echo 19 jack 20 table4 21 table5 22 test 23 test1 24 test2 25 test3 26 Time taken: 27.589 seconds, Fetched: 8 row(s) 27 hive> select * from table4; 28 OK 29 NULL NULL NULL 30 1 1 5 31 2 4 5 32 3 4 5 33 4 5 6 34 5 6 7 35 6 1 5 36 7 5 6 37 8 3 6 38 NULL NULL NULL 39 Time taken: 2.124 seconds, Fetched: 10 row(s)
之前里面的数据没有丢失。
方法二:使用hadoop namenode -importCheckpoint
1、删除name目录:
1 [hadoop@node1 hdfs]$ rm -rf name
2、关闭集群,从secondarynamenode拷贝namesecondary目录到dfs.name.dir:
[hadoop@node2 hdfs]$ scp -r namesecondary node1:/app/user/hdfs/ fsimage 100% 157 0.2KB/s 00:00 fstime 100% 8 0.0KB/s 00:00 fsimage 100% 2410 2.4KB/s 00:00 VERSION 100% 101 0.1KB/s 00:00 edits 100% 4 0.0KB/s 00:00 fstime 100% 8 0.0KB/s 00:00 fsimage 100% 2410 2.4KB/s 00:00 VERSION 100% 101 0.1KB/s 00:00 edits 100% 4 0.0KB/s 00:00
3、在namenode节点上执行hadoop namenode -importCheckpoint
[hadoop@node1 hdfs]$ hadoop namenode -importCheckpoint 13/11/14 07:24:20 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = node1/192.168.1.151 STARTUP_MSG: args = [-importCheckpoint] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 13/11/14 07:24:20 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 13/11/14 07:24:20 INFO namenode.NameNode: Namenode up at: node1.com/192.168.1.151:9000 13/11/14 07:24:20 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 13/11/14 07:24:20 INFO metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 13/11/14 07:24:21 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop 13/11/14 07:24:21 INFO namenode.FSNamesystem: supergroup=supergroup 13/11/14 07:24:21 INFO namenode.FSNamesystem: isPermissionEnabled=true 13/11/14 07:24:21 INFO metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 13/11/14 07:24:21 INFO namenode.FSNamesystem: Registered FSNamesystemStatusMBean 13/11/14 07:24:21 INFO common.Storage: Storage directory /app/user/hdfs/name is not formatted. 13/11/14 07:24:21 INFO common.Storage: Formatting ... 13/11/14 07:24:21 INFO common.Storage: Number of files = 26 13/11/14 07:24:21 INFO common.Storage: Number of files under construction = 0 13/11/14 07:24:21 INFO common.Storage: Image file of size 2410 loaded in 0 seconds. 13/11/14 07:24:21 INFO common.Storage: Edits file /app/user/hdfs/namesecondary/current/edits of size 4 edits # 0 loaded in 0 seconds. 13/11/14 07:24:21 INFO common.Storage: Image file of size 2410 saved in 0 seconds. 13/11/14 07:24:21 INFO common.Storage: Image file of size 2410 saved in 0 seconds. 13/11/14 07:24:21 INFO namenode.FSNamesystem: Number of transactions: 0 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 13/11/14 07:24:21 INFO namenode.FSNamesystem: Finished loading FSImage in 252 msecs 13/11/14 07:24:21 INFO hdfs.StateChange: STATE* Safe mode ON. The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically. 13/11/14 07:24:21 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 13/11/14 07:24:21 INFO http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50070 13/11/14 07:24:21 INFO http.HttpServer: listener.getLocalPort() returned 50070 webServer.getConnectors()[0].getLocalPort() returned 50070 13/11/14 07:24:21 INFO http.HttpServer: Jetty bound to port 50070 13/11/14 07:24:21 INFO mortbay.log: jetty-6.1.14 13/11/14 07:24:21 INFO mortbay.log: Started SelectChannelConnector@node1.com:50070 13/11/14 07:24:21 INFO namenode.NameNode: Web-server up at: node1.com:50070 13/11/14 07:24:21 INFO ipc.Server: IPC Server Responder: starting 13/11/14 07:24:21 INFO ipc.Server: IPC Server listener on 9000: starting 13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 0 on 9000: starting 13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 1 on 9000: starting 13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 2 on 9000: starting 13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 3 on 9000: starting 13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 4 on 9000: starting 13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 5 on 9000: starting 13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 6 on 9000: starting 13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 9 on 9000: starting 13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 7 on 9000: starting 13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 8 on 9000: starting 13/11/14 07:37:05 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.1.151 ************************************************************/ [hadoop@node1 current]$ start-all.sh starting namenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-namenode-node1.out 192.168.1.152: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node2.out 192.168.1.153: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node3.out 192.168.1.152: starting secondarynamenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-node2.out starting jobtracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-jobtracker-node1.out 192.168.1.152: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node2.out 192.168.1.153: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node3.out [hadoop@node1 current]$ jps 1027 JobTracker 1121 Jps 879 NameNode
4、验证数据的完整性:
1 [hadoop@node3 conf]$ hive 2 3 Logging initialized using configuration in jar:file:/app/hive/lib/hive-common-0.11.0.jar!/hive-log4j.properties 4 Hive history file=/tmp/hadoop/hive_job_log_hadoop_8383@node3_201311111443_2018635710.txt 5 hive> select * from table4; 6 OK 7 NULL NULL NULL 8 1 1 5 9 2 4 5 10 3 4 5 11 4 5 6 12 5 6 7 13 6 1 5 14 7 5 6 15 8 3 6 16 NULL NULL NULL 17 Time taken: 3.081 seconds, Fetched: 10 row(s)
总结:
注意:恢复的namenode中secondarynamenode的最近一次check到故障发生这段时间的内容将丢失,所以fs.checkpoint.period参数值在实际设定中要尽可能的权衡。并且也时常备份secondarynamenode节点中的内容,因为scondarynamenode也是单点的,以防发生故障。
补充说明:如果是用新的节点来恢复namenode,则要注意
1、新节点的Linux环境,目录结构,环境变量等等配置需要跟原来的namenode一模一样,包括conf目录下的所有文件配置。
2、新namenode的主机名要与原namenode保持一致,如果是重新命名主机名的话,则需要批量替换datanode和secondarynamenode的hosts文件,并且重新配置以下文件的部分core-site.xml文件中的fs.default.name
hdfs-site.xml文件中的dfs.http.address(secondarynamenode节点上)
mapred-site.xml文件中的mapred.job.tracker(如果jobtracker与namenode在同一个机器上,一般都是同一台机器上)。