在网上找到如下方案,监控 zk 的进程,如果进程不在,就重启 zk。
有种情况解决不了:当 zk 僵死的时候,进程还在,但是存在很多 CLOSE_WAIT 的 tcp 连接,导致 zk 连接不上!
#!/bin/sh while true; do time1=$(date) echo $time1 count=`ps -ef|grep zookeeper | grep -v grep` if [ "$?" != "0" ];then echo ">>>>zookeeper has shutdown" echo ">>>>restart zookeeper now !" sh zkServer.sh start else echo ">>>>zookeeper is runing..." fi sleep 60 done
zk 僵死的时候,发送 sh zkServer.sh status 时,会返回一个错误的字符串,如果是正常的,就会返回 Mode: leader 或者 Mode: follower。
改进的监控程序如下:
monitorzk.sh
1 #!/bin/sh 2 3 while true; 4 do 5 time1=$(date) 6 echo $time1 7 t=`sh zkServer.sh status` 8 if [[ $t == Mode* ]];then 9 echo ">>>>zookeeper is runing..." 10 else 11 echo ">>>>zookeeper has shutdown" 12 echo ">>>>restart zookeeper now !" 13 kill -9 $(cat "/usr/local/zookeeper-3.4.6/data/zookeeper_server.pid") 14 sh zkServer.sh start 15 fi 16 sleep 60 17 done
startMonitor.sh
nohup sh monitorzk.sh >> monitor.log 2>&1 &