zoukankan      html  css  js  c++  java
  • k8s 集群中的etcd故障解决

    一次在k8s集群中创建实例发现etcd集群状态出现连接失败状况,导致创建实例失败。于是排查了一下原因。

    问题来源

    下面是etcd集群健康状态:

    [root@docker01 ~]# cd /opt/kubernetes/ssl/
    [root@docker01 ssl]# /opt/kubernetes/bin/etcdctl 
    > --ca-file=ca.pem --cert-file=server.pem --key-file=server-key.pem 
    > --endpoints="https://10.0.0.99:2379,https://10.0.0.100:2379,https://10.0.0.111:2379" 
    > cluster-health
    member 1bd4d12de986e887 is healthy: got healthy result from https://10.0.0.99:2379
    member 45396926a395958b is healthy: got healthy result from https://10.0.0.100:2379
    failed to check the health of member c2c5804bd87e2884 on https://10.0.0.111:2379: Get https://10.0.0.111:2379/health: net/http: TLS handshake timeout
    member c2c5804bd87e2884 is unreachable: [https://10.0.0.111:2379] are all unreachable
    cluster is healthy
    [root@docker01 ssl]# 

    可以明显看到etcd节点03出现问题。

    这个时候到节点03上来重启etcd服务如下:

    [root@docker03 ~]# systemctl restart etcd
    Job for etcd.service failed because the control process exited with error code. See "systemctl status etcd.service" and "journalctl -xe" for details.
    [root@docker03 ~]# journalctl -xe
    Mar 24 22:24:32 docker03 etcd[1895]: setting maximum number of CPUs to 1, total number of available CPUs is 1
    Mar 24 22:24:32 docker03 etcd[1895]: the server is already initialized as member before, starting as etcd member...
    Mar 24 22:24:32 docker03 etcd[1895]: peerTLS: cert = /opt/kubernetes/ssl/server.pem, key = /opt/kubernetes/ssl/server-key.pem, ca = , trusted-ca = /opt/kubernetes/ssl
    Mar 24 22:24:32 docker03 etcd[1895]: listening for peers on https://10.0.0.111:2380
    Mar 24 22:24:32 docker03 etcd[1895]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
    Mar 24 22:24:32 docker03 etcd[1895]: listening for client requests on 127.0.0.1:2379
    Mar 24 22:24:32 docker03 etcd[1895]: listening for client requests on 10.0.0.111:2379
    Mar 24 22:24:32 docker03 etcd[1895]: member c2c5804bd87e2884 has already been bootstrapped
    Mar 24 22:24:32 docker03 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
    Mar 24 22:24:32 docker03 systemd[1]: Failed to start Etcd Server.
    -- Subject: Unit etcd.service has failed
    -- Defined-By: systemd
    -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
    -- 
    -- Unit etcd.service has failed.
    -- 
    -- The result is failed.
    Mar 24 22:24:32 docker03 systemd[1]: Unit etcd.service entered failed state.
    Mar 24 22:24:32 docker03 systemd[1]: etcd.service failed.
    Mar 24 22:24:33 docker03 systemd[1]: etcd.service holdoff time over, scheduling restart.
    Mar 24 22:24:33 docker03 systemd[1]: start request repeated too quickly for etcd.service
    Mar 24 22:24:33 docker03 systemd[1]: Failed to start Etcd Server.
    -- Subject: Unit etcd.service has failed
    -- Defined-By: systemd
    -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
    -- 
    -- Unit etcd.service has failed.
    -- 
    -- The result is failed.
    Mar 24 22:24:33 docker03 systemd[1]: Unit etcd.service entered failed state.
    Mar 24 22:24:33 docker03 systemd[1]: etcd.service failed.

    并没有成功启动服务,可以看到提示信息:member c2c5804bd87e2884 has already been bootstrapped

    查看资料说是:
    One of the member was bootstrapped via discovery service. You must remove the previous data-dir to clean up the member information. Or the member will ignore the new configuration and start with the old configuration. That is why you see the mismatch.
    大概意思:
    其中一个成员是通过discovery service引导的。必须删除以前的数据目录来清理成员信息。否则成员将忽略新配置,使用旧配置。这就是为什么你看到了不匹配。
    看到了这里,问题所在也就很明确了,启动失败的原因在于data-dir (/var/lib/etcd/default.etcd)中记录的信息与 etcd启动的选项所标识的信息不太匹配造成的。

    问题解决

    第一种方式我们可以通过修改启动参数解决这类错误。既然 data-dir 中已经记录信息,我们就没必要在启动项中加入多于配置。具体修改--initial-cluster-state参数:

    [root@docker03 ~]# cat /usr/lib/systemd/system/etcd.service
    [Unit]
    Description=Etcd Server
    After=network.target
    After=network-online.target
    Wants=network-online.target
    
    [Service]
    Type=notify
    EnvironmentFile=-/opt/kubernetes/cfg/etcd
    ExecStart=/opt/kubernetes/bin/etcd 
    --name=${ETCD_NAME} 
    --data-dir=${ETCD_DATA_DIR} 
    --listen-peer-urls=${ETCD_LISTEN_PEER_URLS} 
    --listen-client-urls=${ETCD_LISTEN_CLIENT_URLS},http://127.0.0.1:2379 
    --advertise-client-urls=${ETCD_ADVERTISE_CLIENT_URLS} 
    --initial-advertise-peer-urls=${ETCD_INITIAL_ADVERTISE_PEER_URLS} 
    --initial-cluster=${ETCD_INITIAL_CLUSTER} 
    --initial-cluster-token=${ETCD_INITIAL_CLUSTER} 
    --initial-cluster-state=existing   # 将new这个参数修改成existing,启动正常!
    --cert-file=/opt/kubernetes/ssl/server.pem 
    --key-file=/opt/kubernetes/ssl/server-key.pem 
    --peer-cert-file=/opt/kubernetes/ssl/server.pem 
    --peer-key-file=/opt/kubernetes/ssl/server-key.pem 
    --trusted-ca-file=/opt/kubernetes/ssl/ca.pem 
    --peer-trusted-ca-file=/opt/kubernetes/ssl/ca.pem
    Restart=on-failure
    LimitNOFILE=65536
    
    [Install]
    WantedBy=multi-user.target
    

    我们将 --initial-cluster-state=new 修改成  --initial-cluster-state=existing,再次重新启动就ok了。

    第二种方式删除所有etcd节点的 data-dir 文件(不删也行),重启各个节点的etcd服务,这个时候,每个节点的data-dir的数据都会被更新,就不会有以上故障了。

    第三种方式是复制其他节点的data-dir中的内容,以此为基础上以 --force-new-cluster 的形式强行拉起一个,然后以添加新成员的方式恢复这个集群。

    这是目前的几种解决办法

  • 相关阅读:
    linux 进程学习笔记-运行新进程
    linux 进程学习笔记-进程状态
    linux 进程学习笔记-进程调度
    linux 进程学习笔记-进程退出/终止进程
    linux 进程学习笔记-暂停进程
    linux 进程学习笔记-进程跟踪
    linux 进程学习笔记-等待子进程结束
    linux 进程学习笔记-进程pipe管道
    linux 进程学习笔记-named pipe (FIFO)命名管道
    linux 进程学习笔记-进程信号sigal
  • 原文地址:https://www.cnblogs.com/kaye/p/10600597.html
Copyright © 2011-2022 走看看