zoukankan      html  css  js  c++  java
  • etcd节点故障处理

    问题:巡检发现k8s集群的etcd集群状态不对,其中有一个节点不健康,现象如下:

    [root@k8s-master1 ~]# kubectl get cs
    NAME                 STATUS      MESSAGE                                  ERROR
    controller-manager   Healthy     ok                                       
    scheduler            Healthy     ok                                       
    etcd-1               Healthy     {"health":"true"}                        
    etcd-0               Healthy     {"health":"true"}                        
    etcd-2               Unhealthy   HTTP probe failed with statuscode: 503

    而且查询etcd日志没有太多报错信息,时间和证书都是正常的,而且也没有防火墙问题,于是开始进行如下操作

    1.将有故障的etcd节点remove出集群:

    [root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member list
    20fd79755169a89, started, etcd-3, https://172.16.23.122:2380, https://172.16.23.122:2379, false
    39356a19c9b19f6d, started, etcd-1, https://172.16.23.120:2380, https://172.16.23.120:2379, false
    506e9a48a5c19ec3, started, etcd-2, https://172.16.23.121:2380, https://172.16.23.121:2379, false

    由上面信息可知,有故障的etcd节点为etcd-2这个,对应etcd-3这个name也就是122这一台机器

    [root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member remove 20fd79755169a89
    Member  20fd79755169a89 removed from cluster ad1f122f981ee2bf
    [root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member list
    39356a19c9b19f6d, started, etcd-1, https://172.16.23.120:2380, https://172.16.23.120:2379, false
    506e9a48a5c19ec3, started, etcd-2, https://172.16.23.121:2380, https://172.16.23.121:2379, false

    2.第一步已经将有故障的etcd节点etcd-2剔除了集群,开始操作etcd-3这个节点,删除etcd数据,然后将etcd配置文件集群信息由new修改为existing

    # rm -rf /var/lib/etcd/default.etcd/member/

    修改etcd配置文件,将下面new修改为:

    修改前:

    ETCD_INITIAL_CLUSTER_STATE="new"

    修改后:

    ETCD_INITIAL_CLUSTER_STATE="existing"

    3.然后将etcd-3节点加入到集群:

    [root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member add etcd-2 --peer-urls=https://172.16.23.122:2380
    Member a98137c10970d43c added to cluster ad1f122f981ee2bf

    然后查看集群列表:

    [root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member list
    39356a19c9b19f6d, started, etcd-1, https://172.16.23.120:2380, https://172.16.23.120:2379, false
    506e9a48a5c19ec3, started, etcd-2, https://172.16.23.121:2380, https://172.16.23.121:2379, false
    a98137c10970d43c, unstarted, , https://172.16.23.122:2380, , false

    4.重启etcd故障节点:

    [root@k8s-master3 ~]# systemctl start etcd
    [root@k8s-master3 ~]# systemctl status etcd
    ● etcd.service - Etcd Server
       Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
       Active: active (running) since 日 2021-02-28 22:04:34 CST; 4s ago

    最后查看k8s集群的etcd:

    [root@k8s-master1 ~]# kubectl get cs
    NAME                 STATUS    MESSAGE             ERROR
    scheduler            Healthy   ok                  
    controller-manager   Healthy   ok                  
    etcd-2               Healthy   {"health":"true"}   
    etcd-0               Healthy   {"health":"true"}   
    etcd-1               Healthy   {"health":"true"}
  • 相关阅读:
    F#+for+Scientists8OPTIMIZATI0N
    F#+for+Scientists9LIBRARIES
    F# 基础语法—关键字和结构[z]
    Matrix and linear algebra in F#, Part I: the F# Matrix type[z]
    Matrix and linear algebra in F#, Part IV: profile your program, find the bottleneck and speed it up: using matrix multiplication as an example[z]
    计算机程序的构造和解释 目录
    使用Jt2Go控件显示3D模型
    F#+for+Scientists3DATA STRUCTURES
    MATLAB 7的安装
    入境问俗,入门问禁
  • 原文地址:https://www.cnblogs.com/jsonhc/p/14460885.html
Copyright © 2011-2022 走看看