zoukankan      html  css  js  c++  java
  • K8S集群灾备环境部署

    etcd是kubernetes集群极为重要的一块服务,存储了kubernetes集群所有的数据信息,如Namespace、Pod、Service、路由等状态信息。如果etcd集群发生灾难或者 etcd 集群数据丢失,都会影响k8s集群数据的恢复。因此,通过备份etcd数据来实现kubernetes集群的灾备环境十分重要。
     
    一、etcd集群备份
    etcd不同版本的 etcdctl 命令不一样,但大致差不多,这里备份使用 napshot save进行快照备份。
    需要注意几点:
    • 备份操作在etcd集群的其中一个节点执行就可以。
    • 这里使用的是etcd v3的api,因为从 k8s 1.13 开始,k8s不再支持 v2 版本的 etcd,即k8s的集群数据都存在了v3版本的etcd中。故备份的数据也只备份了使用v3添加的etcd数据,v2添加的etcd数据是没有做备份的。
    • 本案例使用的是二进制部署的k8s v1.18.6 + Calico 容器环境(下面命令中的"ETCDCTL_API=3 etcdctl" 等同于 "etcdctl")
     
    1)开始备份之前,先来查看下etcd数据
    etcd 数据目录
    [root@k8s-master01 ~]# cat /opt/k8s/bin/environment.sh |grep "ETCD_DATA_DIR="
    export ETCD_DATA_DIR="/data/k8s/etcd/data"
         
    etcd WAL 目录
    [root@k8s-master01 ~]# cat /opt/k8s/bin/environment.sh |grep "ETCD_WAL_DIR="
    export ETCD_WAL_DIR="/data/k8s/etcd/wal"
    
    [root@k8s-master01 ~]# ls /data/k8s/etcd/data/
    member
    [root@k8s-master01 ~]# ls /data/k8s/etcd/data/member/
    snap
    [root@k8s-master01 ~]# ls /data/k8s/etcd/wal/
    0000000000000000-0000000000000000.wal  0.tmp
    

      

    2)执行etcd集群数据备份
    在etcd集群的其中一个节点执行备份操作,然后将备份文件拷贝到其他节点上。
     
    先在etcd集群的每个节点上创建备份目录
    # mkdir -p /data/etcd_backup_dir
    

    在etcd集群其中个一个节点(这里在k8s-master01)上执行备份:

    [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/cert/ca.pem --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --endpoints=https://172.16.60.231:2379 snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db
    

    将备份文件拷贝到其他的etcd节点
    [root@k8s-master01 ~]# rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master02:/data/etcd_backup_dir/
    [root@k8s-master01 ~]# rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master03:/data/etcd_backup_dir/
    

      

    可以将上面k8s-master01节点的etcd备份命令放在脚本里,结合crontab进行定时备份:

    [root@k8s-master01 ~]# cat /data/etcd_backup_dir/etcd_backup.sh
    #!/usr/bin/bash
    
    date;
    CACERT="/etc/kubernetes/cert/ca.pem"
    CERT="/etc/etcd/cert/etcd.pem"
    EKY="/etc/etcd/cert/etcd-key.pem"
    ENDPOINTS="172.16.60.231:2379"
    
    ETCDCTL_API=3 /opt/k8s/bin/etcdctl 
    --cacert="${CACERT}" --cert="${CERT}" --key="${EKY}" 
    --endpoints=${ENDPOINTS} 
    snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db
    
    # 备份保留30天
    find /data/etcd_backup_dir/ -name "*.db" -mtime +30 -exec rm -f {} ;
    
    # 同步到其他两个etcd节点
    /bin/rsync -e "ssh -p5522" -avpgolr --delete /data/etcd_backup_dir/ root@k8s-master02:/data/etcd_backup_dir/
    /bin/rsync -e "ssh -p5522" -avpgolr --delete /data/etcd_backup_dir/ root@k8s-master03:/data/etcd_backup_dir/
    

      

    设置crontab定时备份任务,每天凌晨5点执行备份:
    [root@k8s-master01 ~]# chmod 755 /data/etcd_backup_dir/etcd_backup.sh
    [root@k8s-master01 ~]# crontab -l
    #etcd集群数据备份
    0 5 * * * /bin/bash -x /data/etcd_backup_dir/etcd_backup.sh > /dev/null 2>&1
    

      

    二、etcd集群恢复
    etcd集群备份操作只需要在其中的一个etcd节点上完成,然后将备份文件拷贝到其他节点。
    但etcd集群恢复操作必须要所有的etcd节点上完成!

    1)模拟etcd集群数据丢失
    删除三个etcd集群节点的data数据 (或者直接删除data目录)

    # rm -rf /data/k8s/etcd/data/*
    

    查看k8s集群状态:

    [root@k8s-master01 ~]# kubectl get cs
    NAME                 STATUS      MESSAGE                                                                                           ERROR
    etcd-2               Unhealthy   Get https://172.16.60.233:2379/health: dial tcp 172.16.60.233:2379: connect: connection refused
    etcd-1               Unhealthy   Get https://172.16.60.232:2379/health: dial tcp 172.16.60.232:2379: connect: connection refused
    etcd-0               Unhealthy   Get https://172.16.60.231:2379/health: dial tcp 172.16.60.231:2379: connect: connection refused
    scheduler            Healthy     ok
    controller-manager   Healthy     ok
    

      

    由于此时etcd集群的三个节点服务还在,过一会儿查看集群状态恢复正常:
    [root@k8s-master01 ~]# kubectl get cs
    NAME                 STATUS    MESSAGE             ERROR
    controller-manager   Healthy   ok
    scheduler            Healthy   ok
    etcd-0               Healthy   {"health":"true"}
    etcd-2               Healthy   {"health":"true"}
    etcd-1               Healthy   {"health":"true"}
    
    [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem endpoint health
    https://172.16.60.231:2379 is healthy: successfully committed proposal: took = 9.918673ms
    https://172.16.60.233:2379 is healthy: successfully committed proposal: took = 10.985279ms
    https://172.16.60.232:2379 is healthy: successfully committed proposal: took = 13.422545ms
    
    [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem member list --write-out=table
    +------------------+---------+------------+----------------------------+----------------------------+------------+
    |        ID        | STATUS  |    NAME    |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
    +------------------+---------+------------+----------------------------+----------------------------+------------+
    | 1d1d7edbba38c293 | started | k8s-etcd03 | https://172.16.60.233:2380 | https://172.16.60.233:2379 |      false |
    | 4c0cfad24e92e45f | started | k8s-etcd02 | https://172.16.60.232:2380 | https://172.16.60.232:2379 |      false |
    | 79cf4f0a8c3da54b | started | k8s-etcd01 | https://172.16.60.231:2380 | https://172.16.60.231:2379 |      false |
    +------------------+---------+------------+----------------------------+----------------------------+------------+
    

      

    如上发现,etcd集群三个节点的leader都是false,即没有选主。此时需要重启三个节点的etcd服务:
    # systemctl restart etcd
    

      

    重启后,再次查看发现etcd集群已经选主成功,集群状态正常!
    [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl -w table --cacert=/etc/kubernetes/cert/ca.pem   --cert=/etc/etcd/cert/etcd.pem   --key=/etc/etcd/cert/etcd-key.pem   --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" endpoint status
    +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    |          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
    +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    | https://172.16.60.231:2379 | 79cf4f0a8c3da54b |   3.4.9 |  1.6 MB |      true |      false |         5 |      24658 |              24658 |        |
    | https://172.16.60.232:2379 | 4c0cfad24e92e45f |   3.4.9 |  1.6 MB |     false |      false |         5 |      24658 |              24658 |        |
    | https://172.16.60.233:2379 | 1d1d7edbba38c293 |   3.4.9 |  1.7 MB |     false |      false |         5 |      24658 |              24658 |        |
    +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    

      

    但是,k8s集群数据其实已经丢失了。namespace命名空间下的pod等资源都没有了。此时就需要通过etcd集群备份文件来恢复,即通过上面的etcd集群快照文件恢复。
    [root@k8s-master01 ~]# kubectl get ns
    NAME              STATUS   AGE
    default           Active   9m47s
    kube-node-lease   Active   9m39s
    kube-public       Active   9m39s
    kube-system       Active   9m47s
    [root@k8s-master01 ~]# kubectl get pods -n kube-system
    No resources found in kube-system namespace.
    [root@k8s-master01 ~]# kubectl get pods --all-namespaces
    No resources found
    

      

    2)etcd集群数据恢复,即kubernetes集群数据恢复
    在etcd数据恢复之前,先依次关闭所有master节点的kube-aposerver服务,所有etcd节点的etcd服务:
    # systemctl stop kube-apiserver
    # systemctl stop etcd
    

      

    特别注意:在进行etcd集群数据恢复之前,一定要先将所有etcd节点的data和wal旧工作目录删掉,这里指的是/data/k8s/etcd/data文件夹跟/data/k8s/etcd/wal文件夹,可能会导致恢复失败(恢复命令执行时报错数据目录已存在)。
    # rm -rf /data/k8s/etcd/data && rm -rf /data/k8s/etcd/wal
    

      

    在每个etcd节点执行恢复操作:
    172.16.60.231节点
    -------------------------------------------------------
    ETCDCTL_API=3 etcdctl 
    --name=k8s-etcd01 
    --endpoints="https://172.16.60.231:2379" 
    --cert=/etc/etcd/cert/etcd.pem 
    --key=/etc/etcd/cert/etcd-key.pem 
    --cacert=/etc/kubernetes/cert/ca.pem 
    --initial-cluster-token=etcd-cluster-0 
    --initial-advertise-peer-urls=https://172.16.60.231:2380 
    --initial-cluster=k8s-etcd01=https://172.16.60.231:2380,k8s-etcd02=https://172.16.60.232:2380,k8s-etcd03=https://192.168.137.233:2380 
    --data-dir=/data/k8s/etcd/data 
    --wal-dir=/data/k8s/etcd/wal 
    snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db
    
    
    172.16.60.232节点
    -------------------------------------------------------
    ETCDCTL_API=3 etcdctl 
    --name=k8s-etcd02 
    --endpoints="https://172.16.60.232:2379" 
    --cert=/etc/etcd/cert/etcd.pem 
    --key=/etc/etcd/cert/etcd-key.pem 
    --cacert=/etc/kubernetes/cert/ca.pem 
    --initial-cluster-token=etcd-cluster-0 
    --initial-advertise-peer-urls=https://172.16.60.232:2380 
    --initial-cluster=k8s-etcd01=https://172.16.60.231:2380,k8s-etcd02=https://172.16.60.232:2380,k8s-etcd03=https://192.168.137.233:2380 
    --data-dir=/data/k8s/etcd/data 
    --wal-dir=/data/k8s/etcd/wal 
    snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db
    
    
    192.168.137.233节点
    -------------------------------------------------------
    ETCDCTL_API=3 etcdctl 
    --name=k8s-etcd03 
    --endpoints="https://192.168.137.233:2379" 
    --cert=/etc/etcd/cert/etcd.pem 
    --key=/etc/etcd/cert/etcd-key.pem 
    --cacert=/etc/kubernetes/cert/ca.pem 
    --initial-cluster-token=etcd-cluster-0 
    --initial-advertise-peer-urls=https://192.168.137.233:2380 
    --initial-cluster=k8s-etcd01=https://172.16.60.231:2380,k8s-etcd02=https://172.16.60.232:2380,k8s-etcd03=https://192.168.137.233:2380 
    --data-dir=/data/k8s/etcd/data 
    --wal-dir=/data/k8s/etcd/wal 
    snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db
    

      

    依次启动所有etcd节点的etcd服务:
    # systemctl start etcd
    # systemctl status etcd
    

      

    检查 ETCD 集群状态(如下,发现etcd集群里已经成功选主了)
    [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem endpoint health
    https://172.16.60.232:2379 is healthy: successfully committed proposal: took = 12.837393ms
    https://172.16.60.233:2379 is healthy: successfully committed proposal: took = 13.306671ms
    https://172.16.60.231:2379 is healthy: successfully committed proposal: took = 13.602805ms
    
    [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl -w table --cacert=/etc/kubernetes/cert/ca.pem   --cert=/etc/etcd/cert/etcd.pem   --key=/etc/etcd/cert/etcd-key.pem   --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" endpoint status
    +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    |          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
    +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    | https://172.16.60.231:2379 | 79cf4f0a8c3da54b |   3.4.9 |  9.0 MB |     false |      false |         2 |         13 |                 13 |        |
    | https://172.16.60.232:2379 | 4c0cfad24e92e45f |   3.4.9 |  9.0 MB |      true |      false |         2 |         13 |                 13 |        |
    | https://172.16.60.233:2379 | 5f70664d346a6ebd |   3.4.9 |  9.0 MB |     false |      false |         2 |         13 |                 13 |        |
    +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    

      

    再依次启动所有master节点的kube-apiserver服务:
    # systemctl start kube-apiserver
    # systemctl status kube-apiserver
    

      

    查看kubernetes集群状态:
    [root@k8s-master01 ~]# kubectl get cs
    NAME                 STATUS      MESSAGE                                  ERROR
    controller-manager   Healthy     ok
    scheduler            Healthy     ok
    etcd-2               Unhealthy   HTTP probe failed with statuscode: 503
    etcd-1               Unhealthy   HTTP probe failed with statuscode: 503
    etcd-0               Unhealthy   HTTP probe failed with statuscode: 503
    
    由于etcd服务刚重启,需要多刷几次状态就会正常:
    [root@k8s-master01 ~]# kubectl get cs
    NAME                 STATUS    MESSAGE             ERROR
    controller-manager   Healthy   ok
    scheduler            Healthy   ok
    etcd-2               Healthy   {"health":"true"}
    etcd-0               Healthy   {"health":"true"}
    etcd-1               Healthy   {"health":"true"}
    

      

    查看kubernetes的资源情况:
    [root@k8s-master01 ~]# kubectl get ns
    NAME              STATUS   AGE
    default           Active   7d4h
    kevin             Active   5d18h
    kube-node-lease   Active   7d4h
    kube-public       Active   7d4h
    kube-system       Active   7d4h
    
    [root@k8s-master01 ~]# kubectl get pods --all-namespaces
    NAMESPACE     NAME                                       READY   STATUS              RESTARTS   AGE
    default       dnsutils-ds-22q87                          0/1     ContainerCreating   171        7d3h
    default       dnsutils-ds-bp8tm                          0/1     ContainerCreating   138        5d18h
    default       dnsutils-ds-bzzqg                          0/1     ContainerCreating   138        5d18h
    default       dnsutils-ds-jcvng                          1/1     Running             171        7d3h
    default       dnsutils-ds-xrl2x                          0/1     ContainerCreating   138        5d18h
    default       dnsutils-ds-zjg5l                          1/1     Running             0          7d3h
    default       kevin-t-84cdd49d65-ck47f                   0/1     ContainerCreating   0          2d2h
    default       nginx-ds-98rm2                             1/1     Running             2          7d3h
    default       nginx-ds-bbx68                             1/1     Running             0          7d3h
    default       nginx-ds-kfctv                             0/1     ContainerCreating   1          5d18h
    default       nginx-ds-mdcd9                             0/1     ContainerCreating   1          5d18h
    default       nginx-ds-ngqcm                             1/1     Running             0          7d3h
    default       nginx-ds-tpcxs                             0/1     ContainerCreating   1          5d18h
    kevin         nginx-ingress-controller-797ffb479-vrq6w   0/1     ContainerCreating   0          5d18h
    kevin         test-nginx-7d4f96b486-qd4fl                0/1     ContainerCreating   0          2d1h
    kevin         test-nginx-7d4f96b486-qfddd                0/1     Running             0          2d1h
    kube-system   calico-kube-controllers-578894d4cd-9rp4c   1/1     Running             1          7d3h
    kube-system   calico-node-d7wq8                          0/1     PodInitializing     1          7d3h
    在etcd集群数据恢复后,pod容器也会慢慢恢复到running状态。至此,kubernetes整个集群已经通过etcd备份数据恢复了。
     
    三、最后总结
    Kubernetes 集群备份主要是备份 ETCD 集群。而恢复时,主要考虑恢复整个顺序:
    停止kube-apiserver --> 停止ETCD --> 恢复数据 --> 启动ETCD --> 启动kube-apiserve
     
    特别注意:
    • 备份ETCD集群时,只需要备份一个ETCD数据,然后同步到其他节点上。
    • 恢复ETCD数据时,拿其中一个节点的备份数据恢复即可。
    *************** 当你发现自己的才华撑不起野心时,就请安静下来学习吧!***************
  • 相关阅读:
    一次聚类引发的一系列问题(工作经验篇)
    SQLServer数据库返回错误的国际化
    记一次SQL优化
    java设计模式-工厂模式(springweb为例子)
    JAVA中的泛型(Generic)
    spring源码分析-core.io包里面的类
    java设计模式-代理模式
    javaWeb正则表达式
    Java中的泛型
    关于API,前后端分离
  • 原文地址:https://www.cnblogs.com/kevingrace/p/14616824.html
Copyright © 2011-2022 走看看