zoukankan      html  css  js  c++  java
  • K8S集群灾备环境部署

    etcd是kubernetes集群极为重要的一块服务,存储了kubernetes集群所有的数据信息,如Namespace、Pod、Service、路由等状态信息。如果etcd集群发生灾难或者 etcd 集群数据丢失,都会影响k8s集群数据的恢复。因此,通过备份etcd数据来实现kubernetes集群的灾备环境十分重要。
     
    一、etcd集群备份
    etcd不同版本的 etcdctl 命令不一样,但大致差不多,这里备份使用 napshot save进行快照备份。
    需要注意几点:
    • 备份操作在etcd集群的其中一个节点执行就可以。
    • 这里使用的是etcd v3的api,因为从 k8s 1.13 开始,k8s不再支持 v2 版本的 etcd,即k8s的集群数据都存在了v3版本的etcd中。故备份的数据也只备份了使用v3添加的etcd数据,v2添加的etcd数据是没有做备份的。
    • 本案例使用的是二进制部署的k8s v1.18.6 + Calico 容器环境(下面命令中的"ETCDCTL_API=3 etcdctl" 等同于 "etcdctl")
     
    1)开始备份之前,先来查看下etcd数据
    etcd 数据目录
    [root@k8s-master01 ~]# cat /opt/k8s/bin/environment.sh |grep "ETCD_DATA_DIR="
    export ETCD_DATA_DIR="/data/k8s/etcd/data"
         
    etcd WAL 目录
    [root@k8s-master01 ~]# cat /opt/k8s/bin/environment.sh |grep "ETCD_WAL_DIR="
    export ETCD_WAL_DIR="/data/k8s/etcd/wal"
    
    [root@k8s-master01 ~]# ls /data/k8s/etcd/data/
    member
    [root@k8s-master01 ~]# ls /data/k8s/etcd/data/member/
    snap
    [root@k8s-master01 ~]# ls /data/k8s/etcd/wal/
    0000000000000000-0000000000000000.wal  0.tmp
    

      

    2)执行etcd集群数据备份
    在etcd集群的其中一个节点执行备份操作,然后将备份文件拷贝到其他节点上。
     
    先在etcd集群的每个节点上创建备份目录
    # mkdir -p /data/etcd_backup_dir
    

    在etcd集群其中个一个节点(这里在k8s-master01)上执行备份:

    [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/cert/ca.pem --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --endpoints=https://172.16.60.231:2379 snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db
    

    将备份文件拷贝到其他的etcd节点
    [root@k8s-master01 ~]# rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master02:/data/etcd_backup_dir/
    [root@k8s-master01 ~]# rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master03:/data/etcd_backup_dir/
    

      

    可以将上面k8s-master01节点的etcd备份命令放在脚本里,结合crontab进行定时备份:

    [root@k8s-master01 ~]# cat /data/etcd_backup_dir/etcd_backup.sh
    #!/usr/bin/bash
    
    date;
    CACERT="/etc/kubernetes/cert/ca.pem"
    CERT="/etc/etcd/cert/etcd.pem"
    EKY="/etc/etcd/cert/etcd-key.pem"
    ENDPOINTS="172.16.60.231:2379"
    
    ETCDCTL_API=3 /opt/k8s/bin/etcdctl 
    --cacert="${CACERT}" --cert="${CERT}" --key="${EKY}" 
    --endpoints=${ENDPOINTS} 
    snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db
    
    # 备份保留30天
    find /data/etcd_backup_dir/ -name "*.db" -mtime +30 -exec rm -f {} ;
    
    # 同步到其他两个etcd节点
    /bin/rsync -e "ssh -p5522" -avpgolr --delete /data/etcd_backup_dir/ root@k8s-master02:/data/etcd_backup_dir/
    /bin/rsync -e "ssh -p5522" -avpgolr --delete /data/etcd_backup_dir/ root@k8s-master03:/data/etcd_backup_dir/
    

      

    设置crontab定时备份任务,每天凌晨5点执行备份:
    [root@k8s-master01 ~]# chmod 755 /data/etcd_backup_dir/etcd_backup.sh
    [root@k8s-master01 ~]# crontab -l
    #etcd集群数据备份
    0 5 * * * /bin/bash -x /data/etcd_backup_dir/etcd_backup.sh > /dev/null 2>&1
    

      

    二、etcd集群恢复
    etcd集群备份操作只需要在其中的一个etcd节点上完成,然后将备份文件拷贝到其他节点。
    但etcd集群恢复操作必须要所有的etcd节点上完成!

    1)模拟etcd集群数据丢失
    删除三个etcd集群节点的data数据 (或者直接删除data目录)

    # rm -rf /data/k8s/etcd/data/*
    

    查看k8s集群状态:

    [root@k8s-master01 ~]# kubectl get cs
    NAME                 STATUS      MESSAGE                                                                                           ERROR
    etcd-2               Unhealthy   Get https://172.16.60.233:2379/health: dial tcp 172.16.60.233:2379: connect: connection refused
    etcd-1               Unhealthy   Get https://172.16.60.232:2379/health: dial tcp 172.16.60.232:2379: connect: connection refused
    etcd-0               Unhealthy   Get https://172.16.60.231:2379/health: dial tcp 172.16.60.231:2379: connect: connection refused
    scheduler            Healthy     ok
    controller-manager   Healthy     ok
    

      

    由于此时etcd集群的三个节点服务还在,过一会儿查看集群状态恢复正常:
    [root@k8s-master01 ~]# kubectl get cs
    NAME                 STATUS    MESSAGE             ERROR
    controller-manager   Healthy   ok
    scheduler            Healthy   ok
    etcd-0               Healthy   {"health":"true"}
    etcd-2               Healthy   {"health":"true"}
    etcd-1               Healthy   {"health":"true"}
    
    [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem endpoint health
    https://172.16.60.231:2379 is healthy: successfully committed proposal: took = 9.918673ms
    https://172.16.60.233:2379 is healthy: successfully committed proposal: took = 10.985279ms
    https://172.16.60.232:2379 is healthy: successfully committed proposal: took = 13.422545ms
    
    [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem member list --write-out=table
    +------------------+---------+------------+----------------------------+----------------------------+------------+
    |        ID        | STATUS  |    NAME    |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
    +------------------+---------+------------+----------------------------+----------------------------+------------+
    | 1d1d7edbba38c293 | started | k8s-etcd03 | https://172.16.60.233:2380 | https://172.16.60.233:2379 |      false |
    | 4c0cfad24e92e45f | started | k8s-etcd02 | https://172.16.60.232:2380 | https://172.16.60.232:2379 |      false |
    | 79cf4f0a8c3da54b | started | k8s-etcd01 | https://172.16.60.231:2380 | https://172.16.60.231:2379 |      false |
    +------------------+---------+------------+----------------------------+----------------------------+------------+
    

      

    如上发现,etcd集群三个节点的leader都是false,即没有选主。此时需要重启三个节点的etcd服务:
    # systemctl restart etcd
    

      

    重启后,再次查看发现etcd集群已经选主成功,集群状态正常!
    [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl -w table --cacert=/etc/kubernetes/cert/ca.pem   --cert=/etc/etcd/cert/etcd.pem   --key=/etc/etcd/cert/etcd-key.pem   --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" endpoint status
    +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    |          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
    +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    | https://172.16.60.231:2379 | 79cf4f0a8c3da54b |   3.4.9 |  1.6 MB |      true |      false |         5 |      24658 |              24658 |        |
    | https://172.16.60.232:2379 | 4c0cfad24e92e45f |   3.4.9 |  1.6 MB |     false |      false |         5 |      24658 |              24658 |        |
    | https://172.16.60.233:2379 | 1d1d7edbba38c293 |   3.4.9 |  1.7 MB |     false |      false |         5 |      24658 |              24658 |        |
    +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    

      

    但是,k8s集群数据其实已经丢失了。namespace命名空间下的pod等资源都没有了。此时就需要通过etcd集群备份文件来恢复,即通过上面的etcd集群快照文件恢复。
    [root@k8s-master01 ~]# kubectl get ns
    NAME              STATUS   AGE
    default           Active   9m47s
    kube-node-lease   Active   9m39s
    kube-public       Active   9m39s
    kube-system       Active   9m47s
    [root@k8s-master01 ~]# kubectl get pods -n kube-system
    No resources found in kube-system namespace.
    [root@k8s-master01 ~]# kubectl get pods --all-namespaces
    No resources found
    

      

    2)etcd集群数据恢复,即kubernetes集群数据恢复
    在etcd数据恢复之前,先依次关闭所有master节点的kube-aposerver服务,所有etcd节点的etcd服务:
    # systemctl stop kube-apiserver
    # systemctl stop etcd
    

      

    特别注意:在进行etcd集群数据恢复之前,一定要先将所有etcd节点的data和wal旧工作目录删掉,这里指的是/data/k8s/etcd/data文件夹跟/data/k8s/etcd/wal文件夹,可能会导致恢复失败(恢复命令执行时报错数据目录已存在)。
    # rm -rf /data/k8s/etcd/data && rm -rf /data/k8s/etcd/wal
    

      

    在每个etcd节点执行恢复操作:
    172.16.60.231节点
    -------------------------------------------------------
    ETCDCTL_API=3 etcdctl 
    --name=k8s-etcd01 
    --endpoints="https://172.16.60.231:2379" 
    --cert=/etc/etcd/cert/etcd.pem 
    --key=/etc/etcd/cert/etcd-key.pem 
    --cacert=/etc/kubernetes/cert/ca.pem 
    --initial-cluster-token=etcd-cluster-0 
    --initial-advertise-peer-urls=https://172.16.60.231:2380 
    --initial-cluster=k8s-etcd01=https://172.16.60.231:2380,k8s-etcd02=https://172.16.60.232:2380,k8s-etcd03=https://192.168.137.233:2380 
    --data-dir=/data/k8s/etcd/data 
    --wal-dir=/data/k8s/etcd/wal 
    snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db
    
    
    172.16.60.232节点
    -------------------------------------------------------
    ETCDCTL_API=3 etcdctl 
    --name=k8s-etcd02 
    --endpoints="https://172.16.60.232:2379" 
    --cert=/etc/etcd/cert/etcd.pem 
    --key=/etc/etcd/cert/etcd-key.pem 
    --cacert=/etc/kubernetes/cert/ca.pem 
    --initial-cluster-token=etcd-cluster-0 
    --initial-advertise-peer-urls=https://172.16.60.232:2380 
    --initial-cluster=k8s-etcd01=https://172.16.60.231:2380,k8s-etcd02=https://172.16.60.232:2380,k8s-etcd03=https://192.168.137.233:2380 
    --data-dir=/data/k8s/etcd/data 
    --wal-dir=/data/k8s/etcd/wal 
    snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db
    
    
    192.168.137.233节点
    -------------------------------------------------------
    ETCDCTL_API=3 etcdctl 
    --name=k8s-etcd03 
    --endpoints="https://192.168.137.233:2379" 
    --cert=/etc/etcd/cert/etcd.pem 
    --key=/etc/etcd/cert/etcd-key.pem 
    --cacert=/etc/kubernetes/cert/ca.pem 
    --initial-cluster-token=etcd-cluster-0 
    --initial-advertise-peer-urls=https://192.168.137.233:2380 
    --initial-cluster=k8s-etcd01=https://172.16.60.231:2380,k8s-etcd02=https://172.16.60.232:2380,k8s-etcd03=https://192.168.137.233:2380 
    --data-dir=/data/k8s/etcd/data 
    --wal-dir=/data/k8s/etcd/wal 
    snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db
    

      

    依次启动所有etcd节点的etcd服务:
    # systemctl start etcd
    # systemctl status etcd
    

      

    检查 ETCD 集群状态(如下,发现etcd集群里已经成功选主了)
    [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem endpoint health
    https://172.16.60.232:2379 is healthy: successfully committed proposal: took = 12.837393ms
    https://172.16.60.233:2379 is healthy: successfully committed proposal: took = 13.306671ms
    https://172.16.60.231:2379 is healthy: successfully committed proposal: took = 13.602805ms
    
    [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl -w table --cacert=/etc/kubernetes/cert/ca.pem   --cert=/etc/etcd/cert/etcd.pem   --key=/etc/etcd/cert/etcd-key.pem   --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" endpoint status
    +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    |          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
    +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    | https://172.16.60.231:2379 | 79cf4f0a8c3da54b |   3.4.9 |  9.0 MB |     false |      false |         2 |         13 |                 13 |        |
    | https://172.16.60.232:2379 | 4c0cfad24e92e45f |   3.4.9 |  9.0 MB |      true |      false |         2 |         13 |                 13 |        |
    | https://172.16.60.233:2379 | 5f70664d346a6ebd |   3.4.9 |  9.0 MB |     false |      false |         2 |         13 |                 13 |        |
    +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    

      

    再依次启动所有master节点的kube-apiserver服务:
    # systemctl start kube-apiserver
    # systemctl status kube-apiserver
    

      

    查看kubernetes集群状态:
    [root@k8s-master01 ~]# kubectl get cs
    NAME                 STATUS      MESSAGE                                  ERROR
    controller-manager   Healthy     ok
    scheduler            Healthy     ok
    etcd-2               Unhealthy   HTTP probe failed with statuscode: 503
    etcd-1               Unhealthy   HTTP probe failed with statuscode: 503
    etcd-0               Unhealthy   HTTP probe failed with statuscode: 503
    
    由于etcd服务刚重启,需要多刷几次状态就会正常:
    [root@k8s-master01 ~]# kubectl get cs
    NAME                 STATUS    MESSAGE             ERROR
    controller-manager   Healthy   ok
    scheduler            Healthy   ok
    etcd-2               Healthy   {"health":"true"}
    etcd-0               Healthy   {"health":"true"}
    etcd-1               Healthy   {"health":"true"}
    

      

    查看kubernetes的资源情况:
    [root@k8s-master01 ~]# kubectl get ns
    NAME              STATUS   AGE
    default           Active   7d4h
    kevin             Active   5d18h
    kube-node-lease   Active   7d4h
    kube-public       Active   7d4h
    kube-system       Active   7d4h
    
    [root@k8s-master01 ~]# kubectl get pods --all-namespaces
    NAMESPACE     NAME                                       READY   STATUS              RESTARTS   AGE
    default       dnsutils-ds-22q87                          0/1     ContainerCreating   171        7d3h
    default       dnsutils-ds-bp8tm                          0/1     ContainerCreating   138        5d18h
    default       dnsutils-ds-bzzqg                          0/1     ContainerCreating   138        5d18h
    default       dnsutils-ds-jcvng                          1/1     Running             171        7d3h
    default       dnsutils-ds-xrl2x                          0/1     ContainerCreating   138        5d18h
    default       dnsutils-ds-zjg5l                          1/1     Running             0          7d3h
    default       kevin-t-84cdd49d65-ck47f                   0/1     ContainerCreating   0          2d2h
    default       nginx-ds-98rm2                             1/1     Running             2          7d3h
    default       nginx-ds-bbx68                             1/1     Running             0          7d3h
    default       nginx-ds-kfctv                             0/1     ContainerCreating   1          5d18h
    default       nginx-ds-mdcd9                             0/1     ContainerCreating   1          5d18h
    default       nginx-ds-ngqcm                             1/1     Running             0          7d3h
    default       nginx-ds-tpcxs                             0/1     ContainerCreating   1          5d18h
    kevin         nginx-ingress-controller-797ffb479-vrq6w   0/1     ContainerCreating   0          5d18h
    kevin         test-nginx-7d4f96b486-qd4fl                0/1     ContainerCreating   0          2d1h
    kevin         test-nginx-7d4f96b486-qfddd                0/1     Running             0          2d1h
    kube-system   calico-kube-controllers-578894d4cd-9rp4c   1/1     Running             1          7d3h
    kube-system   calico-node-d7wq8                          0/1     PodInitializing     1          7d3h
    在etcd集群数据恢复后,pod容器也会慢慢恢复到running状态。至此,kubernetes整个集群已经通过etcd备份数据恢复了。
     
    三、最后总结
    Kubernetes 集群备份主要是备份 ETCD 集群。而恢复时,主要考虑恢复整个顺序:
    停止kube-apiserver --> 停止ETCD --> 恢复数据 --> 启动ETCD --> 启动kube-apiserve
     
    特别注意:
    • 备份ETCD集群时,只需要备份一个ETCD数据,然后同步到其他节点上。
    • 恢复ETCD数据时,拿其中一个节点的备份数据恢复即可。
    *************** 当你发现自己的才华撑不起野心时,就请安静下来学习吧!***************
  • 相关阅读:
    LintCode "Maximum Gap"
    LintCode "Wood Cut"
    LintCode "Expression Evaluation"
    LintCode "Find Peak Element II"
    LintCode "Remove Node in Binary Search Tree"
    LintCode "Delete Digits"
    LintCode "Binary Representation"
    LeetCode "Game of Life"
    LintCode "Coins in a Line"
    LintCode "Word Break"
  • 原文地址:https://www.cnblogs.com/kevingrace/p/14616824.html
Copyright © 2011-2022 走看看