zoukankan      html  css  js  c++  java
  • 《.NET 5.0 背锅案》第6集-案发现场回顾:故障情况下 Kubernetes 部署表现团队

    我们的博客系统是部署在用阿里云服务器自己搭建的 Kubernetes 集群上,故障在 k8s 部署更新 pod 的过程中就出现了,昨天发布时,我们特地观察一下,在这1集中分享一下。

    在部署过程中,k8s 会进行3个阶段的 pod 更新操作:

    1. "xxx new replicas have been updated"
    2. "xxx replicas are pending termination"
    3. "xxx updated replicas are available"

    正常发布情况下,整个部署操作通常在5-8分钟左右完成(这与livenessProbe和readinessProbe的配置有关),下面是部署期间的控制台输出

    Waiting for deployment "blog-web" rollout to finish: 4 out of 8 new replicas have been updated...
    Waiting for deployment spec update to be observed...
    Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
    Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
    Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated...
    Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
    Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
    Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated...
    ...
    Waiting for deployment "blog-web" rollout to finish: 4 old replicas are pending termination...
    ...
    Waiting for deployment "blog-web" rollout to finish: 14 of 15 updated replicas are available...
    deployment "blog-web" successfully rolled out
    

    而在故障场景下,整个部署操作需要在15分钟左右才能完成,3个阶段的 pod 更新都比正常情况下慢,尤其是"old replicas are pending termination"阶段。

    在部署期间通过 kubectl get pods -l app=blog-web -o wide 命令查看 pod 的状态,新部署的 pod 处于 Running 状态,说明 livenessProbe 健康检查成功,但多数 pod 没有进入 ready 状态,说明这些 pod 的 readinessProbe 健康检查失败,restarts 大于0 说明 livenessProbe 健康检查失败对 pod 进行了重启。

    NAME                        READY   STATUS    RESTARTS   AGE     IP                NODE         NOMINATED NODE   READINESS GATES
    blog-web-55d5677cf-2854n    0/1     Running   1          5m1s    192.168.107.213   k8s-node3    <none>           <none>
    blog-web-55d5677cf-7vkqb    0/1     Running   2          6m17s   192.168.228.33    k8s-n9       <none>           <none>
    blog-web-55d5677cf-8gq6n    0/1     Running   2          5m29s   192.168.102.235   k8s-n19      <none>           <none>
    blog-web-55d5677cf-g8dsr    0/1     Running   2          5m54s   192.168.104.78    k8s-node11   <none>           <none>
    blog-web-55d5677cf-kk9mf    0/1     Running   2          6m9s    192.168.42.3      k8s-n13      <none>           <none>
    blog-web-55d5677cf-kqwzc    0/1     Pending   0          4m44s   <none>            <none>       <none>           <none>
    blog-web-55d5677cf-lmbvf    0/1     Running   2          5m54s   192.168.201.123   k8s-n14      <none>           <none>
    blog-web-55d5677cf-ms2tk    0/1     Pending   0          6m9s    <none>            <none>       <none>           <none>
    blog-web-55d5677cf-nkjrd    1/1     Running   2          6m17s   192.168.254.129   k8s-n7       <none>           <none>
    blog-web-55d5677cf-nnjdx    0/1     Pending   0          4m48s   <none>            <none>       <none>           <none>
    blog-web-55d5677cf-pqgpr    0/1     Pending   0          4m33s   <none>            <none>       <none>           <none>
    blog-web-55d5677cf-qrjr5    0/1     Pending   0          2m38s   <none>            <none>       <none>           <none>
    blog-web-55d5677cf-t5wvq    1/1     Running   3          6m17s   192.168.10.100    k8s-n12      <none>           <none>
    blog-web-55d5677cf-w52xc    1/1     Running   3          6m17s   192.168.73.35     k8s-node10   <none>           <none>
    blog-web-55d5677cf-zk559    0/1     Running   1          5m21s   192.168.118.6     k8s-n4       <none>           <none>
    blog-web-5b57b7fcb6-7cbdt   1/1     Running   2          18m     192.168.168.77    k8s-n6       <none>           <none>
    blog-web-5b57b7fcb6-cgfr4   1/1     Running   4          19m     192.168.89.250    k8s-n8       <none>           <none>
    blog-web-5b57b7fcb6-cz278   1/1     Running   3          19m     192.168.218.99    k8s-n18      <none>           <none>
    blog-web-5b57b7fcb6-hvzwp   1/1     Running   3          18m     192.168.195.242   k8s-node5    <none>           <none>
    blog-web-5b57b7fcb6-rhgkq   1/1     Running   1          16m     192.168.86.126    k8s-n20      <none>           <none>
    

    在我们的 k8e deployment 配置中 livenessProbe 与 readinessProbe 检查的是同一个地址,具体配置如下

    livenessProbe:
        httpGet:
        path: /
        port: 80
        httpHeaders:
        - name: X-Forwarded-Proto
            value: https
        - name: Host
            value: www.cnblogs.com
        initialDelaySeconds: 30
        periodSeconds: 3
        successThreshold: 1
        failureThreshold: 5
        timeoutSeconds: 5
    readinessProbe:
        httpGet:
        path: /
        port: 80
        httpHeaders:
        - name: X-Forwarded-Proto
            value: https
        - name: Host
            value: www.cnblogs.com
        initialDelaySeconds: 40
        periodSeconds: 5
        successThreshold: 1
        failureThreshold: 5
        timeoutSeconds: 5
    

    由于潜藏的并发问题造成 livenessProbe 与 readinessProbe 健康检查频繁失败,造成 k8s 更新 pod 的过程跌跌撞撞,在这个过程中,由于有部分旧 pod 分担负载,新 pod 出现问题会暂停更新,等正在部署的 pod 恢复正常,所以这时故障的影响局限在一定范围内,访问网站的表现是时好时坏。

    这个跌跌撞撞的艰难部署过程最终会完成,而部署完成之际,就是故障全面爆发之时。部署完成后,新 pod 全面接管负载,存在并发问题的新 pod 在并发请求的重压下溃不成军,多个 pod 因 livenessProbe 健康检查失败被重启,重启后因为 readinessProbe 健康检查失败很难进入 ready 状态分担负载,仅剩的 pod 不堪重负,CrashLoopBackOff 此起彼伏,在源源不断的并发请求的冲击下,始终没有足够的 pod 应付当前的负载,故障就一直无法恢复。

  • 相关阅读:
    Tensorflow io demo (待)
    tf.Dataset
    tf.estimator
    并发队列
    Callable的Future模式
    hadoop之HDFS介绍
    线程池
    并发工具类
    并发编程
    初学hadoop之hadoop集群搭建
  • 原文地址:https://www.cnblogs.com/cmt/p/13999061.html
Copyright © 2011-2022 走看看