zoukankan      html  css  js  c++  java
  • Flagger on ASM——基于Mixerless Telemetry实现渐进式灰度发布系列 3 渐进式灰度发布

    简介: 作为CNCF[成员](https://landscape.cncf.io/card-mode?category=continuous-integration-delivery&grouping=category&selected=weave-flagger),[Weave Flagger](flagger.app)提供了持续集成和持续交付的各项能力。Flagger将渐进式发布总结为3类: - **灰度发布/金丝雀发布(Canary)**:用于渐进式切流到灰度版本(progressive traffic shifting) - **A/B测试(A/B Testing)**:用于根据请求信息将

    作为CNCF成员Weave Flagger提供了持续集成和持续交付的各项能力。Flagger将渐进式发布总结为3类:

    • 灰度发布/金丝雀发布(Canary):用于渐进式切流到灰度版本(progressive traffic shifting)
    • A/B测试(A/B Testing):用于根据请求信息将用户请求路由到A/B版本(HTTP headers and cookies traffic routing)
    • 蓝绿发布(Blue/Green):用于流量切换和流量复制 (traffic switching and mirroring)

    本篇将介绍Flagger on ASM的渐进式灰度发布实践。

    Setup Flagger

    1 部署Flagger

    执行如下命令部署flagger(完整脚本参见:demo_canary.sh)。

    alias k="kubectl --kubeconfig $USER_CONFIG"
    alias h="helm --kubeconfig $USER_CONFIG"
    
    cp $MESH_CONFIG kubeconfig
    k -n istio-system create secret generic istio-kubeconfig --from-file kubeconfig
    k -n istio-system label secret istio-kubeconfig istio/multiCluster=true
    
    h repo add flagger https://flagger.app
    h repo update
    k apply -f $FLAAGER_SRC/artifacts/flagger/crd.yaml
    h upgrade -i flagger flagger/flagger --namespace=istio-system 
        --set crd.create=false 
        --set meshProvider=istio 
        --set metricsServer=http://prometheus:9090 
        --set istio.kubeconfig.secretName=istio-kubeconfig 
        --set istio.kubeconfig.key=kubeconfig

    2 部署Gateway

    在灰度发布过程中,Flagger会请求ASM更新用于灰度流量配置的VirtualService,这个VirtualService会使用到命名为public-gateway的Gateway。为此我们创建相关Gateway配置文件public-gateway.yaml如下:

    apiVersion: networking.istio.io/v1alpha3
    kind: Gateway
    metadata:
      name: public-gateway
      namespace: istio-system
    spec:
      selector:
        istio: ingressgateway
      servers:
        - port:
            number: 80
            name: http
            protocol: HTTP
          hosts:
            - "*"

    执行如下命令部署Gateway:

    kubectl --kubeconfig "$MESH_CONFIG" apply -f resources_canary/public-gateway.yaml

    3 部署flagger-loadtester

    flagger-loadtester是灰度发布阶段,用于探测灰度POD实例的应用。

    执行如下命令部署flagger-loadtester:

    kubectl --kubeconfig "$USER_CONFIG" apply -k "https://github.com/fluxcd/flagger//kustomize/tester?ref=main"

    4 部署PodInfo及其HPA

    我们首先使用Flagger发行版自带的HPA配置(这是一个运维级的HPA),待完成完整流程后,我们再使用应用级的HPA。

    执行如下命令部署PodInfo及其HPA:

    kubectl --kubeconfig "$USER_CONFIG" apply -k "https://github.com/fluxcd/flagger//kustomize/podinfo?ref=main"

    渐进式灰度发布

    1 部署Canary

    Canary是基于Flagger进行灰度发布的核心CRD,详见How it works。我们首先部署如下Canary配置文件podinfo-canary.yaml,完成完整的渐进式灰度流程,然后在此基础上引入应用维度的监控指标,来进一步实现应用有感知的渐进式灰度发布。

    apiVersion: flagger.app/v1beta1
    kind: Canary
    metadata:
      name: podinfo
      namespace: test
    spec:
      # deployment reference
      targetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: podinfo
      # the maximum time in seconds for the canary deployment
      # to make progress before it is rollback (default 600s)
      progressDeadlineSeconds: 60
      # HPA reference (optional)
      autoscalerRef:
        apiVersion: autoscaling/v2beta2
        kind: HorizontalPodAutoscaler
        name: podinfo
      service:
        # service port number
        port: 9898
        # container port number or name (optional)
        targetPort: 9898
        # Istio gateways (optional)
        gateways:
        - public-gateway.istio-system.svc.cluster.local
        # Istio virtual service host names (optional)
        hosts:
        - '*'
        # Istio traffic policy (optional)
        trafficPolicy:
          tls:
            # use ISTIO_MUTUAL when mTLS is enabled
            mode: DISABLE
        # Istio retry policy (optional)
        retries:
          attempts: 3
          perTryTimeout: 1s
          retryOn: "gateway-error,connect-failure,refused-stream"
      analysis:
        # schedule interval (default 60s)
        interval: 1m
        # max number of failed metric checks before rollback
        threshold: 5
        # max traffic percentage routed to canary
        # percentage (0-100)
        maxWeight: 50
        # canary increment step
        # percentage (0-100)
        stepWeight: 10
        metrics:
        - name: request-success-rate
          # minimum req success rate (non 5xx responses)
          # percentage (0-100)
          thresholdRange:
            min: 99
          interval: 1m
        - name: request-duration
          # maximum req duration P99
          # milliseconds
          thresholdRange:
            max: 500
          interval: 30s
        # testing (optional)
        webhooks:
          - name: acceptance-test
            type: pre-rollout
            url: http://flagger-loadtester.test/
            timeout: 30s
            metadata:
              type: bash
              cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
          - name: load-test
            url: http://flagger-loadtester.test/
            timeout: 5s
            metadata:
              cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"

    执行如下命令部署Canary:

    kubectl --kubeconfig "$USER_CONFIG" apply -f resources_canary/podinfo-canary.yaml

    部署Canary后,Flagger会将名为podinfo的Deployment复制为podinfo-primary,并将podinfo-primary扩容至HPA定义的最小POD数量。然后逐步将名为podinfo的这个Deployment的POD数量将缩容至0。也就是说,podinfo将作为灰度版本的Deployment,podinfo-primary将作为生产版本的Deployment。

    同时,创建3个服务——podinfopodinfo-primarypodinfo-canary,前两者指向podinfo-primary这个Deployment,最后者指向podinfo这个Deployment。

    2 升级podinfo

    执行如下命令,将灰度Deployment的版本从3.1.0升级到3.1.1

    kubectl --kubeconfig "$USER_CONFIG" -n test set image deployment/podinfo podinfod=stefanprodan/podinfo:3.1.1

    3 渐进式灰度发布

    此时,Flagger将开始执行如本系列第一篇所述的渐进式灰度发布流程,这里再简述主要流程如下:

    1. 逐步扩容灰度POD、验证
    2. 渐进式切流、验证
    3. 滚动升级生产Deployment、验证
    4. 100%切回生产
    5. 缩容灰度POD至0

    我们可以通过如下命令观察这个渐进式切流的过程:

    while true; do kubectl --kubeconfig "$USER_CONFIG" -n test describe canary/podinfo; sleep 10s;done

    输出的日志信息示意如下:

    Events:
      Type     Reason  Age                From     Message
      ----     ------  ----               ----     -------
      Warning  Synced  39m                flagger  podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less then desired generation
      Normal   Synced  38m (x2 over 39m)  flagger  all the metrics providers are available!
      Normal   Synced  38m                flagger  Initialization done! podinfo.test
      Normal   Synced  37m                flagger  New revision detected! Scaling up podinfo.test
      Normal   Synced  36m                flagger  Starting canary analysis for podinfo.test
      Normal   Synced  36m                flagger  Pre-rollout check acceptance-test passed
      Normal   Synced  36m                flagger  Advance podinfo.test canary weight 10
      Normal   Synced  35m                flagger  Advance podinfo.test canary weight 20
      Normal   Synced  34m                flagger  Advance podinfo.test canary weight 30
      Normal   Synced  33m                flagger  Advance podinfo.test canary weight 40
      Normal   Synced  29m (x4 over 32m)  flagger  (combined from similar events): Promotion completed! Scaling down podinfo.test

    相应的Kiali视图(可选),如下图所示:

    kiali.png

    到此,我们完成了一个完整的渐进式灰度发布流程。如下是扩展阅读。

    灰度中的应用级扩缩容

    在完成上述渐进式灰度发布流程的基础上,我们接下来再来看上述Canary配置中,关于HPA的配置。

      autoscalerRef:
        apiVersion: autoscaling/v2beta2
        kind: HorizontalPodAutoscaler
        name: podinfo

    这个名为podinfo的HPA是Flagger自带的配置,当灰度Deployment的CPU利用率达到99%时扩容。完整配置如下:

    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    metadata:
      name: podinfo
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: podinfo
      minReplicas: 2
      maxReplicas: 4
      metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              # scale up if usage is above
              # 99% of the requested CPU (100m)
              averageUtilization: 99

    我们在前面一篇中讲述了应用级扩缩容的实践,在此,我们将其应用于灰度发布的过程中。

    1 感知应用QPS的HPA

    执行如下命令部署感知应用请求数量的HPA,实现在QPS达到10时进行扩容(完整脚本参见:advanced_canary.sh):

    kubectl --kubeconfig "$USER_CONFIG" apply -f resources_hpa/requests_total_hpa.yaml

    相应地,Canary配置更新为:

      autoscalerRef:
        apiVersion: autoscaling/v2beta2
        kind: HorizontalPodAutoscaler
        name: podinfo-total

    2 升级podinfo

    执行如下命令,将灰度Deployment的版本从3.1.0升级到3.1.1

    kubectl --kubeconfig "$USER_CONFIG" -n test set image deployment/podinfo podinfod=stefanprodan/podinfo:3.1.1

    3 验证渐进式灰度发布及HPA

    命令观察这个渐进式切流的过程:

    while true; do k -n test describe canary/podinfo; sleep 10s;done

    在渐进式灰度发布过程中(在出现Advance podinfo.test canary weight 10信息后,见下图),我们使用如下命令,从入口网关发起请求以增加QPS:

    INGRESS_GATEWAY=$(kubectl --kubeconfig $USER_CONFIG -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    hey -z 20m -c 2 -q 10 http://$INGRESS_GATEWAY

    使用如下命令观察渐进式灰度发布进度:

    watch kubectl --kubeconfig $USER_CONFIG get canaries --all-namespaces

    使用如下命令观察hpa的副本数变化:

    watch kubectl --kubeconfig $USER_CONFIG -n test get hpa/podinfo-total

    结果如下图所示,在渐进式灰度发布过程中,当切流到30%的某一时刻,灰度Deployment的副本数为4:

    hpa-canary.png

    灰度中的应用级监控指标

    在完成上述灰度中的应用级扩缩容的基础上,最后我们再来看上述Canary配置中,关于metrics的配置:

      analysis:
        metrics:
        - name: request-success-rate
          # minimum req success rate (non 5xx responses)
          # percentage (0-100)
          thresholdRange:
            min: 99
          interval: 1m
        - name: request-duration
          # maximum req duration P99
          # milliseconds
          thresholdRange:
            max: 500
          interval: 30s
        # testing (optional)

    1 Flagger内置监控指标

    到目前为止,Canary中使用的metrics配置一直是Flagger的两个内置监控指标:请求成功率(request-success-rate)和请求延迟(request-duration)。如下图所示,Flagger中不同平台对内置监控指标的定义,其中,istio使用的是本系列第一篇介绍的Mixerless Telemetry相关的遥测数据。

    image.png

    2 自定义监控指标

    为了展示灰度发布过程中,遥测数据为验证灰度环境带来的更多灵活性,我们再次以istio_requests_total为例,创建一个名为not-found-percentageMetricTemplate,统计请求返回404错误码的数量占请求总数的比例。

    配置文件metrics-404.yaml如下(完整脚本参见:advanced_canary.sh):

    apiVersion: flagger.app/v1beta1
    kind: MetricTemplate
    metadata:
      name: not-found-percentage
      namespace: istio-system
    spec:
      provider:
        type: prometheus
        address: http://prometheus.istio-system:9090
      query: |
        100 - sum(
            rate(
                istio_requests_total{
                  reporter="destination",
                  destination_workload_namespace="{{ namespace }}",
                  destination_workload="{{ target }}",
                  response_code!="404"
                }[{{ interval }}]
            )
        )
        /
        sum(
            rate(
                istio_requests_total{
                  reporter="destination",
                  destination_workload_namespace="{{ namespace }}",
                  destination_workload="{{ target }}"
                }[{{ interval }}]
            )
        ) * 100

    执行如下命令创建上述MetricTemplate:

    k apply -f resources_canary2/metrics-404.yaml

    相应地,Canary中metrics的配置更新为:

      analysis:
        metrics:
          - name: "404s percentage"
            templateRef:
              name: not-found-percentage
              namespace: istio-system
            thresholdRange:
              max: 5
            interval: 1m

    3 最后的验证

    最后,我们一次执行完整的实验脚本。脚本advanced_canary.sh示意如下:

    #!/usr/bin/env sh
    SCRIPT_PATH="$(
        cd "$(dirname "$0")" >/dev/null 2>&1
        pwd -P
    )/"
    cd "$SCRIPT_PATH" || exit
    
    source config
    alias k="kubectl --kubeconfig $USER_CONFIG"
    alias m="kubectl --kubeconfig $MESH_CONFIG"
    alias h="helm --kubeconfig $USER_CONFIG"
    
    echo "#### I Bootstrap ####"
    echo "1 Create a test namespace with Istio sidecar injection enabled:"
    k delete ns test
    m delete ns test
    k create ns test
    m create ns test
    m label namespace test istio-injection=enabled
    
    echo "2 Create a deployment and a horizontal pod autoscaler:"
    k apply -f $FLAAGER_SRC/kustomize/podinfo/deployment.yaml -n test
    k apply -f resources_hpa/requests_total_hpa.yaml
    k get hpa -n test
    
    echo "3 Deploy the load testing service to generate traffic during the canary analysis:"
    k apply -k "https://github.com/fluxcd/flagger//kustomize/tester?ref=main"
    
    k get pod,svc -n test
    echo "......"
    sleep 40s
    
    echo "4 Create a canary custom resource:"
    k apply -f resources_canary2/metrics-404.yaml
    k apply -f resources_canary2/podinfo-canary.yaml
    
    k get pod,svc -n test
    echo "......"
    sleep 120s
    
    echo "#### III Automated canary promotion ####"
    
    echo "1 Trigger a canary deployment by updating the container image:"
    k -n test set image deployment/podinfo podinfod=stefanprodan/podinfo:3.1.1
    
    echo "2 Flagger detects that the deployment revision changed and starts a new rollout:"
    
    while true; do k -n test describe canary/podinfo; sleep 10s;done

    使用如下命令执行完整的实验脚本:

    sh progressive_delivery/advanced_canary.sh

    实验结果示意如下:

    
    #### I Bootstrap ####
    1 Create a test namespace with Istio sidecar injection enabled:
    namespace "test" deleted
    namespace "test" deleted
    namespace/test created
    namespace/test created
    namespace/test labeled
    2 Create a deployment and a horizontal pod autoscaler:
    deployment.apps/podinfo created
    horizontalpodautoscaler.autoscaling/podinfo-total created
    NAME            REFERENCE            TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
    podinfo-total   Deployment/podinfo   <unknown>/10 (avg)   1         5         0          0s
    3 Deploy the load testing service to generate traffic during the canary analysis:
    service/flagger-loadtester created
    deployment.apps/flagger-loadtester created
    NAME                                      READY   STATUS     RESTARTS   AGE
    pod/flagger-loadtester-76798b5f4c-ftlbn   0/2     Init:0/1   0          1s
    pod/podinfo-689f645b78-65n9d              1/1     Running    0          28s
    
    NAME                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
    service/flagger-loadtester   ClusterIP   172.21.15.223   <none>        80/TCP    1s
    ......
    4 Create a canary custom resource:
    metrictemplate.flagger.app/not-found-percentage created
    canary.flagger.app/podinfo created
    NAME                                      READY   STATUS    RESTARTS   AGE
    pod/flagger-loadtester-76798b5f4c-ftlbn   2/2     Running   0          41s
    pod/podinfo-689f645b78-65n9d              1/1     Running   0          68s
    
    NAME                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
    service/flagger-loadtester   ClusterIP   172.21.15.223   <none>        80/TCP    41s
    ......
    #### III Automated canary promotion ####
    1 Trigger a canary deployment by updating the container image:
    deployment.apps/podinfo image updated
    2 Flagger detects that the deployment revision changed and starts a new rollout:
    
    Events:
      Type     Reason  Age                  From     Message
      ----     ------  ----                 ----     -------
      Warning  Synced  10m                  flagger  podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less then desired generation
      Normal   Synced  9m23s (x2 over 10m)  flagger  all the metrics providers are available!
      Normal   Synced  9m23s                flagger  Initialization done! podinfo.test
      Normal   Synced  8m23s                flagger  New revision detected! Scaling up podinfo.test
      Normal   Synced  7m23s                flagger  Starting canary analysis for podinfo.test
      Normal   Synced  7m23s                flagger  Pre-rollout check acceptance-test passed
      Normal   Synced  7m23s                flagger  Advance podinfo.test canary weight 10
      Normal   Synced  6m23s                flagger  Advance podinfo.test canary weight 20
      Normal   Synced  5m23s                flagger  Advance podinfo.test canary weight 30
      Normal   Synced  4m23s                flagger  Advance podinfo.test canary weight 40
      Normal   Synced  23s (x4 over 3m23s)  flagger  (combined from similar events): Promotion completed! Scaling down podinfo.test
     
    原文链接
    本文为阿里云原创内容,未经允许不得转载。
  • 相关阅读:
    kindeditor的使用
    阅读笔记(三)
    阅读笔记(二)
    架构漫谈
    阅读笔记(一)
    hdfs
    暑假周总结八
    暑假周总结七
    暑假周总结六
    暑假周总结五
  • 原文地址:https://www.cnblogs.com/yunqishequ/p/14680785.html
Copyright © 2011-2022 走看看