Flagger on ASM·基于Mixerless Telemetry实现渐进式灰度发布系列 1 遥测数据

简介：服务网格ASM的Mixerless Telemetry技术，为业务容器提供了无侵入式的遥测数据。遥测数据一方面作为监控指标被ARMPS/prometheus采集，用于服务网格可观测性；另一方面被HPA和flaggers使用，成为应用级扩缩容和渐进式灰度发布的基石。本系列聚焦于遥测数据在应用级扩缩容和渐进式灰度发布上的实践，将分三篇介绍遥测数据(监控指标)、应用级扩缩容，和渐进式灰度发布。

序

服务网格ASM的Mixerless Telemetry技术，为业务容器提供了无侵入式的遥测数据。遥测数据一方面作为监控指标被ARMPS/prometheus采集，用于服务网格可观测性；另一方面被HPA和flaggers使用，成为应用级扩缩容和渐进式灰度发布的基石。

本系列聚焦于遥测数据在应用级扩缩容和渐进式灰度发布上的实践，将分三篇介绍遥测数据(监控指标)、应用级扩缩容，和渐进式灰度发布。

总体架构

本系列的总体架构如下图所示：

ASM下发Mixerless Telemetry相关的EnvoyFilter配置到各ASM sidecar(envoy)，启用应用级监控指标的采集。
业务流量通过Ingress Gateway进入，各ASM sidecar开始采集相关监控指标。
Prometheus从各POD上采集监控指标。
HPA通过Adapter从Prometheus查询相关POD的监控指标，并根据配置进行扩缩容。
Flagger通过Prometheus查询相关POD的监控指标，并根据配置向ASM发起VirtualService配置更新。
ASM下发VirtualService配置到各ASM sidecar，从而实现渐进式灰度发布。

Flagger渐进式发布流程

Flagger官网描述了渐进式发布流程，这里翻译如下：

探测并更新灰度Deployment到新版本
灰度POD实例数从0开始扩容
等待灰度POD实例数到达HPA定义的最小副本数量
灰度POD实例健康检测
由flagger-loadtester实例发起acceptance-test验证
灰度发布在验证失败时终止
由flagger-loadtester实例发起load-test验证
在配置流量复制时开始从生产全流量复制到灰度
每分钟从Prometheus查询并检测请求成功率和请求延迟等监控指标
灰度发布在监控指标不符预期的数量到达阈值时终止
达到配置中迭代的次数后停止流量复制
开始切流到灰度POD实例
更新生产Deployment到新版本
等待生产Deployment滚动升级完毕
等待生产POD实例数到达HPA定义的最小副本数量
生产POD实例健康检测
切流回生产POD实例
灰度POD实例缩容至0
发送灰度发布分析结果通知

原文如下：

With the above configuration, Flagger will run a canary release with the following steps:

detect new revision (deployment spec, secrets or configmaps changes)

scale from zero the canary deployment

wait for the HPA to set the canary minimum replicas

check canary pods health

run the acceptance tests

abort the canary release if tests fail

start the load tests

mirror 100% of the traffic from primary to canary

check request success rate and request duration every minute

abort the canary release if the metrics check failure threshold is reached

stop traffic mirroring after the number of iterations is reached

route live traffic to the canary pods

promote the canary (update the primary secrets, configmaps and deployment spec)

wait for the primary deployment rollout to finish

wait for the HPA to set the primary minimum replicas

check primary pods health

switch live traffic back to primary

scale to zero the canary

send notification with the canary analysis result

前提条件

已创建ACK集群，详情请参见创建Kubernetes托管版集群。
已创建ASM实例，详情请参见创建ASM实例。

Setup Mixerless Telemetry

本篇将介绍如何基于ASM配置并采集应用级监控指标(比如请求数量总数istio_requests_total和请求延迟istio_request_duration等)。主要步骤包括创建EnvoyFilter、校验envoy遥测数据和校验Prometheus采集遥测数据。

1 EnvoyFilter

登录ASM控制台，左侧导航栏选择服务网格 >网格管理，并进入ASM实例的功能配置页面。

勾选开启采集Prometheus 监控指标
点选启用自建 Prometheus，并填入Prometheus服务地址：`prometheus:9090(本系列将使用社区版Prometheus，后文将使用这个配置)。如果使用阿里云产品ARMS，请参考集成ARMS Prometheus实现网格监控。
勾选启用 Kiali(可选)

点击确定后，我们将在控制平面看到ASM生成的相关EnvoyFilter列表：

2 Prometheus

2.1 Install

执行如下命令安装Prometheus(完整脚本参见：demo_mixerless.sh)。

kubectl --kubeconfig "$USER_CONFIG" apply -f $ISTIO_SRC/samples/addons/prometheus.yaml

2.2 Config Scrape

安装完Prometheus，我们需要为其配置添加istio相关的监控指标。登录ACK控制台，左侧导航栏选择配置管理>配置项，在istio-system下找到prometheus一行，点击编辑。

在prometheus.yaml配置中，将scrape_configs.yaml中的配置追加到scrape_configs中。

保存配置后，左侧导航栏选择工作负载>容器组，在istio-system下找到prometheus一行，删除Prometheus POD，以确保配置在新的POD中生效。

可以执行如下命令查看Prometheus配置中的job_name：

kubectl --kubeconfig "$USER_CONFIG" get cm prometheus -n istio-system -o jsonpath={.data.prometheus\.yml} | grep job_name
- job_name: 'istio-mesh'
- job_name: 'envoy-stats'
- job_name: 'istio-policy'
- job_name: 'istio-telemetry'
- job_name: 'pilot'
- job_name: 'sidecar-injector'
- job_name: prometheus
  job_name: kubernetes-apiservers
  job_name: kubernetes-nodes
  job_name: kubernetes-nodes-cadvisor
- job_name: kubernetes-service-endpoints
- job_name: kubernetes-service-endpoints-slow
  job_name: prometheus-pushgateway
- job_name: kubernetes-services
- job_name: kubernetes-pods
- job_name: kubernetes-pods-slow

Mixerless验证

1 podinfo

1.1 部署

使用如下命令部署本系列的示例应用podinfo：

kubectl --kubeconfig "$USER_CONFIG" apply -f $PODINFO_SRC/kustomize/deployment.yaml -n test
kubectl --kubeconfig "$USER_CONFIG" apply -f $PODINFO_SRC/kustomize/service.yaml -n test

1.2 生成负载

使用如下命令请求podinfo，以产生监控指标数据

podinfo_pod=$(k get po -n test -l app=podinfo -o jsonpath={.items..metadata.name})
for i in {1..10}; do
   kubectl --kubeconfig "$USER_CONFIG" exec $podinfo_pod -c podinfod -n test -- curl -s podinfo:9898/version
  echo
done

2 确认生成(Envoy)

本系列重点关注的监控指标项是istio_requests_total和istio_request_duration。首先，我们在envoy容器内确认这些指标已经生成。

2.1 istio_requests_total

使用如下命令请求envoy获取stats相关指标数据，并确认包含istio_requests_total。

kubectl --kubeconfig "$USER_CONFIG" exec $podinfo_pod -n test -c istio-proxy -- curl -s localhost:15090/stats/prometheus | grep istio_requests_total

返回结果信息如下：

:::: istio_requests_total ::::
# TYPE istio_requests_total counter
istio_requests_total{response_code="200",reporter="destination",source_workload="podinfo",source_workload_namespace="test",source_principal="spiffe://cluster.local/ns/test/sa/default",source_app="podinfo",source_version="unknown",source_cluster="c199d81d4e3104a5d90254b2a210914c8",destination_workload="podinfo",destination_workload_namespace="test",destination_principal="spiffe://cluster.local/ns/test/sa/default",destination_app="podinfo",destination_version="unknown",destination_service="podinfo.test.svc.cluster.local",destination_service_name="podinfo",destination_service_namespace="test",destination_cluster="c199d81d4e3104a5d90254b2a210914c8",request_protocol="http",response_flags="-",grpc_response_status="",connection_security_policy="mutual_tls",source_canonical_service="podinfo",destination_canonical_service="podinfo",source_canonical_revision="latest",destination_canonical_revision="latest"} 10

istio_requests_total{response_code="200",reporter="source",source_workload="podinfo",source_workload_namespace="test",source_principal="spiffe://cluster.local/ns/test/sa/default",source_app="podinfo",source_version="unknown",source_cluster="c199d81d4e3104a5d90254b2a210914c8",destination_workload="podinfo",destination_workload_namespace="test",destination_principal="spiffe://cluster.local/ns/test/sa/default",destination_app="podinfo",destination_version="unknown",destination_service="podinfo.test.svc.cluster.local",destination_service_name="podinfo",destination_service_namespace="test",destination_cluster="c199d81d4e3104a5d90254b2a210914c8",request_protocol="http",response_flags="-",grpc_response_status="",connection_security_policy="unknown",source_canonical_service="podinfo",destination_canonical_service="podinfo",source_canonical_revision="latest",destination_canonical_revision="latest"} 10

2.2 istio_request_duration

使用如下命令请求envoy获取stats相关指标数据，并确认包含istio_request_duration。

kubectl --kubeconfig "$USER_CONFIG" exec $podinfo_pod -n test -c istio-proxy -- curl -s localhost:15090/stats/prometheus | grep istio_request_duration

返回结果信息如下：

:::: istio_request_duration ::::
# TYPE istio_request_duration_milliseconds histogram
istio_request_duration_milliseconds_bucket{response_code="200",reporter="destination",source_workload="podinfo",source_workload_namespace="test",source_principal="spiffe://cluster.local/ns/test/sa/default",source_app="podinfo",source_version="unknown",source_cluster="c199d81d4e3104a5d90254b2a210914c8",destination_workload="podinfo",destination_workload_namespace="test",destination_principal="spiffe://cluster.local/ns/test/sa/default",destination_app="podinfo",destination_version="unknown",destination_service="podinfo.test.svc.cluster.local",destination_service_name="podinfo",destination_service_namespace="test",destination_cluster="c199d81d4e3104a5d90254b2a210914c8",request_protocol="http",response_flags="-",grpc_response_status="",connection_security_policy="mutual_tls",source_canonical_service="podinfo",destination_canonical_service="podinfo",source_canonical_revision="latest",destination_canonical_revision="latest",le="0.5"} 10

istio_request_duration_milliseconds_bucket{response_code="200",reporter="destination",source_workload="podinfo",source_workload_namespace="test",source_principal="spiffe://cluster.local/ns/test/sa/default",source_app="podinfo",source_version="unknown",source_cluster="c199d81d4e3104a5d90254b2a210914c8",destination_workload="podinfo",destination_workload_namespace="test",destination_principal="spiffe://cluster.local/ns/test/sa/default",destination_app="podinfo",destination_version="unknown",destination_service="podinfo.test.svc.cluster.local",destination_service_name="podinfo",destination_service_namespace="test",destination_cluster="c199d81d4e3104a5d90254b2a210914c8",request_protocol="http",response_flags="-",grpc_response_status="",connection_security_policy="mutual_tls",source_canonical_service="podinfo",destination_canonical_service="podinfo",source_canonical_revision="latest",destination_canonical_revision="latest",le="1"} 10
...

3 确认采集(Prometheus)

最后，我们验证Envoy生成的监控指标数据，是否被Prometheus实时采集上来。对外暴露Prometheus服务，并使用浏览器请求该服务。然后在查询框输入istio_requests_total，得到结果如下图所示。

原文链接

本文为阿里云原创内容，未经允许不得转载。