背景
在腾讯云对k8s集群负载的优化讨论中提到了关于节点资源超卖的方案,可以简述为:根据每个节点的真实历史负载数据,动态的配置节点可分配资源总量Allocatable,以控制允许调度到该节点的pod数量。文章中给出了整体的技术方案,也提出了很多需要我们自己去考虑的细节问题,这里尝试简单的实现这个方案。
实现
文章中提到的方案如下
- 每个节点的资源超卖比例,我们设置到Node的Annotation中,比如cpu超卖对应Annotation stke.platform/cpu-oversale-ratio。 - 每个节点的超卖比例,由自研组件基于节点历史监控数据,动态的/周期性的去调整超卖比例 - Node超卖特性一定要是可以关闭和还原的,通过Node Annotation stke.platform/mutate: "false"关闭Node超卖,Node在下一个心跳会完成资源复原。 - 通过kube-apiserver的Mutating Admission Webhook对Node的Create和Status Update事件进行拦截,根据超卖比重新计算该Node的Allocatable&Capacity Resource,Patch到APIServer。
这里关键是需要考虑以下细节问题
1.Kubelet Register Node To ApiServer的详细原理是什么,通过webhook直接Patch Node Status是否可行?
1.1 简单描述心跳方式:目前k8s支持两种心跳方式,其中新的心跳方式NodeLease更加轻量,每次心跳内容0.1Kb左右,可以较好的解决节点数量多的时候频繁的心跳信息给apiserver带来的压力,NodeLease目前作为未正式可用的新特性,需要kubelet中开启这个feature,这里仅考虑原心跳方式(关于两种心跳的协同工作原理之后可以记录一下)。
简单看一下源码https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status.go 中的tryUpdateNodeStatus方法,先获取Node信息(注释里面说到了这个Get操作优先从本地缓存里面拿,效率高一点),然后通过setNodeStatus方法对node的status进行补充(心跳的内容主要就是node对象的node.status字段值),例如补充MachineInfo即cpu和内存信息、ImageList、Condition等(具体会补充什么内容可以查看pkg/kubelet/kubelet_node_status.go#defaultNodeStatusFuncs),之后会调用函数PatchNodeStatus向apiserver发送这个node信息,参数nodeStatusUpdateFrequency指定发送频率,默认是10s。
func (kl *Kubelet) tryUpdateNodeStatus(tryNumber int) error { // In large clusters, GET and PUT operations on Node objects coming // from here are the majority of load on apiserver and etcd. // To reduce the load on etcd, we are serving GET operations from // apiserver cache (the data might be slightly delayed but it doesn't // seem to cause more conflict - the delays are pretty small). // If it result in a conflict, all retries are served directly from etcd. opts := metav1.GetOptions{} if tryNumber == 0 { util.FromApiserverCache(&opts) } node, err := kl.heartbeatClient.CoreV1().Nodes().Get(context.TODO(), string(kl.nodeName), opts) if err != nil { return fmt.Errorf("error getting node %q: %v", kl.nodeName, err) } originalNode := node.DeepCopy() if originalNode == nil { return fmt.Errorf("nil %q node object", kl.nodeName) } podCIDRChanged := false if len(node.Spec.PodCIDRs) != 0 { // Pod CIDR could have been updated before, so we cannot rely on // node.Spec.PodCIDR being non-empty. We also need to know if pod CIDR is // actually changed. podCIDRs := strings.Join(node.Spec.PodCIDRs, ",") if podCIDRChanged, err = kl.updatePodCIDR(podCIDRs); err != nil { klog.Errorf(err.Error()) } } kl.setNodeStatus(node) now := kl.clock.Now() if now.Before(kl.lastStatusReportTime.Add(kl.nodeStatusReportFrequency)) { if !podCIDRChanged && !nodeStatusHasChanged(&originalNode.Status, &node.Status) { // We must mark the volumes as ReportedInUse in volume manager's dsw even // if no changes were made to the node status (no volumes were added or removed // from the VolumesInUse list). // // The reason is that on a kubelet restart, the volume manager's dsw is // repopulated and the volume ReportedInUse is initialized to false, while the // VolumesInUse list from the Node object still contains the state from the // previous kubelet instantiation. // // Once the volumes are added to the dsw, the ReportedInUse field needs to be // synced from the VolumesInUse list in the Node.Status. // // The MarkVolumesAsReportedInUse() call cannot be performed in dsw directly // because it does not have access to the Node object. // This also cannot be populated on node status manager init because the volume // may not have been added to dsw at that time. kl.volumeManager.MarkVolumesAsReportedInUse(node.Status.VolumesInUse) return nil } } // Patch the current status on the API server updatedNode, _, err := nodeutil.PatchNodeStatus(kl.heartbeatClient.CoreV1(), types.NodeName(kl.nodeName), originalNode, node) if err != nil { return err } kl.lastStatusReportTime = now kl.setLastObservedNodeAddresses(updatedNode.Status.Addresses) // If update finishes successfully, mark the volumeInUse as reportedInUse to indicate // those volumes are already updated in the node's status kl.volumeManager.MarkVolumesAsReportedInUse(updatedNode.Status.VolumesInUse) return nil }
1.2 通过webhook来patch node.status是否可行:经过验证通过配置mutatingwebhook获取心跳信息并patch node.status是可行的
1.3 webhook配置细节:mutatingwebhookconfiguration用于向apiserver注册webhook,如下所示
apiVersion: admissionregistration.k8s.io/v1beta1 kind: MutatingWebhookConfiguration metadata: name: demo-webhook labels: app: demo-webhook kind: mutator webhooks: - name: demo-webhook.app.svc clientConfig: service: name: demo-webhook namespace: app path: "/mutate" caBundle: ${CA_BUNDLE} rules: - operations: [ "UPDATE" ] apiGroups: [""] apiVersions: ["v1"] resources: ["nodes/status"]
关注上述yaml的.webhooks[0].rules字段,对于k8s中的Resource如pod,pod.status被称为pod的subResource,起初我们将rules.resorces配置为["nodes"]和或者["*"],webhook中都无法获取到10s一次的心跳信息,最终通过查看源码中对到达apiserver的请求是否需要发给webhook这一匹配逻辑,如下:会同时判断Resource和SubResource,也就说["*/*"]才能匹配到所有情况
func (r *Matcher) resource() bool { opRes, opSub := r.Attr.GetResource().Resource, r.Attr.GetSubresource() for _, res := range r.Rule.Resources { res, sub := splitResource(res) resMatch := res == "*" || res == opRes subMatch := sub == "*" || sub == opSub if resMatch && subMatch { return true } } return false }
1.4 k8s中的patch分为哪些类型:apiserver根据请求header中的Content-Type字段来区分patch的类型
Json Patch Content-Type: application/json-patch+json,参考https://tools.ietf.org/html/rfc6902,这种patch方式支持的操作类型挺丰富的,如add、replace、remove、copy等
Merge Patch Content-Type: application/merge-patch+json,参考https://tools.ietf.org/html/rfc7386,这种方式每次patch对某个域的修改总是被替换
Strategic Merge Patch Content-Type: application/strategic-merge-patch+json,这种方式会根据对象的定义中添加的元数据来判断哪些应该是要被merge而不是默认的替换,如下pod.Spec定义中的containers字段通过patchStrategy和patchMergeKey表明当name不相同时,这个字段要被merge
containers []Container `json:"containers" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,2,rep,name=containers"`
2.当节点资源超卖后,Kubernetes对应的Cgroup动态调整机制是否能继续正常工作?
3.Node status更新太频繁,每次status update都会触发webhook,大规模集群容易对apiserver造成性能问题,怎么解决?
可启用NodeLease心跳机制(需要看一下NodeLease心跳信息中是否有资源分配水位线allocatable值),也可考虑增大心跳频率发送频率,同时webhook逻辑要尽量简单
4.节点资源超卖对Kubelet Eviction的配置是否也有超配效果,还是仍然按照实际Node配置和负载进行evict? 如果对Evict有影响,又该如解决?
5.超卖比例从大往小调低时,存在节点上 Sum(pods' request resource) > node's allocatable
情况出现,这里是否有风险,该如何处理?
初步方案是当超卖比例从大往小调低时,保证allocatable的值不会比 Sum(pods' request resource)小
6.监控系统对Node的监控与Node Allocatable&Capacity Resource有关,超卖后,意味着监控系统对Node的监控不再正确,需要做一定程度的修正,如何让监控系统也能动态的感知超卖比例进行数据和视图的修正?
7.Node Allocatable和Capacity分别该如何超卖?超卖对节点预留资源的影响是如何的?
注意事项
Ocp环境中需要在master中添加如下配置开启mutatingwebhook
admissionConfig: pluginConfig: MutatingAdmissionWebhook: configuration: apiVersion: v1 kind: DefaultAdmissionConfig disable: false
参考链接: