一 节点的污点以及pod的容忍度以及节点的亲缘性对比
1.1 首先需要介绍的是节点的污点以及pod的污点容忍度
污点是节点的属性,容忍度是pod的属性,只有当一个pod的容忍度包含节点的污点,pod才能够将其调度到该节点上
1.2 对比污点和容忍度以及节点的亲缘性的应用场景
节点的污点是,通过对现有的节点上面添加污点,来拒绝某些pod被调度过来,而节点的亲缘性是在pod定义上明确的指出这个pod可以或者不可以调度到某个节点上面。
二 认识了解节点的污点以及pod的容忍度
2.1 查看集群节点的污点
Name: master Roles: master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=master kubernetes.io/os=linux node-role.kubernetes.io/master= Annotations: flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"b6:6a:dc:5d:74:7e"} flannel.alpha.coreos.com/backend-type: vxlan flannel.alpha.coreos.com/kube-subnet-manager: true flannel.alpha.coreos.com/public-ip: 172.16.70.6 kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock node.alpha.kubernetes.io/ttl: 0 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Mon, 21 Dec 2020 11:40:57 +0800 Taints: node-role.kubernetes.io/master:NoSchedule
- 主节点包含一个 node-role.kubernetes.io/master:NoSchedule的污点
- 一般没有这个容忍度的pod无法调度到这个节点,只有系统级别的pod才能调度到这个系统节点上面来
2.2 显示pod的容忍度
Name: kube-proxy-z6nwk Namespace: kube-system ...... QoS Class: BestEffort Node-Selectors: <none> Tolerations: op=Exists CriticalAddonsOnly op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/network-unavailable:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: <none>
- 红色字体部分表示是该系统pod的污点容忍度
2.3 了解污点的效果
每一个节点的污点都关联一个效果
-
- NoSchedule表示如果pod没有容忍这些污点,pod则不能调度含有这些污点的节点上面
- preferNoSchedule表示尽量阻止pod调度到这个节点上面来,如果pod实在没地方调度,也允许调度这个节点
- NoExecute表示对节点上面的pod都有影响,就是如果节点上面的pod没有这些容忍度,就算已经在了,也会被驱逐出去
2.4 为节点添加污点或者删除污点
##添加污点
k taint node node01 node-type=product:NoSchedule
###删除污点
k taint node node01 node-type:NoSchedule-
2.5 创建几个pod观察效果
[root@node01 wxm]# k get po test{1,2,3,4,5} -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test1 1/1 Running 0 2m41s 10.244.1.126 node02 <none> <none> test2 1/1 Running 0 2m26s 10.244.1.127 node02 <none> <none> test3 1/1 Running 0 2m21s 10.244.1.128 node02 <none> <none> test4 1/1 Running 0 2m15s 10.244.1.129 node02 <none> <none> test5 1/1 Running 0 2m10s 10.244.1.130 node02 <none> <none>
- 可以看到pod都被调度到node02上面去了,因为node01被加了污点
2.6 在pod上面添加污点容忍度
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prod spec: replicas: 5 template: metadata: name: prod labels: app: prod spec: containers: - image: busybox command: ["sleep","999999"] name: busybox tolerations: - key: node-type operator: Equal value: product effect: NoSchedule
- 在pod上面添加污点容忍度,表示允许将pod调度那些有污点的节点上去,但是不代表只能调度到相应的污点节点上面去
2.7 观察其他pod上面的污点容忍度
Name: prod-7c8c7f9b47-xcbkt Namespace: default ........ Tolerations: node-type=product:NoSchedule node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s ........
- 这个pod有三个污点容忍度,第一个是可以容忍node-type的值为product,可以把这个pod调度到具有这些污点的节点
- 第二个第三个分别是当节点为不可达或者为unreachable的时候,节点上面的pod将会被重新调度走,300s是这个调度走之前的等待时间,当节点被标记为不可调度或者不可达的时候,还会继续等待300s,如果300s后仍然是是当前状态,则会将这些pod调度到其他节点上面去
三 认识节点的亲缘性
3.1 将集群的2个节点分别添加可用去标签以及是否独享和共享的标签
k label node node01 availability-zone=zone1 k label node node02 availability-zone=zone2 k label node node01 share-type=dedicated k label node node02 share-type=shared
3.2 之后部署一组deployment并且对zone1和dedicated的亲和权重分别设置为80%以及20%
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prod spec: replicas: 5 template: metadata: name: prod labels: app: prod spec: containers: - image: busybox command: ["sleep","999999"] name: busybox affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 80 preference: matchExpressions: - key: availability-zone operator: In values: - zone1 - weight: 20 preference: matchExpressions: - key: share-type operator: In values: - dedicated
3.3 观察这些pod被调度到了什么位置上面去了
[root@node01 Chapter16]# k get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES prod-5ffd8886-4646n 1/1 Running 0 52m 10.244.2.130 node01 <none> <none> prod-5ffd8886-c8p4v 1/1 Running 0 52m 10.244.2.129 node01 <none> <none> prod-5ffd8886-crhtp 1/1 Running 0 52m 10.244.1.156 node02 <none> <none> prod-5ffd8886-hrn2b 1/1 Running 0 52m 10.244.2.131 node01 <none> <none> prod-5ffd8886-x4qv6 1/1 Running 0 52m 10.244.2.132 node01 <none> <none>
- 可以看到这个deployment的的pod基本都调度到了node01上面这个是符合预期的,因为我们node01节点上面的labels为zone1,正好符合deployment的预期
- 你可能有一点奇怪的是,为什么会有一台跑到了节点2上面去了,那是因为调度算法里面有一个条例尽量不要让所有的pod都在同一台节点上,这样当这个节点挂了,就没有其他pod可以对外提供服务了
四 认识pod之间的亲缘性
4.1 什么是pod之间的亲缘性?有什么应用场景
pod之间的亲缘性,就是可以在某个节点上面的pod具有某个标签,然后将其他pod也部署到这个节点上面去,举一个例子来说,我在node01上面部署了一个后段pod,我们希望在部署3个前端pod的时候,让这些前端的pod跟他部署在同一个机器上面,这样就能大大的增加性能。
4.2 部署一个例子,第一步部署一个后段pod
k run backend -l app=backend --image=busybox -- sleep 999999
4.3 之后部署5个前端pod
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prod spec: replicas: 5 template: metadata: name: prod labels: app: prod spec: containers: - image: busybox command: ["sleep","999999"] name: busybox affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: kubernetes.io/hostname labelSelector: matchLabels: app: backend
4.3 观察是否在同一个机器上面
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES backend 1/1 Running 0 12m 10.244.1.157 node02 <none> <none> prod-bdd66cb75-7bcmj 1/1 Running 0 118s 10.244.1.162 node02 <none> <none> prod-bdd66cb75-r69q7 1/1 Running 0 118s 10.244.1.161 node02 <none> <none> prod-bdd66cb75-r8fjn 1/1 Running 0 118s 10.244.1.159 node02 <none> <none> prod-bdd66cb75-twphc 1/1 Running 0 118s 10.244.1.158 node02 <none> <none> prod-bdd66cb75-vkrm8 1/1 Running 0 118s 10.244.1.160 node02 <none> <none>
结果符合预期:前端的pod都已经调度到了与后段相同的节点上面了
五 表达pod的亲缘性的取消强制要求以及节点的非亲缘性
5.1 将之前的强制变为优先调度到这些节点上,如果不满足也可以调度到其他节点上,一个例子如何所示
[root@node01 Chapter16]# cat frontend-podaffinity-host-2.yaml apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prod spec: replicas: 5 template: metadata: name: prod labels: app: prod spec: containers: - image: busybox command: ["sleep","999999"] name: busybox affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 80 podAffinityTerm: topologyKey: kubernetes.io/hostname labelSelector: matchLabels: app: backend
5.2 查看前端的pod如何调度
[root@node01 Chapter16]# k get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES backend 1/1 Running 0 37m 10.244.1.157 node02 <none> <none> prod-7cd8bf84c4-862qw 1/1 Running 0 11s 10.244.1.165 node02 <none> <none> prod-7cd8bf84c4-db6j5 1/1 Running 0 11s 10.244.1.164 node02 <none> <none> prod-7cd8bf84c4-dzsq7 1/1 Running 0 11s 10.244.2.133 node01 <none> <none> prod-7cd8bf84c4-fhds5 1/1 Running 0 11s 10.244.1.163 node02 <none> <none> prod-7cd8bf84c4-mkps2 1/1 Running 0 11s 10.244.1.166 node02 <none> <none>
- 大部分都调度到了与backend相同的机器上面去了
- 还有一个调度到了节点以上面,这就是我们不加强制性的好处
- 所以应该用非强制性而不是强制性调度
5.3 利用pod的非亲缘性将pod调度到不同的节点
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prod spec: replicas: 5 template: metadata: name: prod labels: app: prod spec: containers: - image: busybox command: ["sleep","999999"] name: busybox affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: kubernetes.io/hostname labelSelector: matchLabels: app: prod
- 非亲缘性节点是与之相同的pod的标签都无法调度上去
- 关键字是 podAntiAffinity
5.4 查看调度结果
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES prod-76d8477ff8-5jvv5 1/1 Running 0 3m10s 10.244.2.141 node01 <none> <none> prod-76d8477ff8-bb2t9 0/1 Pending 0 3m10s <none> <none> <none> <none> prod-76d8477ff8-gtj2t 0/1 Pending 0 3m10s <none> <none> <none> <none> prod-76d8477ff8-hwskt 1/1 Running 0 3m10s 10.244.1.175 node02 <none> <none> prod-76d8477ff8-hzz4j 0/1 Pending 0 3m10s <none> <none> <none> <none>
- 集群只有2个节点,所以只有2个pod被调度,其余都无法调度,符合预期
5.5 看下pod的结果
QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 29s (x5 over 4m54s) default-scheduler 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules
- 调度失败原因符合预期