1. 查看coreDNS是否正常启动
kubectl -n kube-system get po|grep core
2. 如果不正常,并确定yaml配置无误,可将coreDNS pod通过修改deployment yaml 添加 nodeName尝试调度到其他节点,排查是否为node原因,或直接通过第4条排查问题node
spec: nodeName: 1.1.1.1 containers: - name: xxx image: xxx ports: - containerPort: 8080
3. 如果通过nodeName成功运行并启动coreDNS pod 可在任意node上通过sevice name解析coreDNS的可用性,示例:
# coreDNS地址:10.96.0.10 # 任意服务的service name:tiller-deploy nslookup tiller-deploy.kube-system.svc.cluster.local 10.96.0.10
4. 如果某一节点出现解析失败,则测试node 到pod是否连通
# pod ip: 10.96.0.10 ping 10.53.5.165
5. 连通性测试失败,查看问题node flannel是否正常运行,如果正常运行,继续排查
# 1、查看问题节点flannel容器 docker ps | grep flannel # 2、查看flannle网卡状态 ifconfig flannel.1 # 3、查看路由表与正常节点对比是否齐全,示例: route -n 10.244.1.0 10.244.1.0 255.255.255.0 UG 0 0 0 flannel.1 10.244.2.0 10.244.2.0 255.255.255.0 UG 0 0 0 flannel.1 10.244.3.0 10.244.3.0 255.255.255.0 UG 0 0 0 flannel.1 10.244.4.0 10.244.4.0 255.255.255.0 UG 0 0 0 flannel.1 10.244.5.0 10.244.5.0 255.255.255.0 UG 0 0 0 flannel.1 10.244.6.0 10.244.6.0 255.255.255.0 UG 0 0 0 flannel.1 10.244.7.0 10.244.7.0 255.255.255.0 UG 0 0 0 flannel.1 # 如果上面查看有问题,确定是否启动NetworkManager服务,该服务会导致flannel异常 # 查看 systemctl status NetworkManager # 关闭 && 禁用 systemctl stop NetworkManager && systemctl disable NetworkManager # 如果是该服务影响,日志中会出现此类问题: device (flannel.1): state change: unmanager -> unavailable (reason 'connection-assumed') # 并检查问题解析地址 cat /etc/resolv.conf # 删除flannel 重新拉起flannel docker rm -f flannel
6. 如果flannel未启动情况
# 1. 查看kubelet是否启动 netstat -tnlp| grep kubelet # 2. 未启动,则查看swap分区是否开启,正常情况下为swap原因导致 # 临时关闭swap分区, 重启失效; swapoff -a # 永久关闭swap分区 sed -ri 's/.*swap.*/#&/' /etc/fstab # 通过free 查看swap状态为关闭状态 free -m total used free shared buff/cache available Swap: 0 0 0 # 启动kubelet,会拉起flannel systemctl start kubelet # 如果kubelet启动和flannel正常启动可通过第5条排查问题,并测试服务的可用性