zoukankan      html  css  js  c++  java
  • 「Bug」删除 replicaset 时,Pod 不会被级联删除

    2020-07-17

    问题描述

    Kubernetes 的级联管理功能失效:

    1. 删除 Replicaset 时,Pod 不会被级联删除
    2. 删除 Cronjob 时不会级联删除 Job
    3. 删除 Job 时,也不会自动删除对应的 Pod

    问题排查

    搜索资料,确认级联删除是垃圾收集器提供的。排查 kubelet gc 日志(清理容器和镜像的 gc),未找到明确的问题。

    replicaset/cronjob 都是 controller 类型,查看 controller 日志,发现三个主节点上的 controller 日志各不相同。并且有明显报错:

    主节点1 controller-manager 报错:

    I0703 17:42:50.697659       1 serving.go:319] Generated self-signed cert in-memory
    I0703 17:42:52.253742       1 controllermanager.go:161] Version: v1.16.0
    I0703 17:42:52.258189       1 secure_serving.go:123] Serving securely on 127.0.0.1:10257
    I0703 17:42:52.261809       1 deprecated_insecure_serving.go:53] Serving insecurely on [::]:10252
    I0703 17:42:52.261993       1 leaderelection.go:241] attempting to acquire leader lease  kube-system/kube-controller-manager...
    E0703 20:39:10.062914       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: rpc error: code = Unavailable desc = etcdserver: leader changed
    E0706 10:37:12.396567       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: net/http: TLS handshake timeout
    E0706 10:37:16.441577       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: endpoints "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "endpoints" in API group "" in the namespace "kube-system"
    E0706 10:37:18.598949       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 192.168.1.90:6443: connect: connection refused
    E0706 10:37:21.205271       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 192.168.1.90:6443: connect: connection refused
    E0706 10:37:24.042719       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 192.168.1.90:6443: connect: connection refused
    E0706 10:37:26.528240       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 192.168.1.90:6443: connect: connection refused
    E0706 10:37:29.040759       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 192.168.1.90:6443: connect: connection refused
    E0706 10:37:31.755211       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 192.168.1.90:6443: connect: connection refused
    E0706 10:37:43.769537       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get https://apiserver.cluster.local:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
    E0706 10:37:46.599186       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: endpoints "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "endpoints" in API group "" in the namespace "kube-system"
    

    主节点2 controller-manager 报错:

    /apis/monitoring.coreos.com/v1/prometheusrules?limit=500&resourceVersion=0: stream error: stream ID 304059; INTERNAL_ERROR
    E0707 16:05:15.833656       1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/thanosrulers?limit=500&resourceVersion=0: stream error: stream ID 304063; INTERNAL_ERROR
    E0707 16:05:16.067473       1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/servicemonitors?limit=500&resourceVersion=0: stream error: stream ID 304065; INTERNAL_ERROR
    E0707 16:05:16.718849       1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/prometheusrules?limit=500&resourceVersion=0: stream error: stream ID 304071; INTERNAL_ERROR
    E0707 16:05:16.841991       1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/thanosrulers?limit=500&resourceVersion=0: stream error: stream ID 304073; INTERNAL_ERROR
    E0707 16:05:17.070573       1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/servicemonitors?limit=500&resourceVersion=0: stream error: stream ID 304075; INTERNAL_ERROR
    E0707 16:05:17.721035       1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/prometheusrules?limit=500&resourceVersion=0: stream error: stream ID 304077; INTERNAL_ERROR
    E0707 16:05:17.850094       1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/thanosrulers?limit=500&resourceVersion=0: stream error: stream ID 304079; INTERNAL_ERROR
    E0707 16:05:18.073291       1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/servicemonitors?limit=500&resourceVersion=0: stream error: stream ID 304081; INTERNAL_ERROR
    E0707 16:05:18.224933       1 shared_informer.go:200] unable to sync caches for garbage collector
    E0707 16:05:18.224983       1 garbagecollector.go:230] timed out waiting for dependency graph builder sync during GC sync (attempt 803)
    E0707 16:05:18.388394       1 namespace_controller.go:148] deletion of namespace monitoring failed: [Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors: stream error: stream ID 1728289; INTERNAL_ERROR, Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules: stream error: stream ID 1728343; INTERNAL_ERROR, Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/thanosrulers: stream error: stream ID 1728381; INTERNAL_ERROR]
    E0707 16:05:18.723249       1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/prometheusrules?limit=500&resourceVersion=0: stream error: stream ID 304087; INTERNAL_ERROR
    E0707 16:05:18.859193       1 reflector.go:123] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: Get https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/thanosrulers?limit=500&resourceVersion=0: stream error: stream ID 304089; INTERNAL_ERROR
    

    主节点3 controller manager 日志:

    I0706 21:13:58.923131       1 serving.go:319] Generated self-signed cert in-memory
    I0706 21:13:59.235491       1 controllermanager.go:161] Version: v1.16.0
    I0706 21:13:59.235870       1 secure_serving.go:123] Serving securely on 127.0.0.1:10257
    I0706 21:13:59.236162       1 deprecated_insecure_serving.go:53] Serving insecurely on [::]:10252
    I0706 21:13:59.236201       1 leaderelection.go:241] attempting to acquire leader lease  kube-system/kube-controller-manager...
    

    查看到 controller manager 访问 apiserver,报错 Internal Error. 于是再查看 apiserver 信息。在主节点2 的 apiserver 上发现明显错误日志:

    goroutine 9351585 [running]:
    k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x3a32ce0, 0xc01f4aa850)
            /workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
    k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc029cfdcd8, 0x1, 0x1)
            /workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
    panic(0x3a32ce0, 0xc01f4aa850)
            /usr/local/go/src/runtime/panic.go:522 +0x1b5
    k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc00ca32240, 0x7b10de0, 0xc02014a310, 0xc0340b7d00)
            /workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:118 +0x3ef
    k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x7b10de0, 0xc02014a310, 0xc0340b7c00)
            /workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/waitgroup.go:47 +0xf3
    net/http.HandlerFunc.ServeHTTP(0xc004f981e0, 0x7b10de0, 0xc02014a310, 0xc0340b7c00)
            /usr/local/go/src/net/http/server.go:1995 +0x44
    k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x7b10de0, 0xc02014a310, 0xc0340b7b00)
            /workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x2b8
    net/http.HandlerFunc.ServeHTTP(0xc004f98210, 0x7b10de0, 0xc02014a310, 0xc0340b7b00)
            /usr/local/go/src/net/http/server.go:1995 +0x44
    k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1(0x7b10de0, 0xc02014a310, 0xc0340b7b00)
            /workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/cachecontrol.go:31 +0xa8
    net/http.HandlerFunc.ServeHTTP(0xc00ca32260, 0x7b10de0, 0xc02014a310, 0xc0340b7b00)
            /usr/local/go/src/net/http/server.go:1995 +0x44
    k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x7b04720, 0xc013c82c18, 0xc0340b7a00)
            /workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:89 +0x29c
    net/http.HandlerFunc.ServeHTTP(0xc00ca32280, 0x7b04720, 0xc013c82c18, 0xc0340b7a00)
            /usr/local/go/src/net/http/server.go:1995 +0x44
    k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x7b04720, 0xc013c82c18, 0xc0340b7a00)
            /workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/wrap.go:51 +0x105
    net/http.HandlerFunc.ServeHTTP(0xc00ca322a0, 0x7b04720, 0xc013c82c18, 0xc0340b7a00)
            /usr/local/go/src/net/http/server.go:1995 +0x44
    k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc004f98240, 0x7b04720, 0xc013c82c18, 0xc0340b7a00)
            /workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:189 +0x51
    net/http.serverHandler.ServeHTTP(0xc00172a0d0, 0x7b04720, 0xc013c82c18, 0xc0340b7a00)
            /usr/local/go/src/net/http/server.go:2774 +0xa8
    net/http.initNPNRequest.ServeHTTP(0xc0169cee00, 0xc00172a0d0, 0x7b04720, 0xc013c82c18, 0xc0340b7a00)
            /usr/local/go/src/net/http/server.go:3323 +0x8d
    k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler(0xc00a8e1080, 0xc013c82c18, 0xc0340b7a00, 0xc0004b9be0)
            /workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2125 +0x89
    created by k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).processHeaders
            /workspace/anago-v1.16.0-rc.2.1+2bd9643cee5b3b/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:1859 +0x4f4
    E0707 16:00:11.468119       1 wrap.go:39] apiserver panic'd on GET /apis/monitoring.coreos.com/v1/servicemonitors?limit=500&resourceVersion=0
    I0707 16:00:11.468205       1 log.go:172] http2: panic serving 192.168.1.91:37280: runtime error: invalid memory address or nil pointer dereference
    ...
    

    错误信息显示,在处理 GET /apis/monitoring.coreos.com/v1/servicemonitors?limit=500&resourceVersion=0 这个请求时,报错:runtime error: invalid memory address or nil pointer dereference

    看起来是监控相关的 api,在集群中查找监控相关信息,找到 monitoring 名字空间:

    [root@192-168-1-90 ~]# kubectl get ns
    NAME                   STATUS        AGE
    ......  # 省略若干名字空间
    default                Active        86d
    istio-system           Active        69d
    kube-node-lease        Active        86d
    kube-public            Active        86d
    kube-system            Active        86d
    kubernetes-dashboard   Active        86d
    monitoring             Terminating   30h
    

    monitoring 处于 Terminating 状态,查看该名字空间的详细信息:

    apiVersion: v1
    kind: Namespace
    metadata:
      annotations:
        kubectl.kubernetes.io/last-applied-configuration: |
          {"apiVersion":"v1","kind":"Namespace","metadata":{"annotations":{},"name":"monitoring"}}
      creationTimestamp: "2020-07-06T02:33:50Z"
      deletionTimestamp: "2020-07-06T02:36:59Z"
      name: monitoring
      resourceVersion: "56322781"
      selfLink: /api/v1/namespaces/monitoring
      uid: 2a41ac04-d86c-4086-9325-5c87dd2a15ac
    spec:
      finalizers:
      - kubernetes
    status:
      conditions:
      - lastTransitionTime: "2020-07-06T02:37:50Z"
        message: All resources successfully discovered
        reason: ResourcesDiscovered
        status: "False"
        type: NamespaceDeletionDiscoveryFailure
      - lastTransitionTime: "2020-07-06T02:37:50Z"
        message: All legacy kube types successfully parsed
        reason: ParsedGroupVersions
        status: "False"
        type: NamespaceDeletionGroupVersionParsingFailure
      - lastTransitionTime: "2020-07-06T02:37:50Z"
        message: 'Failed to delete all resource types, 3 remaining: Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules:
          stream error: stream ID 190291; INTERNAL_ERROR, Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors:
          stream error: stream ID 190119; INTERNAL_ERROR, Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/thanosrulers:
          stream error: stream ID 190153; INTERNAL_ERROR'
        reason: ContentDeletionFailed
        status: "True"
        type: NamespaceDeletionContentFailure
      phase: Terminating
    

    按网上介绍的方法,修改 spec.finalizers 以删除该名字空间,没有任何效果。

    # 手动编辑该名字空间的配置,删除 spec.finalizers 属性,没有任何效果。
    kubectl edit ns monitoring
    
    # 直接 delete 无效
    [root@192-168-1-90 ~]# kubectl delete ns monitoring  --grace-period=0 --force
    warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
    Error from server (Conflict): Operation cannot be fulfilled on namespaces "monitoring": The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.
    

    仔细查看上面的错误信息,在 status.conditions 有如下内容:

      - lastTransitionTime: "2020-07-06T02:37:50Z"
        message: 'Failed to delete all resource types, 3 remaining: Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules:
          stream error: stream ID 190291; INTERNAL_ERROR, Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors:
          stream error: stream ID 190119; INTERNAL_ERROR, Delete https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/thanosrulers:
          stream error: stream ID 190153; INTERNAL_ERROR'
    

    从这些日志看,删除名字空间也需要调用 apiserver 的 https://apiserver.cluster.local:6443/apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors 接口。
    但是该接口报 INTERNAL_ERROR,导致名字空间无法删除。

    在 Github 上提了 Issue: https://github.com/kubernetes/kubernetes/issues/92858

    等待后续反馈。

    ===========

    更新:根据 k8s 官方人员提示,直接改用 1.16.2+ 的 kuberntes,到目前没出过问题。

    总结

    总的来说,是因为在删除 monitoring 名字空间时,级联删除 coreos 的 crd 报错,直接导致整个 Kubernetes 的级联管理功能彻底失效。。。
    蝴蝶效应hhh

    参考

  • 相关阅读:
    移动web开发资源大整合
    移动WEB模拟原声APP滑动删除
    jQuery的live绑定事件在mobile safari(iphone / ipad / ipod)上失效的解决方案
    精仿公众号菜单效果
    javascript markdown 解析器
    第四天
    第三天
    第二天
    第一天
    day5
  • 原文地址:https://www.cnblogs.com/kirito-c/p/14022707.html
Copyright © 2011-2022 走看看