zoukankan      html  css  js  c++  java
  • 折腾kubernetes各种问题汇总-<1>

    折腾kubernetes各种问题汇总-<1>

    折腾部署fluend-elasticsearch日志,折腾出一大堆问题,解决这些问题过程中,感觉又了解了不少.

    如何删除不一致状态下的rc,deployment,service.

    在某些情况下,经常发现kubectl进程挂起现象,然后在get时候发现删了一半,而另外的删除不了

    [root@k8s-master ~]# kubectl get -f fluentd-elasticsearch/
    NAME DESIRED CURRENT READY AGE
    rc/elasticsearch-logging-v1 0 2 2 15h
    
    NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
    deploy/kibana-logging 0 1 1 1 15h
    Error from server (NotFound): services "elasticsearch-logging" not found
    Error from server (NotFound): daemonsets.extensions "fluentd-es-v1.22" not found
    Error from server (NotFound): services "kibana-logging" not found
    

    删除这些deployment,service或者rc命令如下:

    kubectl delete deployment kibana-logging -n kube-system --cascade=false
    
    kubectl delete deployment kibana-logging -n kube-system  --ignore-not-found
    
    delete rc elasticsearch-logging-v1  -n kube-system --force now --grace-period=0
    

    删除不了后如何重置etcd

    rm -rf  /var/lib/etcd/*
    

    删除后重新reboot master结点.

    reset etcd后需要重新设置网络

    etcdctl mk /atomic.io/network/config '{ "Network": "192.168.0.0/16" }'
    

    启动apiserver失败

    每次启动都是报

    start request repeated too quickly for kube-apiserver.service
    

    但其实不是启动频率问题,需要查看,/var/log/messages,在我的情况中是因为开启ServiceAccount后找不到ca.crt等文件,导致启动出错

    May 21 07:56:41 k8s-master kube-apiserver: Flag --port has been deprecated, see --insecure-port instead.
    May 21 07:56:41 k8s-master kube-apiserver: F0521 07:56:41.692480 4299 universal_validation.go:104] Validate server run options failed: unable to load client CA file: open /var/run/kubernetes/ca.crt: no such file or directory
    May 21 07:56:41 k8s-master systemd: kube-apiserver.service: main process exited, code=exited, status=255/n/a
    May 21 07:56:41 k8s-master systemd: Failed to start Kubernetes API Server.
    May 21 07:56:41 k8s-master systemd: Unit kube-apiserver.service entered failed state.
    May 21 07:56:41 k8s-master systemd: kube-apiserver.service failed.
    May 21 07:56:41 k8s-master systemd: kube-apiserver.service holdoff time over, scheduling restart.
    May 21 07:56:41 k8s-master systemd: start request repeated too quickly for kube-apiserver.service
    May 21 07:56:41 k8s-master systemd: Failed to start Kubernetes API Server.
    

    在部署fluentd等日志组件的时候,很多问题都是因为需要开启ServiceAccount选项需要配置安全导致,所以说到底还是需要配置好ServiceAccount.

    出现Permission denied情况

    在配置fluentd时候出现cannot create /var/log/fluentd.log: Permission denied错误,这是因为没有关掉SElinux安全导致.

    可以在/etc/selinux/config中将SELINUX=enforcing设置成disabled,然后reboot

    基于ServiceAccount的配置

    首先生成各种需要的keys,k8s-master需替换成master的主机名.

    openssl genrsa -out ca.key 2048
    openssl req -x509 -new -nodes -key ca.key -subj "/CN=k8s-master" -days 10000 -out ca.crt
    openssl genrsa -out server.key 2048
    
    echo subjectAltName=IP:10.254.0.1 > extfile.cnf
    
    #ip由下述命令决定
    
    #kubectl get services --all-namespaces |grep 'default'|grep 'kubernetes'|grep '443'|awk '{print $3}'
    
    
    openssl req -new -key server.key -subj "/CN=k8s-master" -out server.csr
    
    openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -extfile extfile.cnf -out server.crt -days 10000
    

    如果修改/etc/kubernetes/apiserver的配置文件参数的话,通过systemctl start kube-apiserver启动失败,出错信息为:

    Validate server run options failed: unable to load client CA file: open /root/keys/ca.crt: permission denied
    

    但可以通过命令行启动API Server

    /usr/bin/kube-apiserver --logtostderr=true --v=0 --etcd-servers=http://k8s-master:2379 --address=0.0.0.0 --port=8080 --kubelet-port=10250 --allow-privileged=true --service-cluster-ip-range=10.254.0.0/16 --admission-control=ServiceAccount --insecure-bind-address=0.0.0.0 --client-ca-file=/root/keys/ca.crt --tls-cert-file=/root/keys/server.crt --tls-private-key-file=/root/keys/server.key --basic-auth-file=/root/keys/basic_auth.csv --secure-port=443 &>> /var/log/kubernetes/kube-apiserver.log &
    

    命令行启动Controller-manager

    /usr/bin/kube-controller-manager --logtostderr=true --v=0 --master=http://k8s-master:8080 --root-ca-file=/root/keys/ca.crt --service-account-private-key-file=/root/keys/server.key & >>/var/log/kubernetes/kube-controller-manage.log
    

    ETCD启动不起来-问题<1>

    etcd是kubernetes集群的zookeeper进程,几乎所有的service都依赖于etcd的启动,比如flanneld,apiserver,docker.....

    在启动etcd是报错日志如下

    May 24 13:39:09 k8s-master systemd: Stopped Flanneld overlay address etcd agent.
    May 24 13:39:28 k8s-master systemd: Starting Etcd Server...
    May 24 13:39:28 k8s-master etcd: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd:2379,http://etcd:4001
    May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_NAME, but unused: shadowed by corresponding flag 
    May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corresponding flag 
    May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadowed by corresponding flag 
    May 24 13:39:28 k8s-master etcd: etcd Version: 3.1.3
    May 24 13:39:28 k8s-master etcd: Git SHA: 21fdcc6
    May 24 13:39:28 k8s-master etcd: Go Version: go1.7.4
    May 24 13:39:28 k8s-master etcd: Go OS/Arch: linux/amd64
    May 24 13:39:28 k8s-master etcd: setting maximum number of CPUs to 1, total number of available CPUs is 1
    May 24 13:39:28 k8s-master etcd: the server is already initialized as member before, starting as etcd member...
    May 24 13:39:28 k8s-master etcd: listening for peers on http://localhost:2380
    May 24 13:39:28 k8s-master etcd: listening for client requests on 0.0.0.0:2379
    May 24 13:39:28 k8s-master etcd: listening for client requests on 0.0.0.0:4001
    May 24 13:39:28 k8s-master etcd: recovered store from snapshot at index 140014
    May 24 13:39:28 k8s-master etcd: name = master
    May 24 13:39:28 k8s-master etcd: data dir = /var/lib/etcd/default.etcd
    May 24 13:39:28 k8s-master etcd: member dir = /var/lib/etcd/default.etcd/member
    May 24 13:39:28 k8s-master etcd: heartbeat = 100ms
    May 24 13:39:28 k8s-master etcd: election = 1000ms
    May 24 13:39:28 k8s-master etcd: snapshot count = 10000
    May 24 13:39:28 k8s-master etcd: advertise client URLs = http://etcd:2379,http://etcd:4001
    May 24 13:39:28 k8s-master etcd: ignored file 0000000000000001-0000000000012700.wal.broken in wal
    May 24 13:39:29 k8s-master etcd: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 148905
    May 24 13:39:29 k8s-master etcd: 8e9e05c52164694d became follower at term 12
    May 24 13:39:29 k8s-master etcd: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 12, commit: 148905, applied: 140014, lastindex: 148905, lastterm: 12]
    May 24 13:39:29 k8s-master etcd: enabled capabilities for version 3.1
    May 24 13:39:29 k8s-master etcd: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
    May 24 13:39:29 k8s-master etcd: set the cluster version to 3.1 from store
    May 24 13:39:29 k8s-master etcd: starting server... [version: 3.1.3, cluster version: 3.1]
    May 24 13:39:29 k8s-master etcd: raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/0.tmp: is a directory
    May 24 13:39:29 k8s-master systemd: etcd.service: main process exited, code=exited, status=1/FAILURE
    May 24 13:39:29 k8s-master systemd: Failed to start Etcd Server.
    May 24 13:39:29 k8s-master systemd: Unit etcd.service entered failed state.
    May 24 13:39:29 k8s-master systemd: etcd.service failed.
    May 24 13:39:29 k8s-master systemd: etcd.service holdoff time over, scheduling restart.
    

    核心语句

    raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/0.tmp: is a directory进入相关目录,删除0.tmp,然后就可以启动啦!
    

    ETCD启动不起来-超时问题<2>

    问题背景:当前部署了3个etcd节点,突然有一天3台集群全部停电宕机了。重新启动之后发现K8S 集群是可以正常使用的,但是检查了一遍组件之后,发现有一个节点的etcd启动不了。

    经过一遍探查,发现时间不准确,通过以下命令ntpdate ntp.aliyun.com 重新将时间调整正确,重新启动etcd,发现还是起不来,报错如下:

    Mar 05 14:27:15 k8s-node2 etcd[3248]: etcd Version: 3.3.13
    Mar 05 14:27:15 k8s-node2 etcd[3248]: Git SHA: 98d3084
    Mar 05 14:27:15 k8s-node2 etcd[3248]: Go Version: go1.10.8
    Mar 05 14:27:15 k8s-node2 etcd[3248]: Go OS/Arch: linux/amd64
    Mar 05 14:27:15 k8s-node2 etcd[3248]: setting maximum number of CPUs to 4, total number of available CPUs is 4
    Mar 05 14:27:15 k8s-node2 etcd[3248]: the server is already initialized as member before, starting as etcd member
    ...
    Mar 05 14:27:15 k8s-node2 etcd[3248]: peerTLS: cert = /opt/etcd/ssl/server.pem, key = /opt/etcd/ssl/server-key.pe
    m, ca = , trusted-ca = /opt/etcd/ssl/ca.pem, client-cert-auth = false, crl-file = 
    Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for peers on https://192.168.25.226:2380
    Mar 05 14:27:15 k8s-node2 etcd[3248]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert 
    files are presented. Ignored key/cert files.
    Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for client requests on 127.0.0.1:2379
    Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for client requests on 192.168.25.226:2379
    Mar 05 14:27:15 k8s-node2 etcd[3248]: member 9c166b8b7cb6ecb8 has already been bootstrapped
    Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
    Mar 05 14:27:15 k8s-node2 systemd[1]: Failed to start Etcd Server.
    Mar 05 14:27:15 k8s-node2 systemd[1]: Unit etcd.service entered failed state.
    Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service failed.
    Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service failed.
    Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service holdoff time over, scheduling restart.
    Mar 05 14:27:15 k8s-node2 systemd[1]: Starting Etcd Server...
    Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_NAME, but unused: shadowed by correspo
    nding flag
    Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corr
    esponding flag
    Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_LISTEN_PEER_URLS, but unused: shadowed
     by corresponding flag
    Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadow
    ed by corresponding flag
    Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS, but unuse
    d: shadowed by corresponding flag
    Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_ADVERTISE_CLIENT_URLS, but unused: sha
    dowed by corresponding flag
    Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER, but unused: shadowed 
    by corresponding flag
    Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER_TOKEN, but unused: sha
    dowed by corresponding flag
    Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER_STATE, but unused: sha
    dowed by corresponding flag
    
    

    解决方法:

    检查日志发现并没有特别明显的错误,根据经验来讲,etcd节点坏掉一个其实对集群没有大的影响,这时集群已经可以正常使用了,但是这个坏掉的etcd节点并没有启动,解决方法如下:

    1. 进入etcd的数据存储目录进行备份

      备份原有数据:
      cd /var/lib/etcd/default.etcd/member/
      cp *  /data/bak/
      
      
    2. 删除这个目录下的所有数据文件

      rm -rf /var/lib/etcd/default.etcd/member/* 
      
    3. 停止另外两台etcd 节点,因为etcd节点启动时需要所有节点一起启动,启动成功后即可使用。

      master节点
      systemctl stop etcd
      systemctl restart etcd
      
      node1节点
      systemctl stop etcd
      systemctl restart etcd
      
      node2节点
      systemctl stop etcd
      systemctl restart etcd
      

    CentOS下配置主机互信

    • 在每台服务器需要建立主机互信的用户名执行以下命令生成公钥/密钥,默认回车即可
    ssh-keygen -t rsa 
    

    可以看到生成个公钥的文件

    • 互传公钥,第一次需要输入密码,之后就OK了
    ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.199.132 (-p 2222)
    

    -p 端口 默认端口不加-p,如果更改过端口,就得加上-p

    可以看到是在.ssh/下生成了个authorized_keys的文件,记录了能登陆这台服务器的其他服务器的公钥

    • 测试看是否能登陆
    ssh 192.168.199.132 (-p 2222)
    

    CentOS主机名的修改

    hostnamectl set-hostname k8s-master1
    

    Virtualbox实现CentOS复制和粘贴功能

    如果不安装或者不输出,可以将update修改成install再运行

    yum install update
    yum update kernel
    yum update kernel-devel
    yum install kernel-headers
    yum install gcc
    yum install gcc make
    

    运行完后sh VBoxLinuxAdditions.run

    删除Pod一直处于Terminating状态

    可以通过下面命令强制删除

    kubectl delete pod NAME --grace-period=0 --force
    

    删除namespace一直处于Terminating状态

    可以通过以下脚本强制删除

    [root@k8s-master1 k8s]# cat delete-ns.sh 
    #!/bin/bash
    set -e
    
    useage(){
        echo "useage:"
        echo "  delns.sh NAMESPACE"
    }
    
    if [ $# -lt 1 ];then
        useage
        exit
    fi
    
    NAMESPACE=$1
    JSONFILE=${NAMESPACE}.json
    kubectl get ns "${NAMESPACE}" -o json > "${JSONFILE}"
    vi "${JSONFILE}"
    curl -k -H "Content-Type: application/json" -X PUT --data-binary @"${JSONFLE}" 
        http://127.0.0.1:8001/api/v1/namespaces/"${NAMESPACE}"/finalize
    

    容器包含有效的 CPU/内存requests且没有指定limits可能会出现什么问题?

    下面我们创建一个对应的容器,该容器只有requests设定,但是没有limits 设定,

    - name: busybox-cnt02
        image: busybox
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo hello from cnt02; sleep 10;done"]
        resources:
          requests:
            memory: "100Mi"
            cpu: "100m"
    

    这个容器创建出来会有什么问题呢?
    其实对于正常的环境来说没有什么问题,但是对于资源型pod来说,如果有的容器没有设定limit限制,资源会被其他的pod抢占走,可能会造成容器应用失败的情况。可以通过limitrange 策略来去匹配,让pod 自动设定,前提是要提前配置好limitrange规则。

  • 相关阅读:
    ld -l选项注意事项
    linux下创建用户(转)
    delete void *
    __attribute__机制介绍(转)
    正常断开连接情况下,判断非阻塞模式socket连接是否断开
    std::thread “terminate called without an active exception”
    Android 开发手记二 C可执行程序编译实例(转帖)
    c++11 on Android
    由一段小程序看算法复杂度
    Linux守护进程的编程实现(转)
  • 原文地址:https://www.cnblogs.com/passzhang/p/12420488.html
Copyright © 2011-2022 走看看