zoukankan      html  css  js  c++  java
  • 安装k8s和NVIDIA环境

    安装环境

    系统要求

    CPU: 2个核心

    内存: 2GB

    显卡:NVIDIA系列

    安装docker

    apt install docker.io

    安装k8s

    添加软件源

    方便起见,将Ubuntu的软件管理中的下载地址修改为阿里云。

    Screenshot from 2019-12-18 14-25-44

    在/etc/apt/source.list添加k8s的软件源

    deb https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial main

    更新apt update

    问题: NO_PUBKEY

    NO_PUBKEY BA300B7755AFCFAE
    apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 6A030B21BA07F4FB

    问题: depend on sth

    执行 apt update
    或者 apt --fix-broken install

    修改HOST

    vim /etc/hosts
    注释掉
    127.0.0.1 computer_name
    根据想要形成的集群的IP地址添加
    192.168.9.103 master
    192.168.9.104 node1

    192.168.9.105 node2 ......

    #127.0.0.1      localhost
    #127.0.1.1      dell3
    
    # The following lines are desirable for IPv6 capable hosts
    ::1     ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    
    192.168.9.103 master
    

    安装kubeadm

    apt install kubeadm

    安装kubeadm时会自动安装kubectl、kubelet。

    列出需要的镜像

    kubeadm config images list

    结果是:

    k8s.gcr.io/kube-apiserver:v1.17.0
    k8s.gcr.io/kube-controller-manager:v1.17.0
    k8s.gcr.io/kube-scheduler:v1.17.0
    k8s.gcr.io/kube-proxy:v1.17.0
    k8s.gcr.io/pause:3.1
    k8s.gcr.io/etcd:3.4.3-0
    k8s.gcr.io/coredns:1.6.5
    

    使用国内源下载这些镜像

    docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.17.0
      docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.17.0
      docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.17.0
      docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.17.0
      docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1
      docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.3-0
      docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.6.5
    

    使用tag命令打标

    docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1 k8s.gcr.io/pause:3.1
      docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.17.0 k8s.gcr.io/kube-apiserver:v1.17.0
      docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.17.0 k8s.gcr.io/kube-controller-manager:v1.17.0
      docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.17.0 k8s.gcr.io/kube-scheduler:v1.17.0
      docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.17.0 k8s.gcr.io/kube-proxy:v1.17.0
      docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.3-0 k8s.gcr.io/etcd:3.4.3-0
      docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.6.5 k8s.gcr.io/coredns:1.6.5
    

    配置 master

    先关闭swap

    swapoff -a

    进行初始化:

    root@dell3:~# kubeadm init --kubernetes-version=v1.17.0 --pod-network-cidr 192.168.0.0/16
    W1218 14:48:40.560734   20883 validation.go:28] Cannot validate kube-proxy config - no validator is available
    W1218 14:48:40.560767   20883 validation.go:28] Cannot validate kubelet config - no validator is available
    [init] Using Kubernetes version: v1.17.0
    [preflight] Running pre-flight checks
    	[WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
    	[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
    error execution phase preflight: [preflight] Some fatal errors occurred:
    	[ERROR Swap]: running with swap on is not supported. Please disable swap
    [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
    To see the stack trace of this error execute with --v=5 or higher
    
    

    信息提示关闭swap

    swapoff -a

    之后再执行安装

    安装完成的信息是:

    Your Kubernetes control-plane has initialized successfully!
    
    To start using your cluster, you need to run the following as a regular user:
    
      mkdir -p $HOME/.kube
      sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
      sudo chown $(id -u):$(id -g) $HOME/.kube/config
    
    You should now deploy a pod network to the cluster.
    Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
      https://kubernetes.io/docs/concepts/cluster-administration/addons/
    
    Then you can join any number of worker nodes by running the following on each as root:
    
    kubeadm join 192.168.9.103:6443 --token tn3e9a.6fgbdbu3vvus8ia9 \
        --discovery-token-ca-cert-hash sha256:ce5aa219f8fd1da40646997f2c3d27ee905989812b115146356ecfc9304036ba 
    
    

    按照提示执行三个命令:

      mkdir -p $HOME/.kube
      sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
      sudo chown $(id -u):$(id -g) $HOME/.kube/config
    

    添加网络配置

    kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

    完成的信息是:

    podsecuritypolicy.policy/psp.flannel.unprivileged created
    clusterrole.rbac.authorization.k8s.io/flannel created
    clusterrolebinding.rbac.authorization.k8s.io/flannel created
    serviceaccount/flannel created
    configmap/kube-flannel-cfg created
    daemonset.apps/kube-flannel-ds-amd64 created
    daemonset.apps/kube-flannel-ds-arm64 created
    daemonset.apps/kube-flannel-ds-arm created
    daemonset.apps/kube-flannel-ds-ppc64le created
    daemonset.apps/kube-flannel-ds-s390x created
    

    查看pod

    root@dell3:~# kubectl get pod
    No resources found in default namespace.
    root@dell3:~# kubectl get pod -n kube-system
    NAME                            READY   STATUS     RESTARTS   AGE
    coredns-6955765f44-gccbp        0/1     Pending    0          2m33s
    coredns-6955765f44-gl7zg        0/1     Pending    0          2m33s
    etcd-dell3                      1/1     Running    0          2m33s
    kube-apiserver-dell3            1/1     Running    0          2m33s
    kube-controller-manager-dell3   1/1     Running    0          2m33s
    kube-flannel-ds-amd64-rrhng     0/1     Init:0/1   0          70s
    kube-proxy-srnvg                1/1     Running    0          2m33s
    kube-scheduler-dell3            1/1     Running    0          2m33s
    
    

    如果k8s的核心组件都在运行中了,说明k8s安装成功。

    安装cuda

    NVIDIA建议先安装cuda再安装NVIDIA驱动。

    ./cuda.run --override 使用--override参数来取消安装时对gcc版本的检查。

    安装 NVIDIA 驱动

    寻找合适的版本

    根据自己的GPU型号,到英伟达网站寻找合适的版本。

    安装

    方便起见,直接使用Ubuntu19提供的驱动管理软件Additional Drivers来安装

    Screenshot from 2019-12-18 16-50-06

    成功的标志

    打开 NVIDIA X Server 看到显卡的详细信息。如果打开是空白的,说明当前没有安装NVIDIA驱动。

    Screenshot from 2019-12-18 15-09-12

    输入 nvidia-smi 可以看到驱动的详细信息。

    root@dell3:~# nvidia-smi
    Wed Dec 18 15:07:25 2019       
    +------------------------------------------------------+                       
    | NVIDIA-SMI 340.107    Driver Version: 340.107        |                       
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GT 610      Off  | 0000:01:00.0     N/A |                  N/A |
    |100%   56C    P8    N/A /  N/A |    129MiB /  1023MiB |     N/A      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GT 610      Off  | 0000:06:00.0     N/A |                  N/A |
    |100%   46C    P8    N/A /  N/A |      3MiB /  1023MiB |     N/A      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Compute processes:                                               GPU Memory |
    |  GPU       PID  Process name                                     Usage      |
    |=============================================================================|
    |    0            Not Supported                                               |
    |    1            Not Supported                                               |
    +-----------------------------------------------------------------------------+
    
    

    问题

    重复登录

    开机之后按Ctrl+Alt+F[1-6]进入字符界面,卸载NVIDIA驱动。

    apt remove --purge nvidia-*

    如果有驱动的安装包,也可以执行

    ./nvidia-*.run --uninstall

    重启,再次进入原生的图形界面,然后在设置中关闭密码登录。

    再次安装NVIDIA驱动。

    进不去图形界面

    重新安装NVIDIA驱动。

    安装 NVIDIA对k8s的插件

    安装 nvidia-docker2

    # Add the package repositories
    $ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    $ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
    $ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    
    $ sudo apt-get update && sudo apt-get install -y nvidia-docker2
    $ sudo systemctl restart docker
    

    测试nvidia-docker2

    使用nvidia-docker2来运行cuda:

    docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

    Screenshot from 2019-12-18 15-28-48

    第一次运行需下载镜像,如果镜像下载太慢,可以添加加速器。

    运行结果是:

    root@dell:~# docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
    Wed Dec 18 08:55:00 2019       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 750 Ti  Off  | 00000000:01:00.0  On |                  N/A |
    | 33%   30C    P8     1W /  38W |    359MiB /  1999MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+
    

    添加配置

    $ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
    

    修改runtime

    修改/etc/docker/daemon.json文件,添加default-runtime键。

    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    

    重启docker

    # systemctl daemon-reload

    # systemctl restart docker

    使用

    查看k8s是否识别出了GPU

    执行kubectl describe node node_name来查看本节点的详细信息:

    root@dell:~/mypod# kubectl describe nodes
    Name:               dell
    Roles:              master
    Labels:             beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/os=linux
                        disktype=ssd
                        kubernetes.io/arch=amd64
                        kubernetes.io/hostname=dell
                        kubernetes.io/os=linux
                        node-role.kubernetes.io/master=
    Annotations:        flannel.alpha.coreos.com/backend-data: {"VtepMAC":"c6:9a:2d:50:03:4b"}
                        flannel.alpha.coreos.com/backend-type: vxlan
                        flannel.alpha.coreos.com/kube-subnet-manager: true
                        flannel.alpha.coreos.com/public-ip: 192.168.8.52
                        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                        node.alpha.kubernetes.io/ttl: 0
                        volumes.kubernetes.io/controller-managed-attach-detach: true
    CreationTimestamp:  Thu, 12 Dec 2019 10:25:16 +0800
    Taints:             <none>
    Unschedulable:      false
    Conditions:
      Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
      ----             ------  -----------------                 ------------------                ------                       -------
      MemoryPressure   False   Wed, 18 Dec 2019 17:17:55 +0800   Thu, 12 Dec 2019 18:00:39 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
      DiskPressure     False   Wed, 18 Dec 2019 17:17:55 +0800   Thu, 12 Dec 2019 18:00:39 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
      PIDPressure      False   Wed, 18 Dec 2019 17:17:55 +0800   Thu, 12 Dec 2019 18:00:39 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
      Ready            True    Wed, 18 Dec 2019 17:17:55 +0800   Mon, 16 Dec 2019 10:30:19 +0800   KubeletReady                 kubelet is posting ready status. AppArmor enabled
    Addresses:
      InternalIP:  192.168.8.52
      Hostname:    dell
    Capacity:
     cpu:                4
     ephemeral-storage:  479152840Ki
     hugepages-1Gi:      0
     hugepages-2Mi:      0
     memory:             24568140Ki
     nvidia.com/gpu:     1
     pods:               110
    Allocatable:
     cpu:                4
     ephemeral-storage:  441587256613
     hugepages-1Gi:      0
     hugepages-2Mi:      0
     memory:             24465740Ki
     nvidia.com/gpu:     1
     pods:               110
    System Info:
     Machine ID:                 833fac65cd12401db017c0b0033439e7
     System UUID:                28d52460-d7da-11dd-9d00-40167e218cad
     Boot ID:                    7a4a6548-28da-447c-845a-fab20ed82181
     Kernel Version:             5.3.0-24-generic
     OS Image:                   Ubuntu 19.10
     Operating System:           linux
     Architecture:               amd64
     Container Runtime Version:  docker://19.3.2
     Kubelet Version:            v1.16.3
     Kube-Proxy Version:         v1.16.3
    PodCIDR:                     192.168.0.0/24
    PodCIDRs:                    192.168.0.0/24
    Non-terminated Pods:         (13 in total)
      
    

    此时如果看到capacity中包含了nvidia.com/gpu:1的信息,说明k8s识别出了本机含有一块GPU。

    调用GPU

    创建一个调用GPU的pod

    创建一个文件gpu-pod.yaml

    apiVersion: v1
    kind: Pod
    metadata:
      name: tf-pod
    spec:
      containers:
        - name: tf-container
          image: tensorflow/tensorflow:latest-gpu
          resources:
            limits:
              nvidia.com/gpu: 1 # requesting 1 GPUs
    
    
    

    然后执行kubectl apply -f gpu-pod.yaml
    使用kubectl get pod查看pod的状态。

    root@dell:~/mypod# kubectl get pod
    NAME            READY   STATUS              RESTARTS   AGE
    busybox-6t962   0/1     Completed           0          6d2h
    cuda3           0/1     Completed           47         2d7h
    gpu-cuda        0/1     Completed           0          5d
    gpu-pod         0/1     Completed           0          5d1h
    gpu-pod23       0/2     Pending             0          2d6h
    hello-world     0/1     ContainerCreating   0          6d5h
    myjob-k9hx5     0/1     Completed           0          6d3h
    myjob2-xmdm8    0/1     Completed           0          6d1h
    pi-9cttz        0/1     Completed           0          6d2h
    
    

    使用kubectl describe pod pod_name来查看pod的详细信息。

    问题:kubectl 命令无效

    像这样的报错:

    root@dell3:~# kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml The connection to the server 192.168.9.103:6443 was refused - did you specify the right host or port?

    多是重启之后连不上docker,解决方法:

    # swapoff -a
    # systemctl daemon-reload
    # systemctl restart docker
    # systemctl restart kubelet
    

    其中最重要的就是禁用swap。

  • 相关阅读:
    从URL输入到页面展现,过程中发生了什么?
    Android ADB被占用 重启 ADB方法
    Android消息处理:EventBus、BroadCast和Handler-优缺点比较
    Android EventBus 的使用
    浅谈Java/Android下的注解
    如何理解Android中的xmlns
    【LeetCode】165
    【leetcode】155
    【LeetCode】12 & 13
    【LeetCode】66 & 67- Plus One & Add Binary
  • 原文地址:https://www.cnblogs.com/liuluopeng/p/12098071.html
Copyright © 2011-2022 走看看