2019年3月18日 星期一

Kubernetes 1.13.4 安裝測試 -- NVIDIA Device Plugin


機器 名稱 和 對應 IP

K8S01   Master   192.168.8.53   Ubuntu 18.04

K8S02  Notes     192.168.8.54   Ubuntu 18.04

K8S03  Notes    192.168.8.55    Ubuntu 18.04

SVAI01 Notes – GPU 192.168.3.48 Ubuntu 18.04

SVAI02 Notes – GPU  192.168.3.49  Ubuntu 18.04


安裝前注意

  • 設置主機名

sudo hostnamectl set-hostname k8s-master

sudo vi /etc/hostname

  • /etc/hosts  要添加全部 hosts

  • 關閉防火牆

sudo iptables -F

  • 關閉系統 Swap

sudo swapoff -a

修改 /etc/fstab,避免 Swap 自動掛載

sudo sed -e '/swap/ s/^#*/#/' -i /etc/fstab

確認關閉

free -m

  • 在所有節點上將系統軟件包更新到最新版本:

sudo apt-get update

sudo apt-get upgrade

sudo apt-get install linux-image-extra-virtual

sudo reboot

  • 添加用戶以管理Kubernetes集群:

sudo useradd -s /bin/bash -m kube

sudo passwd kube   pw:kube

sudo usermod -aG sudo kube

echo "kube ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/kube


安裝Docker Engine

先確認系統上已卸載任何舊版本的Docker引擎:

sudo apt-get remove docker docker-engine docker.i

安裝相關套件

sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

安裝Docker

sudo apt install docker.io

sudo systemctl enable docker

Install Docker -CE

安裝GPG證書

https_proxy=192.168.1.88:3128 wget https://download.docker.com/linux/ubuntu/gpg -O docker.key

sudo apt-key add docker.key

寫入軟件源信息

add source

Create a new file for the Docker repository at /etc/apt/sources.list.d/docker.list

寫入軟件源信息

sudo add-apt-repository "deb [arch=amd64] http://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"

安裝 Docker-CE

sudo apt-get install docker-ce


測試 hello-world ,就出現錯誤

sudo docker run hello-world

先建 下面目錄

sudo mkdir /etc/systemd/system/docker.service.d

再新增一個 http-proxy.conf 檔案

sudo vi /etc/systemd/system/docker.service.d/http-proxy.conf

內容如下:

[Service]

Environment="HTTP_PROXY=http://192.168.2.91:80/"

Environment="HTTPS_PROXY=http://192.168.2.91:80/"


sudo systemctl daemon-reload

sudo systemctl show --property Environment docker

sudo systemctl restart docker

再跑一次 sudo docker run hello-world 還是錯誤,但是錯誤碼不同。要用 docker login

去 Docker 註冊一個帳號,跑一次 Docker login

sudo docker run hello-world  -- 再跑一次,終於成攻了


Kuberntes 安裝

添加 憑證 和 repository

curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add

sudo apt-add-repository "deb http://apt.kubernetes.io/ kubernetes-xenial main"

安裝K8S 相關套件

sudo apt install kubeadm kubectl kubelet

初始化 Maste

sudo kubeadm init --kubernetes-version=v1.13.4 --pod-network-cidr=10.244.0.0/16 service-cidr=10.96.0.0/12

沒有關閉 SWAP 會出現下面錯誤

執行畫面

neo@u1810:~$ sudo kubeadm init --kubernetes-version=v1.13.4 --pod-network-cidr=10.244.0.0/16 service-cidr=10.96.0.0/12

[init] Using Kubernetes version: v1.13.4

[preflight] Running pre-flight checks

        [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 18.09.3. Latest validated version: 18.06

[preflight] Pulling images required for setting up a Kubernetes cluster

[preflight] This might take a minute or two, depending on the speed of your internet connection

[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'

[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"

[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"

[kubelet-start] Activating the kubelet service

[certs] Using certificateDir folder "/etc/kubernetes/pki"

[certs] Generating "ca" certificate and key

[certs] Generating "apiserver-kubelet-client" certificate and key

[certs] Generating "apiserver" certificate and key

[certs] apiserver serving cert is signed for DNS names [u1810 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 192.168.8.53]

[certs] Generating "etcd/ca" certificate and key

[certs] Generating "etcd/server" certificate and key

[certs] etcd/server serving cert is signed for DNS names [u1810 localhost] and IPs [192.168.8.53 127.0.0.1 ::1]

[certs] Generating "etcd/peer" certificate and key

[certs] etcd/peer serving cert is signed for DNS names [u1810 localhost] and IPs [192.168.8.53 127.0.0.1 ::1]

[certs] Generating "etcd/healthcheck-client" certificate and key

[certs] Generating "apiserver-etcd-client" certificate and key

[certs] Generating "front-proxy-ca" certificate and key

[certs] Generating "front-proxy-client" certificate and key

[certs] Generating "sa" key and public key

[kubeconfig] Using kubeconfig folder "/etc/kubernetes"

[kubeconfig] Writing "admin.conf" kubeconfig file

[kubeconfig] Writing "kubelet.conf" kubeconfig file

[kubeconfig] Writing "controller-manager.conf" kubeconfig file

[kubeconfig] Writing "scheduler.conf" kubeconfig file

[control-plane] Using manifest folder "/etc/kubernetes/manifests"

[control-plane] Creating static Pod manifest for "kube-apiserver"

[control-plane] Creating static Pod manifest for "kube-controller-manager"

[control-plane] Creating static Pod manifest for "kube-scheduler"

[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"

[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s

[apiclient] All control plane components are healthy after 31.014621 seconds

[uploadconfig] storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace

[kubelet] Creating a ConfigMap "kubelet-config-1.13" in namespace kube-system with the configuration for the kubelets in the cluster

[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "u1810" as an annotation

[mark-control-plane] Marking the node u1810 as control-plane by adding the label "node-role.kubernetes.io/master=''"

[mark-control-plane] Marking the node u1810 as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]

[bootstrap-token] Using token: rnrbe5.tq9bglome3cmceci

[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles

[bootstraptoken] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials

[bootstraptoken] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token

[bootstraptoken] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster

[bootstraptoken] creating the "cluster-info" ConfigMap in the "kube-public" namespace

[addons] Applied essential addon: CoreDNS

[addons] Applied essential addon: kube-proxy

Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube

  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.

Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:

https://kubernetes.io/docs/concepts/cluster-administration/addons/

You can now join any number of machines by running the following on each node

as root:  下面這串是 要添加 Node 所需的指令和 taken

kubeadm join 192.168.8.53:6443 --token rnrbe5.tq9bglome3cmceci --discovery-token-ca-cert-hash sha256:e7db4a5329742758c6a448bced245b1a9f257e17430fac437dda1c889b13af4f


創建用戶配置文件

mkdir -p $HOME/.kube

sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

sudo chown $(id -u):$(id -g) $HOME/.kube/config

測試

kubectl get componentstatus

kubectl get nodes


安裝網路 -- flannel

sudo kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml


確認 Master 是否正常

kubectl get nodes

kubectl get pods --all-namespaces


  • 添加 Node 到 K8S Cluster

使用之前 kubeadm init 最後產生的資訊,在要增加新的 Node 上執行

sudo kubeadm join 192.168.8.53:6443 --token rnrbe5.tq9bglome3cmceci --discovery-token-ca-cert-hash sha256:e7db4a5329742758c6a448bced245b1a9f257e17430fac437dda1c889b13af4f


#token 24小時會過期,若後續還須添加新的node,需產新的 token

kubeadm token create


neo@k8s02:~$ sudo kubeadm join 192.168.8.53:6443 --token rnrbe5.tq9bglome3cmceci --discovery-token-ca-cert-hash sha256:e7db4a5329742758c6a448bced245b1a9f257e17430fac437dda1c889b13af4f

[preflight] Running pre-flight checks

        [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 18.09.3. Latest validated version: 18.06

[discovery] Trying to connect to API Server "192.168.8.53:6443"

[discovery] Created cluster-info discovery client, requesting info from "https://192.168.8.53:6443"

[discovery] Requesting info from "https://192.168.8.53:6443" again to validate TLS against the pinned public key

[discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "192.168.8.53:6443"

[discovery] Successfully established connection with API Server "192.168.8.53:6443"

[join] Reading configuration from the cluster...

[join] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'

[kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.13" ConfigMap in the kube-system namespace

[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"

[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"

[kubelet-start] Activating the kubelet service

[tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...

[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "k8s02" as an annotation

This node has joined the cluster:

* Certificate signing request was sent to apiserver and a response was received.

* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the master to see this node join the cluster.


在 Master 執行 kubectl get nodes ,可以確認 多出 Node


檢查 cluster 是否健康

kubectl get cs

kubectl cluster-info

kubectl version -- short=true


K8S 安裝 NVIDIA Device Plugin 

參考官網步驟 https://github.com/NVIDIA/k8s-device-plugin


  • Prerequisites

The list of prerequisites for running the NVIDIA device plugin is described below:


#查看目前 NVIDIA 硬體

lspci | grep NVIDIA

#使用官方的NVIDIA驅動程式進行手動安裝

https://www.nvidia.com/Download/

這次使用的顯卡是 2080 Ti


sudo chmod +x NVIDIA-Linux-x86_64-418.43.run

sudo ./NVIDIA-Linux-x86_64-418.43.run -no-x-check -no-nouveau-check -no-opengl-files


#掛載Nvidia驅動

modprobe nvidia

#查看顯示卡資訊

nvidia-smi  


#在有 GPU的 Node Enable the nvidia runtime as your default runtime on your node

#修改 /etc/docker/daemon.json 如下:

{

    "default-runtime": "nvidia",

    "runtimes": {

        "nvidia": {

            "path": "/usr/bin/nvidia-container-runtime",

            "runtimeArgs": []

        }

    }

}


#重啟 docker

sudo systemctl daemon-reload && sudo systemctl restart docker


#Enabling GPU Support in Kubernetes

#在 Master 執行,Enable GPU support

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml

#重啟 kubelet

sudo systemctl daemon-reload && sudo systemctl restart kubelet

#確認 GPU Node 是否有 GPU 資源可以分配

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"