雜思集: 3月 2019

機器名稱和對應 IP

K8S01 Master 192.168.8.53 Ubuntu 18.04

K8S02 Notes 192.168.8.54 Ubuntu 18.04

K8S03 Notes 192.168.8.55 Ubuntu 18.04

SVAI01 Notes – GPU 192.168.3.48 Ubuntu 18.04

SVAI02 Notes – GPU 192.168.3.49 Ubuntu 18.04

安裝前注意

設置主機名

sudo hostnamectl set-hostname k8s-master

sudo vi /etc/hostname

/etc/hosts 要添加全部 hosts

關閉防火牆

sudo iptables -F

關閉系統 Swap

sudo swapoff -a

修改 /etc/fstab，避免 Swap 自動掛載

sudo sed -e '/swap/ s/^#*/#/' -i /etc/fstab

確認關閉

free -m

在所有節點上將系統軟件包更新到最新版本：

sudo apt-get update

sudo apt-get upgrade

sudo apt-get install linux-image-extra-virtual

sudo reboot

添加用戶以管理Kubernetes集群：

sudo useradd -s /bin/bash -m kube

sudo passwd kube pw:kube

sudo usermod -aG sudo kube

echo "kube ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/kube

安裝Docker Engine

先確認系統上已卸載任何舊版本的Docker引擎：

sudo apt-get remove docker docker-engine docker.i

安裝相關套件

sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

安裝Docker

sudo apt install docker.io

sudo systemctl enable docker

Install Docker -CE

安裝GPG證書

https_proxy=192.168.1.88:3128 wget https://download.docker.com/linux/ubuntu/gpg -O docker.key

sudo apt-key add docker.key

寫入軟件源信息

add source

Create a new file for the Docker repository at /etc/apt/sources.list.d/docker.list

寫入軟件源信息

sudo add-apt-repository "deb [arch=amd64] http://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"

安裝 Docker-CE

sudo apt-get install docker-ce

測試 hello-world ，就出現錯誤

sudo docker run hello-world

先建下面目錄

sudo mkdir /etc/systemd/system/docker.service.d

再新增一個 http-proxy.conf 檔案

sudo vi /etc/systemd/system/docker.service.d/http-proxy.conf

內容如下：

[Service]

Environment="HTTP_PROXY=http://192.168.2.91:80/"

Environment="HTTPS_PROXY=http://192.168.2.91:80/"

sudo systemctl daemon-reload

sudo systemctl show --property Environment docker

sudo systemctl restart docker

再跑一次 sudo docker run hello-world 還是錯誤，但是錯誤碼不同。要用 docker login

去 Docker 註冊一個帳號，跑一次 Docker login

sudo docker run hello-world -- 再跑一次，終於成攻了

Kuberntes 安裝

添加憑證和 repository

curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add

sudo apt-add-repository "deb http://apt.kubernetes.io/ kubernetes-xenial main"

安裝K8S 相關套件

sudo apt install kubeadm kubectl kubelet

初始化 Maste

sudo kubeadm init --kubernetes-version=v1.13.4 --pod-network-cidr=10.244.0.0/16 service-cidr=10.96.0.0/12

沒有關閉 SWAP 會出現下面錯誤

執行畫面

neo@u1810:~$ sudo kubeadm init --kubernetes-version=v1.13.4 --pod-network-cidr=10.244.0.0/16 service-cidr=10.96.0.0/12

[init] Using Kubernetes version: v1.13.4

[preflight] Running pre-flight checks

[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 18.09.3. Latest validated version: 18.06

[preflight] Pulling images required for setting up a Kubernetes cluster

[preflight] This might take a minute or two, depending on the speed of your internet connection

[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'

[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"

[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"

[kubelet-start] Activating the kubelet service

[certs] Using certificateDir folder "/etc/kubernetes/pki"

[certs] Generating "ca" certificate and key

[certs] Generating "apiserver-kubelet-client" certificate and key

[certs] Generating "apiserver" certificate and key

[certs] apiserver serving cert is signed for DNS names [u1810 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 192.168.8.53]

[certs] Generating "etcd/ca" certificate and key

[certs] Generating "etcd/server" certificate and key

[certs] etcd/server serving cert is signed for DNS names [u1810 localhost] and IPs [192.168.8.53 127.0.0.1 ::1]

[certs] Generating "etcd/peer" certificate and key

[certs] etcd/peer serving cert is signed for DNS names [u1810 localhost] and IPs [192.168.8.53 127.0.0.1 ::1]

[certs] Generating "etcd/healthcheck-client" certificate and key

[certs] Generating "apiserver-etcd-client" certificate and key

[certs] Generating "front-proxy-ca" certificate and key

[certs] Generating "front-proxy-client" certificate and key

[certs] Generating "sa" key and public key

[kubeconfig] Using kubeconfig folder "/etc/kubernetes"

[kubeconfig] Writing "admin.conf" kubeconfig file

[kubeconfig] Writing "kubelet.conf" kubeconfig file

[kubeconfig] Writing "controller-manager.conf" kubeconfig file

[kubeconfig] Writing "scheduler.conf" kubeconfig file

[control-plane] Using manifest folder "/etc/kubernetes/manifests"

[control-plane] Creating static Pod manifest for "kube-apiserver"

[control-plane] Creating static Pod manifest for "kube-controller-manager"

[control-plane] Creating static Pod manifest for "kube-scheduler"

[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"

[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s

[apiclient] All control plane components are healthy after 31.014621 seconds

[uploadconfig] storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace

[kubelet] Creating a ConfigMap "kubelet-config-1.13" in namespace kube-system with the configuration for the kubelets in the cluster

[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "u1810" as an annotation

[mark-control-plane] Marking the node u1810 as control-plane by adding the label "node-role.kubernetes.io/master=''"

[mark-control-plane] Marking the node u1810 as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]

[bootstrap-token] Using token: rnrbe5.tq9bglome3cmceci

[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles

[bootstraptoken] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials

[bootstraptoken] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token

[bootstraptoken] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster

[bootstraptoken] creating the "cluster-info" ConfigMap in the "kube-public" namespace

[addons] Applied essential addon: CoreDNS

[addons] Applied essential addon: kube-proxy

Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

mkdir -p $HOME/.kube

sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.

Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:

https://kubernetes.io/docs/concepts/cluster-administration/addons/

You can now join any number of machines by running the following on each node

as root: 下面這串是要添加 Node 所需的指令和 taken

kubeadm join 192.168.8.53:6443 --token rnrbe5.tq9bglome3cmceci --discovery-token-ca-cert-hash sha256:e7db4a5329742758c6a448bced245b1a9f257e17430fac437dda1c889b13af4f

創建用戶配置文件

mkdir -p $HOME/.kube

sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

sudo chown $(id -u):$(id -g) $HOME/.kube/config

測試

kubectl get componentstatus

kubectl get nodes

安裝網路 -- flannel

sudo kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

確認 Master 是否正常

kubectl get nodes

kubectl get pods --all-namespaces

添加 Node 到 K8S Cluster

使用之前 kubeadm init 最後產生的資訊，在要增加新的 Node 上執行

sudo kubeadm join 192.168.8.53:6443 --token rnrbe5.tq9bglome3cmceci --discovery-token-ca-cert-hash sha256:e7db4a5329742758c6a448bced245b1a9f257e17430fac437dda1c889b13af4f

#token 24小時會過期，若後續還須添加新的node，需產新的 token

kubeadm token create

neo@k8s02:~$ sudo kubeadm join 192.168.8.53:6443 --token rnrbe5.tq9bglome3cmceci --discovery-token-ca-cert-hash sha256:e7db4a5329742758c6a448bced245b1a9f257e17430fac437dda1c889b13af4f

[preflight] Running pre-flight checks

[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 18.09.3. Latest validated version: 18.06

[discovery] Trying to connect to API Server "192.168.8.53:6443"

[discovery] Created cluster-info discovery client, requesting info from "https://192.168.8.53:6443"

[discovery] Requesting info from "https://192.168.8.53:6443" again to validate TLS against the pinned public key

[discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "192.168.8.53:6443"

[discovery] Successfully established connection with API Server "192.168.8.53:6443"

[join] Reading configuration from the cluster...

[join] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'

[kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.13" ConfigMap in the kube-system namespace

[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"

[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"

[kubelet-start] Activating the kubelet service

[tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...

[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "k8s02" as an annotation

This node has joined the cluster:

* Certificate signing request was sent to apiserver and a response was received.

* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the master to see this node join the cluster.

在 Master 執行 kubectl get nodes ，可以確認多出 Node

檢查 cluster 是否健康

kubectl get cs

kubectl cluster-info

kubectl version -- short=true

K8S 安裝 NVIDIA Device Plugin

參考官網步驟 https://github.com/NVIDIA/k8s-device-plugin

Prerequisites

The list of prerequisites for running the NVIDIA device plugin is described below:

NVIDIA drivers ~= 361.93
nvidia-docker version > 2.0 (see how to install and it's prerequisites)
docker configured with nvidia as the default runtime.
Kubernetes version = 1.11

#查看目前 NVIDIA 硬體

lspci | grep NVIDIA

#使用官方的NVIDIA驅動程式進行手動安裝

https://www.nvidia.com/Download/

這次使用的顯卡是 2080 Ti

sudo chmod +x NVIDIA-Linux-x86_64-418.43.run

sudo ./NVIDIA-Linux-x86_64-418.43.run -no-x-check -no-nouveau-check -no-opengl-files

#掛載Nvidia驅動

modprobe nvidia

#查看顯示卡資訊

nvidia-smi

#在有 GPU的 Node Enable the nvidia runtime as your default runtime on your node

#修改 /etc/docker/daemon.json 如下：

{

"default-runtime": "nvidia",

"runtimes": {

"nvidia": {

"path": "/usr/bin/nvidia-container-runtime",

"runtimeArgs": []

}

#重啟 docker

sudo systemctl daemon-reload && sudo systemctl restart docker

#Enabling GPU Support in Kubernetes

#在 Master 執行，Enable GPU support

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml

#重啟 kubelet

sudo systemctl daemon-reload && sudo systemctl restart kubelet

#確認 GPU Node 是否有 GPU 資源可以分配

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

雜思集

2019年3月18日星期一

Kubernetes 1.13.4 安裝測試 -- NVIDIA Device Plugin

參考官網步驟 https://github.com/NVIDIA/k8s-device-plugin

Prerequisites

#Enabling GPU Support in Kubernetes

2019年3月18日 星期一

Kubernetes 1.13.4 安裝測試 -- NVIDIA Device Plugin

參考官網步驟 https://github.com/NVIDIA/k8s-device-plugin

Prerequisites

#Enabling GPU Support in Kubernetes

2019年3月18日星期一