Ubuntu 當機,強制重開機後
啟動過程 跑到了 busybox initramfs 界面
fsck 檢查完成後, reboot server,就正常了。
注意事項:
機器 名稱 和 對應 IP
K8S01 Master 192.168.8.53 Ubuntu 18.04
K8S02 Notes 192.168.8.54 Ubuntu 18.04
K8S03 Notes 192.168.8.55 Ubuntu 18.04
SVAI01 Notes – GPU 192.168.3.48 Ubuntu 18.04
SVAI02 Notes – GPU 192.168.3.49 Ubuntu 18.04
安裝前注意
設置主機名
sudo hostnamectl set-hostname k8s-master
sudo vi /etc/hostname
/etc/hosts 要添加全部 hosts
關閉防火牆
sudo iptables -F
關閉系統 Swap
sudo swapoff -a
修改 /etc/fstab,避免 Swap 自動掛載
sudo sed -e '/swap/ s/^#*/#/' -i /etc/fstab
確認關閉
free -m
在所有節點上將系統軟件包更新到最新版本:
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install linux-image-extra-virtual
sudo reboot
添加用戶以管理Kubernetes集群:
sudo useradd -s /bin/bash -m kube
sudo passwd kube pw:kube
sudo usermod -aG sudo kube
echo "kube ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/kube
安裝Docker Engine
先確認系統上已卸載任何舊版本的Docker引擎:
sudo apt-get remove docker docker-engine docker.i
安裝相關套件
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
安裝Docker
sudo apt install docker.io
sudo systemctl enable docker
Install Docker -CE
安裝GPG證書
https_proxy=192.168.1.88:3128 wget https://download.docker.com/linux/ubuntu/gpg -O docker.key
sudo apt-key add docker.key
寫入軟件源信息
add source
Create a new file for the Docker repository at /etc/apt/sources.list.d/docker.list
寫入軟件源信息
sudo add-apt-repository "deb [arch=amd64] http://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
安裝 Docker-CE
sudo apt-get install docker-ce
測試 hello-world ,就出現錯誤
sudo docker run hello-world
先建 下面目錄
sudo mkdir /etc/systemd/system/docker.service.d
再新增一個 http-proxy.conf 檔案
sudo vi /etc/systemd/system/docker.service.d/http-proxy.conf
內容如下:
[Service]
Environment="HTTP_PROXY=http://192.168.2.91:80/"
Environment="HTTPS_PROXY=http://192.168.2.91:80/"
sudo systemctl daemon-reload
sudo systemctl show --property Environment docker
sudo systemctl restart docker
再跑一次 sudo docker run hello-world 還是錯誤,但是錯誤碼不同。要用 docker login
去 Docker 註冊一個帳號,跑一次 Docker login
sudo docker run hello-world -- 再跑一次,終於成攻了
Kuberntes 安裝
添加 憑證 和 repository
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add
sudo apt-add-repository "deb http://apt.kubernetes.io/ kubernetes-xenial main"
安裝K8S 相關套件
sudo apt install kubeadm kubectl kubelet
初始化 Maste
sudo kubeadm init --kubernetes-version=v1.13.4 --pod-network-cidr=10.244.0.0/16 service-cidr=10.96.0.0/12
沒有關閉 SWAP 會出現下面錯誤
執行畫面
neo@u1810:~$ sudo kubeadm init --kubernetes-version=v1.13.4 --pod-network-cidr=10.244.0.0/16 service-cidr=10.96.0.0/12
[init] Using Kubernetes version: v1.13.4
[preflight] Running pre-flight checks
[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 18.09.3. Latest validated version: 18.06
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [u1810 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 192.168.8.53]
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [u1810 localhost] and IPs [192.168.8.53 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [u1810 localhost] and IPs [192.168.8.53 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[apiclient] All control plane components are healthy after 31.014621 seconds
[uploadconfig] storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.13" in namespace kube-system with the configuration for the kubelets in the cluster
[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "u1810" as an annotation
[mark-control-plane] Marking the node u1810 as control-plane by adding the label "node-role.kubernetes.io/master=''"
[mark-control-plane] Marking the node u1810 as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]
[bootstrap-token] Using token: rnrbe5.tq9bglome3cmceci
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstraptoken] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstraptoken] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstraptoken] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstraptoken] creating the "cluster-info" ConfigMap in the "kube-public" namespace
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy
Your Kubernetes master has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
You can now join any number of machines by running the following on each node
as root: 下面這串是 要添加 Node 所需的指令和 taken
kubeadm join 192.168.8.53:6443 --token rnrbe5.tq9bglome3cmceci --discovery-token-ca-cert-hash sha256:e7db4a5329742758c6a448bced245b1a9f257e17430fac437dda1c889b13af4f
創建用戶配置文件
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
測試
kubectl get componentstatus
kubectl get nodes
安裝網路 -- flannel
sudo kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
確認 Master 是否正常
kubectl get nodes
kubectl get pods --all-namespaces
使用之前 kubeadm init 最後產生的資訊,在要增加新的 Node 上執行
sudo kubeadm join 192.168.8.53:6443 --token rnrbe5.tq9bglome3cmceci --discovery-token-ca-cert-hash sha256:e7db4a5329742758c6a448bced245b1a9f257e17430fac437dda1c889b13af4f
#token 24小時會過期,若後續還須添加新的node,需產新的 token
kubeadm token create
neo@k8s02:~$ sudo kubeadm join 192.168.8.53:6443 --token rnrbe5.tq9bglome3cmceci --discovery-token-ca-cert-hash sha256:e7db4a5329742758c6a448bced245b1a9f257e17430fac437dda1c889b13af4f
[preflight] Running pre-flight checks
[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 18.09.3. Latest validated version: 18.06
[discovery] Trying to connect to API Server "192.168.8.53:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://192.168.8.53:6443"
[discovery] Requesting info from "https://192.168.8.53:6443" again to validate TLS against the pinned public key
[discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "192.168.8.53:6443"
[discovery] Successfully established connection with API Server "192.168.8.53:6443"
[join] Reading configuration from the cluster...
[join] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.13" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Activating the kubelet service
[tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...
[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "k8s02" as an annotation
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the master to see this node join the cluster.
在 Master 執行 kubectl get nodes ,可以確認 多出 Node
檢查 cluster 是否健康
kubectl get cs
kubectl cluster-info
kubectl version -- short=true
K8S 安裝 NVIDIA Device Plugin
The list of prerequisites for running the NVIDIA device plugin is described below:
NVIDIA drivers ~= 361.93
nvidia-docker version > 2.0 (see how to install and it's prerequisites)
docker configured with nvidia as the default runtime.
Kubernetes version = 1.11
#查看目前 NVIDIA 硬體
lspci | grep NVIDIA
#使用官方的NVIDIA驅動程式進行手動安裝
https://www.nvidia.com/Download/
這次使用的顯卡是 2080 Ti
sudo chmod +x NVIDIA-Linux-x86_64-418.43.run
sudo ./NVIDIA-Linux-x86_64-418.43.run -no-x-check -no-nouveau-check -no-opengl-files
#掛載Nvidia驅動
modprobe nvidia
#查看顯示卡資訊
nvidia-smi
#在有 GPU的 Node Enable the nvidia runtime as your default runtime on your node
#修改 /etc/docker/daemon.json 如下:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
#重啟 docker
sudo systemctl daemon-reload && sudo systemctl restart docker
#在 Master 執行,Enable GPU support
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
#重啟 kubelet
sudo systemctl daemon-reload && sudo systemctl restart kubelet
#確認 GPU Node 是否有 GPU 資源可以分配
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"