Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arktos-Mizar-Integration] Nginx Pods scheduled to the new worker node unable to enter in Running state in two-node Arktos scale-up cluster #562

Closed
q131172019 opened this issue Nov 3, 2021 · 5 comments
Assignees

Comments

@q131172019
Copy link

q131172019 commented Nov 3, 2021

What happened:
During Arktos and Mizar integration, Arktos team wants to test this case - Worker node joining: new worker node should be able to join cluster, and basic pod connectivity should be provided.

In two-node Arktos scale-up cluster with Mizar, new worker node is able to join cluster and enters in Ready state. However, when nginx application is deployed, pods scheduled to the new worker node unable to enter in Running state but stuck in ContainerCreating state. The kubelet log on new worker node shows cni problem,

What you expected to happen:
The nginx application pods scheduled to the new worker node should enter in Running state,

How to reproduce it (as minimally and precisely as possible):

  1. Create single-node arktos cluster with Mizar using the procedure at https://github.com/Click2Cloud-Centaurus/arktos/blob/default-cni-mizar/docs/setup-guide/arktos-with-mizar-cni.md and apply PR 1114 at Support for Mizar CNI in arktos-up arktos#1114; and verify health status with procedure at https://github.com/CentaurusInfra/mizar/wiki/Mizar-Cluster-Health-Criteria.

  2. Create worker node running AWS Ubuntu 18.04, SSH/SCP should work on master and worker node; The corresponding ports should be opened in the security group on both nodes; upgrade kernel to 5.6-rc2, clone the Arktos repository and install the required dependencies. The follow up the step 3 and step 4 at https://github.com/q131172019/arktos/blob/CarlXie_singleNodeArktosCluster/docs/setup-guide/multi-node-dev-scale-up-cluster.md to join cluster.

  3. Check the status of two nodes

./cluster/kubectl.sh get nodes
NAME               STATUS   ROLES    AGE     VERSION
ip-172-31-28-132   Ready    <none>   59m     v0.9.0
ip-172-31-28-9     Ready    <none>   5m37s   v0.9.0
  1. Deploy Nginx application
./cluster/kubectl.sh run nginx --image=nginx --replicas=10
  1. Check the status of Nginx pods
./cluster/kubectl.sh get pods -o wide
NAME                              HASHKEY               READY   STATUS              RESTARTS   AGE     IP              NODE               NOMINATED NODE   READINESS GATES
mizar-daemon-jm68c                8489468700754586427   1/1     Running             1          6m22s   172.31.28.9     ip-172-31-28-9     <none>           <none>
mizar-daemon-qth7w                9137939745832437098   1/1     Running             0          59m     172.31.28.132   ip-172-31-28-132   <none>           <none>
mizar-operator-6b78d7ffc4-fm4pp   3559156546840379852   1/1     Running             0          59m     172.31.28.132   ip-172-31-28-132   <none>           <none>
netpod1                           6169807998527740827   1/1     Running             0          56m     20.0.0.41       ip-172-31-28-132   <none>           <none>
netpod2                           8273861328765425857   1/1     Running             0          56m     20.0.0.18       ip-172-31-28-132   <none>           <none>
nginx-5d79788459-48l6q            1654048144042850706   0/1     ContainerCreating   0          10s     <none>          ip-172-31-28-9     <none>           <none>
nginx-5d79788459-5g7w5            7238259642648495854   1/1     Running             0          10s     20.0.0.37       ip-172-31-28-132   <none>           <none>
nginx-5d79788459-5lrp9            2544086530344874677   0/1     ContainerCreating   0          10s     <none>          ip-172-31-28-9     <none>           <none>
nginx-5d79788459-bvqmt            8343757557393359410   1/1     Running             0          10s     20.0.0.45       ip-172-31-28-132   <none>           <none>
nginx-5d79788459-fqhhx            3882393408941073858   0/1     ContainerCreating   0          10s     <none>          ip-172-31-28-9     <none>           <none>
nginx-5d79788459-l46z5            5947515209238197563   0/1     ContainerCreating   0          10s     <none>          ip-172-31-28-9     <none>           <none>
nginx-5d79788459-l9m77            4466493257680056631   0/1     ContainerCreating   0          10s     <none>          ip-172-31-28-9     <none>           <none>
nginx-5d79788459-n6v7x            6802894149274701769   0/1     ContainerCreating   0          10s     <none>          ip-172-31-28-9     <none>           <none>
nginx-5d79788459-spvfs            1493173479578877424   0/1     ContainerCreating   0          10s     <none>          ip-172-31-28-9     <none>           <none>
nginx-5d79788459-zf92h            7575485586534491466   0/1     ContainerCreating   0          10s     <none>          ip-172-31-28-9     <none>           <none>
ubuntu@ip-172-31-28-132:~/go/src/k8s.io/arktos$
  1. Check the error of containerd
journalctl -u containerd
Nov 01 22:31:19 ip-172-31-28-9 containerd[2324]: time="2021-11-01T22:31:19.318504821Z" level=info msg="No cni config template is specified, wait for other system components to drop the config."
Nov 01 22:31:19 ip-172-31-28-9 containerd[2324]: time="2021-11-01T22:31:19.419328912Z" level=info msg="No cni config template is specified, wait for other system components to drop the config."
Nov 01 22:31:19 ip-172-31-28-9 containerd[2324]: time="2021-11-01T22:31:19.520129201Z" level=info msg="No cni config template is specified, wait for other system components to drop the config."
Nov 01 22:31:19 ip-172-31-28-9 containerd[2324]: time="2021-11-01T22:31:19.721247645Z" level=info msg="No cni config template is specified, wait for other system components to drop the config."
Nov 01 22:31:19 ip-172-31-28-9 containerd[2324]: time="2021-11-01T22:31:19.821904690Z" level=info msg="No cni config template is specified, wait for other system components to drop the config."
Nov 01 22:31:19 ip-172-31-28-9 containerd[2324]: time="2021-11-01T22:31:19.922523863Z" level=info msg="No cni config template is specified, wait for other system components to drop the config."
Nov 01 22:31:20 ip-172-31-28-9 containerd[2324]: time="2021-11-01T22:31:20.023186444Z" level=info msg="No cni config template is specified, wait for other system components to drop the config."
Nov 01 22:31:20 ip-172-31-28-9 containerd[2324]: time="2021-11-01T22:31:20.123857308Z" level=info msg="No cni config template is specified, wait for other system components to drop the config."
  1. Check the error in kubelet log at /tmp/kubelet.log
grep nginx-5d79788459-48l6q /tmp/kubelet.worker.log |tail -3
E1103 23:01:50.925691    3384 kuberuntime_manager.go:1024] createPodSandbox for pod "nginx-5d79788459-48l6q_default_system(9fef41f2-b9a0-4203-a26c-a3f28a5df19f)" failed: rpc error: code = Unknown desc = failed to setup network for sandbox "b81277168b31e4d5bae52855bff3a1e7e97fa35130db727dfbf8b3c8c833bd1b": rpc error: code = DeadlineExceeded desc = Deadline Exceeded
E1103 23:01:50.925771    3384 pod_workers.go:196] Error syncing pod 9fef41f2-b9a0-4203-a26c-a3f28a5df19f ("nginx-5d79788459-48l6q_default_system(9fef41f2-b9a0-4203-a26c-a3f28a5df19f)"), skipping: failed to "CreatePodSandbox" for "nginx-5d79788459-48l6q_default_system(9fef41f2-b9a0-4203-a26c-a3f28a5df19f)" with CreatePodSandboxError: "CreatePodSandbox for pod \"nginx-5d79788459-48l6q_default_system(9fef41f2-b9a0-4203-a26c-a3f28a5df19f)\" failed: rpc error: code = Unknown desc = failed to setup network for sandbox \"b81277168b31e4d5bae52855bff3a1e7e97fa35130db727dfbf8b3c8c833bd1b\": rpc error: code = DeadlineExceeded desc = Deadline Exceeded"
I1103 23:01:50.925803    3384 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"nginx-5d79788459-48l6q", UID:"9fef41f2-b9a0-4203-a26c-a3f28a5df19f", APIVersion:"v1", ResourceVersion:"1927", FieldPath:"", Tenant:"system"}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b81277168b31e4d5bae52855bff3a1e7e97fa35130db727dfbf8b3c8c833bd1b": rpc error: code = DeadlineExceeded desc = Deadline Exceeded

Anything else we need to know?:

Environment:

  • Arktos version (use kubectl version): 0.9.0
  • Cloud provider or hardware configuration: AWS EC2 instance
  • OS (e.g: cat /etc/os-release): Ubuntu 18.04
  • Kernel (e.g. uname -a): 5.6.0-rc2
  • Install tools: .//hack/setup-dev-node.sh
  • Network plugin and version (if this is a network-related bug):
  • Others:
@vinaykul
Copy link
Member

vinaykul commented Nov 4, 2021

This issue does not occur with Mizar in upstream code (1.21.0). So it is arktos specific issue.

kubeadm join 192.168.1.144:6443 --token c3t6gd.1ena66gozlo967q4 \

--discovery-token-ca-cert-hash sha256:1009a695e930bf66284f356bc51bbf34af087922571b016a33f173946f530a65 

root@ip-192-168-1-144:~# k create -f ~/ndeploy.mizar.yaml 

Warning: apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition

customresourcedefinition.apiextensions.k8s.io/bouncers.mizar.com created

customresourcedefinition.apiextensions.k8s.io/dividers.mizar.com created

customresourcedefinition.apiextensions.k8s.io/droplets.mizar.com created

customresourcedefinition.apiextensions.k8s.io/endpoints.mizar.com created

customresourcedefinition.apiextensions.k8s.io/subnets.mizar.com created

customresourcedefinition.apiextensions.k8s.io/vpcs.mizar.com created

serviceaccount/mizar-operator created

clusterrolebinding.rbac.authorization.k8s.io/mizar-operator created

daemonset.apps/mizar-daemon created

deployment.apps/mizar-operator created

root@ip-192-168-1-144:~# kp

NAME                              READY   STATUS     RESTARTS   AGE   IP              NODE               NOMINATED NODE   READINESS GATES

mizar-daemon-zwnrb                0/1     Init:0/1   0          3s    192.168.1.144   ip-192-168-1-144   <none>           <none>

mizar-operator-644fd89585-hnp4x   1/1     Running    0          3s    192.168.1.144   ip-192-168-1-144   <none>           <none>

root@ip-192-168-1-144:~# kp

NAME                              READY   STATUS    RESTARTS   AGE   IP              NODE               NOMINATED NODE   READINESS GATES

mizar-daemon-zwnrb                1/1     Running   0          78s   192.168.1.144   ip-192-168-1-144   <none>           <none>

mizar-operator-644fd89585-hnp4x   1/1     Running   0          78s   192.168.1.144   ip-192-168-1-144   <none>           <none>

root@ip-192-168-1-144:~#


root@ip-192-168-1-144:~# kv

NAME   IP         PREFIX   VNI   DIVIDERS   STATUS   CREATETIME                   PROVISIONDELAY

vpc0   20.0.0.0   8        1     1          Init     2021-11-04T02:08:15.727310   

root@ip-192-168-1-144:~# kv

NAME   IP         PREFIX   VNI   DIVIDERS   STATUS        CREATETIME                   PROVISIONDELAY

vpc0   20.0.0.0   8        1     1          Provisioned   2021-11-04T02:08:15.727310   20.703013

root@ip-192-168-1-144:~# kb

NAME                                          VPC    NET    IP              MAC                 DROPLET            STATUS        CREATETIME                   PROVISIONDELAY

net0-b-317fcad2-664f-4ee2-9074-ebe08cada3b7   vpc0   net0   192.168.1.144   02:4a:a1:fb:ce:f1   ip-192-168-1-144   Provisioned   2021-11-04T02:08:56.615784   1.310861

root@ip-192-168-1-144:~# kd

NAME                                          VPC    IP              MAC                 DROPLET            STATUS        CREATETIME                   PROVISIONDELAY

vpc0-d-90cf4682-16f6-45b7-b3db-edb0fff7ccec   vpc0   192.168.1.144   02:4a:a1:fb:ce:f1   ip-192-168-1-144   Provisioned   2021-11-04T02:08:36.421403   0.521281

root@ip-192-168-1-144:~# kdr

NAME               MAC                 IP              STATUS        INTERFACE   CREATETIME                   PROVISIONDELAY

ip-192-168-1-144   02:4a:a1:fb:ce:f1   192.168.1.144   Provisioned   eth0        2021-11-04T02:08:16.698067   0.493059

root@ip-192-168-1-144:~#


...




root@ip-192-168-1-23:~# kubeadm join 192.168.1.144:6443 --token c3t6gd.1ena66gozlo967q4 \

> --discovery-token-ca-cert-hash sha256:1009a695e930bf66284f356bc51bbf34af087922571b016a33f173946f530a65 

[preflight] Running pre-flight checks

[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/

[preflight] Reading configuration from the cluster...

[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'

[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"

[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"

[kubelet-start] Starting the kubelet

[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...


This node has joined the cluster:

* Certificate signing request was sent to apiserver and a response was received.

* The Kubelet was informed of the new secure connection details.


Run 'kubectl get nodes' on the control-plane to see this node join the cluster.


root@ip-192-168-1-23:~# 




...


root@ip-192-168-1-144:~# kp

NAME                              READY   STATUS     RESTARTS   AGE     IP              NODE               NOMINATED NODE   READINESS GATES

mizar-daemon-5ncxq                0/1     Init:0/1   0          24s     192.168.1.23    ip-192-168-1-23    <none>           <none>

mizar-daemon-zwnrb                1/1     Running    0          3m35s   192.168.1.144   ip-192-168-1-144   <none>           <none>

mizar-operator-644fd89585-hnp4x   1/1     Running    0          3m35s   192.168.1.144   ip-192-168-1-144   <none>           <none>

root@ip-192-168-1-144:~# 

root@ip-192-168-1-144:~# kp

NAME                              READY   STATUS    RESTARTS   AGE   IP              NODE               NOMINATED NODE   READINESS GATES

mizar-daemon-5ncxq                1/1     Running   0          49s   192.168.1.23    ip-192-168-1-23    <none>           <none>

mizar-daemon-zwnrb                1/1     Running   0          4m    192.168.1.144   ip-192-168-1-144   <none>           <none>

mizar-operator-644fd89585-hnp4x   1/1     Running   0          4m    192.168.1.144   ip-192-168-1-144   <none>           <none>

root@ip-192-168-1-144:~# kdr

NAME               MAC                 IP              STATUS        INTERFACE   CREATETIME                   PROVISIONDELAY

ip-192-168-1-144   02:4a:a1:fb:ce:f1   192.168.1.144   Provisioned   eth0        2021-11-04T02:08:16.698067   0.493059

ip-192-168-1-23    02:0c:10:b5:e4:eb   192.168.1.23    Provisioned   eth0        2021-11-04T02:09:51.369691   0.252838

root@ip-192-168-1-144:~# k create -f ~/2netpod.yaml 

pod/netpod1 created

pod/netpod2 created

root@ip-192-168-1-144:~# kp

NAME                              READY   STATUS    RESTARTS   AGE     IP              NODE               NOMINATED NODE   READINESS GATES

mizar-daemon-5ncxq                1/1     Running   0          82s     192.168.1.23    ip-192-168-1-23    <none>           <none>

mizar-daemon-zwnrb                1/1     Running   0          4m33s   192.168.1.144   ip-192-168-1-144   <none>           <none>

mizar-operator-644fd89585-hnp4x   1/1     Running   0          4m33s   192.168.1.144   ip-192-168-1-144   <none>           <none>

netpod1                           1/1     Running   0          8s      20.0.0.45       ip-192-168-1-144   <none>           <none>

netpod2                           1/1     Running   0          8s      20.0.0.25       ip-192-168-1-23    <none>           <none>

root@ip-192-168-1-144:~# k exec -ti netpod1 -- ping -c2 20.0.0.25

PING 20.0.0.25 (20.0.0.25) 56(84) bytes of data.

64 bytes from 20.0.0.25: icmp_seq=1 ttl=64 time=1.10 ms

64 bytes from 20.0.0.25: icmp_seq=2 ttl=64 time=0.209 ms


--- 20.0.0.25 ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 3ms

rtt min/avg/max/mdev = 0.209/0.654/1.099/0.445 ms

root@ip-192-168-1-144:~# 

@Sindica
Copy link
Collaborator

Sindica commented Nov 11, 2021

Similar issue happened for k8s 1.21 cluster set up with kubeadm:

Setting up k8s 1.21 with kubeadmin:
. Initial Mizar set up for single node cluster was success
. Join worker node: mizar-daemon for worker node was in "Init:CrashLoopBackOff" status, node shown as not ready

Success criterial:
. All Mizar daemon/operator are in running status
. All Mizar CRD objects are in provisioned status
. Created two pods in system tenant, they can pin each other

@vinaykul vinaykul self-assigned this Nov 11, 2021
@Sindica
Copy link
Collaborator

Sindica commented Nov 30, 2021

To check case 4 - add worker node into existing arktos cluster, I did the following steps:

  1. Checkout arktos branch poc-2022-01-30
  2. On master node, start processes with "CNIPLUGIN=mizar ./hack/arktos-up.sh -O", wait till bouncers are provisioned
  3. On worker node
mkdir /tmp/arktos
  1. Create file kubelet.kubeconfig in /tmp/arktos
  2. Add the following content to /tmp/arktos/kubelet.kubeconfig
apiVersion: v1
clusters:
- cluster:
    server: http://ip-172-30-0-14:8080/
  name: local-up-cluster
contexts:
- context:
    cluster: local-up-cluster
    user: local-up-cluster
  name: local-up-cluster
current-context: local-up-cluster
kind: Config
preferences: {}
users:
- name: local-up-cluster
  1. Copy /var/run/kubernetes/client-ca.crt from master to worker /tmp/arktos
  2. Run "export KUBELET_IP=hostname -i;echo $KUBELET_IP"
  3. Run script "CNIPLUGIN=mizar ./hack/arktos-worker-up.sh"
  4. Both nodes are ready, daemonset pods are ready
  5. Create a deployment netpod-deployment-10.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: netpod-deployment
spec:
  replicas: 10
  selector:
    matchLabels:
      app: netpod
  template:
    metadata:
      labels:
        app: netpod
    spec:
      terminationGracePeriodSeconds: 10
      restartPolicy: Always
      containers:
      - name: netctr
        image: mizarnet/testpod
        ports:
        - containerPort: 9001
          protocol: TCP
        - containerPort: 5001
          protocol: UDP
        - containerPort: 7000
          protocol: TCP
  1. Wait till all pods are running. Pods deployed on master node can ping each other on the same node, cannot ping other pods on worker. Pods deployed on worker node cannot ping others.

Logs for both daemonset pods are accessible. worker daemon log normal. master daemon log has error:

INFO:root:Creating interface eth-838f731a
ERROR:grpc._server:Exception calling application: (17, 'File exists')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/grpc/_server.py", line 443, in _call_behavior
    response_or_iterator = behavior(argument, context)
  File "/usr/local/lib/python3.7/site-packages/mizar/daemon/interface_service.py", line 65, in InitializeInterfaces
    self._CreateInterface(interface)
  File "/usr/local/lib/python3.7/site-packages/mizar/daemon/interface_service.py", line 74, in _CreateInterface
    return self._CreateVethInterface(interface)
  File "/usr/local/lib/python3.7/site-packages/mizar/daemon/interface_service.py", line 87, in _CreateVethInterface
    peer=veth_peer, kind='veth')
  File "/usr/local/lib/python3.7/site-packages/pr2modules/iproute/linux.py", line 1395, in link
    msg_flags=msg_flags)
  File "/usr/local/lib/python3.7/site-packages/pr2modules/netlink/nlsocket.py", line 391, in nlm_request
    return tuple(self._genlm_request(*argv, **kwarg))
  File "/usr/local/lib/python3.7/site-packages/pr2modules/netlink/nlsocket.py", line 884, in nlm_request
    callback=callback):
  File "/usr/local/lib/python3.7/site-packages/pr2modules/netlink/nlsocket.py", line 394, in get
    return tuple(self._genlm_get(*argv, **kwarg))
  File "/usr/local/lib/python3.7/site-packages/pr2modules/netlink/nlsocket.py", line 719, in get
    raise msg['header']['error']
pr2modules.netlink.exceptions.NetlinkError: (17, 'File exists')

Note: if there is any issue, using https://github.com/q131172019/arktos/blob/CarlXie_singleNodeArktosCluster/docs/setup-guide/multi-node-dev-scale-up-cluster.md as a reference.

@Sindica
Copy link
Collaborator

Sindica commented Nov 30, 2021

One more test, steps are same as above, there are one pod on worker could not start. Error log in daemonset:

INFO:root:Consuming interfaces for pod: netpod-deployment-8df48867-n2mz6-default-system Current Queue: []
INFO:root:Deleting interfaces for pod netpod-deployment-8df48867-n2mz6-default-system with interfaces []
ERROR:root:Timeout, no new interface to consume! netpod-deployment-8df48867-n2mz6-default-system []
  k8s_pod_name: "netpod-deployment-8df48867-n2mz6"
, cni_params.pod_id k8s_pod_name: "netpod-deployment-8df48867-n2mz6"
, pod_name netpod-deployment-8df48867-n2mz6-default-system
  k8s_pod_name: "netpod-deployment-8df48867-n2mz6"

@vinaykul
Copy link
Member

This should be working now. Closing as per discussion in Dec 29th Network SIG meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants