pod launched by unexpected CNI when the health checking of the agent fails and multus.conf is lost #3758

cyclinder · 2024-07-24T06:07:39Z

Thanks for contributing!

What type of PR is this?

release/bug

What this PR does / why we need it:

the prestop of the agent's container would clean up the multus cni config(00-multus.conf), If the container restart but not pod restarts, the config file wouldn't be re-generated, which can cause the multus doesn't work.

the pr moves the multus-cni from init-container to container, and choose the it as the first containers.

Which issue(s) this PR fixes:

Fixes #3755

Special notes for your reviewer:

restart the multus-cni container, the 00-multus.conf would be re-generated.

root@spider-worker:/# crictl ps
CONTAINER           IMAGE               CREATED              STATE               NAME                     ATTEMPT             POD ID              POD
a4d9d4af2fe69       d958cfbb222dd       About a minute ago   Running             spiderpool-agent         1                   4ee679bf58463       spiderpool-agent-tzv5j
d831dc97340f0       c0e8690ae66a1       2 minutes ago        Running             multus-cni               2                   4ee679bf58463       spiderpool-agent-tzv5j
8d75c8c8afcc8       0cdd32d9ecad4       6 minutes ago        Running             spiderpool-controller    0                   63e231c0e00a4       spiderpool-controller-6c96859774-xfk8k
1cb443c9e2275       ded66453eb630       24 hours ago         Running             calico-node              0                   07b5a657f52d2       calico-node-99ns6
c6b519fbbc373       ce18e076e9d4b       24 hours ago         Running             local-path-provisioner   0                   84e082ef2ce9e       local-path-provisioner-6f8956fb48-rrjxj
de187f755a123       cbb01a7bd410d       24 hours ago         Running             coredns                  0                   5ccf34350d4d2       coredns-76f75df574-vrvvw
c3bd1ec34ff92       cbb01a7bd410d       24 hours ago         Running             coredns                  0                   0d54dd25c240c       coredns-76f75df574-qxjwr
70065ffc12219       fa4dee78049db       24 hours ago         Running             kube-proxy               0                   d2acd3475056d       kube-proxy-tzcfr
root@spider-worker:/# ls -l /etc/cni/net.d/
total 16
-rw-r--r-- 1 root root  461 Jul 25 08:11 00-multus.conf
-rw-r--r-- 1 root root  752 Jul 24 07:58 10-calico.conflist
-rw------- 1 root root 2737 Jul 25 01:42 calico-kubeconfig
drwxr-xr-x 2 root root 4096 Jul 25 08:11 multus.d
root@spider-worker:/# crictl stop d831dc97340f0
d831dc97340f0
root@spider-worker:/# ls -l /etc/cni/net.d/
total 16
-rw-r--r-- 1 root root  461 Jul 25 08:14 00-multus.conf
-rw-r--r-- 1 root root  752 Jul 24 07:58 10-calico.conflist
-rw------- 1 root root 2737 Jul 25 01:42 calico-kubeconfig
drwxr-xr-x 2 root root 4096 Jul 25 08:14 multus.d

restart the agent's container, the 00-multus.conf has no changes.

charts/spiderpool/templates/daemonset.yaml

weizhoublue · 2024-07-25T02:21:37Z

currently, there is two containers in the pod, all debugging commands in the CI should change to kubectl logs spiderpool-agent -c xxx to get the expected log

weizhoublue · 2024-07-25T03:16:16Z

charts/spiderpool/templates/daemonset.yaml

+          {{- if .Values.multus.multusCNI.extraVolumes }}
+          {{- include "tplvalues.render" ( dict "value" .Values.multus.multusCNI.extraVolumeMounts "context" $ ) | nindent 12 }}
+          {{- end }}
+      {{- end }}
      - name: {{ .Values.spiderpoolAgent.name | trunc 63 | trimSuffix "-" }}


does it make a difference to do kubect logs , if shifting this container as the first container

Yes, we should make the agent's container as the first container.

codecov · 2024-07-25T08:54:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.16%. Comparing base (1138e66) to head (6dae80f).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3758      +/-   ##
==========================================
- Coverage   81.68%   81.16%   -0.53%     
==========================================
  Files          50       50              
  Lines        4391     4391              
==========================================
- Hits         3587     3564      -23     
- Misses        643      670      +27     
+ Partials      161      157       -4

Flag	Coverage Δ
unittests	`81.16% <ø> (-0.53%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 1 file with indirect coverage changes

weizhoublue · 2024-07-26T01:54:21Z

@ty-dc pay an attention to https://github.com/spidernet-io/spiderpool/actions/runs/10094313282/job/27912257643?pr=3758

weizhoublue · 2024-07-26T13:02:35Z

test/Makefile

@@ -382,6 +383,15 @@ uninstall_spiderpool:
 	@echo -e "\033[35m [helm uninstall spiderpool] \033[0m"
 	helm uninstall $(RELEASE_NAME) --wait --debug -n $(RELEASE_NAMESPACE) \
 	     --kubeconfig $(E2E_KUBECONFIG)  || { KIND_CLUSTER_NAME=$(E2E_CLUSTER_NAME) ./scripts/debugEnv.sh $(E2E_KUBECONFIG) "detail"   ; exit 1 ; } ; \
+	NODE_LIST=` kind get nodes --name $(E2E_CLUSTER_NAME) `; \


I add it in https://github.com/spidernet-io/spiderpool/pull/3716/files#diff-740cb5a1689091cb894445e46683c255e39689ccba1c6c63d8a8841c8df8817dR411

ok, I will rebase it after #3716 merged.

I mean these codes could be removed

Signed-off-by: cyclinder <qifeng.guo@daocloud.io>

pod launched by unexpected CNI when the health checking of the agent fails and multus.conf is lost Signed-off-by: robot <tao.yang@daocloud.io>

cyclinder added release/bug cherrypick-release-v0.8 Cherry-pick the PR to branch release-v0.8. cherrypick-release-v0.9 cherrypick-release-v1.0 Cherry-pick the PR to branch release-v1.0. labels Jul 24, 2024

cyclinder requested a review from weizhoublue as a code owner July 24, 2024 06:07

weizhoublue reviewed Jul 25, 2024

View reviewed changes

charts/spiderpool/templates/daemonset.yaml Outdated Show resolved Hide resolved

weizhoublue reviewed Jul 25, 2024

View reviewed changes

cyclinder force-pushed the charts/multus_uninstall branch 2 times, most recently from a709cea to a0a4247 Compare July 25, 2024 08:50

cyclinder requested review from ty-dc and bzsuni as code owners July 25, 2024 08:50

cyclinder force-pushed the charts/multus_uninstall branch from a0a4247 to 7b11a50 Compare July 25, 2024 12:42

weizhoublue changed the title ~~charts: avoiding unexpect loss of 00-multus.conf on node~~ pod run in unexpected CNI when the health checking of the agent fails and multus.conf is lost Jul 26, 2024

weizhoublue previously approved these changes Jul 26, 2024

View reviewed changes

cyclinder dismissed weizhoublue’s stale review via aa4ff0a July 26, 2024 08:00

cyclinder force-pushed the charts/multus_uninstall branch from 7b11a50 to aa4ff0a Compare July 26, 2024 08:00

weizhoublue reviewed Jul 26, 2024

View reviewed changes

weizhoublue force-pushed the charts/multus_uninstall branch from aa4ff0a to af7e26e Compare July 30, 2024 02:01

charts: avoiding unexpect loss of 00-multus.conf

6dae80f

Signed-off-by: cyclinder <qifeng.guo@daocloud.io>

cyclinder force-pushed the charts/multus_uninstall branch from af7e26e to 6dae80f Compare July 31, 2024 02:07

weizhoublue changed the title ~~pod run in unexpected CNI when the health checking of the agent fails and multus.conf is lost~~ pod launched by unexpected CNI when the health checking of the agent fails and multus.conf is lost Jul 31, 2024

weizhoublue approved these changes Jul 31, 2024

View reviewed changes

weizhoublue merged commit cfae10b into spidernet-io:main Jul 31, 2024
56 checks passed

This was referenced Jul 31, 2024

failed to cherry pick PR 3758 from cyclinder, to branch release-v0.8 #3807

Closed

failed to cherry pick PR 3758 from cyclinder, to branch release-v0.9 #3808

Closed

weizhoublue mentioned this pull request Jul 31, 2024

pod launched by unexpected CNI when the health checking of the agent fails and multus.conf is lost #3809

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pod launched by unexpected CNI when the health checking of the agent fails and multus.conf is lost #3758

pod launched by unexpected CNI when the health checking of the agent fails and multus.conf is lost #3758

cyclinder commented Jul 24, 2024 •

edited

Loading

weizhoublue commented Jul 25, 2024

weizhoublue Jul 25, 2024 •

edited

Loading

cyclinder Jul 25, 2024

codecov bot commented Jul 25, 2024 •

edited

Loading

weizhoublue commented Jul 26, 2024

weizhoublue Jul 26, 2024

cyclinder Jul 28, 2024

weizhoublue Jul 30, 2024

cyclinder Jul 31, 2024

pod launched by unexpected CNI when the health checking of the agent fails and multus.conf is lost #3758

pod launched by unexpected CNI when the health checking of the agent fails and multus.conf is lost #3758

Conversation

cyclinder commented Jul 24, 2024 • edited Loading

Thanks for contributing!

What type of PR is this?

weizhoublue commented Jul 25, 2024

weizhoublue Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

cyclinder Jul 25, 2024

Choose a reason for hiding this comment

codecov bot commented Jul 25, 2024 • edited Loading

Codecov Report

weizhoublue commented Jul 26, 2024

weizhoublue Jul 26, 2024

Choose a reason for hiding this comment

cyclinder Jul 28, 2024

Choose a reason for hiding this comment

weizhoublue Jul 30, 2024

Choose a reason for hiding this comment

cyclinder Jul 31, 2024

Choose a reason for hiding this comment

cyclinder commented Jul 24, 2024 •

edited

Loading

weizhoublue Jul 25, 2024 •

edited

Loading

codecov bot commented Jul 25, 2024 •

edited

Loading