fix: Spiderpool GC incorrect IP address during statefulset Pod scale up/down, causing IP conflict #3778

ty-dc · 2024-07-26T07:43:23Z

Thanks for contributing!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #3751 #3817

Special notes for your reviewer:

现象：

当 statefulSet 应用经历快速缩容后再扩容，Pod 重新运行后有概率出现 IP 地址冲突。
statefulSet 应用先使用 Spiderpool IPAM，然后不再使用它，经历过 GC All 流程后，仍存在 spiderEndpoint 和 IPPool IP 残留。

根因

现象 1

当 statefulSet 应用经历快速缩容后再扩容，handlePodEntryForTracingTimeOut 检查到 Pod 删除超时，但未判断当前应用的 status.IPs 是否为空（如果为空，实际上已经被 cmdDel 正确回收了），由于没判断，导致为空时，仍执行了回收 IP 的动作，而如果刚好有其他的 Pod 启动，并向 Spiderpool 请求，也分配到了该 IP 地址，但由于被该回收 IP 动作又把 Spiderpool 中记录的清理掉了。此时就会出现在 K8S 中记录的 Pod IP 却在 IPool 中不存在。而 Spiderpool 就会认为这个 IP 地址可以被使用，当有新的 Pod 来获取分配时，Spiderpool 又把同一 IP 分给了新 Pod，那么先后两次 Pod 获得了同一个 IP 地址，此时就出现了 IP 冲突。
当 statefulSet 应用经历快速缩容后再扩容，当出现 Pod Name 和 SpiderEndpoint Name 相同，但它们的 UID 却不同时，这样的 SpiderEndpoint 应该被回收，但没办法知道 SpiderEndpoint 旧历史 IP 信息和 IP 池名，应该保持不回收，让 GC all 处理，而在 releaseIPPoolIPExecutor 方法中，缺少对该逻辑的判断，直接使用了 SpiderEndpoint 记录的 UID 和 IPs 执行了回收操作，此时 SpiderEndpoint 记录的信息可能是重启后 Pod 新分配到的 IP 地址，但却被错误的回收了。导致 K8S 和 IPPool 记录不一致，从而出现 IP 地址冲突。

现象 2

statefulSet 不再使用 Spiderpool IPAM ，Spiderpool 不会在 tracePod_worker 去处理和回收 spiderEndpoint 和旧数据，而是在 gc All 代码中处理，但代码执行到 string(podYaml.UID) != poolIPAllocation.PodUID 时，此时可能是 sts Pod 重启了、也可能是 sts Pod 未再使用 spiderpool IPAM，但是 gc all 原本代码只判断了它是否是需要固定 IP 的应用（statefulset、kubevirt），如果是，就直接跳过，未对其进行任何回收操作，因此导致了 spiderEndpoint 和 IP 残留。

现象3:

job 类型的 pod 处于 complete 状态后， IP 被回收，但是 endpoint 未被回收。

解决：

现象 1

当有状态应用快速经历缩容又扩容，当检查 podEntry.TracingStopTime 超时后，增强逻辑判断 K8S 中的当前 Pod 的 IPs 是否为空，如果为空，说明已经通过 cmdDel 完成了正常的 IP 释放，此时不需要去回收 IP 地址，如果不为空，那么需要去回收，通过这样去避免错误的回收 IP 地址。
当有状态应用快速经历缩容又扩容，此时 tracePodWorker 检查到 Pod 的 Name 与 SpiderEndpoint 的 Name 一致，但 UID 却不同，这样的 SpiderEndpoint 应该被回收，但是在 tracePodWorker 无法追踪 SpiderEndpoint 所使用的 IPPool 的 Name 和 IP 信息，因此在 releaseIPPoolIPExecutor 方法中判断 endpoint.Status.Current.UID != podCache.UID，如果是，后续通过 gcAll 来处理。（Pod 的 k8s.v1.cni.cncf.io/networks-status 中没有记录 IPPool Name 信息，无法获取到历史 IP 记录，因此通过 GC all 处理。）

现象 2

当 GC All 处理残留的 spiderEndpoint 和 IPPool IP 时，由于是 statefulset 类型的应用，有固定 IP 地址的需求，我们应该考虑如下的场景，避免 IP 被 GC all 错误回收，导致 IP 固定异常。

如果是 statefulSet Pod 刚刚重启，正处于重启中的状态，此时其 Status.PodIPs 可能为空，gc all 无法确定该场景下的是否可以安全的去回收 IP 地址和 spiderEndpoint，出于安全考虑，新代码中对其跳过。
如果 statefulSet 没有使用 Spiderpool IPAM 了，此时检查到 Pod 与 ippool中记录的 Pod uid 不一致，并且判断 Status.PodIPs 中不为空，则证明 Pod 已经完成了重启，并且已经执行完成了 cmdAdd 操作配置好了 IP 地址，此时再去检查 SpiderEndpoint 的 UID 、ippool UID 与 Pod UID 是否一致，如果不一致，则证明 statefulSet Pod 没有使用 spiderpool 来分配 IP 地址，此时 GC all 应该回收 IPPool 中历史 IP 地址和 spiderEndpoint，如果一致，则只需要回收 IPPool 中的历史 IP 地址

codecov · 2024-07-26T07:48:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.28%. Comparing base (a3c940d) to head (f53668e).
Report is 30 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3778      +/-   ##
==========================================
+ Coverage   81.16%   81.28%   +0.11%     
==========================================
  Files          50       50              
  Lines        4391     4408      +17     
==========================================
+ Hits         3564     3583      +19     
+ Misses        670      669       -1     
+ Partials      157      156       -1

Flag	Coverage Δ
unittests	`81.28% <100.00%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...orkloadendpointmanager/workloadendpoint_manager.go	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

pkg/gcmanager/pod_cache.go

pkg/gcmanager/tracePod_worker.go

pkg/gcmanager/scanAll_IPPool.go

weizhoublue · 2024-08-16T12:02:15Z

test/e2e/common/spiderpool.go

+		wep, err := GetWorkloadByName(f, podYaml.Namespace, podYaml.Name)
+		if err != nil {
+			if api_errors.IsNotFound(err) {
+				return fmt.Errorf("endpoint %s/%s dose not exist", podYaml.Namespace, podYaml.Name)


前面 comment 提到：这个函数在 for 遍历 ippool 过程中，不要中途 return ，把所有的正常的错误都能够 printf 出来，函数最后退出时，再报错，方便找出所有问题

weizhoublue · 2024-08-16T12:02:45Z

test/e2e/common/spiderpool.go

+		podYaml, err := f.GetPod(podName, podNS)
+		if err != nil {
+			if api_errors.IsNotFound(err) {
+				return fmt.Errorf("pod %s/%s does not exist", podNS, podName)


前面 comment 提到：这个函数在 for 遍历 ippool 过程中，不要中途 return ，把所有的 IP 错误都能够 printf 出来，函数最后退出时，再报错，方便找出所有问题
显然这里找不到 ep 是个正常的错误，它不是 api 通信异常

weizhoublue · 2024-08-16T12:03:34Z

test/e2e/common/spiderpool.go

+
+		podNetworkIPs, err := ParsePodNetworkAnnotation(f, podYaml)
+		if nil != err {
+			return fmt.Errorf("failed to parse pod network annotation: %v", err)


这里不把 yaml 和 pod name 等打印出来，错误了都不知道如何排障，不知道 yaml 为什么不能解析

weizhoublue · 2024-08-16T12:09:44Z

test/e2e/common/spiderpool.go

+						f.Log("Failed: The UID %s recorded in IPPool %s for the Pod does not match the UID %s in the endpoint %s/%s.", poolIPAllocation.PodUID, poolName, tmpIPs[idx].UID, podNS, podName)
+					} else if tmpIPs[idx].IP != poolIP {
+						isSanity = false
+						f.Log("Failed: The IP %s recorded in IPPool %s for the Pod %s/%s does not match the IP %s in the endpoint %s/%s.", poolIP, poolName, podNS, podName, tmpIPs[idx].IP, podNS, podName)


这里各种 f.Log 不是以错误级别来打印日志把 ?
这里成功和失败日志都走 f.Log

done

"msg"="Failed" "error"="the IP 192.158.26.2 recorded in IPPool v4pool-b7b3e87c9dcc33e for the Pod gc-chaos-ns-5056-466058729/gc-chaos-kruise-sts-1-5143-327365450-0 does not match the IP 192.158.26.2 in the endpoint gc-chaos-ns-5056-466058729/gc-chaos-kruise-sts-1-5143-327365450-0"

weizhoublue · 2024-08-19T02:08:59Z

S [SKIPPED] [0.001 seconds]
Chaos Testing of GC [reclaim]
/home/runner/work/spiderpool/spiderpool/test/e2e/reclaim/chaos_test.go:27
  Chaos Testing of GC [BeforeEach]
  /home/runner/work/spiderpool/spiderpool/test/e2e/reclaim/chaos_test.go:212
    Multiple resource types compete for a single IPPool. In scenarios of creation, scaling up/down, and deletion, GC all can correctly handle IP addresses. [G00012, overlay]
    /home/runner/work/spiderpool/spiderpool/test/e2e/reclaim/chaos_test.go:275

  Timeline >>
  > Enter [BeforeEach] Chaos Testing of GC - /home/runner/work/spiderpool/spiderpool/test/e2e/reclaim/chaos_test.go:212 @ 08/17/24 16:23:58.476
  [SKIPPED] overlay CNI is not installed , ignore this case
  In [BeforeEach] at: /home/runner/work/spiderpool/spiderpool/test/e2e/reclaim/chaos_test.go:214 @ 08/17/24 16:23:58.477
  < Exit [BeforeEach] Chaos Testing of GC - /home/runner/work/spiderpool/spiderpool/test/e2e/reclaim/chaos_test.go:212 @ 08/17/24 16:23:58.477 (1ms)
  << Timeline

weizhoublue · 2024-08-19T02:10:56Z

SUCCESS! - Suite skipped in BeforeSuite -- 0 Passed | 0 Failed | 0 Pending | 12 Skipped

weizhoublue · 2024-08-19T02:11:28Z

SUCCESS! - Suite skipped in BeforeSuite -- 0 Passed | 0 Failed | 0 Pending | 3 Skipped

weizhoublue · 2024-08-19T02:12:07Z

SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 7 Skipped

weizhoublue · 2024-08-19T02:12:35Z

SUCCESS! - Suite skipped in BeforeSuite -- 0 Passed | 0 Failed | 1 Pending | 1 Skipped

ty-dc · 2024-08-19T02:23:06Z

这个 case 耗时长，underlay 模式case 多，跑的时间整体比较长了，这个 case 放到 overlay 里面跑的。能缩小整体 CI 的耗时

• [290.750 seconds]
Chaos Testing of GC [reclaim]
/home/runner/work/spiderpool/spiderpool/test/e2e/reclaim/chaos_test.go:27
  Chaos Testing of GC
  /home/runner/work/spiderpool/spiderpool/test/e2e/reclaim/chaos_test.go:198
    Multiple resource types compete for a single IPPool. In scenarios of creation, scaling up/down, and deletion, GC all can correctly handle IP addresses. [G00012, overlay]
    /home/runner/work/spiderpool/spiderpool/test/e2e/reclaim/chaos_test.go:275

  Timeline >>
  > Enter [BeforeEach] Chaos Testing of GC - /home/runner/work/spiderpool/spiderpool/test/e2e/reclaim/chaos_test.go:212 @ 08/17/24 16:20:53.909
  SPIDERPOOL_GC_DEFAULT_INTERVAL_DURATION: 30 
  succeed to create namespace gc-chaos-ns-1740-6212094 
  succeed to create v4 IPPool v4pool-a15d24dbf181e42 
  succeed to create v6 IPPool v6pool-9e38e34988d1589 
  < Exit [BeforeEach] Chaos Testing of GC - /home/runner/work/spiderpool/spiderpool/test/e2e/reclaim/chaos_test.go:212 @ 08/17/24 16:20:54.994 (1.086s)
  > Enter [It] Multiple resource types compete for a single IPPool. In scenarios of creation, scaling up/down, and deletion, GC all can correctly handle IP addresses. - /home/runner/work/spiderpool/spiderpool/test/e2e/reclaim/chaos_test.go:275 @ 08/17/24 16:20:54.995
  succeed to generate annotations map[ipam.spidernet.io/ippool:{"ipv4":["v4pool-a15d24dbf181e42"],"ipv6":["v6pool-9e38e34988d1589"]} v1.multus-cni.io/default-network:kube-system/macvlan-vlan0] 
  Succeeded in creating gc-chaos-ns-1740-6212094/gc-chaos-deploy-1-2054-995260084 at time 2024-08-17 16:20:59.995798619 +0000 UTC m=+200.277877970 
  Succeeded in creating gc-chaos-ns-1740-6212094/gc-chaos-kruise-sts-1-2054-995274632 at time 2024-08-17 16:21:07.996414739 +0000 UTC m=+208.278494080

weizhoublue · 2024-08-19T02:44:29Z

回复我上周五的 comment

… causing IP conflict Signed-off-by: ty-dc <tao.yang@daocloud.io>

fix: Spiderpool GC incorrect IP address during statefulset Pod scale up/down, causing IP conflict Signed-off-by: robot <tao.yang@daocloud.io>

ty-dc requested a review from weizhoublue as a code owner July 26, 2024 07:43

weizhoublue reviewed Jul 26, 2024

View reviewed changes

pkg/gcmanager/pod_cache.go Show resolved Hide resolved

weizhoublue reviewed Jul 26, 2024

View reviewed changes

pkg/gcmanager/tracePod_worker.go Outdated Show resolved Hide resolved

weizhoublue reviewed Jul 26, 2024

View reviewed changes

pkg/gcmanager/tracePod_worker.go Outdated Show resolved Hide resolved

weizhoublue reviewed Jul 26, 2024

View reviewed changes

pkg/gcmanager/tracePod_worker.go Outdated Show resolved Hide resolved

ty-dc changed the title ~~fix: Spiderpool GC incorrect IP address during sts Pod scale up/down,…~~ fix: Spiderpool GC incorrect IP address during statefulset Pod scale up/down, causing IP conflict Jul 26, 2024

ty-dc force-pushed the fix/error-gc branch 2 times, most recently from d636008 to 2aae3f1 Compare July 26, 2024 10:04

ty-dc added release/bug pr/not-ready not ready for merging labels Jul 26, 2024

ty-dc force-pushed the fix/error-gc branch 9 times, most recently from 4d58e97 to 64d00e2 Compare July 31, 2024 14:01

ty-dc requested a review from bzsuni as a code owner July 31, 2024 14:01

ty-dc force-pushed the fix/error-gc branch from 64d00e2 to 1456e6a Compare August 1, 2024 02:23

ty-dc added cherrypick-release-v0.8 Cherry-pick the PR to branch release-v0.8. cherrypick-release-v0.9 labels Aug 1, 2024