Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added debug log of AllocatedIPCount of ippool #3926

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ty-dc
Copy link
Collaborator

@ty-dc ty-dc commented Aug 22, 2024

Thanks for contributing!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes ##3771

Special notes for your reviewer:

@ty-dc ty-dc added the pr/not-ready not ready for merging label Aug 22, 2024
Copy link

codecov bot commented Aug 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.84%. Comparing base (b64851c) to head (a9069ca).
Report is 6 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #3926   +/-   ##
=======================================
  Coverage   80.83%   80.84%           
=======================================
  Files          51       51           
  Lines        4514     4516    +2     
=======================================
+ Hits         3649     3651    +2     
  Misses        699      699           
  Partials      166      166           
Flag Coverage Δ
unittests 80.84% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/ippoolmanager/ippool_manager.go 89.38% <100.00%> (+0.09%) ⬆️

@@ -371,6 +379,18 @@ func (s *SpiderGC) executeScanAll(ctx context.Context) {
}
}
}

if *pool.Status.AllocatedIPCount != int64(tempAllocatedIPCount) {
Copy link
Collaborator

@weizhoublue weizhoublue Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这肯定是会产生 新 bug 的,
更新 ipool 都是 基于 resource version, 这个统计 必须是 在 基于 相同的 resource version
前面在 统计 ip 使用量 过程中,一旦 成功 做了 gc ip , resource version 就变了。 再 拿着 老的 基于 老 resource version 统计的 ip用量 来更新 ,如果此时有 ip 分配或者 释放的 并发,或者 ip 池 ip 的添加,这个老的 统计ip 用量 再 刷进去 就是 错误的

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateAllocatedIPCount 判断了更新 apierrors.IsConflict(err) ?如果 resource version 变了,打印日志,更新失败。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你已经再想想,你统计用量时的version 和你刷新的version 是同一个么?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

的确不是同一个。

@ty-dc ty-dc force-pushed the fix/error-allocatedipcount branch from e1cda70 to d2cedea Compare August 23, 2024 02:16
func (im *ipPoolManager) UpdateAllocatedIPCount(ctx context.Context, poolName string, allocatedIPCount *int64) error {
logger := logutils.FromContext(ctx)

ipPool, err := im.GetIPPoolByName(ctx, poolName, constant.IgnoreCache)
Copy link
Collaborator

@weizhoublue weizhoublue Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个version 有啥用,不是 统计状态时的version,这期间的统计值 变了多少哦,你没弄明白version的意义

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

整个pr思路错了

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已纠正。

@@ -352,6 +359,7 @@ func (s *SpiderGC) executeScanAll(ctx context.Context) {

GCIP:
if flagGCIPPoolIP {
tempAllocatedIPCount--
Copy link
Collaborator

@weizhoublue weizhoublue Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个代码写的复杂了,不就统计下 已用 ip 中的数组成员 数量,就出来了?
也不需要 加来减去的

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

赞同。

@weizhoublue
Copy link
Collaborator

weizhoublue commented Aug 23, 2024

需要找出 用量统计的根因,gc并不能解决真正的问题,只是事后补救,这是两个纬度

@ty-dc
Copy link
Collaborator Author

ty-dc commented Aug 23, 2024

需要找出 用量统计的根因,gc并不能解决真正的问题,只是事后补救,这是两个纬度

image
原本分配 IP 的流程,是在已有的 allcateIPCount 数量++,如果分配地址时,并发进行一个在分配、一个在释放,而释放的时候 AllocatedIPCount 还没有完成 -- 这个动作,而分配的时候拿到的是未 -- 的 AllocatedIPCount ,并进行了 ++ ,从而导致出现了统计出现不正确的现象(猜测,暂未能证实猜想)

@ty-dc ty-dc force-pushed the fix/error-allocatedipcount branch from d2cedea to 42941c7 Compare August 23, 2024 08:30
@@ -586,6 +586,23 @@ var _ = Describe("test ip with reclaim ip case", Label("reclaim"), func() {
// Delete Pod
Expect(frame.DeletePod(podName, namespace)).To(Succeed(), "Failed to delete pod %v/%v\n", namespace, podName)
GinkgoWriter.Printf("succeed to delete pod %v/%v\n", namespace, podName)

// Check whether the dirty IP data is recovered successfully and whether the AllocatedIPCount decreases and meets expectations?
Copy link
Collaborator Author

@ty-dc ty-dc Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用例是在手动更新 AllocatedIPCount 为一个错误的值,期望在新的改进引入后,分配 IP 或则 释放 IP,或则 gc all 回收异常 IP,都能够纠正这个 AllocatedIPCount 值。

IPPool 的状态是健壮的。

@ty-dc ty-dc added release/bug cherrypick-release-v0.8 Cherry-pick the PR to branch release-v0.8. cherrypick-release-v0.9 cherrypick-release-v1.0 Cherry-pick the PR to branch release-v1.0. and removed pr/not-ready not ready for merging labels Aug 23, 2024
@ty-dc ty-dc force-pushed the fix/error-allocatedipcount branch from 42941c7 to e7ee784 Compare August 23, 2024 09:25
@weizhoublue weizhoublue added pr/not-ready not ready for merging and removed cherrypick-release-v0.8 Cherry-pick the PR to branch release-v0.8. cherrypick-release-v0.9 cherrypick-release-v1.0 Cherry-pick the PR to branch release-v1.0. labels Sep 3, 2024
@ty-dc ty-dc removed the pr/not-ready not ready for merging label Sep 9, 2024
@weizhoublue
Copy link
Collaborator

(1) 这个日志 真有 能帮助 debug 问题根因么
(2)这个 pr 的 title 并不能 fix

@ty-dc ty-dc changed the title fix: the statistics of ippool's status.AllocatedIPCount are inaccurate Added debug log of AllocatedIPCount of ippool Sep 10, 2024
@ty-dc ty-dc removed the release/bug label Sep 10, 2024
@ty-dc ty-dc added the release/none no release note label Sep 10, 2024
@ty-dc
Copy link
Collaborator Author

ty-dc commented Sep 10, 2024

(1) 这个日志 真有 能帮助 debug 问题根因么

当前没有任何日志去获悉 ippool 的 AllocatedIPCount 数量在什么时候开始出现异常的。AllocatedIPCount++,AllocatedIPCount-- 只出现在文中两处,补充这样的日志,当 IPPool 分配与释放 IP 打印当前的 AllocatedIPCount,能够知道在哪一次分配和释放出现的记录错误的问题。

@weizhoublue
Copy link
Collaborator

weizhoublue commented Sep 12, 2024

(1) 这个日志 真有 能帮助 debug 问题根因么

当前没有任何日志去获悉 ippool 的 AllocatedIPCount 数量在什么时候开始出现异常的。AllocatedIPCount++,AllocatedIPCount-- 只出现在文中两处,补充这样的日志,当 IPPool 分配与释放 IP 打印当前的 AllocatedIPCount,能够知道在哪一次分配和释放出现的记录错误的问题。

你再想想,你这行 日志 真的能 进行 生产监控么
一直在这里 debug 级别 打印 数字

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/none no release note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants