-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use backoff to tolerant race condition. #1894
Conversation
container/containerd/handler.go
Outdated
// `ContainerStatus` only returns result after `StartContainer` finishes. | ||
var taskPid uint32 | ||
backoff := 100 * time.Millisecond | ||
for retry := 5; retry > 0; retry-- { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so after 5 retries we continue on? Will we not get any metrics in this case? Would it be better just to return an error so it doesnt fail silently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... my bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Lantao Liu <lantaol@google.com>
f484c62
to
d5ee05f
Compare
} | ||
retry-- | ||
if !errdefs.IsNotFound(err) || retry == 0 { | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If retry == 0 case is hit, we will return err = nil. This will probably result in a nil pointer somewhere down the road. Can we return a new error for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we check err == nil
before this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Update cadvisor to v0.29.1 Update cadvisor to v0.29.1 to include a bug fix for containerd integration. google/cadvisor#1894 **Release note**: ```release-note none ```
I keep seeing this error in kubelet log:
The reason is that container cgroup is created in the middle of task creation. And there is a race condition that cadvisor see the cgroup, but the corresponding task hasn't been fully created yet in containerd.
There is no such problem for docker, because docker has some internal lock to make sure
Inspect
only returns after container start is finished. CRI-Containerd and I believe cri-o are doing the same thing. Actually cadvisor is relying on some container runtime internal implementation details here.However, here we are talking with containerd directly, which still has this race condition.
This PR added a retry to avoid this problem for now, and we should come up with a better fix next release.
I've validated this PR, and it works for me.
/cc @abhi @dashpole
Signed-off-by: Lantao Liu lantaol@google.com