Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use backoff to tolerant race condition. #1894

Merged
merged 1 commit into from
Feb 23, 2018

Conversation

Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented Feb 23, 2018

I keep seeing this error in kubelet log:

Feb 23 09:31:05 workstation kubelet[11956]: I0223 09:31:05.336135   11956 factory.go:105] Error trying to work out if we can handle /kubepods/burstable/pod445bc55c-187c-11e8-bb75-42010af00002/de9b277dbb62d6e2bc2e372f190c84e227986acd40f6e2f2af8a115a51061373: error inspecting container: Error: No such container: de9b277dbb62d6e2bc2e372f190c84e227986acd40f6e2f2af8a115a51061373
Feb 23 09:31:05 workstation kubelet[11956]: I0223 09:31:05.336156   11956 factory.go:116] Factory "docker" was unable to handle container "/kubepods/burstable/pod445bc55c-187c-11e8-bb75-42010af00002/de9b277dbb62d6e2bc2e372f190c84e227986acd40f6e2f2af8a115a51061373"
Feb 23 09:31:05 workstation kubelet[11956]: I0223 09:31:05.336733   11956 factory.go:112] Using factory "containerd" for container "/kubepods/burstable/pod445bc55c-187c-11e8-bb75-42010af00002/de9b277dbb62d6e2bc2e372f190c84e227986acd40f6e2f2af8a115a51061373"
Feb 23 09:31:05 workstation kubelet[11956]: W0223 09:31:05.337858   11956 manager.go:1178] Failed to process watch event {EventType:0 Name:/kubepods/burstable/pod445bc55c-187c-11e8-bb75-42010af00002/de9b277dbb62d6e2bc2e372f190c84e227986acd40f6e2f2af8a115a51061373 WatchSource:0}: task de9b277dbb62d6e2bc2e372f190c84e227986acd40f6e2f2af8a115a51061373 not found: not found

The reason is that container cgroup is created in the middle of task creation. And there is a race condition that cadvisor see the cgroup, but the corresponding task hasn't been fully created yet in containerd.

There is no such problem for docker, because docker has some internal lock to make sure Inspect only returns after container start is finished. CRI-Containerd and I believe cri-o are doing the same thing. Actually cadvisor is relying on some container runtime internal implementation details here.

However, here we are talking with containerd directly, which still has this race condition.

This PR added a retry to avoid this problem for now, and we should come up with a better fix next release.

I've validated this PR, and it works for me.

/cc @abhi @dashpole

Signed-off-by: Lantao Liu lantaol@google.com

// `ContainerStatus` only returns result after `StartContainer` finishes.
var taskPid uint32
backoff := 100 * time.Millisecond
for retry := 5; retry > 0; retry-- {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so after 5 retries we continue on? Will we not get any metrics in this case? Would it be better just to return an error so it doesnt fail silently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... my bad

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Lantao Liu <lantaol@google.com>
@Random-Liu Random-Liu force-pushed the avoid-containerd-race branch from f484c62 to d5ee05f Compare February 23, 2018 22:04
}
retry--
if !errdefs.IsNotFound(err) || retry == 0 {
return nil, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If retry == 0 case is hit, we will return err = nil. This will probably result in a nil pointer somewhere down the road. Can we return a new error for this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we check err == nil before this, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right

Copy link
Collaborator

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dashpole dashpole merged commit b817801 into google:master Feb 23, 2018
@Random-Liu Random-Liu deleted the avoid-containerd-race branch February 23, 2018 22:32
Copy link
Contributor

@abhi abhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request Mar 8, 2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Update cadvisor to v0.29.1

Update cadvisor to v0.29.1 to include a bug fix for containerd integration. google/cadvisor#1894

**Release note**:

```release-note
none
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants