Skip to content
This repository has been archived by the owner on Mar 9, 2022. It is now read-only.

reload cni network config if has fs change events #1405

Merged
merged 1 commit into from
Apr 15, 2020

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Feb 27, 2020

With go RWMutex design, no goroutine should expect to be able to
acquire a read lock until the read lock has been released, if one
goroutine call lock.

The original design is to reload cni network config on every single
Status CRI gRPC call. If one RunPodSandbox request holds read lock
to allocate IP for too long, all other RunPodSandbox/StopPodSandbox
requests will wait for the RunPodSandbox request to release read lock.
And the Status CRI call will fail and kubelet becomes NOTReady.

Reload cni network config at every single Status CRI call is not
necessary and also brings NOTReady situation. To lower the possibility
of NOTReady, CRI will reload cni network config if there is any valid fs
change events from the cni network config dir.

Signed-off-by: Wei Fu fuweid89@gmail.com

@fuweid
Copy link
Member Author

fuweid commented Feb 27, 2020

ERROR: gcloud crashed (MetadataServerException): HTTP Error 400: Bad Request
If you would like to report this issue, please run the following command:
  gcloud feedback

no idea what it is. 😢

@fuweid
Copy link
Member Author

fuweid commented Feb 27, 2020

/retest

@fuweid fuweid changed the title Status: async reload latest CNI network conf [wip] Status: async reload latest CNI network conf Feb 27, 2020
@fuweid
Copy link
Member Author

fuweid commented Feb 27, 2020

Forgot one thing that rlock will wait when the wlock is pending. Looping to reload CNI network conf still can cause that NotReady issue. Need inotify to load conf on demand.

@mikebrow
Copy link
Member

ERROR: gcloud crashed (MetadataServerException): HTTP Error 400: Bad Request
If you would like to report this issue, please run the following command:
  gcloud feedback

no idea what it is. 😢

@yujuhong @Random-Liu still having problems running infra on gcloud is this a known issue? thoughts?

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is wip :) but thought I'd give some early comments.

pkg/server/service.go Outdated Show resolved Hide resolved
pkg/server/status.go Show resolved Hide resolved
pkg/server/service.go Outdated Show resolved Hide resolved
@fuweid fuweid force-pushed the me-async-load-cnicnf branch from 80027f4 to b796ebf Compare March 8, 2020 08:38
@k8s-ci-robot k8s-ci-robot added size/L and removed size/M labels Mar 8, 2020
@fuweid fuweid force-pushed the me-async-load-cnicnf branch from b796ebf to 032812b Compare March 8, 2020 08:39
@fuweid fuweid changed the title [wip] Status: async reload latest CNI network conf reload cni network config if has fs change events Mar 8, 2020
@fuweid fuweid force-pushed the me-async-load-cnicnf branch 2 times, most recently from 8499fd7 to faadc5f Compare March 8, 2020 10:45
@fuweid fuweid force-pushed the me-async-load-cnicnf branch from faadc5f to 59adb48 Compare March 8, 2020 11:09
@fuweid
Copy link
Member Author

fuweid commented Mar 8, 2020

@mikebrow updated. PTAL. Thanks


It seems that cri-containerd-node-e2e fails because containerd doesn't start. Any idea how to check the log?

@fuweid
Copy link
Member Author

fuweid commented Mar 9, 2020

/retest

@mikebrow
Copy link
Member

mikebrow commented Mar 9, 2020

/test pull-cri-containerd-node-e2e

@mikebrow
Copy link
Member

mikebrow commented Mar 9, 2020

@mikebrow updated. PTAL. Thanks

Any idea how to check the log?

Sorry I don't. @Random-Liu how would one get a containerd log dump, would we have to insert a print of the log at failure?

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! see nits

pkg/server/cni_conf_syncer.go Outdated Show resolved Hide resolved
vendor.conf Outdated Show resolved Hide resolved
pkg/server/service.go Outdated Show resolved Hide resolved
@fuweid fuweid force-pushed the me-async-load-cnicnf branch 2 times, most recently from 36391d9 to 0e29917 Compare March 12, 2020 12:46
@fuweid
Copy link
Member Author

fuweid commented Mar 30, 2020

ping @Random-Liu and how to check containerd log in e2e 😂

@fuweid fuweid force-pushed the me-async-load-cnicnf branch from 0e29917 to 6ef19c4 Compare April 3, 2020 03:22
@fuweid
Copy link
Member Author

fuweid commented Apr 3, 2020

@mikebrow I check kubernetes with extra-logs and upload logs into artifacts :) finally get it

@fuweid fuweid force-pushed the me-async-load-cnicnf branch from 6ef19c4 to cf4b4d9 Compare April 3, 2020 03:33
@fuweid fuweid force-pushed the me-async-load-cnicnf branch 7 times, most recently from 84c1bb9 to 19f6bee Compare April 3, 2020 04:22
With go RWMutex design, no goroutine should expect to be able to
acquire a read lock until the read lock has been released, if one
goroutine call lock.

The original design is to reload cni network config on every single
Status CRI gRPC call. If one RunPodSandbox request holds read lock
to allocate IP for too long, all other RunPodSandbox/StopPodSandbox
requests will wait for the RunPodSandbox request to release read lock.
And the Status CRI call will fail and kubelet becomes NOTReady.

Reload cni network config at every single Status CRI call is not
necessary and also brings NOTReady situation. To lower the possibility
of NOTReady, CRI will reload cni network config if there is any valid fs
change events from the cni network config dir.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@fuweid fuweid force-pushed the me-async-load-cnicnf branch from 19f6bee to 4ce334a Compare April 3, 2020 04:29
return nil, errors.Wrap(err, "failed to create fsnotify watcher")
}

if err := os.MkdirAll(confDir, 0700); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous test failed because the conf dir doesn't exist.

@mikebrow
Copy link
Member

mikebrow commented Apr 3, 2020

@mikebrow I check kubernetes with extra-logs and upload logs into artifacts :) finally get it

ah there it is in the artifacts/tmp* directory :-)

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fuweid
Copy link
Member Author

fuweid commented Apr 15, 2020

ping @mikebrow is it ok to move on?

@mikebrow mikebrow merged commit d531dc4 into containerd:master Apr 15, 2020
@mikebrow
Copy link
Member

@fuweid yes we have a fix for the windows cri test ... it has been moved to github actions!

@fuweid fuweid deleted the me-async-load-cnicnf branch April 16, 2020 02:10
@fuweid
Copy link
Member Author

fuweid commented Apr 16, 2020

@mikebrow great. Thanks!

leoluk pushed a commit to monogon-dev/monogon that referenced this pull request Mar 30, 2021
This unbreaks bbolt (as part of containerd) on 1.14+ (see etcd-io/bbolt#201 and
etcd-io/bbolt#220), pulls in my patch to ignore image-defined volumes
(containerd/cri#1504) and gets us some robustness fixes in containerd CNI/CRI integration
(containerd/cri#1405). This also updates K8s at the same time since they share a lot of
dependencies and only updating one is very annoying. On the K8s side we mostly get the standard stream of fixes
plus some patches that are no longer necessary.

One annoying on the K8s side (but with no impact to the functionality) are these messages in the logs of various
components:
```
W0714 11:51:26.323590       1 warnings.go:67] policy/v1beta1 PodSecurityPolicy is deprecated in v1.22+, unavailable in v1.25+
```
They are caused by KEP-1635, but there's not explanation why this gets logged so aggressively considering the operators
cannot do anything about it. There's no newer version of PodSecurityPolicy and you are pretty much required to use it if
you use RBAC.

Test Plan: Covered by existing tests

Bug: T753

X-Origin-Diff: phab/D597
GitOrigin-RevId: f6c447da1de037c27646f9ec9f45ebd5d6660ab0
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants