Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate, Simplified Async Pull #137

Merged
merged 46 commits into from
Mar 4, 2024

Conversation

imuni4fun
Copy link
Contributor

@imuni4fun imuni4fun commented Feb 5, 2024

This PR implements an alternative, simplified approach for sync and async image pulling of images. It leverages the previous, synchronous code and behavior as much as possible.

Effectively, only the actual pull activity is affected. Before v0.7.0, the pull activity was entirely synchronous and a timeout meant the pull was terminated. Now, the same flow is preserved except that after the existing logic is evaluated and a decision to pull is made, a check is done to see if if the pull is already occurring because it was previously tried and timed out.

Many factors complicate the simple timeout-based cancellation that this asynchronous pull feature addresses. Proper modularity allows for more robust and simple tests of individual logic. The changes to existing code are very minimal. The new behavior is implemented in a modular way in new files (a new package). The behavior is more easily tested because more of the system can be mocked.

For another comparison which does not have to back out as many new changes, see https://github.com/warm-metal/container-image-csi-driver/compare/v0.7.0...imuni4fun:container-image-csi-driver:async-pull-simplified-diff?expand=1
That comparison is missing a few minor tweaks but it illustrates the minimalism of the change to existing code.

Created issue #147 per request in community standup. It shares a little more background on this PR.

@imuni4fun imuni4fun requested a review from a team as a code owner February 5, 2024 20:41
@imuni4fun
Copy link
Contributor Author

Note that I still need to update logging to use the chosen logger rather than fmt.Printf() which I had used in my linked example for the pattern. Logging has also changed since the initial integration that I did pre-v0.7.0. Everything other than a few logging lines are ready for evaluation.

@imuni4fun
Copy link
Contributor Author

logging is now updated to use klog. verbose logging of internals of the AsyncPuller can be enabled by increasing log verbosity, though i was having trouble trying to pass this through go test to get those higher verbosity outputs.

@imuni4fun
Copy link
Contributor Author

go test -count=1 -v ./pkg/remoteimageasync/...
=== RUN TestPullDuration
pullerMock: starting to pull image nginx:6
pullerMock: starting to pull image nginx:2
pullerMock: starting to pull image nginx:7
pullerMock: starting to pull image nginx:1
pullerMock: starting to pull image nginx:4
pullerMock: starting to pull image nginx:3
pullerMock: starting to pull image nginx:8
pullerMock: finshed pulling image nginx:1
pullerMock: finshed pulling image nginx:2
pullerMock: finshed pulling image nginx:3
pullerMock: finshed pulling image nginx:4
pullerMock: context cancelled
pullerMock: context cancelled
pullerMock: context cancelled
--- PASS: TestPullDuration (5.00s)
=== RUN TestParallelPull
pullerMock: starting to pull image nginx:2
pullerMock: finshed pulling image nginx:2
--- PASS: TestParallelPull (3.00s)
=== RUN TestSerialResumedSessions
pullerMock: starting to pull image nginx:A
pullerMock: finshed pulling image nginx:A
--- PASS: TestSerialResumedSessions (6.00s)
=== RUN TestParallelResumedSessions
pullerMock: starting to pull image nginx:A
pullerMock: finshed pulling image nginx:A
--- PASS: TestParallelResumedSessions (6.00s)
PASS
ok github.com/warm-metal/container-image-csi-driver/pkg/remoteimageasync 20.607s

@mugdha-adhav
Copy link
Collaborator

mugdha-adhav commented Feb 8, 2024

Node plugin pod seems to be crashing due to -

container-image-csi-driver-warm-metal-csi-driver-nodeplugizd67g csi-plugin I0208 11:19:14.556750       1 provider.go:82] Docker config file not found: couldn't find valid .dockercfg after checking in [  /root /]
container-image-csi-driver-warm-metal-csi-driver-nodeplugizd67g csi-plugin panic: runtime error: invalid memory address or nil pointer dereference
container-image-csi-driver-warm-metal-csi-driver-nodeplugizd67g csi-plugin [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x19f7b2c]
container-image-csi-driver-warm-metal-csi-driver-nodeplugizd67g csi-plugin
container-image-csi-driver-warm-metal-csi-driver-nodeplugizd67g csi-plugin goroutine 44 [running]:
container-image-csi-driver-warm-metal-csi-driver-nodeplugizd67g csi-plugin github.com/warm-metal/container-image-csi-driver/pkg/remoteimage.puller.Pull({{0x0, 0x0}, {0x22a1800, 0x400017c8c0}, {0x2299df8, 0x40007317a0}}, {0x22b3d90, 0x4000a25770})
container-image-csi-driver-warm-metal-csi-driver-nodeplugizd67g csi-plugin 	/go/src/container-image-csi-driver/pkg/remoteimage/pull.go:47 +0x13c

@@ -51,8 +51,8 @@ var (
enableCache = flag.Bool("enable-daemon-image-credential-cache", true,
"Whether to save contents of imagepullsecrets of the daemon ServiceAccount in memory. "+
"If set to false, secrets will be fetched from the API server on every image pull.")
asyncImagePullMount = flag.Bool("async-pull-mount", false,
"Whether to pull images asynchronously (helps prevent timeout for larger images)")
asyncImagePullTimeout = flag.Duration("async-pull-timeout", 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@imuni4fun
Copy link
Contributor Author

@mugdha-adhav could you kick integration tests again? thanks

…not updated to remove completionChan depth and the build targets still succeeded. should have failed integration tests if executed.
"github.com/stretchr/testify/assert"
)

type chanTestStruct struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the actual session object here?

}

// demonstrates session channel structure's pass-by-reference is appropriate
func TestChannelStructContent(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. If these tests are not directly or indirectly related by code to synchronizer, I think we should put it in a separate file.

@vadasambar
Copy link
Contributor

Overall LGTM provided we

@imuni4fun
Copy link
Contributor Author

imuni4fun commented Feb 26, 2024

summary of discussion with @vadasambar:

I agree that it would be good to build out additional regression tests at higher layers of the onion... in this case at the call to PublishNodeVolume() with mocked containerd being called over GRPC.

I believe this additional testing does not exercise additional code paths beyond one case where there is interplay between caller timeout (kubelet) and actual pull time when pull policy is always. In this case, an image pull that takes too long to pull and exceeds the 2m caller timeout but ALSO completes the pull BEFORE the next call to continue (completes during backoff waiting period) will start a new pull on next call. The image in this scenario will never succeed. Because it would never succeed previously, and just timeout and cancel, I consider this the same behavior.

The above scenario tests the pre-existing pull-always logic in a new context where now many more scenarios CAN succeed. Anything that cannot succeed after the change would also have not succeeded prior.

I am happy to grow the regression tests in a future change but don't think adding these new layers should block this feature.

@imuni4fun
Copy link
Contributor Author

@vadasambar @mugdha-adhav could you review? thx

@imuni4fun
Copy link
Contributor Author

@mugdha-adhav created meeting to wrap up tomorrow

could you kick off integration tests again? it looks like it might need another run but it’s not clear from my end

last commit was just comments. previous commit had passing integration tests

@imuni4fun
Copy link
Contributor Author

imuni4fun commented Feb 27, 2024

Also why did we remove the metrics test?

I didn’t remove the test. I removed the “rendered” version which is rendered from https://github.com/imuni4fun/container-image-csi-driver/blob/4c6d7160fa72408225477aad4b061765753095e8/test/integration/metrics-manifests/test.yaml#L1

rendered test changes each time and should be ignored. i added it to .gitignore file in different PR/issue.
☝️ https://github.com/warm-metal/container-image-csi-driver/pull/148/files#diff-bc37d034bad564583790a46f19d807abfe519c5671395fd494d8cce506c42947

{{- if .Values.enableAsyncPullMount }}
- --async-pull-mount=true
{{- if .Values.enableAsyncPull }}
- --async-pull-timeout=10m
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pass this timeout value from helm values?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with @imuni4fun, we decided that we want to keep the enableAsyncPull and adding default timeout value to be passed in from values.yaml. Something like --async-pull-timeout={{ .Values.asyncPullTimeout }}

pkg/remoteimage/pull.go Outdated Show resolved Hide resolved
@imuni4fun
Copy link
Contributor Author

imuni4fun commented Feb 29, 2024

@mugdha-adhav pushed changes called out in review plus from @woehrl01 above. also one small replacement of docker.Named.String() with Puller.ImageWithTag()

ephemeral-volumet-set test passed in local kind cluster

ready for last integration tests and merge

@mugdha-adhav mugdha-adhav merged commit ade923f into warm-metal:main Mar 4, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants