Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: async image pull and mount #71

Merged
merged 1 commit into from
Nov 30, 2023
Merged

feat: async image pull and mount #71

merged 1 commit into from
Nov 30, 2023

Conversation

vadasambar
Copy link
Contributor

@vadasambar vadasambar commented Oct 31, 2023

Problem

warm metal driver throws context deadline exceeded errors

  • when used in cluster with high pod churn/activity
  • when the size of the image is too large

This happens because

  • warm metal doesn't resume download of image in the second, third, fourth ... attempt if the first attempt failed. It starts from scratch every time
  • kubelet has a hard timeout of 2m for GRPC requests. Since warm metal serves the request synchronously, images have to be pulled within 2m.

Solution

We pull and mount images asynchronously so that:

  1. Even if the first request from kubelet fails (due to timeout), warm metal continues to pull and mount the image in the background so that next requests from kubelet would succeeed
  2. We expose asynchronous image pulling and mounting as a commandline option so that users can enable/disable it

Tests

With asynchronous pull and mount:
root@cdf7ee254501:/code/cmd/plugin# go test -run 'TestNodePublishVolumeAsync'
I1108 10:40:04.407812     853 mounter.go:45] load 0 snapshots from runtime
I1108 10:40:04.410577     853 server.go:108] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}
I1108 10:40:06.419940     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=18b4d43e-1b30-4790-8233-02d8b627e334)
I1108 10:40:06.426363     853 pullexecutor.go:97] pull image "docker.io/library/redis:latest"
E1108 10:40:06.518674     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:06.622587     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=8bf3146b-424c-4981-96e6-cecdcf69abb8)
E1108 10:40:06.722786     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:06.822109     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=c2447ef6-ec61-4426-94e3-e40df32b8318)
E1108 10:40:06.922087     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:07.024885     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=1ba0ef13-b149-47bf-883c-87fbb6793d4b)
E1108 10:40:07.123843     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:07.224659     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=093b8bdd-412b-4a90-a00b-7bc06ee36e3f)
E1108 10:40:07.324687     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:07.420104     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=bcc63fe0-0031-449a-8e42-3a4e3d9ad2c0)
E1108 10:40:07.522951     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:07.624685     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=18383b25-6e39-4742-b0dd-df290692bfd4)
E1108 10:40:07.724119     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:07.820916     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=fa744426-0932-4a5f-a922-107ebe8a26b9)
E1108 10:40:07.920573     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:08.022154     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=10e42084-1393-4cc8-bbef-0aede91b2186)
E1108 10:40:08.124091     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:08.222235     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=16bf770d-ae06-43a8-86b0-21978ef39b12)
E1108 10:40:08.324075     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:08.422891     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=84a75da1-bfc7-411a-ad56-dcb1347191b3)
E1108 10:40:08.522908     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:08.620155     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=271cc66b-017e-4d4f-becc-c759a60b2b9b)
E1108 10:40:08.723800     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:08.820847     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=13e3d579-822a-4f35-9007-c10bb51835ad)
E1108 10:40:08.920222     853 node_server_test.go:123] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:40:08.938543     853 containerd.go:82] image "docker.io/library/redis:latest" unpacked
I1108 10:40:08.941227     853 mounter.go:177] refer read-only snapshot of image "docker.io/library/redis:latest" with key "csi-image.warm-metal.tech-sha256:9075a3a4d047ee389b94ebcce57958440bf0b07dc7de1b69df90b75a57665c28"
I1108 10:40:08.941259     853 mounter.go:116] create snapshot "csi-image.warm-metal.tech-sha256:9075a3a4d047ee389b94ebcce57958440bf0b07dc7de1b69df90b75a57665c28" of image "sha256:9075a3a4d047ee389b94ebcce57958440bf0b07dc7de1b69df90b75a57665c28" and refer it
I1108 10:40:08.941296     853 containerd.go:99] create ro snapshot "csi-image.warm-metal.tech-sha256:9075a3a4d047ee389b94ebcce57958440bf0b07dc7de1b69df90b75a57665c28" for image "sha256:9075a3a4d047ee389b94ebcce57958440bf0b07dc7de1b69df90b75a57665c28" with metadata map[string]string{"containerd.io/gc.root":"2023-11-08T10:40:08Z", "csi-image.warm-metal.tech/target|test-path":"√"}
I1108 10:40:08.943600     853 mounter.go:125] snapshot "csi-image.warm-metal.tech-sha256:9075a3a4d047ee389b94ebcce57958440bf0b07dc7de1b69df90b75a57665c28" is shared by 1 volumes
I1108 10:40:09.023588     853 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=cfb6f2f5-4973-4074-94a3-c0dd49e2c1e7)
I1108 10:40:14.033213     853 mounter.go:211] unmount volume "docker.io/library/redis:latest" at "test-path"
I1108 10:40:14.034725     853 mounter.go:216] try to unref read-only snapshot
I1108 10:40:14.034757     853 mounter.go:155] snapshot "csi-image.warm-metal.tech-sha256:9075a3a4d047ee389b94ebcce57958440bf0b07dc7de1b69df90b75a57665c28" isn't used by other volumes. delete it
I1108 10:40:14.034796     853 containerd.go:189] remove snapshot "csi-image.warm-metal.tech-sha256:9075a3a4d047ee389b94ebcce57958440bf0b07dc7de1b69df90b75a57665c28"
I1108 10:40:14.041244     853 node_server_test.go:94] server was stopped
PASS
ok  	github.com/warm-metal/csi-driver-image/cmd/plugin	9.661s
With synchronous pull and mount:
root@cdf7ee254501:/code/cmd/plugin# go test -run 'TestNodePublishVolumeSync'
I1108 10:41:04.481074    1239 mounter.go:45] load 0 snapshots from runtime
I1108 10:41:04.483854    1239 server.go:108] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}
I1108 10:41:06.487828    1239 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=87042018-3ab9-4910-adc8-a2dcc82d85d8)
I1108 10:41:06.496336    1239 pullexecutor.go:70] pull image "docker.io/library/redis:latest"
E1108 10:41:06.589885    1239 node_server_test.go:245] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E1108 10:41:06.589970    1239 utils.go:101] GRPC error: rpc error: code = Aborted desc = unable to pull image "docker.io/library/redis:latest": unable to pull image "docker.io/library/redis:latest": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:41:06.692623    1239 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=72d5a350-d9ea-4a37-9736-aa658fa816b2)
I1108 10:41:06.698974    1239 pullexecutor.go:70] pull image "docker.io/library/redis:latest"
E1108 10:41:06.792714    1239 node_server_test.go:245] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E1108 10:41:06.792754    1239 utils.go:101] GRPC error: rpc error: code = Aborted desc = unable to pull image "docker.io/library/redis:latest": unable to pull image "docker.io/library/redis:latest": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:41:06.892533    1239 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=fce7da24-b806-4cff-a579-d5f0554c068c)
I1108 10:41:06.896913    1239 pullexecutor.go:70] pull image "docker.io/library/redis:latest"
E1108 10:41:06.992312    1239 node_server_test.go:245] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E1108 10:41:06.992328    1239 utils.go:101] GRPC error: rpc error: code = Aborted desc = unable to pull image "docker.io/library/redis:latest": unable to pull image "docker.io/library/redis:latest": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:41:07.092163    1239 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=1047530e-c016-41e0-b30a-dcec76c93e2b)
I1108 10:41:07.097247    1239 pullexecutor.go:70] pull image "docker.io/library/redis:latest"
E1108 10:41:07.192077    1239 node_server_test.go:245] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E1108 10:41:07.192177    1239 utils.go:101] GRPC error: rpc error: code = Aborted desc = unable to pull image "docker.io/library/redis:latest": unable to pull image "docker.io/library/redis:latest": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:41:07.295193    1239 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=027ce974-9185-4a84-896f-e554a9e8ca38)
I1108 10:41:07.298117    1239 pullexecutor.go:70] pull image "docker.io/library/redis:latest"
E1108 10:41:07.395878    1239 utils.go:101] GRPC error: rpc error: code = Aborted desc = unable to pull image "docker.io/library/redis:latest": unable to pull image "docker.io/library/redis:latest": rpc error: code = DeadlineExceeded desc = context deadline exceeded
E1108 10:41:07.395888    1239 node_server_test.go:245] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1108 10:41:07.495071    1239 node_server.go:66] mount request: volume_id:"docker.io/library/redis:latest" target_path:"test-path" volume_capability:<access_mode:<mode:SINGLE_NODE_READER_ONLY > > volume_context:<key:"pullAlways" value:"true" >  %!(EXTRA string=reqId , string=b58498e2-6ce8-4ce9-b853-b495595a0b00)
I1108 10:41:07.502611    1239 pullexecutor.go:70] pull image "docker.io/library/redis:latest"
E1108 10:41:07.594498    1239 node_server_test.go:245] context deadline exceeded; retrying: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E1108 10:41:07.594566    1239 utils.go:101] GRPC error: rpc error: code = Aborted desc = unable to pull image "docker.io/library/redis:latest": unable to pull image "docker.io/library/redis:latest": rpc error: code = DeadlineExceeded desc = context deadline exceeded
... continues

I1108 10:41:41.699399    1239 mounter.go:211] unmount volume "docker.io/library/redis:latest" at "test-path"
I1108 10:41:41.700355    1239 mounter.go:216] try to unref read-only snapshot
I1108 10:41:41.700382    1239 mounter.go:135] target "test-path" is not read-only
I1108 10:41:41.700401    1239 mounter.go:222] delete the read-write snapshot
I1108 10:41:41.700482    1239 containerd.go:189] remove snapshot "csi-image.warm-metal.tech-docker.io/library/redis:latest"
E1108 10:41:41.702751    1239 containerd.go:192] unable to remove the snapshot "csi-image.warm-metal.tech-docker.io/library/redis:latest": snapshot csi-image.warm-metal.tech-docker.io/library/redis:latest does not exist: not found
I1108 10:41:41.702819    1239 node_server_test.go:216] server was stopped
PASS
ok  	github.com/warm-metal/csi-driver-image/cmd/plugin	37.253s

Note that test passing here means, warm metal was not able to mount the image within 37 seconds

@vadasambar vadasambar marked this pull request as ready for review November 8, 2023 10:43
@vadasambar vadasambar changed the title [DON'T REVIEW] feat: wip add tests to reproduce context deadline exceeded problem feat: async image pull and mount Nov 8, 2023
@vadasambar
Copy link
Contributor Author

vadasambar commented Nov 14, 2023

CI is failing with
image

  Warning  FailedMount  46s (x9 over 2m59s)  kubelet            MountVolume.SetUp failed for volume "ro" : rpc error: code = Aborted desc = unable to pull image "docker.io/warmmetal/csi-image-test:simple-fs": unable to pull image "docker.io/warmmetal/csi-image-test:simple-fs": rpc error: code = Unknown desc = Error reading manifest latest in docker.io/library/csi-d1b0aaece3bfd88788deb7ff8e5dd0791be4b8d169246b40be5e19671c925a1c: errors:
denied: requested access to the resource is denied
unauthorized: authentication required
  Warning  FailedMount  46s (x9 over 2m59s)  kubelet  MountVolume.SetUp failed for volume "rw" : rpc error: code = Aborted desc = unable to pull image "docker.io/warmmetal/csi-image-test:simple-fs": unable to pull image "docker.io/warmmetal/csi-image-test:simple-fs": rpc error: code = Unknown desc = Error reading manifest latest in docker.io/library/csi-3e5bc86b2ae66f183c99451bc334180de7b33a82ca6e8961aefcc4340461f2b8: errors:

Check if this error is unrelated to the current PR: #74

@vadasambar
Copy link
Contributor Author

vadasambar commented Nov 14, 2023

CI for #74 passed (CI link) . Looks like code changes in this PR is causing the CI to fail.

image

cmd/plugin/node_server.go Outdated Show resolved Hide resolved
cmd/plugin/node_server.go Outdated Show resolved Hide resolved
cmd/plugin/node_server.go Outdated Show resolved Hide resolved
pkg/mountexecutor/mountexecutor.go Outdated Show resolved Hide resolved
pkg/mountexecutor/mountexecutor.go Outdated Show resolved Hide resolved
pkg/mountexecutor/mountexecutor.go Outdated Show resolved Hide resolved
pkg/mountexecutor/mountexecutor.go Show resolved Hide resolved
Copy link
Contributor

@dinesh dinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, encouraging. Have few questions.

cmd/plugin/main.go Outdated Show resolved Hide resolved
cmd/plugin/main.go Outdated Show resolved Hide resolved
cmd/plugin/node_server.go Outdated Show resolved Hide resolved
pkg/mountexecutor/mountexecutor.go Show resolved Hide resolved
pkg/mountexecutor/mountexecutor.go Outdated Show resolved Hide resolved
pkg/mountstatus/mountstatus.go Show resolved Hide resolved
pkg/pullexecutor/pullexecutor.go Show resolved Hide resolved
pkg/pullexecutor/pullexecutor.go Show resolved Hide resolved
@vadasambar
Copy link
Contributor Author

vadasambar commented Nov 30, 2023

Tests in node_server_test.go is passing (after all the changes):
image

cmd/plugin/node_server.go Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
@vadasambar

This comment was marked as outdated.

@vadasambar

This comment was marked as outdated.

feat: wip add tests to reproduce `context deadline exceeded` problem

test: add test to call `NodePublishVolume` constantly for 2 minutes until image is downloaded
- this should fail for the current code

feat: wip add logic for async pull
- not working right now

fix: async image pull timing out
- mount is still a bottleneck

feat: wip working async mount
- fails with race conditions infrequently (<- looking at this)

feat: fix concurrency errors
- abstract mounting and pulling logic to separate packages
- add readme for running tests

refactor: remove redundant `uuid` related code

fix: CI lint/fmt errors
- use `defer cancel()` instead of discarding the fn
- fix log

fix: update mount status on finishing sync mount

chore: remove extra whitespace in the log

refactor: use localized error
- to address PR comment and avoid confusion

refactor: define magic numbers as constants

refactor: define mount and pull ctx timeout as constants

refactor: rename `--async-image-pulls` -> `--async-pull-mount`

fix: ci failing because of image pull error

fix: mounter is nil

fix: image not mounting if the image already exists on the node

fix: mounting the same volume more than once doesn't work

refactor: add log if image hasn't been pulled or mount is still in progress

refactor: make `NewNodeServer` call more compact

refactor: defer unlocking mutex instead of explicitly calling unlock at multiple places

refactor: remove leftover `ctx` in print statement

chore: bump version in Makefile and Chart.yaml

chore: revert bump to `v0.8.0`
- causes CI to fail

fix: remove conflict markers in Chart.yaml
- was causing CI to fail

chore: add a space after Chart.yaml
- to remove newline diff in the PR
@mugdha-adhav mugdha-adhav merged commit 7e9517e into warm-metal:master Nov 30, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants