-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Adding taint during image pre-pulling to prevent user pod being scheduled #3011
base: main
Are you sure you want to change the base?
Conversation
Thanks for submitting your first pull request! You are awesome! 🤗 |
Thank you for your work on this @xcompass!! ❤️ 🌻 Can you what you wish to accomplish, and how how it is accomplished in the PR description? Also, the docker image - where is the Dockerfile and its bundled source code defined? This will be a quite advanced PR to review, so we will need all the help we can get to consider this. |
@consideRatio Please checkout the PR description. The docker image is linked in the description as well. Let me know if there is anything else you would like me to add. |
Tagging @yuvipanda @donotpush @Lastcysa for visibility and possible feedbacks |
I'll work on failed tests next a few days. |
Ok, so I absolutely love the idea! I think if we can implement this cleanly, it's a major win to cold-start problems that exist. I too can't seem to find the source for the images, etc though - they should probably live in this repo as well, and be built by chartpress. The two things to test are:
Anyway, thank you so much for working on this! |
Thank you @xcompass, now I understand the strategy and think your PR description helps others understand the strategy. I'd like review effort to remain at a very high level still. I'll switch back to #3010 to help narrow down exactly what you see as the problem that you want to address with this. I don't think the issue description captures that at the moment. |
I would say that importing https://github.com/ubc/taintmanager into https://github.com/jupyterhub/zero-to-jupyterhub-k8s/tree/main/images, and shaping it to be as similar to https://github.com/jupyterhub/zero-to-jupyterhub-k8s/tree/main/images/image-awaiter as possible would also help a lot with getting this reviewed. We have limited go expertise in the jupyterhub community, and making this new go thing be as close to the existing go thing we have will help a lot! |
The trick that I'm happy to see here and I hadn't considered before is that because initContainers execute sequentially, we can use it to:
Which massively simplifies the process vs a full controller as I was thinking of before. |
sorry, I didn't see your comments @consideRatio! I'll let @xcompass provide more info now, as I'd assume they have a more up-to-date understanding of the problem right now than I do :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding those here! I am going to be off for a few days, but left a couple preliminary comments to simplify this further, keeping in mind our ability to maintain this long term with limited go expertise.
I think by falling back to kubeconfig auth if we aren't in cluster, we can get rid of the tiltfile as well as all the extra YAML bits needed for testing and do quick tests locally. We will need to add some tests that exercise a z2jh config with this tainting though.
Excited to see progress on this!
@@ -0,0 +1,25 @@ | |||
# syntax=docker/dockerfile:1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Atleast to start with and to keep the amount of required go knowledge to a minimum, what do you think about doing this build the same way as https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/main/images/image-awaiter/Dockerfile? I think primarily that means using scratch as the final target rather than distroless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason I'm using distroless is to run taintmanager as nonroot user to be more secure. If it is not a concern, I can change it to scratch.
images/taint-manager/taintmanager.go
Outdated
} | ||
|
||
// creates the in-cluster config | ||
config, err := rest.InClusterConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fallback to kubeconfig here if inclusterconfig is not available? That would allow us to test this locally much more easily, and get rid of all the extra YAML bits and tiltfile required to simply test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. out-cluster kubeconfig and node name parameter are added.
42340da
to
6521cb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! My next question is how do you think we can write a unit / integration test for this here?
@@ -0,0 +1,8 @@ | |||
apiVersion: rbac.authorization.k8s.io/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we can test this by running the taintmanager locally, are these test files still needed?
@@ -559,6 +559,9 @@ scheduling: | |||
operator: Equal | |||
value: core | |||
effect: NoSchedule | |||
- key: hub.jupyter.org/imagepulling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image-pull-in-progress maybe?
jupyterhub/schema.yaml
Outdated
required: [enabled] | ||
description: | | ||
Manage the taints during the single user image prepulling. It will add | ||
sepcified taint before image pulling and remove it after it's done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"sepcified" -> "specified"
Did this get dropped? We probably would benefit. |
I'll bring it up-to-date people are still interested in it. |
Yeah i still think this would be pretty useful if we can get it to work correctly! |
The taint manager is a go binary to add/remove taint for a node.
This is to add support for out-cluster development and testing. Also removed Titlefile and deployment manifest as this can be tested outside the cluster
0f4323c
to
c32c90d
Compare
This is to solve the issue in #3010.
This PR is to add two additional init-containers (first one and last one) for image puller. The first init-container is used to add a taint to prevent user pods being scheduled while the single user image is being pulled. The last init-container removes the taint added by the first one so the user pods can be scheduled on the node. Both init-containers are using the same binary called taintmanager with different parameter for adding and removing the taints.
A new service account is also added to grant permission to call K8s API to manage taint on the node. The taintmanager is using in-cluster service account token to access K8s API. The taint can be customized through helm value.