Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Forensic Container Checkpointing KEP #1990

Merged
merged 1 commit into from
Feb 2, 2022

Conversation

adrianreber
Copy link
Contributor

@adrianreber adrianreber commented Sep 17, 2020

This KEP is about Forensic Container Checkpointing

Ref: #2008

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 17, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @adrianreber!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @adrianreber. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 17, 2020
@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 17, 2020
@kikisdeliveryservice
Copy link
Member

Hi @adrianreber

All KEPs must have an open issue linked to the PR, can you please create filling out all of the sections and link here?

Thanks,
Kirsten

@adrianreber
Copy link
Contributor Author

@kikisdeliveryservice Thanks for letting me know. Here it is:

#2008

@schrej
Copy link
Member

schrej commented Sep 23, 2020

According to the template the number in the file name should match the issue number, which would be 2008-checkpoint-restore in this case.

@adrianreber
Copy link
Contributor Author

@schrej Thanks for pointing that out. Will update this PR soon.

@derekwaynecarr
Copy link
Member

@adrianreber while it is clear that we may want this type of API in the CRI to facilitate a future apiserver->kubelet<->runtime checkpoint/restore, but I am not inclined to add things to the CRI until we know how the higher order systems would utilize it. Can you maybe explore what it would mean to checkpoint/restore a Pod across hosts to maybe identify the questions that may arise? For example, do init containers re-run on the new host? What is the advantage of checkpoint/restore a pod versus just schedule a new pod. Basically, look at the problem from a Kube API level first before going to CRI level.

@mikebrow
Copy link
Member

mikebrow commented Oct 19, 2020

Interesting subject. Use case examples from the below link:

- Restarting the host machine without stopping/starting containers
- Speeding up the start time of slow start applications
- "Rewinding" processes to an earlier point in time
- "Forensic debugging" of running processes
- live migration

https://github.com/docker/cli/blob/master/experimental/checkpoint-restore.md#use-cases-for-checkpoint--restore

Would go a long way just to make k8s aware of the additional possible runc container states and work them into valid container life cycle tooling/scenarios for k8s users.

With the state of pod and node resource management being shared between kubelet and the container runtimes.. checkpoint restore of a pod and it's in use resources will be somewhat difficult and probably needs to be staged based on dependency chains. See discussions over on the sidecar/container dependency tree for bootup/tear down kep.

@adrianreber
Copy link
Contributor Author

I will try to be relatively verbose in my answers to hopefully avoid ambiguity. So I might repeat obvious things.

@adrianreber while it is clear that we may want this type of API in the CRI to facilitate a future apiserver->kubelet<->runtime checkpoint/restore, but I am not inclined to add things to the CRI until we know how the higher order systems would utilize it. Can you maybe explore what it would mean to checkpoint/restore a Pod across hosts to maybe identify the questions that may arise?

Looking first at the technical level. To migrate (checkpoint/restore) a Pod from one host to another, I would say we only need to worry about migrating single containers. A Pod, as far as I have seen it, is a pause container and multiple other containers sharing the (some) namespaces. Sharing namespaces upon restore, with the help of CRIU, is possible in runc once opencontainers/runc#2583 is merged. So on the lower level we have everything implemented and working.

Migrating a Pod is now nothing more than checkpointing all containers in the Pod (except the pause container), destroying the Pod on the source system, creating a new Pod on the destination system (including a newly created pause container) and restoring all containers into the new Pod on the destination system.

From what I have done so far in Podman and CRI-O, I can say, that everything I mentioned is doable. To migrate a container to a new host, the container is created with the same ID as it had before checkpointing, the corresponding image is pulled from the registry, also based on the image ID, config.json is adapted and then runc restores the container with the help of CRIU. The container can get a new ID if necessary, but, at least in Podman, I use the same ID as during checkpointing if possible.

Migrating a Pod would mean creating a new Pod (based on the properties of the original Pod), creating a pause container, restoring all checkpointed containers into the new Pod and thus joining the corresponding namespaces.

The one thing which I am not yet sure how to solve is how to transfer the checkpoint from the source host to the destination host.

For example, do init containers re-run on the new host?

If they are still running on the source of the migration, yes they will continue to run (but not re-run), as they will be migrated as any other container. We know the image the init container is based on, so restore can easily pull it from a registry and every other part of the state is included in the checkpoint. If they are no longer running in the Pod they will not be migrated. It is different from when a Pod restarts, where, according to the documentation, init containers are running again.

I agree that init containers could sound like a possible problematic point. If the initialization somehow creates a connection between the containers and the host then it brakes when moving the container to another host. Not sure if this is acceptable, but this is something the user is responsible for: Just because it is possible it still might not be a good idea to do it, depending on the application.

Related to init containers and relations between the host and the Pod/container. Podman, runc and CRIU are able to migrate established TCP connections. This, however, only works if the IP address of the migrated container is the same as of the checkpointed container. In Podman I am blocking migrations if the IP is not available during restore. CRIU offers the possibility to migrate with established TCP connections, but CRIU also offers to just close all TCP connections, so even if established TCP connections are open they will be closed on restore. The default behaviour is to not checkpoint a process if established TCP connections exists if neither the option to keep established connections or close established connections has been used. Mentioning this in case it is related to init containers and relations between the host and the Pod/container

What is the advantage of checkpoint/restore a pod versus just schedule a new pod.

Not 100% sure I understand this correctly. This can either mean that checkpointing is not useful at all, because containers are supposed to be stateless and can be stopped and started any time and the application will automatically adapt and scale. If this is your question, then it is really hard to answer, because checkpoint/restore is indeed somehow antithetical to the paradigm that containers are stateless. I guess I can only say, that they are not always stateless and especially looking at the already listed use cases (in the KEP) it is useful from my point of view.

If your question is more about why migrate a Pod and all its container instead of why not create a new Pod and restore the containers into the new Pod, then I think the answer is easy, because I described it above, that this exactly how I would do it. Create a new Pod with a new pause container and restore the containers into that Pod and join the pause container's namespaces.

Basically, look at the problem from a Kube API level first before going to CRI level.

By providing the ability to checkpoint containers out of running Pods, even with keeping the container running, and the possibility to restore containers into existing Pods all possible use cases should be covered.

Use cases could be:

  • Draining of a node
    • Migrate running containers to another node by moving (as described above) the Pod to another node
    • Especially when looking at long initialization times (databases, JVM): a stateful migration using checkpoint/restore can reduce downtime (or draintime) substantially.
  • Create replicas
    • Same as above: especially when looking at long initialization times (databases, JVM): creating replicas from an already initialized state can be much faster
  • Load balancing
    • This is one of the classic checkpoint/restore use cases next to fault tolerance. Having a long running job in the background which makes interactive jobs slow. Using checkpoint/restore this background container can be migrated to another node without losing the state and restarting it from the beginning. This is, from what I have seen, the main use case of checkpoint/restore in Borg. Writing application level checkpoints can be used to avoid this, but then every application has to do it itself. This is also the classic discussion if system level or application level checkpointing is more useful. The one (system level) is more complex to be able to deal with a wide range of applications and the other (application level) needs to be re-implemented in every application.

@adrianreber
Copy link
Contributor Author

Interesting subject. Use case examples from the below link:

Thanks for listing this. I will try to add additional comments to those examples.

  • Restarting the host machine without stopping/starting containers

This is part of the KEP document and currently already possible with my CRI-O changes.

  • Speeding up the start time of slow start applications

Also a good one. This example, as well as the previous example, can be seen here using Podman: https://www.redhat.com/en/blog/container-migration-podman-rhel

  • "Rewinding" processes to an earlier point in time

Using the keep running functionality this is also possible using Podman and should also be easy to implement in CRI-O next to my existing changes.

  • "Forensic debugging" of running processes

Based on all the above also not difficult. Especially as CRIU provides tools to convert a checkpoint to a coredump.

  • live migration

This is of course one of the main goals, but, in the end, it is nothing more than checkpoint, transfer and restore.

Would go a long way just to make k8s aware of the additional possible runc container states and work them into valid container life cycle tooling/scenarios for k8s users.

runc has no special states for checkpointed containers. A checkpointed container is just stopped and, if restoring on the runc level, needs to be removed before restore. The same is true on the Podman level. If the container has been started without --rm it needs to be deleted manually before restore or the restore needs to be done with --name to tell Podman to create a new container for the restore. So nothing needs to be done for Kubernetes to map runc states, because they do not exist. There is of course the need to represent checkpointed containers in Kubernetes even if runc does not have it.

With the state of pod and node resource management being shared between kubelet and the container runtimes.. checkpoint restore of a pod and it's in use resources will be somewhat difficult and probably needs to be staged based on dependency chains. See discussions over on the sidecar/container dependency tree for bootup/tear down kep.

Thanks for the pointer. Not yet clear what this means, but I will find out.

Thanks @derekwaynecarr and @mikebrow for your feedback. Let me know (here or on Slack) if my answers are helpful and if there additional topics which need to be discussed.

@derekwaynecarr
Copy link
Member

Migrating a Pod is now nothing more than checkpointing all containers in the Pod (except the pause container), destroying the Pod on the source system, creating a new Pod on the destination system (including a newly created pause container) and restoring all containers into the new Pod on the destination system.

This is a simplification, but to express an idea:

  • a pod has volumes (external to container scratch space)
  • a pod has an IP
  • a pod has a set of lifecycle phases (that results in eventual mortality)
  • a pod is bound to a node via an external scheduler (pod.spec.nodeName)
  • a pod consumes extra resources (gpus as example)
  • a pod may consume local resources to the host

If we zoom out and ask 'what is the user goal to do the checkpoint/restore?' we have to ask which of the items above persist, and manifest themselves on the new pod, how the scheduler schedules that new pod to the new host (and when). I assume at some point in a larger system architecture a pod is 'frozen' and then migrated to another clone of said pod scheduled to new host. There is a lot that has to happen in the pod API model and lifecycle to facilitate that beyond just the discrete containers.

Maybe narrowing the use case to a particular set of pods would help.

@derekwaynecarr
Copy link
Member

Maybe a good way to frame this is 'if a workload wanted to do a checkpoint/restore for something that was managed by a StatefulSet workload controller, what is the proposed set of changes to meet that use case?'.

@sftim
Copy link
Contributor

sftim commented Oct 20, 2020

You could consider checkpointing to be part of the desired state of a Pod. Whether or not that checkpointing succeeded would be part of the actual state of the Pod.

If that's the case, then one way to implement this is to extend the Pod API further. However, I'd prefer to see checkpoints (and later restores) implemented as a CustomResourceDefinition, and make an extension to the kubelet that pays attention to these CRDs. The PodCheckpointRequest object, once created, might also claim a persistent volume and use that to store the checkpoint data as an opaque set of files (or possibly even on block storage).

Would that approach work? If so, the KEP for that would be to extend the kubelet's behavior either to allow a way for plugins to interact with a Pod's containers for checkpointing, or directly implement it.

Later you might also need to add support to the scheduler to know that checkpointed Pods need special treatment, perhaps - that could be via a scheduling plugin.

If this feature can work without changes to the code running in the cluster control plane, then I'd like it to work that way.

@sftim
Copy link
Contributor

sftim commented Oct 20, 2020

Just musing on this briefly - the kubelet automatically registers “mirror” Pod objects for any static pods it sees in its local configuration. Restoring a pod onto a node could trigger a similar form of Pod creation into the cluster's Kubernetes API - but does create logistical challenges when that restored Pod has storage volumes and other things. Notable those are things that, currently, you can't use with static pods.

@mikebrow
Copy link
Member

mikebrow commented Oct 20, 2020

runc has no special states for checkpointed containers. A checkpointed container is just stopped and, if restoring on the runc level, needs to be removed before restore. The same is true on the Podman level. If the container has been started without --rm it needs to be deleted manually before restore or the restore needs to be done with --name to tell Podman to create a new container for the restore. So nothing needs to be done for Kubernetes to map runc states, because they do not exist. There is of course the need to represent checkpointed containers in Kubernetes even if runc does not have it.

Fair, with a skim of the code the additional runc state(s), I see, are an artifact of the current implementation for the containerd ctr checkpoint command which uses runc's pause command before executing the checkpoint then resumes the container afterwards. Which may or may not be a valid/valuable pattern/use case for your kep.

@schrej
Copy link
Member

schrej commented Oct 21, 2020

I've been working on Pod migration for a few months as I'm writing a masters thesis about it. I haven't shared too much so far, also because I'm not completely satisfied with my results yet and also because I needed to focus on writing for a while.

Regarding the API extensions I tried to collect my thoughts in this google doc. I'll try to update it with my most recent ideas as it is probably a little bit outdated.
Imho it makes the most sense to clone Pods instead of migrating them, as that fits the current lifecycle much better. If the old Pod then gets deleted it essentially behaves the same way as a migration. I think it also makes sense to extend the Pod spec with either a reference to a Pod that should be cloned, or a reference to a new resource like a PodCloneRequest that manages the clone/migration process. If no extension to the Pod spec is made it would be hard for kubelet to figure out if a Pod is supposed to be a clone.

In general there are a lot of good reasons to have a new resource and a migration controller to go along with it. First of all it makes it easier to trigger and track the migration from a user perspective. There are also things that are hard to do in an elegant way without it, most importantly deleting the old Pod if a migration takes place.

My current approach for transferring the checkpoint between nodes is a mounted volume. As runc writes the checkpoint to disk this is probably one of the fastest ways to transfer it. A migration controller could use PersistentVolumeClaims to mount a volume on both nodes for the migration.

Maybe a good way to frame this is 'if a workload wanted to do a checkpoint/restore for something that was managed by a StatefulSet workload controller, what is the proposed set of changes to meet that use case?'.

I would suggest to do this with two controllers. I'd extend StatefulSet or implement e.g. a MigratingStatefulSet that either

  • controls multiple MigratingPods which manage one Pod each that get migrated on spec change or instead of deletion. This would also make it easy to check constraints if any are defined.
  • manges Pods directly and uses a PodClone whenever a Pod needs to be moved around

However, I'd prefer to see checkpoints (and later restores) implemented as a CustomResourceDefinition, and make an extension to the kubelet that pays attention to these CRDs. The PodCheckpointRequest object, once created, might also claim a persistent volume and use that to store the checkpoint data as an opaque set of files (or possibly even on block storage).

While I agree this is useful for some of the usecases, e.g. auto-scaling and faster Pod startup in general, it's probably not ideal for migration as this would be slower than moving containers directly. In addition, ways of migration with less downtime (CRIU supports pre- and post-copy) can't really be implemented that way. If migration is implemented, short downtime should be one of the goals as that's one of the main reasons for migration.

In order to create and store checkpoints, a Checkpoint resource could be introduced with a reference to a Pod that should be checkpointed. The added reference in the Pod spec could then also accept a reference to a Checkpoint from which restore the containers. I'd only use this for scaling/fast startup, not for migration though.

For example, do init containers re-run on the new host?

I would argue to require a Pod to be finished with init containers before it can be migrated. If it's still running init containers it can probably also just be deleted and recreated. (This of course depends on what init containers get used for).

This, however, only works if the IP address of the migrated container is the same as of the checkpointed container.

From what I've seen this is very tricky. Many networking plugins define a CIDR per node, which wouldn't allow to migrate the IP. I'm currently using a Service to abstract the changing IP, but that does break open tcp connections.

There is a lot that has to happen in the pod API model and lifecycle to facilitate that beyond just the discrete containers.

For basic migration it's actually not that hard. I'd suggest to define certain constrains for Pods that should be able to migrate, e.g. only RWMany Volumes, no extra/local resources, etc. Especially Pods that use special/local resources might be extremely hard to migrate.

I've implemented a very rough but working prototye for Pod migration. I've tried to explain how to set it up here, as I did that rather quickly it's probably missing a few things. The binaries are also not up to date anymore. If you're interested I could do a demo though.

I'll also share the thesis with the community once it's completed (and I'm happy with it) at the end of december. In case any of you would be open to read a earlier version and provide feedback, that would be a tremendous help, also because I don't have a lot of experience with Kubernetes yet. Please reach out on Slack if you'd like to.

As the things I've said here will also appear in the thesis, please point out things that don't seem right, that also helps a lot!

There is also an issue regarding Pod migration btw: kubernetes/kubernetes#3949

@adrianreber
Copy link
Contributor Author

I am happy to see input from @schrej, because it seems he really thought about the high level view much more than me. I was always looking at the low-level implementation details.

One, for me, very interesting result of this discussion is, that everyone, except me, is mostly talking about Pod migration. For me it was always more about container migration which would be limited by the Pod definition or policy. Right now I am thinking it probably is more or less the same, just viewing it from a different angle. What worries me most when talking about Pods instead of containers is that it seems to limit the possibilities of checkpoint/restore to migration only. Doing a fast startup of initialized and checkpointed container using restore seems to be missing from the Pod migration discussion.

One of my goals to move forward with checkpoint/restore/migration was to focus on a limited set of features in the first step. That was also the reason I was only proposing checkpoint/restore at the API level. I understand now, that introducing changes on the API level requires a thought-out high level use case, which makes sense to not just introduce low-level API changes for nothing.

Looking at the code @schrej has written, it is indeed very rough 😉 (just like @schrej mentioned), but maybe his thoughts on the Pod API can be very helpful to move this forward. There are now two implementations moving the migration topic forward: @schrej's and mine. On the lower, CRI-API, level the implementations are similar, which is probably a good sign that it is not totally wrong.

Going back to the KEP process level: If @schrej would provide a KEP describing the Pod API changes he has so far and referencing this KEP, would that be a good step forward? Combing our two efforts into one, would that help? One effort, two KEPs.

Would that work for you @schrej? I would be happy to help out with a KEP describing the Pod API changes and also doing the implementation.

Copy link

@rst0git rst0git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adrianreber adrianreber force-pushed the checkpoint-restore branch 2 times, most recently from ab821b9 to 8f8cc15 Compare January 17, 2022 16:56

Although *checkpoint* and *restore* can be used to implement container
migration this KEP is only about enabling the forensic use case. Checkpointing
a pod is not part of this proposal and left for future enhancements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpointing a pod may be required for the forensic use case in the case of sandboxed (e.g. micro VM or gVisor) pods. I think it's fine to leave that out of the initial scope, but just wanted to name it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Checkpointing a pod is not a problem and we demonstrated it in one of the previous proof of concept implementations. We just left it out for this KEP as we only wanted to focus on the things we want to do.

the corresponding code pull request is extended to support *checkpoint* and
*restore* in the CRI API. The reason to add *restore* to the CRI API without
implementing it in the kubelet is to make development and especially testing
easier on the container engine level.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a corresponding proposal for the implementation of the API in cri-o and containerd? I think it would be helpful to get CRI implementations on board before dictating the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: cri-o/cri-o#4199

We used the CRI-O implementation to checkpoint and restore containers and pods in one of the early proof of concept implementations.

There is also: kubernetes-sigs/cri-tools#662


The goal of this KEP is to introduce *checkpoint* and *restore* to the CRI API.
This includes extending the *kubelet* API to support checkpointing single
containers with the forensic use case in mind.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scope of what's included in the checkpoint? Memory snapshot? Writeable layer snapshot? RO snapshot? What about volumes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simple answer is, everything that is in the container.

Everything external (devices, mounts, volumes) is not.

Volumes are special because they exist as unnamed volumes (I hope that is the right name) that is included in our current Podman implementation. External volumes are not included.

Everything external needs additional handling. Right now it is manually but it can become part of Kubernetes at some point. Depending on the type of external resources if it can be migrated to a new location.

API:

```
curl -skv -X POST "https://localhost:10250/checkpoint/default/counters/wildfly"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be seen as a highly privileged operation. How will it be authorized?

This seems on par with exec permissions, but exec uses the highly overloaded proxy subresource for the authorization check. I'd prefer to use a new dedicated subresource, preferably scoped to the container for this. I'm not aware of us doing this anywhere else, but it would be interesting if it required 2 authorization checks: that you have access to the checkpoint subresource on the node (since it's hitting a node API directly), and the checkpoint subresource on the pod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be seen as a highly privileged operation. How will it be authorized?

Not sure I understand this question correctly. To do this operation the normal kubelet API access restrictions apply. I can only do it without authorization if I set KUBELET_AUTHORIZATION_WEBHOOK="false" and KUBELET_FLAGS="--anonymous-auth=true --authorization-mode=AlwaysAllow". Isn't that enough? Are you thinking of additional checks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelet authorization (without AlwaysAllow) works by sending a SubjectAccessReview request to the apiserver, which includes (among other things) the resource being checked (currently always node) and a subresource. You can find the mapping for the existing Kubelet API here: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/auth.go#L59-L112

At the bottom of that link, you'll see that we use a dedicated subresource for some requests (logs, stats & metrics) but everything else just uses the highly overloaded proxy subresource.

I'd like for all new Kubelet APIs to use a new dedicated subresource (e.g. checkpoint). My comment about checking the subresource against the node & pod probably doesn't make sense as a 1-off (maybe something to consider if we ever overhaul the Kubelet API)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair thanks for the feedback, for the forensic use case, the initial desire was root on the node but not root in the cluster.

@adrianreber i think the net of this is we need to ensure that the test case for this function does not redirect back to proxy, but is its own dedicated subresource. for beta, we should add a section on authorization.

see:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/auth_test.go#L111

@adrianreber
Copy link
Contributor Author

@tallclair Thanks for taking a look at this KEP. For me it seems that most questions you asked have already been asked during the 1.5 years lifetime of this KEP. It is probably hard to find everything in more than 100 comments.

Please have a look at my answers and if they answer your questions please mark the conversations as resolved so that I know if the answers work from your side.

At some point this KEP was really huge and contained an answer for every possible question, but we removed most things to focus on the points we want to achieve.

If you any further questions I am happy to answer them all.

### Test Plan

For alpha:
- Unit tests available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There also should be an e2e test for this for alpha. I'm not sure how we would unit test the new endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I went for unit tests only is that the none of the CRI implementations are providing the necessary functionality yet. So we would get a "NotImplemented". I can do that. Would that work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, it would give an extension point for the future so not critical.


###### How can this feature be enabled / disabled in a live cluster?

Using the feature gate `ContainerCheckpointRestore` the feature can be enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the template:

###### How can this feature be enabled / disabled in a live cluster?
<!--
Pick one of these and delete the rest.
-->
- [ ] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name:
- Components depending on the feature gate:
- [ ] Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control
plane?
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).

The feature gate only applies to the kubelet, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the feature gate only applies to the kubelet.

###### Are there any tests for feature enablement/disablement?

Unit tests will temporarily enable the `ContainerCheckpointRestore` feature gate
to ensure that the unit tests are always running.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question is about testing turning the feature on and off, not merely testing with the feature gate on.

This will be covered so long as you complete manual testing of turning the feature gate on and off on an individual node.


## Drawbacks

Not aware of any.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any performance or security implications of this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@adrianreber
Copy link
Contributor Author

@ehashman Thanks a lot for taking a look at this KEP.

I tried to address a couple of your review comments. I was unsure about the feature gate name, but I changed it now to just CheckpointContainer. That sounds indeed more correct.

About e2e testing. If I get a not implemented back from the CRI implementation, is that good enough? If the CRIs are adding support for it the tests need to be adapted as they will return something positive (I hope).

I am not sure I filled out the Scalability section correctly.

Please take another look if you have a chance and let me know if my changes were not correct or if I need to provide more information.

Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
for PRR


###### Are there any tests for feature enablement/disablement?

Currently no.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you manually test this. (i.e. run kubelet with feature gate on, test feature, turn feature gate off and restart it, test that it's disabled)

Signed-off-by: Adrian Reber <areber@redhat.com>
@mrunalp
Copy link
Contributor

mrunalp commented Feb 1, 2022

/assign @dchen1107

@mrunalp
Copy link
Contributor

mrunalp commented Feb 1, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2022
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

this can be refined further, but is good to proceed on implementation.

use case has been discussed extensively in sig.

API:

```
curl -skv -X POST "https://localhost:10250/checkpoint/default/counters/wildfly"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair thanks for the feedback, for the forensic use case, the initial desire was root on the node but not root in the cluster.

@adrianreber i think the net of this is we need to ensure that the test case for this function does not redirect back to proxy, but is its own dedicated subresource. for beta, we should add a section on authorization.

see:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/auth_test.go#L111

For alpha:
- Unit tests available

For beta:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add detail on checkpoint authorization, we will need to restrict access to the kubelet api resource.

on the container runtime, the actual checkpoint is stored in a location is restricted, but prior to beta, we need clear security practices documented.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianreber, derekwaynecarr, ehashman, rst0git

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2022
@k8s-ci-robot k8s-ci-robot merged commit b02f428 into kubernetes:master Feb 2, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 2, 2022
@adrianreber
Copy link
Contributor Author

I want to thank everyone who was involved getting this reviewed and now merged. Thanks!

The corresponding code pull request is ready for review at kubernetes/kubernetes#104907

I updated the pull request to include e2e tests and renamed the feature gate like suggested by @ehashman

@jpellman
Copy link

jpellman commented Mar 24, 2022

Hi @adrianreber,

I really enjoyed your Kubecon talk from last October, and while I understand that this PR is very much closed, I thought I might add another use case that could rely upon the Pod checkpoint/restore feature described above (in addition to the load balancing, node draining, and replica creation use cases that you noted).

In the HPC world, the SLURM scheduler includes time slicing functionality that gives a cluster the ability to alternate between two or more batch jobs on a node by using the SIGSTOP and SIGCONT signals. The SLURM scheduler refers to this as gang scheduling (though I believe this may be a bit of a misnomer; it's not related to communication between processes/threads from what I can tell). The goal of this functionality is to improve batch job throughput by ensuring that short-running jobs aren't blocked from running by longer jobs.

In certain contexts, I could see something like the SLURM gang scheduler being useful in the Kubernetes world. Namely, if a Kubernetes cluster is oriented primarily towards data processing and makes heavy use of the Job resource, having it so that kube-scheduler could periodically juggle between the Pods backing various Jobs by freezing/unfreezing them via CRIU might similarly increase Job throughput (assuming that the Pods being frozen/unfrozen have identical predicates).

It might be argued that with certain Kubernetes distributions providing the ability to autoscale resources that this sort of Pod time slicing is not necessary, but autoscaling might not always be practical or desirable. For instance, if a Kubernetes cluster were running on a number of low power PCs (e.g., Raspberry Pis) in an area with limited internet connectivity (e.g., Antarctica), autoscaling wouldn't be practical; instead economizing usage of fixed resources would be preferable.

Hopefully what I've said here isn't totally crazy and doesn't create unnecessary noise. It's just something that occurred to me when I was watching your talk.

@adrianreber
Copy link
Contributor Author

@jpellman A couple of years ago when I started to look into checkpoint/restore it was also in an HPC environment, so that I am aware of different HPC use cases.

Thanks for bringing up additional use cases. I will try to keep them in mind when working on this in the future.

@xial-thu
Copy link

xial-thu commented Jul 8, 2022

So looking forward to this feature!!! I've been expecting CRIU + k8s since several years ago.

I would like to supplement a use case in HPC world: usually training jobs(such as MPI jobs, pytorch jobs, etc.) are fragile to disaster, stateful(especially model states and optimizer states on GPU), not scalable at all. They may last days to weeks. State-of-the-art AI + k8s solution is not ideal.

At first, any disastrous error, such as kernel panic, electricity outage, even OOM, etc. make the training jobs fail at all.

Secondly, jobs are not scalable. If you create a job with N pods, each using Y GPUs, the size is static. You cannot resize it at runtime. Kubernetes features elastic, but AI cannot benefit from it.

At last, GPU virtualization is not mature. Usually GPUs are still pass-through to container. As a result, in large clusters, resource fragment is a huge problem. For example, a server with 8 GPU, now holds 6 jobs, each using 1 GPU. The remaining 2 GPU is a fragment resource, it cannot serve a 8 GPU training job. Researchers complain about fragmentation, but they get really mad if you stop their job and move to another host.

Let's raise a bigger problem, we take it for granted that AI on cloud(usually means k8s) is a trend. But AI on cloud is not equal to simply putting AI workload in container and developing some operators.

"Count on the framework.", people may say, hoping pytorch gets rid of everything. Yes, pytorch is on the way.

Training framework may do periodically checkpointing(not container checkpoint) and saves data on shared file system. Job will recover from that checkpoint. By this mean, job only loses part of its state due to checkpointing occurs at the end of an epoch.

Pytorch also supports elastic training, making resizing at runtime possible.

But are there alternatives? What can cloud provider do to accelerate the pace?

Microsoft releases its work Singurality, an AI infra that enables cross-region scheduling, live migration, job scaling. But a first problem is that k8s does not support checkpointing!

With this work, I see the possibility of building a real elastic AI infra. Jobs can be checkpointed and restored without loss, scheduler can do housework to avoid fragmentation. AI is far from cloud native, but I'm happy to see gaps are filled one by one. Hopefully one day, we can really enjoy cloud-native AI!

@elchead
Copy link

elchead commented Jul 24, 2022

So looking forward to this feature!!! I've been expecting CRIU + k8s since several years ago.

I would like to supplement a use case in HPC world: usually training jobs(such as MPI jobs, pytorch jobs, etc.) are fragile to disaster, stateful(especially model states and optimizer states on GPU), not scalable at all. They may last days to weeks. State-of-the-art AI + k8s solution is not ideal.

At first, any disastrous error, such as kernel panic, electricity outage, even OOM, etc. make the training jobs fail at all.

Secondly, jobs are not scalable. If you create a job with N pods, each using Y GPUs, the size is static. You cannot resize it at runtime. Kubernetes features elastic, but AI cannot benefit from it.

At last, GPU virtualization is not mature. Usually GPUs are still pass-through to container. As a result, in large clusters, resource fragment is a huge problem. For example, a server with 8 GPU, now holds 6 jobs, each using 1 GPU. The remaining 2 GPU is a fragment resource, it cannot serve a 8 GPU training job. Researchers complain about fragmentation, but they get really mad if you stop their job and move to another host.

Let's raise a bigger problem, we take it for granted that AI on cloud(usually means k8s) is a trend. But AI on cloud is not equal to simply putting AI workload in container and developing some operators.

"Count on the framework.", people may say, hoping pytorch gets rid of everything. Yes, pytorch is on the way.

Training framework may do periodically checkpointing(not container checkpoint) and saves data on shared file system. Job will recover from that checkpoint. By this mean, job only loses part of its state due to checkpointing occurs at the end of an epoch.

Pytorch also supports elastic training, making resizing at runtime possible.

But are there alternatives? What can cloud provider do to accelerate the pace?

Microsoft releases its work Singurality, an AI infra that enables cross-region scheduling, live migration, job scaling. But a first problem is that k8s does not support checkpointing!

With this work, I see the possibility of building a real elastic AI infra. Jobs can be checkpointed and restored without loss, scheduler can do housework to avoid fragmentation. AI is far from cloud native, but I'm happy to see gaps are filled one by one. Hopefully one day, we can really enjoy cloud-native AI!

I second your excitement about the potential of this for HPC jobs. As part of my master thesis (available soon), I also evaluated the potential of migration to increase resource utilization of HPC jobs with unknown resource requirements and it seems feasible even for huge containers of more than 100 GB. But I want to add that the proposal is only a first tiny step toward it. I think it will still take some time to have enough functionality for it to be usable in practice (it took almost one year to merge the minimal PR). For the time, I think we still need to rely on custom solutions. Some industry solutions are also available already:
https://cloudify.co/blog/migrating-pods-containerized-applications-nodes-kubernetes-cluster-using-cloudify/
https://partners-intl.aliyun.com/help/en/container-service-for-kubernetes/latest/back-up-and-restore-applications?spm=a2c63.p38356.0.0.49c9466avUHd1s#task-1993430

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.