Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod lifecycle checkpointing #3949

Open
bgrant0607 opened this issue Jan 29, 2015 · 34 comments
Open

Pod lifecycle checkpointing #3949

bgrant0607 opened this issue Jan 29, 2015 · 34 comments
Labels
area/pod-lifecycle Issues or PRs related to Pod lifecycle kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@bgrant0607
Copy link
Member

Filing this issue for discussion and tracking, since it has come up a number of times.

Starting with background:

Pods are scheduled, started, and eventually terminate. They are replaced with new pods by replication controller (or some other controller, once we add more controllers). That's both reality and the model. Today pods are replaced reactively, but eventually it will replace pods proactively for planned moves. We currently do not preempt pods in order to schedule other pods, and likely won't for some time.

Currently, new pods have no obvious relationship to the pods they replace. They have different names, different uids, different IP addresses, different hostnames (since we set the pod hostname to pod name), and newly initialized volumes.

Replication controllers themselves are not durable objects. They are tied to deployments. New deployments create new replication controllers. This simplifies sophisticated deployment and rollout strategies without making simple scenarios complex. Both rollout tools/components and auto-scaling will deal with groups of replication controllers.

Naming/discovery is addressed using services, DNS, and the Endpoints API. The evolution of these mechanisms is being discussed in #2585.

This is a flexible model that facilitates transparency, simplifies handling of inevitable distributed-systems scenarios, facilitates high availability, and facilitates dynamic deployment and scaling.

But the model is not without issues. The main ones are:

  1. Data durability
  2. Self-discovery
  3. Work/role assignment

Data durability is being discussed in the persistent storage proposal #3318. We will also need to address it for local storage, but local storage is less relevant to "migration", anyway, since it's not feasible to migrate. For remote storage, it will be possible to detach and reattach the devices to new pods/hosts.

Self-discovery: Pods know their IP addresses, but currently do not know the names nor IPs of services targeting them. This will be solved by the service redesign #2585 and downward API #386.

Work/role assignment: We encourage dynamic role assignment: master election, fine-grain locking, sharding, task queues, pubsub, etc. That said, some servers are "pet-like", particularly those requiring large amounts of persistent storage. Many of these are replicated and/or sharded, with application-specific clustering implementations that tie together names/addresses and persistent data. We've discussed a concept tentatively called "nominal services" #260 to stably assign names and IP addresses to individual pods, and we aim to address that in the service redesign #2585.

So, do we need "pod migration", and, if so, what should it mean? I think it minimally should mean that the replacement pod has the same hostname, IP address, and storage.

We should aim to minimize disruption for high-availability servers. We could we do, besides the things planned above?

  • Don't use the pod name as the host name. Associate a name with the IP address instead (e.g., by hashing the address). Pods created by replication controllers aren't currently predictable, so this wouldn't be a regression.
  • Migrate the pod IP address. Currently pod IP addresses are statically partitioned among hosts and are not migratable. This would likely be problematic on some cloud providers with the way we're currently configuring routing, but could be done with an overlay network.
  • Lifecycle hooks pre- and post-migration.
  • Actual live state transfer via CRIU

With respect to Kubernetes objects, the "migrated" pod would still be a new pod with a new name and uid. The orchestration of the migration would be performed by a controller -- possibly an enhanced replication controller, perhaps in collaboration with a network controller to move the address, similar to the separation of concerns in the persistent storage proposal. During the migration process, the old and new pods would coexist, and that coexistence would be visible to clients of the Kubernetes API, but the application being migrated and its clients would not need to be aware of the migration.

/cc @smarterclayton @thockin @alex-mohr

@bgrant0607 bgrant0607 added kind/design Categorizes issue or PR as related to design. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Jan 29, 2015
@thockin
Copy link
Member

thockin commented Jan 30, 2015

Fixing the hostname seems like an obvious change. We should consider where
else that information might pop up. In a different issue we are discussing
allowing users to expose their own pod UID and name as custom env vars.
This would still be safe as long as we don't do live migration. We solved
this internally by allocating a virtual UID at pod creation time which
travels across migrations (live or otherwise), but is allocated as a UUID.
When a migrating controller knows that a new pod is a migration it sets the
VUID of the new pod.

I don't know if we will ever really get to pervasive live migration, but I
hope so

On Thu, Jan 29, 2015 at 1:53 PM, Brian Grant notifications@github.com
wrote:

Filing this issue for discussion and tracking, since it has come up a
number of times.

Starting with background:

Pods are scheduled, started, and eventually terminate. They are replaced
with new pods by replication controller (or some other controller, once we
add more controllers). That's both reality and the model. Today pods are
replaced reactively, but eventually it will replace pods proactively for
planned moves. We currently do not preempt pods in order to schedule other
pods, and likely won't for some time.

Currently, new pods have no obvious relationship to the pods they replace.
They have different names, different uids, different IP addresses,
different hostnames (since we set the pod hostname to pod name), and newly
initialized volumes.

Replication controllers themselves are not durable objects. They are tied
to deployments. New deployments create new replication controllers. This
simplifies sophisticated deployment and rollout strategies without making
simple scenarios complex. Both rollout tools/components and auto-scaling
will deal with groups of replication controllers.

Naming/discovery is addressed using services, DNS, and the Endpoints API.
The evolution of these mechanisms is being discussed in #2585
#2585.

This is a flexible model that facilitates transparency, simplifies
handling of inevitable distributed-systems scenarios, facilitates high
availability, and facilitates dynamic deployment and scaling.

But the model is not without issues. The main ones are:

  1. Data durability
  2. Self-discovery
  3. Work/role assignment

Data durability is being discussed in the persistent storage proposal
#3318 #3318. We
will also need to address it for local storage, but local storage is less
relevant to "migration", anyway, since it's not feasible to migrate. For
remote storage, it will be possible to detach and reattach the devices to
new pods/hosts.

Self-discovery: Pods know their IP addresses, but currently do not know
the names nor IPs of services targeting them. This will be solved by the
service redesign #2585
#2585 and
downward API #386
#386.

Work/role assignment: We encourage dynamic role assignment: master
election, fine-grain locking, sharding, task queues, pubsub, etc. That
said, some servers are "pet-like", particularly those requiring large
amounts of persistent storage. Many of these are replicated and/or sharded,
with application-specific clustering implementations that tie together
names/addresses and persistent data. We've discussed a concept tentatively
called "nominal services" #260
#260 to stably
assign names and IP addresses to individual pods, and we aim to address
that in the service redesign #2585
#2585.

So, do we need "pod migration", and, if so, what should it mean? I think
it minimally should mean that the replacement pod has the same hostname, IP
address, and storage.

We should aim to minimize disruption for high-availability servers. We
could we do, besides the things planned above?

  • Don't use the pod name as the host name. Associate a name with the
    IP address instead (e.g., by hashing the address). Pods created by
    replication controllers aren't currently predictable, so this wouldn't be a
    regression.
  • Migrate the pod IP address. Currently pod IP addresses are
    statically partitioned among hosts and are not migratable. This would
    likely be problematic on some cloud providers with the way we're currently
    configuring routing, but could be done with an overlay network.
  • Lifecycle hooks pre- and post-migration.
  • Actual live state transfer via CRIU http://criu.org/Docker

With respect to Kubernetes objects, the "migrated" pod would still be a
new pod with a new name and uid. The orchestration of the migration would
be performed by a controller -- possibly an enhanced replication
controller, perhaps in collaboration with a network controller to move the
address, similar to the separation of concerns in the persistent storage
proposal. During the migration process, the old and new pods would coexist,
and that coexistence would be visible to clients of the Kubernetes API, but
the application being migrated and its clients would not need to be aware
of the migration.

/cc @smarterclayton https://github.com/smarterclayton @thockin
https://github.com/thockin @alex-mohr https://github.com/alex-mohr

Reply to this email directly or view it on GitHub
#3949.

@bgrant0607
Copy link
Member Author

I'd make the requirement that anything using the Kubernetes API to introspect or manage their own pods would need to be migration-aware. They could use the post-migration hook to get the pod's new name.

Why would someone want the uid? What issue is that? (Other than #386.)

@smarterclayton
Copy link
Contributor

They want the uid as a unique instance identifier. We could give them anything, but self registration into an external system is one way (I'm pod foo, serving X, here's my unique identifier)

@gaocegege
Copy link
Contributor

Hi, I'm interested in pod live migration.

Is there any progress? Docker has built-in C/R operations in experimental mode, could we develop a feature based on that?

@thockin
Copy link
Member

thockin commented Apr 19, 2017 via email

@warmchang
Copy link
Contributor

pod live migration? I think it's not necessary to do this.
you can use service to shielding the destroy and rebirth of pod.

@ktosiek
Copy link

ktosiek commented Aug 18, 2017

@warmchang Live migration would let one reduce downtime for services that don't support failover and have a long startup time.
JIRA and Jenkins would be examples of such services.

You can run them under a single-instance StatefulSet (not under RC, as they do not guarantee stopping the old instance), but then you'll see those slow restarts (potentially multiple ones) on a rolling cluster reboot.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@schrej
Copy link
Member

schrej commented Nov 11, 2020

For evaluation I've implemented a migration controller and a MigratingPod kind, which allows to migrate pods on deletion.
The operator is available at schrej/podmigration-operator (As the rest of my PoC, this is very rough as well.)
It requires my modified kubernetes (the latest modified version of kubelet!) and containerd with modified cri to work of course, see my previous comment.

Here's a small demo gif showing the migration of a simple pod with a "stateful container".
migratingpod

This is implemented with a finalizer that's also respected by kubelet, so the Pod doesn't get terminated immediately.

@vutuong
Copy link

vutuong commented Dec 23, 2020

  1. Hi everyone, thank to @schrej and @adrianreber works, wish this feature lands in k8s soon. For anyone, who wanna try @schrej's setup, but got problems, my ansible-based automation tool, and installation document to build K8s cluster for try out pod migration:
  1. However, in @schrej works, when you want to migrate a Pod, the process is briefly described: Create a new Pod with Spec.clonePod = oldPod => The new Pod will be restored as soon as checkpointing old pod is completed (The old Pod will be deleted). So if I just only want to checkpoint my Pod and save it for later restore (not to restore immediately), with that I can restore pod as many as I need from one checkpoint. It seems like not supported in schrej's work.
    Thank to @schrej's help, I tried to extend his work as I wanna decouple checkpoint - restore. My initial idea [1] is using pod Annotation to trigger checkpoint/ restore as following:
  • To checkpoint a Running Pod, there are two options:
$ kubectl annotate pod [POD_NAME] snapshotPolicy="checkpoint"  snapshotPath="/your/path/"
$ kubectl checkpoint [POD_NAME] [CHECKPOINT_PATH]
  • To restore a new Pod from existing checkpoint info, create a new Pod with these Annotations:
...
  annotations:
    snapshotPolicy: "restore"
    snapshotPath: "/your/path/"
...
  • To live migration ( cold migration maybe), there are two options:
$ Using  Spec.clonePod = oldPod (just like @schrej works)
$ kubectl migrate [POD_NAME] [DEST_HOST] (my option)
  1. The video-streaming pod migration demo:
  1. Finally, I didn't really measure the restore delay, but it looks like not fast as I expect (see the demo). Is it a problem of CRIU or contained?

@wgahnagl
Copy link
Contributor

/close
no longer relevant, reopen if it becomes relevant again 👍

@k8s-ci-robot
Copy link
Contributor

@wgahnagl: Closing this issue.

In response to this:

/close
no longer relevant, reopen if it becomes relevant again 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ehashman
Copy link
Member

/reopen
/title Pod lifecycle checkpointing

This has been raised recently at SIG Node meetings with some demos. kubernetes/enhancements#1990 is not yet implementable but reopening to keep this on backlog.

(closed due to confusion about title/first comment)

@k8s-ci-robot k8s-ci-robot reopened this Jun 24, 2021
@k8s-ci-robot
Copy link
Contributor

@ehashman: Reopened this issue.

In response to this:

/reopen
/title Pod lifecycle checkpointing

This has been raised recently at SIG Node meetings with some demos. kubernetes/enhancements#1990 is not yet implementable but reopening to keep this on backlog.

(closed due to confusion about title/first comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 24, 2021
@gjkim42
Copy link
Member

gjkim42 commented Jun 24, 2021

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 24, 2021
@gjkim42
Copy link
Member

gjkim42 commented Jun 24, 2021

/remove-triage accepted

Sorry for the confusion. It seems that sig-node has not yet decided to accept this.

@k8s-ci-robot k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 24, 2021
@k8s-ci-robot
Copy link
Contributor

@bgrant0607: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 24, 2021
@ehashman
Copy link
Member

/retitle Pod lifecycle checkpointing

@k8s-ci-robot k8s-ci-robot changed the title Pod migration Pod lifecycle checkpointing Jun 25, 2021
@mmiranda96
Copy link
Contributor

/remove-priority awaiting-more-evidence
/triage needs-information
/kind feature
/remove-kind design

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. kind/feature Categorizes issue or PR as related to a new feature. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. kind/design Categorizes issue or PR as related to design. labels Jun 25, 2021
@thockin thockin added the area/pod-lifecycle Issues or PRs related to Pod lifecycle label Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/pod-lifecycle Issues or PRs related to Pod lifecycle kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

Successfully merging a pull request may close this issue.