[Feature] Docker live migration #13785

Lol3rrr · 2022-07-15T18:24:46Z

Proposal

The ability to enable live migration of docker containers running one node to another node. This would allow for more or less seamless transitions when draining stateful services from one node and moving them to another node, possibly just incurring a hang for a short time depending on size/speeds/etc.

After some initial research it seems possible as shown by tools such CRIU and there are also some other resources talking about how this could potentially be done. However I'm not quite sure how much this applies to the way nomad does this or how possible this even is using nomad.

Use-cases

In my setup im using the docker driver to run a couple of stateful services, currently minecraft servers but could be other such game servers or similar, and want the ability to migrate the running service from one node to a different node. Ideally without having to restart as this would be disruptive to my end users.

Attempted Solutions

Im currently just using the standard migrate features, which just restarts the service because I can't have two of the same server instances running. Although this disruption is not huge as it often restarts rather quick it is still annoying and disruptive and forces me to check for users before deciding that I want to drain a node.

schmichael · 2022-07-15T22:48:22Z

(Cross-linking #2323, the QEMU counterpart to this Docker/Linux-container specific proposal.)

Thanks for filing an issue @Lol3rrr! This would be an exciting thing to support, but it is not on our roadmap today.

Implementation Notes

Migration

While the ephemeral_disk.migrate feature is no longer as useful since CSI support went GA, the PrevAllocMigrator does what CRIU would probably require: block starting a replacement allocation until state is transferred from the previous allocation.

So the lifecycle plumbing should mostly be in place!

Opt-in

We would probably want it to be optin not only for backward compatibility but also because a lot of services want to start from scratch for each new allocation. It could start life as a per-driver implementation in task.config stanzas, but if a few runtimes support it then perhaps a task.live_migrate { ... } stanza makes sense. I assume it will need to be a stanza and not a simple boolean since there are a lot of parameters involved, and I have to think at least a few would need to be user tunable: https://pkg.go.dev/github.com/checkpoint-restore/go-criu@v4.0.0+incompatible/rpc#CriuOpts

Lol3rrr · 2022-07-23T12:26:50Z

I would be curious as to what extend I could actively help here.
I'm not really familiar with the nomad codebase, but would be more than willing to start diving into the codebase and start to work on it, if that is even possible currently

Lol3rrr added the type/enhancement label Jul 15, 2022

schmichael added the theme/driver/docker label Jul 15, 2022

Jamesits mentioned this issue Jan 17, 2024

hot reconfiguration and live migration interface for task drivers #19752

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Docker live migration #13785

[Feature] Docker live migration #13785

Lol3rrr commented Jul 15, 2022

schmichael commented Jul 15, 2022

Lol3rrr commented Jul 23, 2022

[Feature] Docker live migration #13785

[Feature] Docker live migration #13785

Comments

Lol3rrr commented Jul 15, 2022

Proposal

Use-cases

Attempted Solutions

schmichael commented Jul 15, 2022

Implementation Notes

Migration

Opt-in

Lol3rrr commented Jul 23, 2022