Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Docker live migration #13785

Open
Lol3rrr opened this issue Jul 15, 2022 · 2 comments
Open

[Feature] Docker live migration #13785

Lol3rrr opened this issue Jul 15, 2022 · 2 comments

Comments

@Lol3rrr
Copy link

Lol3rrr commented Jul 15, 2022

Proposal

The ability to enable live migration of docker containers running one node to another node. This would allow for more or less seamless transitions when draining stateful services from one node and moving them to another node, possibly just incurring a hang for a short time depending on size/speeds/etc.

After some initial research it seems possible as shown by tools such CRIU and there are also some other resources talking about how this could potentially be done. However I'm not quite sure how much this applies to the way nomad does this or how possible this even is using nomad.

Use-cases

In my setup im using the docker driver to run a couple of stateful services, currently minecraft servers but could be other such game servers or similar, and want the ability to migrate the running service from one node to a different node. Ideally without having to restart as this would be disruptive to my end users.

Attempted Solutions

Im currently just using the standard migrate features, which just restarts the service because I can't have two of the same server instances running. Although this disruption is not huge as it often restarts rather quick it is still annoying and disruptive and forces me to check for users before deciding that I want to drain a node.

@schmichael
Copy link
Member

(Cross-linking #2323, the QEMU counterpart to this Docker/Linux-container specific proposal.)

Thanks for filing an issue @Lol3rrr! This would be an exciting thing to support, but it is not on our roadmap today.

Implementation Notes

Migration

While the ephemeral_disk.migrate feature is no longer as useful since CSI support went GA, the PrevAllocMigrator does what CRIU would probably require: block starting a replacement allocation until state is transferred from the previous allocation.

So the lifecycle plumbing should mostly be in place!

Opt-in

We would probably want it to be optin not only for backward compatibility but also because a lot of services want to start from scratch for each new allocation. It could start life as a per-driver implementation in task.config stanzas, but if a few runtimes support it then perhaps a task.live_migrate { ... } stanza makes sense. I assume it will need to be a stanza and not a simple boolean since there are a lot of parameters involved, and I have to think at least a few would need to be user tunable: https://pkg.go.dev/github.com/checkpoint-restore/go-criu@v4.0.0+incompatible/rpc#CriuOpts

@Lol3rrr
Copy link
Author

Lol3rrr commented Jul 23, 2022

I would be curious as to what extend I could actively help here.
I'm not really familiar with the nomad codebase, but would be more than willing to start diving into the codebase and start to work on it, if that is even possible currently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants