You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ability to enable live migration of docker containers running one node to another node. This would allow for more or less seamless transitions when draining stateful services from one node and moving them to another node, possibly just incurring a hang for a short time depending on size/speeds/etc.
After some initial research it seems possible as shown by tools such CRIU and there are also some other resources talking about how this could potentially be done. However I'm not quite sure how much this applies to the way nomad does this or how possible this even is using nomad.
Use-cases
In my setup im using the docker driver to run a couple of stateful services, currently minecraft servers but could be other such game servers or similar, and want the ability to migrate the running service from one node to a different node. Ideally without having to restart as this would be disruptive to my end users.
Attempted Solutions
Im currently just using the standard migrate features, which just restarts the service because I can't have two of the same server instances running. Although this disruption is not huge as it often restarts rather quick it is still annoying and disruptive and forces me to check for users before deciding that I want to drain a node.
The text was updated successfully, but these errors were encountered:
(Cross-linking #2323, the QEMU counterpart to this Docker/Linux-container specific proposal.)
Thanks for filing an issue @Lol3rrr! This would be an exciting thing to support, but it is not on our roadmap today.
Implementation Notes
Migration
While the ephemeral_disk.migrate feature is no longer as useful since CSI support went GA, the PrevAllocMigrator does what CRIU would probably require: block starting a replacement allocation until state is transferred from the previous allocation.
So the lifecycle plumbing should mostly be in place!
Opt-in
We would probably want it to be optin not only for backward compatibility but also because a lot of services want to start from scratch for each new allocation. It could start life as a per-driver implementation in task.config stanzas, but if a few runtimes support it then perhaps a task.live_migrate { ... } stanza makes sense. I assume it will need to be a stanza and not a simple boolean since there are a lot of parameters involved, and I have to think at least a few would need to be user tunable: https://pkg.go.dev/github.com/checkpoint-restore/go-criu@v4.0.0+incompatible/rpc#CriuOpts
I would be curious as to what extend I could actively help here.
I'm not really familiar with the nomad codebase, but would be more than willing to start diving into the codebase and start to work on it, if that is even possible currently
Proposal
The ability to enable live migration of docker containers running one node to another node. This would allow for more or less seamless transitions when draining stateful services from one node and moving them to another node, possibly just incurring a hang for a short time depending on size/speeds/etc.
After some initial research it seems possible as shown by tools such CRIU and there are also some other resources talking about how this could potentially be done. However I'm not quite sure how much this applies to the way nomad does this or how possible this even is using nomad.
Use-cases
In my setup im using the docker driver to run a couple of stateful services, currently minecraft servers but could be other such game servers or similar, and want the ability to migrate the running service from one node to a different node. Ideally without having to restart as this would be disruptive to my end users.
Attempted Solutions
Im currently just using the standard migrate features, which just restarts the service because I can't have two of the same server instances running. Although this disruption is not huge as it often restarts rather quick it is still annoying and disruptive and forces me to check for users before deciding that I want to drain a node.
The text was updated successfully, but these errors were encountered: