Add a new metafora command Drain : to support releasing all work. #102

epsniff · 2015-01-27T19:00:00Z

To support Kubernetes Lifecycle PreStop hook, we are going to add a new command to metafora that freezes accepting new tasks and releases all current tasks. The drain command should only return once all work has been released (maybe within a Timeout). This would allow for frameworks like kubernetes to send us a signal to release work and prepare to shutdown.

The text was updated successfully, but these errors were encountered:

schmichael · 2015-01-27T19:08:42Z

The drain command should only return once all work has been released (maybe within a Timeout).

This might be dangerous since the main loop blocks during command execution. Since drain freezes that means the main loop only handles this subset of events:

https://github.com/lytics/metafora/blob/master/metafora.go#L190-L211

So I guess that means while draining (1) stop signals are ignored and (2) further commands are ignored.

I think this is ok and possibly even ideal! However, it's something to keep in mind as blocking the main loop for long periods of time makes me nervous. :)

epsniff · 2015-01-27T19:41:47Z

Good point. I also need to consider how handle this in the client, which is the code that needs to block. Currently all commands issued by our client are async, and for this we need to send a signal back once drain is finished.

Or just fake it by having the client do polling :p

epsniff · 2015-01-27T19:49:55Z

maybe we should complete #101 as part of this, and only support a blocking drain via a direct command call...

schmichael · 2015-01-27T21:11:30Z

Hm, good point. We have no command "response" mechanism and our etcd client's SubmitCommand method just fires off the command into etcd and moves on.

Implementation

Instead of deleting commands when they're received we could CompareAndSwap them with command state:

Client submits it with state: executable
Consumer picks it up and sets it to executing
Post-execution the state is set to completed or failed depending on results. Maybe add an extra "result": <string> key/value pair to the JSON at this point?

Thinking through what this would mean for commands:

drain would stay executing until all tasks are stopped and then transitions to completed.
freeze/unfreeze would basically transition directly to completed
stop_task could work like drain but on a single task.
balance would probably work like drain as well since it potentially releases tasks.

I actually can't think of a case where we'd use failed. I think timeouts should be implemented purely on the client side (watch the command node for N seconds before just giving up and moving on).

The response command node should have a TTL so it's garbage collected at some point. An hour? A day? I'm not sure.

Issues

It's just a lot of complexity.

Workaround

The HTTP introspection endpoint returns what tasks on running on the node, so polling that is an option... just kinda ugly.

Not sure if I'm 👍 or 👎 on any particular option at this point.

epsniff · 2015-01-27T21:40:40Z

I like the idea, but not the complexity like you said. :( hmmm

I think we should expose a Drain or ExcuteCommand interface to the coordinator or metafora. That Drain can be blocking with a ttl. We can also add a drain to the client, but it would be async.

My suggestion is driven by a specific use case though. Our current use case for the node to drain it's self, not for our task/work coordinator to issue drains remotely. I see this as a special case for Shutdown, instead of a new "remote" command.

schmichael · 2015-01-27T22:43:36Z

I think we should expose a Drain or ExcuteCommand interface to the coordinator or metafora.
...
I see this as a special case for Shutdown, instead of a new "remote" command.

I like this. Really Consumer#Shutdown is just as much a "command" as Drain/Freeze/Unfreeze. Why do we expose it as a method and force the other commands to go through the coordinator? I think the entire command API might be misguided right now.

I think the only time we've used Commands at Lytics is when I manually issued Freeze/Unfreeze commands via curl+etcd to manage a node during deployment. Being able to Freeze/Unfreeze via a local HTTP endpoint or even SIGUSR1/SIGUSR2 or something would have been much more convenient.

So now the question is: Do we tack on a Drain method onto the Consumer or use an ExecuteCommand() method? Does the method monkey with the consumer's state directly or issue commands to the main loop?

After our discussion in person I think we both like the idea of killing commands as they exist today and just putting methods on the Consumer that directly alter Consumer state (falling back to communicating with the main loop via chans if the state manipulation proves too complex/error prone).

👍 👍 👍 👍

schmichael · 2015-06-15T18:19:04Z

Closing this ticket as it's pretty outdated. stop_task is best handled by the statemachine and drain is best handled through simply calling Consumer#Shutdown via a signal or HTTP handler.

schmichael added the enhancement label Jan 27, 2015

epsniff changed the title ~~Add a new command called Drain : to support releasing all work.~~ Add a new metafora command Drain : to support releasing all work. Jan 27, 2015

epsniff self-assigned this Jan 27, 2015

schmichael closed this as completed Jun 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new metafora command Drain : to support releasing all work. #102

Add a new metafora command Drain : to support releasing all work. #102

epsniff commented Jan 27, 2015

schmichael commented Jan 27, 2015

epsniff commented Jan 27, 2015

epsniff commented Jan 27, 2015

schmichael commented Jan 27, 2015

epsniff commented Jan 27, 2015

schmichael commented Jan 27, 2015

schmichael commented Jun 15, 2015

Add a new metafora command Drain : to support releasing all work. #102

Add a new metafora command Drain : to support releasing all work. #102

Comments

epsniff commented Jan 27, 2015

schmichael commented Jan 27, 2015

epsniff commented Jan 27, 2015

epsniff commented Jan 27, 2015

schmichael commented Jan 27, 2015

Implementation

Issues

Workaround

epsniff commented Jan 27, 2015

schmichael commented Jan 27, 2015

schmichael commented Jun 15, 2015