Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new metafora command Drain : to support releasing all work. #102

Closed
epsniff opened this issue Jan 27, 2015 · 7 comments
Closed

Add a new metafora command Drain : to support releasing all work. #102

epsniff opened this issue Jan 27, 2015 · 7 comments
Assignees

Comments

@epsniff
Copy link
Contributor

epsniff commented Jan 27, 2015

Ref : kubernetes/kubernetes#3819

To support Kubernetes Lifecycle PreStop hook, we are going to add a new command to metafora that freezes accepting new tasks and releases all current tasks. The drain command should only return once all work has been released (maybe within a Timeout). This would allow for frameworks like kubernetes to send us a signal to release work and prepare to shutdown.

@epsniff epsniff changed the title Add a new command called Drain : to support releasing all work. Add a new metafora command Drain : to support releasing all work. Jan 27, 2015
@epsniff epsniff self-assigned this Jan 27, 2015
@schmichael
Copy link
Contributor

The drain command should only return once all work has been released (maybe within a Timeout).

This might be dangerous since the main loop blocks during command execution. Since drain freezes that means the main loop only handles this subset of events:

https://github.com/lytics/metafora/blob/master/metafora.go#L190-L211

So I guess that means while draining (1) stop signals are ignored and (2) further commands are ignored.

I think this is ok and possibly even ideal! However, it's something to keep in mind as blocking the main loop for long periods of time makes me nervous. :)

@epsniff
Copy link
Contributor Author

epsniff commented Jan 27, 2015

Good point. I also need to consider how handle this in the client, which is the code that needs to block. Currently all commands issued by our client are async, and for this we need to send a signal back once drain is finished.

Or just fake it by having the client do polling :p

@epsniff
Copy link
Contributor Author

epsniff commented Jan 27, 2015

maybe we should complete #101 as part of this, and only support a blocking drain via a direct command call...

@schmichael
Copy link
Contributor

Hm, good point. We have no command "response" mechanism and our etcd client's SubmitCommand method just fires off the command into etcd and moves on.

Implementation

Instead of deleting commands when they're received we could CompareAndSwap them with command state:

  • Client submits it with state: executable
  • Consumer picks it up and sets it to executing
  • Post-execution the state is set to completed or failed depending on results. Maybe add an extra "result": <string> key/value pair to the JSON at this point?

Thinking through what this would mean for commands:

  • drain would stay executing until all tasks are stopped and then transitions to completed.
  • freeze/unfreeze would basically transition directly to completed
  • stop_task could work like drain but on a single task.
  • balance would probably work like drain as well since it potentially releases tasks.

I actually can't think of a case where we'd use failed. I think timeouts should be implemented purely on the client side (watch the command node for N seconds before just giving up and moving on).

The response command node should have a TTL so it's garbage collected at some point. An hour? A day? I'm not sure.

Issues

It's just a lot of complexity.

Workaround

The HTTP introspection endpoint returns what tasks on running on the node, so polling that is an option... just kinda ugly.

Not sure if I'm 👍 or 👎 on any particular option at this point.

@epsniff
Copy link
Contributor Author

epsniff commented Jan 27, 2015

I like the idea, but not the complexity like you said. :( hmmm

I think we should expose a Drain or ExcuteCommand interface to the coordinator or metafora. That Drain can be blocking with a ttl. We can also add a drain to the client, but it would be async.

My suggestion is driven by a specific use case though. Our current use case for the node to drain it's self, not for our task/work coordinator to issue drains remotely. I see this as a special case for Shutdown, instead of a new "remote" command.

@schmichael
Copy link
Contributor

I think we should expose a Drain or ExcuteCommand interface to the coordinator or metafora.
...
I see this as a special case for Shutdown, instead of a new "remote" command.

I like this. Really Consumer#Shutdown is just as much a "command" as Drain/Freeze/Unfreeze. Why do we expose it as a method and force the other commands to go through the coordinator? I think the entire command API might be misguided right now.

I think the only time we've used Commands at Lytics is when I manually issued Freeze/Unfreeze commands via curl+etcd to manage a node during deployment. Being able to Freeze/Unfreeze via a local HTTP endpoint or even SIGUSR1/SIGUSR2 or something would have been much more convenient.

So now the question is: Do we tack on a Drain method onto the Consumer or use an ExecuteCommand() method? Does the method monkey with the consumer's state directly or issue commands to the main loop?

After our discussion in person I think we both like the idea of killing commands as they exist today and just putting methods on the Consumer that directly alter Consumer state (falling back to communicating with the main loop via chans if the state manipulation proves too complex/error prone).

👍 👍 👍 👍

@schmichael
Copy link
Contributor

Closing this ticket as it's pretty outdated. stop_task is best handled by the statemachine and drain is best handled through simply calling Consumer#Shutdown via a signal or HTTP handler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants