Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More ergonomic support for graceful shutdown #147

Open
merrellb opened this issue May 2, 2017 · 40 comments
Open

More ergonomic support for graceful shutdown #147

merrellb opened this issue May 2, 2017 · 40 comments

Comments

@merrellb
Copy link

merrellb commented May 2, 2017

[Edit: This issue started as a discussion of the API for cancel scope shielding, motivated by @merrellb using it to catch Cancelled errors and do clean up activities like sending "goodbye" messages in #143, but then morphed into a more general discussion of how trio could help users implement "graceful shutdown" functionality in general, so let's make it the standard issue for that. Original text follows. – @njsmith]


The documentation suggests assigning a cancel scope and setting shield as two separate steps:

with move_on_after(CLEANUP_TIMEOUT) as cleanup_scope:
    cleanup_scope.shield = True
    await conn.send_goodbye_msg()

However trio.open_cancel_scope has a shield argument and it seems like it would also make sense with the convenience functions (trio.move_on_after, trio.move_on_at, trio.fail_after and trio.fail_at). eg:

with move_on_after(CLEANUP_TIMEOUT, shield=True):
    await conn.send_goodbye_msg()

I would be happy to make the pull-request if this change makes sense.

@njsmith
Copy link
Member

njsmith commented May 7, 2017

For the record, the reason I didn't do this in the first place is that I feel like shield should be a pretty obscure and rarely used feature. It's​ very easy to use in a way that breaks your program in subtle ways (e.g. by making it unresponsive to rare network failure conditions), it breaks the rule that timeout policy should be set as far up the stack as possible, and I don't want it to become an attractive nuisance. So I included it in open_cancel_scope since that's a kind of low level "raw" API, but left it out of the high level timeout apis.

Maybe it will turn out the I'm wrong and shield really is something that gets used a lot, but I can't help but feel that that would indicate that we've messed up somehow.

In the mean time, I'm sort of feeling like maybe we should remove the kwarg from open_cancel_scope :-). It should be a bit annoying to use...

While we're at it, here's another idea: maybe we can have a way to say "yes, I know I'm cancelled, but please stop raising it unless an operation actually blocks". Like a shield with a zero second per operation timeout. So you could attempt to do things like send a goodbye message, but if it doesn't succeed immediately, oh well, let's move on. It would at least remove the issue of having to pick some magic timeout value way down inside the code...

... Though, I guess this doesn't actually help for your use case with websockets, since the main reason websockets have a goodbye message is so the initiator can then wait until they see the corresponding goodbye message from their peer, and know that both sides successfully processed all the data that was sent. (Since otherwise TCP can lose some data at the end of a connection.) Sending a goodbye message and then immediately closing the socket is pretty useless, but that's all my idea above would allow for.

Otoh I'm not entirely sure I buy that it makes sense to send websocket goodbye messages on control-C in the first place :-). "Potential data loss" sounds bad, but here it just means that the connection might act like you hit control-C a little earlier than you actually did... and who's to say that's not what happened? (Unless you're carefully monitoring the connection state and choosing when to hit control-C with millisecond precision, which seems like a weird hobby.) But maybe the problem is more general – certainly there are a lot of protocols and applications that have some concept of a graceful shutdown.

@merrellb
Copy link
Author

merrellb commented May 8, 2017

At the application level, I think a graceful shutdown might be fairly common. For example, I may want to make a call to a local database (ie something with time constraints I understand fairly well), storing the last record acknowledged by a remote WebSocket client. With nursery cancellation as the best way to manage a group of tasks, I think this may come up in many circumstances other than control-C (eg explicitly deciding to close a single connection)

I am not sure I understand your concerns about adding shield to move_on_after. Isn't that exactly what we should be wrapping our graceful connection shutdown code within? I would almost want it to be automatically enabled when exiting a canceled scope :-) . With a server spawning a nursery per incoming connection, how could we handle this at a higher level?

@njsmith
Copy link
Member

njsmith commented May 9, 2017

I think there are two things making me uneasy here:

  • I don't feel like I have a very clear picture of what the goal here should be. Cancelling work is inherently a messy process, networks are unreliable, and in a distributed systems you rarely have complete information. Like in your example, you say you want to record the last record acknowledged by the remote WebSocket client, but for all you know there might be another acknowledgement that they already sent, just it's still sitting in a router somewhere and you haven't seen it yet. (This is the Byzantine generals problem.) I totally understand the desire to make this kind of thing as reliable as possible! I want that too. But in my experience, the first and most important step in making a distributed system reliable is to have a clear picture of what we {have to / are willing to} give up. Complexity is the number 1 enemy of reliability, so papering over individual problems as they arise just doesn't work. And even in your example of making a last call to a local database, I don't know how to judge whether the complexity cost is worth it. If the control-C had arrived 0.1 seconds earlier, then you wouldn't even know that there was a last record to write, you'd just have exited, and apparently that would have been OK – so is it really worth bending over backwards to handle it? Maybe it is, I don't know.

  • As a matter of software design, I don't like the idea of low-level code imposing policy around timeouts and shutdowns. What if later on you decide that actually, on control-C exiting in 2 seconds is not good enough, you really need to exit in 1 second (or vice-versa). Do you have to go around and modify all the different places in your code that clean up resources?

    What if you're using several libraries, that each have their own internal policy on how long they'll hold out when you try to cancel them? What if you really want to shut down ASAP? Or on the other hand, what if one library wants to wait for 5 seconds and another wants to wait for 10 seconds, so the whole system ends up waiting for 10 seconds and that's fine, but the first library gave up after 5 seconds, and loses data that could have been saved if it had known that it actually had 10 seconds to work with?

These two concerns are closely related, because if we're hard-coding timeouts into little pieces of the code that can't see the big picture, then we're very likely to make the wrong trade-offs; OTOH if timeouts are being configured as late as possible and at as high a level as possible, then it's much more plausible that we can make the right trade-offs.

I almost feel like what we want is for a top-down cancellation to be able to specify a "grace period" – so the code that calls cancel() can also specify how long it's willing to wait for the cancellation to complete. The problem then would be in communicating this to the code being cancelled. The simple version would be:

  • by default, code is immediately cancelled
  • but code that wants to allow for graceful shutdown can have something like a "soft shield" that shields against cancellation during the grace period, but not after the grace period expires
  • and there's some mechanism that code running under the "soft shield" can check whether a cancellation has occurred, so that it knows that it should shut down the connection or whatever, e.g. doing if currently_cancelled(): ... or something.

That's enough for cases where the connection is active, so e.g. we can imagine a HTTP server that every time it goes to send a response, it checks for currently_cancelled() and if so then it marks the response as being the last one of this connection and immediately closes it afterwards. But what it's missing is some way for the code that wants to support graceful cleanup to get notified when a cancellation happens – like if an HTTP server or your websocket server is sitting idle waiting for something to arrive, then when it's cancelled it should immediately wake up and shut down the connection without having to poll currently_cancelled().

One obvious way to do this would be to send in an exception, and then code that wants to handle it gracefully can catch it and do whatever clever thing it wants. But... in general, you can't necessarily catch a Cancelled exception and do something clever with it. (This is an important point that I missed before, but is relevant to what you're trying to do in general!). The mere fact of delivering a Cancelled exception can itself break your connection irrecoverably. Like, consider if you were in the middle of doing a sendall to transmit a websocket message frame, and that got interrupted – you can't necessarily catch that exception and send a goodbye frame, because the previous frame might have been only half-transmitted and now you've lost synchronization. A Cancelled exception is better behaved than a generic KeyboardInterrupt, but it's still a bomb going off in the middle of your program state.

Maybe it's enough to make the soft-shield state toggleable, so you could do something like:

async def handler(ws_conn):
    with soft_shield_enabled:
        while True:
            try:
                with soft_shield_disabled:
                    msg = await ws_conn.next_message()
            except Cancelled:
                await do_cleanup(ws_conn)
                return
            await ws_conn.send_message(make_response(msg))

...Need to think about this some more. I feel like there's something here, but it isn't fully gelled yet :-).

@njsmith
Copy link
Member

njsmith commented May 9, 2017

Note: if we do promote graceful cancellation to a first-class concept, then I guess we'll want to rethink restrict_keyboard_interrupt_to_checkpoints. Maybe something like an argument that (a) makes it so KI cancels everything (i.e. actually triggers a cancellation inside init), and (b) lets you specify the grace period on this cancellation?

@merrellb
Copy link
Author

merrellb commented May 9, 2017

  • I am thinking along the lines of long running WebSockets where I am not too worried about losing a few messages but rather several hours of accumulated state. A largely "in memory" scenario with lazy persistence for performance reasons. Being able to save a snapshot on disconnect is definitely worth some complexity vs potentially replaying hours of logs/journals.
  • As you can imagine I do like the idea of graceful cancellation being a first-class concept :-). I understand how it can be messy but I don't see why it needs to be messy for common cases, especially when one controls both ends of the connection.
  • I like your idea of a graceful shutdown period (from above) and soft-shield.
  • Do we also need to worry about partial receives when canceling? I could see potentially getting out of sync if we are expecting an ack as part of our shutdown. Is there some way to cancel only if the receive is idle?

@njsmith
Copy link
Member

njsmith commented May 10, 2017

I am thinking along the lines of long running WebSockets where I am not too worried about losing a few messages but rather several hours of accumulated state. A largely "in memory" scenario with lazy persistence for performance reasons. Being able to save a snapshot on disconnect is definitely worth some complexity vs potentially replaying hours of logs/journals.

I'm dubious about the wisdom of getting into the state where you have hours of valuable data accumulated in memory in the first place – maybe you should flush snapshots regularly on a timer? But that said, assuming there's a good reason to do things this way, I think you'd really want to implement this as a task that listens for control-C and then flushes a snapshot. There's nothing wrong with doing that if you have needs that are a little more unusual/extreme, and I think "I have hours of valuable data that I need to make sure get flushed to a database over an async connection while shutting down" counts.

Do we also need to worry about partial receives when canceling? I could see potentially getting out of sync if we are expecting an ack as part of our shutdown. Is there some way to cancel only if the receive is idle?

Yeah, you'd have to be careful that the code you put inside the soft_shield_disabled block (like ws_conn.next_message in the example above) is written in a cancellation-safe manner. So for example if it looks like:

    async def next_message(self):
        while True:
            message = self._proto_state.get_next_message()
            if message is not None:
                return message
            else:
                # Receive some data and then try again
                data = await self._transport.recv(BUFSIZE)
                self._proto_state.receive_bytes(data)

then we're pretty much OK – the only await is in the call to self._transport.recv, and if it raises Cancelled then we'll exit early and leave behind any data that we already received in _proto_state's internal buffer to finish reading next time. So as long recv itself is cancellation safe then next_message will be too. And SocketType.recv is cancellation-safe.

However... this gets tricky for some protocols.

If self._transport is an SSL-wrapped socket instead of a regular socket, then it's possible that calling SSLStream.recv will need to send data on the underlying socket (because of a horrible thing called renegotiation). The version of SSLStream in my branch (#107) definitely does not guarantee that the connection is still usable after recv is cancelled.

A similar issue arises at the protocol level. For websockets, the next_message method looks to the user like it's just reading, but if it ends up reading a ping frame then it might want to send back a pong. And then if this gets cancelled half-way through then again we've lost sync.

Solving these problems seems really hard :-(. The way these protocols work, once you start trying to send something then you're really committed to doing so – like you just can't stop in the middle of a websocket Pong frame and switch to a Close frame. (For OpenSSL you're even committing to retrying the same OpenSSL API call repeatedly; it's not enough to make sure that the bytes it gave you eventually get sent. But let's ignore that for right now.)

So what would be annoying but possible is to switch the sending to always use a send-style API call (which can do partial sends and reports how much data was sent) instead of a sendall-call, and only remove bytes from the internal protocol object's outgoing buffer when send accepts them. (I wasn't planning to provide send in the high-level stream interface at all but let's assume we have it.) This would have the effect that if one operation gets interrupted during a send, then the next operation on the same object would pick it up and finish it before doing whatever else it needs to do. So e.g. if we get cancelled in the middle of sending a Ping, then when we go to send the Close it'll send the second half of the Ping + the Close.

Possibly this would become simpler if we change the semantics of sendall (in the high level stream interface, not for the raw socket interface) so that on cancellation it holds onto the data and automatically prepends it to the next call to sendall? I'm wary because this moves us back towards the infinite buffers of twisted-like systems and somewhat violates the no-implicit-concurrency rule, but if it's restricted to cancellations, where currently the only thing you can do is just close the connection, then maybe it's good. Basically this would be implementing the buffering logic inside the stream object instead of the protocol object. ...though, maybe it wouldn't work with SSL, because we could get in a state where we need to call sendall(b"") to flush out the internal send buffer before anything will work, but we don't know it and block waiting to receive instead.


Another possibility that we add more shielding states. Specifically, a task could be in the four states listed down the left column here, and the table shows what checkpoints do in each case:

Soft-cancelled Hard-cancelled
Default raise Cancelled raise Cancelled
Soft-shield, exceptions OK raise CancelledIsComing raise Cancelled
Soft-shield, exceptions not OK nothing raise Cancelled
Hard shield nothing nothing

But we don't set these states directly; instead, there's are two context managers: one that marks its contents as "soft-cancellation enabled", and one that marks its contents as "state may become corrupted if a cancellation happens". Then the states are:

  • if there are any shield=True scopes on the stack below the cancelled scope, then we're in state "Hard shield"

  • else, if there is at least one "soft-cancellation enabled" manager in effect, and no "state may become corrupted if a cancellation happens" managers in effect, then we're in state "Soft-shield, exceptions OK"

  • else, if there is at least one "soft-cancellation enabled" manager in effect, and at least one "state may become corrupted if a cancellation happens" managers in effect, then we're in state "Soft-shield, exceptions not OK"

  • else, we're in state "Default"

So... this is really complicated. But the idea is that in the end, a method like SSLStream.recv puts the "state may become corrupted" manager around its calls to sendall, and by default this doesn't do anything, but now it can guarantee that if some code that supports soft-cancellation calls it, and a soft-cancellation happens, then SSLStream.recv will leave the stream in a consistent state.

Hmm.

@njsmith njsmith changed the title Add shield argument to move_on_after and other convenience functions More ergonomic support for graceful shutdown May 10, 2017
@merrellb
Copy link
Author

Can we have a "task that listens for control-C"? I thought this was usually caught by the outer nursery itself after all tasks had canceled. While control-C is an important special case, I would think a problem with a single connection and the cancellation of its associated nursery would be the more routine scenario. I would want a graceful shutdown of these intermediate nurseries, giving a chance to persist a snapshot of the connection's associated application state.

Of the options you mention, the four shielding state approach sounds the most elegant although I realize this is beginning to also threaten the elegant simplicity you currently have :-) Still,l I do like this idea of providing extra protection where dragons lie while adding soft-cancellation to give time for a graceful shutdown (although that is easy for the guy not implementing it to say!)

@njsmith
Copy link
Member

njsmith commented May 10, 2017

Can we have a "task that listens for control-C"? I thought this was usually caught by the outer nursery itself after all tasks had canceled.

By default, KeyboardInterrupt gets delivered to some totally arbitrary place. In practice, it will often end up being delivered to the main task, which is sitting in the nursery __aexit__, and gets handled like any exception that happens inside the nursery scope (i.e., triggers a cancellation of everything in that nursery, and then gets propagated). The reason this is common is that it's what trio does if the control-C happens at a moment when the program is idle (no tasks currently running). But there's no guarantee of this – by default, it could be delivered into the middle of some delicate connection management code or whatever.

This blog post goes into a lot more detail about the trade-offs of different ways of handling control-C. In particular, there's discussion of why the default KeyboardInterrupt behavior (in both regular Python and in trio) is pretty much incompatible with guaranteeing a correct graceful cleanup, and also an example of how to write task that listens for control-C. In the example it just does raise KeyboardInterrupt, but it could just as easily do an await program_state.take_snapshot() first and then cancel everything, or any other arbitrary thing you want.

the four shielding state approach sounds the most elegant although I realize this is beginning to also threaten the elegant simplicity you currently have :-)

Yeah, I'm not sure... some kind of graceful shutdown is a pretty common/fundamental desire, so maybe it's worth it? But this is an area where trio's already ahead of the competition, and it's extremely tricky, and novel, so I'll probably let it stew for a while before making any decisions.

One thing that did occur to me is that I've already been thinking that in code like SSLStream that's "fragile" against cancellation (i.e., it can cause the object's internal state to become corrupted and unusable), it might be a good idea to have some guard that force-closes the stream on cancellation, like:

async def recv(self, max_bytes):
    try:
        # ... complicated stuff here ...
    except Cancelled:
        # our state is broken, but at least we can make sure that any attempt to use it fails fast
        self.forceful_close()
        raise

If I wrapped that up into a context manager like with cleanup_on_cancelled(self.forceful_close): ..., then that context manager could also set the flag that prevents soft cancellations from being delivered inside it if soft-cancellation handling is enabled – since these are two strategies for handling this kind of fragile state, they're needed in the same places.

Hmm.

@njsmith
Copy link
Member

njsmith commented May 10, 2017

I wonder if we should care that in principle someone might want to do some sort of graceful shutdown but doesn't care about the integrity of the stream they're reading from right now, it's some other object that needs cleanup.

@njsmith
Copy link
Member

njsmith commented Jul 10, 2017

This seems like a relevant paper for the whole "cancellation fragility" issue: Kill-Safe Synchronization Abstractions

@njsmith
Copy link
Member

njsmith commented Aug 10, 2017

Was thinking about this again in the shower today. It seems like could maybe simplify the whole "soft shield" idea by making the default state be soft-shielded. That is, the semantics would be something like:

cancel_scope.graceful_cancel(timeout), cancel_scope.graceful_cancel_at(time): set the deadline to min(current_deadline, implied_deadline), and also set the gracefully_cancelled flag

Code that's cancellation-oblivious gets the full time to complete. This seems pretty reasonable -- it's a bit annoying for code that's cancellation oblivious and runs forever, because it will use up the full graceful period before getting killed, but it's definitely a better default for any code that normally runs for a finite amount of time, and if infinite loops want good graceful cancellation then they generally need to do something special anyway.

If you don't want to set a timeout, you can use inf, but probably almost everyone wants to set some timeout so might as well make it an opt-out thing in the API.

Then code that wants to do something clever on graceful cancellation can explicitly register some interest -- I guess by flipping a flag to say "please consider graceful cancellations to be cancellations". (That's sufficient to implement wait_for_graceful_cancellation().) We might also want a currently_in_grace_period() function to poll that state – not sure.

Certainly that's enough for an HTTP server that wants to stop accepting new keepalive requests when the graceful shutdown is triggered. For the case of the websocket server that wants to send a goodbye over TLS... well, one option is to say that grace periods just don't mix well with renegotiation, sorry. In general renegotiation and similar is even less well-supported than I realized when I wrote the above. Or we could make SSLStream.receive_some guaranteed to survive cancellation; that's #198. Or we could have the ability to toggle the "please consider graceful cancellations to be cancellations" back to disabled, when calling transport_stream.send_all() from inside ssl_stream.recv_some.

@njsmith
Copy link
Member

njsmith commented Aug 10, 2017

The terminology here could use some work... graceful is on the tip of my tongue because I just stripped it out of AsyncResource, but we still have trio.aclose_forcefully, which is pretty unrelated. Grace period? cancel_with_grace_period(timeout)?

@njsmith
Copy link
Member

njsmith commented Aug 21, 2017

Another thing I just realized: for the minimal version of this idea where setting a grace period acts as (a) setting an overall deadline, and (b) setting a flag indicating that everyone should hurry up, and (c) letting specific pieces of code opt-in to receiving a cancellation when this flag is set, then it's actually totally possible to prototype as a third-party library. Having it built in would potentially have extra advantages, specifically in terms of letting the built-in server code integrate and letting SSLStream do clever things with renegotiation, and also being a standard thing that the whole ecosystem can coordinate around, but I don't think those are necessary for a prototype to be useful.

@njsmith
Copy link
Member

njsmith commented Dec 23, 2017

There's some interesting discussion of graceful shutdown in Martin Sustrik's Structured Concurrency post, from a rather different perspective.

@cjerdonek
Copy link
Contributor

cjerdonek commented Jan 15, 2018

A few thoughts that could be useful (after seeing the thread here: https://mail.python.org/pipermail/async-sig/2018-January/000437.html):

First, I'd make separate the distinction of what "knobs" a library API might want to expose to its users. These could be things like "finish in this total amount of time no matter what (including clean-up)," "allow x time for clean-up step y (no matter what)," or "aim to finish in this amount of time (excluding clean-up)." Then it would be up to the library author to use that information as it sees fit. For example, if a user specified 30 seconds absolute total time and 10 seconds for clean-up, then the library would compute and allot 20 seconds for the "success" case. Or the library could even tell the user their knob values are incompatible. Lastly, it would be up to the framework (trio in this case) to expose to the library author the primitives to support using that information. One consequence of this way of thinking is that it seems like it would be best if the framework didn't even have to know about concepts like "shutdown" and "graceful." Instead, it could expose lower level operations like increasing or decreasing the allotted time, or execute this block of code within this computed amount of time, and it would be up to the library author to map those operations to support graceful shutdown, etc.

@njsmith
Copy link
Member

njsmith commented Jan 15, 2018

I think once you get into defining a set of knobs whose semantics are specific to the particular function being called, then those are what we call function arguments :-).

In general, there's a big trade-off between semantic richness and universality -- if Trio adds something to the core cancellation semantics, then everyone using trio has to deal with it to some extent, whether it makes sense in their case or not. This can be good – the reason trio has core cancellation semantics in the first place is that libraries that don't support cancellation/timeouts are essentially impossible to use correctly, and getting it right requires some global coordination across different libraries, so it's actually good that we force everyone to think about it and provide a common vocabulary for doing so.

The need for universality definitely constrains what we can do though. We can't in general phrase things as "make sure to be finished by time X", because this requires being able to estimate ahead of time how long cleanup operations will take, and there's no way to do that for arbitrary code. And just in general I don't want to provide tons of bells and whistles that nobody uses, or are only useful in corner cases.

The semantic difference between regular execution and a "soft/graceful cancel" is that you're saying you want things to finish up, but allowing them to take their natural time. This is particularly relevant for operations that run forever, like a server accept loop – for them a graceful cancel is basically the same as a cancel. But for an operation with (hopefully) bounded time, like running an HTTP request handler, it can keep going. So it's... at least kinda general? And it definitely has some important use cases (HTTP servers!). But I'm not sure yet whether it's a good generic abstraction or not.

@njsmith
Copy link
Member

njsmith commented Aug 21, 2018

Another thing I just realized: for the minimal version of this idea where setting a grace period acts as (a) setting an overall deadline, and (b) setting a flag indicating that everyone should hurry up, and (c) letting specific pieces of code opt-in to receiving a cancellation when this flag is set, then it's actually totally possible to prototype as a third-party library

Prototype here in case anyone wants to experiment with this: https://gist.github.com/njsmith/1c83788289aaed49e091c8281d85a85e

(Please do, that would be very helpful for figuring otu if it's something we really want to do :-))

oremanj added a commit to oremanj/trio that referenced this issue Feb 6, 2019
Relevant to python-trio#886, python-trio#606, python-trio#285, python-trio#147, python-trio#70, python-trio#58, maybe others.

I was continuing my effort to shoehorn linked cancel scopes and graceful cancellation into `CancelScope` earlier today and it was feeling too much of a mess, so I decided to explore other options. This PR is the result. It makes major changes to Trio's cancellation internals, but barely any to Trio's cancellation semantics -- all tests pass except for one that is especially persnickety about `cancel_called`. No new tests or docs yet as I wanted to get feedback on the approach before polishing.

An overview:
* New class `CancelBinding` manages a single lexical context (a `with` block or a task) that might get a different cancellation treatment than its surroundings. "All plumbing, no policy."
* Each cancel binding has an effective deadline, a _single_ task, and links to parent and child bindings. Each parent lexically encloses its children. The only cancel bindings with multiple children are the ones immediately surrounding nurseries, and they have one child binding per nursery child task plus maybe one in the nested child.
* Each cancel binding calculates its effective deadline based on its parent's effective deadline and some additional data. The actual calculation is performed by an associated `CancelLogic` instance (a small ABC).
* `CancelScope` now implements `CancelLogic`, providing the deadline/shield semantics we know and love. It manages potentially-multiple `CancelBinding`s.
* Cancel stacks are gone. Instead, each task has an "active" (innermost) cancel binding, which changes as the task moves in and out of cancellation regions. The active cancel binding's effective deadline directly determines whether and when `Cancelled` is raised in the task.
* `Runner.deadlines` stores tasks instead of cancel scopes. There is no longer a meaningful state of "deadline is in the past but scope isn't cancelled yet" (this is what the sole failing test doesn't like). If the effective deadline of a task's active cancel binding is non-infinite and in the future, it goes in Runner.deadlines. If it's in the past, the task has a pending cancellation by definition.

Potential advantages:
* Cancellation becomes extensible without changes to _core, via users writing their own CancelLogic and wrapping a core CancelBinding(s) around it. We could even move CancelScope out of _core if we want to make a point.
* Nursery.start() is much simpler.
* Splitting shielding into a separate object from cancellation becomes trivial (they'd be two kinds of CancelLogic).
* Most operations that are performed frequently take constant time: checking whether you're cancelled, checking what your deadline is, entering and leaving a cancel binding. I haven't benchmarked, so it's possible we're losing on constant factors or something, but in theory this should be faster than the old approach.
* Since tasks now have well-defined root cancel bindings, I think python-trio#606 becomes straightforward via providing a way to spawn a system task whose cancel binding is a child of something other than the system nursery's cancel binding.

Caveats:
* We call `current_time()` a lot. Not sure if this is worth worrying about, and could probably be cached if so.
* There are probably bugs, because aren't there always?

Current cancel logic:
```
    def compute_effective_deadline(
        self, parent_effective_deadline, parent_extra_info, task
    ):
        incoming_deadline = inf if self._shield else parent_effective_deadline
        my_deadline = -inf if self._cancel_called else self._deadline
        return min(incoming_deadline, my_deadline), parent_extra_info
```
Want to support a grace period? I'm pretty sure it would work with something like
```
    def compute_effective_deadline(
        self, parent_effective_deadline, parent_extra_info, task
    ):
        parent_cleanup_deadline = parent_extra_info.get("effective_cleanup_deadline", parent_effective_deadline)
        if self._shield:
            parent_effective_deadline = parent_cleanup_deadline = inf
        my_cleanup_start = min(self._deadline, self._cancel_called_at)
        merged_cleanup_deadline = min(parent_cleanup_deadline, my_cleanup_start + self._grace_period)
        my_extra_info = parent_extra_info.set("effective_cleanup_deadline", merged_cleanup_deadline)
        if self._shield_during_cleanup:
            effective_deadline = merged_cleanup_deadline
        else:
            effective_deadline = min(parent_effective_deadline, my_cleanup_start)
        return effective_deadline, my_extra_info
```
Maybe that's not quite _simple_ but it is miles better than what I was looking at before. :-)
@oremanj
Copy link
Member

oremanj commented Feb 11, 2019

I think that if we add this concept of "service nurseries" or some kind of way to control the order that cancellation gets distributed, and remove the bits of your "cancel scopes have grace periods" proposal that overlap, then your proposal collapses down to become equivalent to my earlier proposal? (Maybe?) Does that sound at all plausible?

Yep, pretty much. The devil is in the details, but then, that's always true. :-)

I no longer think "graceful nurseries" as discussed above are a good idea -- what we Really Want (TM) is more like the "service nurseries" idea plus the base graceful cancellation support, and that's much much easier to think about. So, what's in "base graceful cancellation support"?

  • When you explicitly call cancel(), you should be able to specify a grace period that goes along with that cancellation, like you proposed above. The cancellation takes immediate effect for code not within a "soft shield" (which I've been calling with trio.shield_during_cleanup():), and takes effect after the grace period for code within a "soft shield".
  • If you set a deadline on a cancel scope, you probably also want the ability to set a grace period that goes along with that deadline. When the deadline expires, it's like calling scope.cancel(grace_period=scope.grace_period).
    • Open question: supposing this is in fact just an attribute of the cancel scope called grace_period, should it also be the default if you call cancel() without specifying a grace period? Or should cancel() without an explicit argument always do a hard cancel, and cancel_scope.grace_period only applies to cancellations that occur due to deadline expiry?
  • Should it be possible to specify the grace period that gets used when a nursery cancels its tasks to propagate an exception? Maybe not -- we want exception propagation to be fast.
  • Should it be possible to specify the grace period that gets used when a Ctrl+C is delivered? This one seems pretty useful, which creates some question relative to the previous bullet, since KeyboardInterrupt is an exception.
  • Should it be possible to specify that every cancellation, anywhere in a Trio program, uses a certain grace period unless otherwise specified? Maybe there is a default-grace-period contextvar?
  • Sometimes there are cleanup tasks that absolutely must finish. You can use a regular, "hard" shield for these. But, "this is a bounded-time operation that you have to wait for" is more expressive than just setting a shield -- in particular, it could be used to give other tasks more cleanup time too. (This is your "library A wants 5 seconds, library B wants 10 seconds, so you take 10 seconds, but then it doesn't make sense to cut library A off at 5 seconds" example from upthread.) We can't just interpret all shields as extending the cleanup period, because that could lead to deadlocks when using shields for cancel ordering (like in my service nursery example). Should we have a separate way to say "this code needs at least N seconds for cleanup, even if the grace period of the cancellation that hits it is smaller than that"?

I no longer think the "grace period inheritance" bit in my original proposal is worth the complexity.

@njsmith
Copy link
Member

njsmith commented Feb 11, 2019

  1. I think for most purposes, it's going to be nicer to make a soft-cancel be "opt-in" instead of "opt-out". I.e., by default it does nothing, unless you do with promote_soft_cancel_to_hard or something like that. For example, consider an HTTP/1.1 server: when you soft-cancel it, there are specific points that should shut down immediately (the accept loop; the bit of the protocol parsing code that blocks to wait for the next request to start). There's also a huge, arbitrarily complex amount of code that you don't want to cancel, i.e., everything inside the user's request handler. But it's possible that some specific pieces of the user's request handler do want to opt-in to soft-cancellation notification, for example if the handler's implementing a "long poll".

  2. I'm having trouble thinking of situations where you want to say "don't soft cancel yet, but do in X seconds". Usually it's more like "we just got SIGTERM, so soft cancel now, and hard cancel in X seconds". If so, then I think that allows for a lot of simplification? I'm imagining that CancelScope grows a .soft_cancel() method. The deadline continues to trigger a hard cancel if it expires. So a "clean shutdown with grace period" is just (a) calling .soft_cancel(), (b) setting the deadline appropriately. Are there any realistic cases this doesn't cover?

  3. Should it be possible to specify the grace period that gets used when a Ctrl+C is delivered? This one seems pretty useful, which creates some question relative to the previous bullet, since KeyboardInterrupt is an exception.

    I don't think so. There are two cases. The first case is for programs that leave control-C handling on its default settings. For them, we deliver KeyboardInterrupt to whatever code is running. This is important because it's how we can break out of while True: pass, where we might not be able to deliver a cancellation at all. But I can't see any reasonable way to combine this kind of delivery with a grace period. (Something something background thread + pthread_cancel... let's just not go there.) The second case is programs that care about shutting down carefully. They register a signal handler, and when SIGINT arrives they run whatever code they want. So if they want a grace period they can set it themselves.

  4. Thinking about it more... I realized I have a doubt about the "service nursery" idea. For trio-asyncio, OK, the rationale is pretty clear, your actual logical call stack is chopped up into a bunch of tasks across the nursery, so cancelling all those pieces at once is not going to do what you want. But for most cases, like trio-websocket... it seems intuitive that the background tasks should stick around as long as the main task does, so they can provide services. But... generally consuming those services involves executing a checkpoint, because, y'know, inter-task communication. So how can the main task tell if the background services are there or not, once the main task is cancelled? It can't talk to them. Of course this is only approximately true, there's Event.set and shielding and whatever, but I mean, as a general rule of thumb, am I wrong?

@oremanj
Copy link
Member

oremanj commented Feb 11, 2019

I think we might be talking about two different things when we talk about "soft cancellation":

  1. stop the infinite loops, leave everything else alone (for some amount of time)
  2. cancel like normal, but allow the code that executes while unwinding from the cancellation to block (for some amount of time)

I think these are both useful in different circumstances, but I'm more interested in type 2 for the purposes of this thread, because it's a lot harder to support without handling in _core. (Type 1 does great with the GracefulShutdownManager you posted; the main benefit to standardizing it is that all the server libraries can use the same one.) Also, I sort of get the sense that type 2 is more urgently desired by the Median Trio User at this point? @merrellb's desire that spawned this thread was type 2, as is the example in the Trio docs that demonstrates shielding, as are most of the cases I've run into.

Type 1 wants the default to be "soft cancellation does nothing", and you mark the places that should respond to a soft cancellation (an appropriate point in each the infinite loops); Type 2 wants it to be "soft cancellation is cancellation", and you mark the places that should get the extra time (to a first approximation, all the send_all calls and the __aexit__ and finally blocks). Apart from the defaults, I don't think they actually need different underlying mechanisms, but if we only had to support Type 1 those mechanisms could be substantially similar, as you point out.

Type 2 graceful cancellation at a high level probably looks like with move_on_after(30, grace_period=5):. This is the sort of interface we might want for:

  • run this subprocess, send it SIGTERM if it's not done in 30 seconds, send it SIGKILL if it doesn't exit from SIGTERM in 5 seconds
  • do this RPC call; if it doesn't finish in 30 seconds, tell our peer to cancel it, and wait up to 5 seconds for the response that says "yup, this was cancelled and no state was changed", in the absence of which we have to assume an indeterminate state/network partition and can't proceed with the thing we wanted to do
  • do this request-response interaction on an SSLStream; give up after waiting 30 seconds for the response, but if we can avoid breaking the stream by waiting up to 5 seconds longer for a send_all to finish, do that please

I think this is pretty generally useful?

[Ctrl+C handling]

Agreed, it does seem like the best option is to say "if you want graceful shutdown on Ctrl+C, use open_signal_receiver and write the soft_cancel() or whatever-it-is yourself".

[service nurseries] So how can the main task tell if the background services are there or not, once the main task is cancelled? It can't talk to them.

Yeah, service nurseries are only useful if the body of the async with block is also using Type 2 graceful cancellation, such that it might have need of the services after the cancellation is first raised. (Presumably a Type 1 graceful cancellation wouldn't affect the services in the first place.)

@oremanj
Copy link
Member

oremanj commented Feb 11, 2019

So far I've been imagining Type 2 graceful cancellation support using the tri-state cancel scope model (not cancelled / cancelled but shieldable by "soft-shield" scopes ie "cancelled pending cleanup" / cancelled and only shieldable by "full-shield" scopes ie "fully cancelled"). It occurs to me that we might also model it as an edge-triggered cancellation now (each task receives one Cancelled at its next checkpoint that is not soft shielded) plus a traditional level-triggered cancellation after the grace period expires. Maybe that would be easier than supporting two phases of level-triggered cancellation?

@njsmith
Copy link
Member

njsmith commented Feb 11, 2019

Huh. Yeah, I'm definitely thinking of "Type 1" shutdown. I've never encountered the other type before. I'm trying to figure out if I believe in it as a single coherent category or not :-).

What makes me nervous is in your examples, they all feel very specific to the details of the situation. Trio's cancel scopes shouldn't know the difference between SIGTERM and SIGKILL (though maybe Process objects should expose some configuration for this?), and they definitely don't know how your distributed RPC system copes with network partition. (See also this post up-thread: #147 (comment)) And the SSLStream example seems artificial – if my peer is so uncooperative they aren't even draining their receive buffer then how is waiting 5 seconds going to help? better to throw this connection away and get a better one!

OTOH everyone knows what an infinite loop is :-) (which is a nice characterization btw, thank you).

If I hand you some random third-party library, like I dunno, a complex HTTP app running under hypercorn, then what do you expect it to do if you send a "type 2 soft cancel"? what kind of guarantees would you expect it to give, that you could rely on, and file bugs if they weren't met?

I may just be suffering from a lack of imagination, because I haven't seen systems that supported these more powerful concepts... do you have any more worked-out examples of where you've run into this, as a Not Exactly Median But Let's Call It Close Enough Trio User (or "NEMBLCICETU" for short)?

@oremanj
Copy link
Member

oremanj commented Feb 11, 2019

The common thread I'm trying to point at with the "type 2 soft cancellation" is the idea that sometimes, while you can just tear everything down forcefully, and you need to be able to do so correctly to deal with misbehaving peers/networks/etc, you will get better outcomes if you can take a moment to tear things down non-forcefully when you get cancelled. And this probably just adds to your total deadline in practice if you hit the deadline because of a network problem, but maybe the slowdown was for some other reason and the network operations required for your graceful closure will complete quickly.

Trio's cancel scopes shouldn't know the difference between SIGTERM and SIGKILL

Definitely! But if Trio's cancel scopes provide a common vocabulary for saying "start sweeping the floors and telling customers we're closing in X seconds, and actually shove them out the door Y seconds after that if they haven't left yet", it's natural for phased shutdown of a process to use that vocabulary. (As for why we should care about supporting SIGTERM-wait-SIGKILL: it gives the child process time to clean up its resources, its own child processes, etc.)

and they definitely don't know how your distributed RPC system copes with network partition.

Nope -- but if you can distinguish "this call timed out but the cancellation was processed normally" from "this call timed out and also I didn't hear back about my cancellation request in a reasonable amount of time", that's a useful input into the application-level code for deciding that a network partition has happened and coping with it.

Or maybe we're not thinking in terms of network partitions at all, but we do want to see any exception that might have been raised by the remote side as a result of buggy handling of the cancellation.

And the SSLStream example seems artificial – if my peer is so uncooperative they aren't even draining their receive buffer then how is waiting 5 seconds going to help? better to throw this connection away and get a better one!

I was using the same grace period for all the examples, but this example probably makes more sense with a tiny one -- do we really not want to wait even 100 ms for the send queue to drain in this case?

If I hand you some random third-party library, like I dunno, a complex HTTP app running under hypercorn, then what do you expect it to do if you send a "type 2 soft cancel"? what kind of guarantees would you expect it to give, that you could rely on, and file bugs if they weren't met?

I would expect it to document its behavior in the soft-cancelled state, like "existing requests will be allowed to complete but we'll stop accepting new requests and we'll send CLOSE on all websocket connections once we're done sending queued messages", or whatever. I wouldn't use soft-cancel with a library that didn't define semantics for it, unless I was OK with getting hard-cancel instead. But I also wouldn't use "type 1" soft cancel with a library that didn't define semantics for it, because if it's oblivious then I'm just sleeping for a while before doing a hard cancel, and that's a waste of time.

Some more examples of "type 2 soft cancel" being useful, off the top of my head:

  • I have an SSL connection with some internal service over the public Internet and we're paranoid about security. If I just break the connection, it will hit a log file somewhere and someone will spend time looking into it to rule out the possibility of malicious interference. If I send a CLOSE_NOTIFY, I can save a human some time.

  • I'm talking to a financial institution using FIX, and they will call me on the phone and scold me about my buggy software if I don't send them the proper logout message and wait for their reply before I close the connection. Even if I'm closing the connection because they didn't respond to some request I sent in a reasonable amount of time. (Yes, this really happens.)

  • I'm implementing a "resource server": it knows about various pieces of hardware that my clients might want to run tests on, and when one of them asks me for a place to run their frobnicator, I can tell them "OK, you get WhizBang3" and it will go start mucking with that piece of hardware according to its needs. If I'm shutting down the resource server, or kicking off this client because somebody more important needs WhizBang3, or taking down WhizBang3 for maintenance, it's pretty natural to represent those things as cancellations as various levels... but I would really like to turn that into a request to the client to remove its frobnicator from WhizBang3, which response I will wait some time for, so that it's immediately in a state where I can hand it off to the next person who wants to run a frobnicator, if applicable.

@oremanj
Copy link
Member

oremanj commented Feb 11, 2019

Also, it's worth mentioning that I haven't personally encountered any cases where I needed to set both a deadline and a grace period, but I think that's largely because I'm not using deadlines very much at all in the stuff I've been writing - it's all on reliable internal networks, and cancellation generally occurs on demand. But where I did have occasion to set a grace period, I wanted "not soft shielded" to be the default.

@njsmith
Copy link
Member

njsmith commented Feb 17, 2019

There's been some discussion around these topics happening in the structured concurrency sub-forum: https://trio.discourse.group/t/graceful-shutdown/93

@njsmith
Copy link
Member

njsmith commented Feb 17, 2019

@oremanj I think I somewhat understand what you're saying here, but I'm still struggling to convince myself that a generic global thing like an ambient soft-cancelled state can handle all these heterogenous cases in an appropriate way.

I'm talking to a financial institution using FIX, and they will call me on the phone and scold me about my buggy software if I don't send them the proper logout message and wait for their reply before I close the connection. Even if I'm closing the connection because they didn't respond to some request I sent in a reasonable amount of time. (Yes, this really happens.)

Oh dear.

For the well-behaved case, it seems like you can handle this pretty easily using local knowledge?

with move_on_after(REQUEST_TIMEOUT) as cancel_scope:
    await do_request(...)
if cancel_scope.was_cancelled:
    with move_on_after(LOGOUT_TIMEOUT):
        await do_logout(...)

The nastier case is if we get an outside cancellation, like from a sibling task crashing. But I guess in your synchronous version of this code, you also have some behavior that happens if there's an unexpected exception (e.g. KeyboardInterrupt or NameError from a typo or whatever). Either you drop the connection and take the phone call, which is trivial, or else you have a finally block that makes a last-ditch attempt to logout, which in Trio is equivalent to a finally block with a shield inside it, right?

implementing a "resource server"

I'm guessing that when you shut down the server, then sometimes you'll want to kick everyone off, and other times you'll want to keep the current allocations in place (e.g. an in-place upgrade with some kind of hand-off between the old and new versions). And it would be pretty awkward if you were trying to do a in-place upgrade, but the soft-cancel accidentally kicked everyone off, so you'll need some kind of out-of-band signalling mechanism here anyway?

And for the policy decisions like the current WhizBang3 user getting pre-empted, or taking down WhizBang3 for maintenance, then it seems like the policy engine making these decisions has to have an intimate relationship with the code that sends the "please remove your frobnicator" messages. Can't they use one of Trio's many fine communication channels? To me the key argument for having soft-cancellation built-in is would be to allow coordination between cancellers and cancellees who don't otherwise know about each other (e.g., signal handlers and accept loops).

I'm also trying to imagine: eventually we will write some docs, like "So you want to add graceful shutdown support to your server!" With the "type 1" design, it's something like:

  1. Add a task somewhere that does:

     async with trio.open_signal_receiver(SIGTERM) as receiver:
         async for _ in receiver:
             root_cancel_scope.soft_cancel(hard_timeout=10)
  2. Go find the places where work enters your system or you block indefinitely, and add with upgrade_soft_cancel_to_hard_cancel around them. (Optional; for many typical servers, the accept loops will already have this and nothing more needs to be done.)

  3. You're done!

If we make soft-cancel an opt-out thing, then I fear this documentation will become much more complicated...

@nmalaguti
Copy link
Contributor

I was thinking about your Type 1 and Type 2 cancellation types, and I think they can both be represented by the prototype I made in #941.

The "softness" or "gracefulness" is inherited from parent scopes as long as intervening scopes don't explicitly set it.

For Type 1, you would do something like:

async def accept_loop():
    while True:
        with trio.CancelScope(graceful=True) as cancel_scope:
            # simulate accept
            await trio.sleep(2)

        if cancel_scope.cancelled_caught:
            # do cleanup behavior for accept cancellation
            # this will immediately exit for hard cancel
            # e.g. send close to a network peer
            await trio.sleep(1)
            break

        # simulate handling request
        await trio.sleep(5)

    # general cleanup could go here

For Type 2:

async def soft_cancel_after(timeout, cancel_scope):
    logging.info('calling soft cancel in %ss', timeout)
    with trio.move_on_after(timeout):
        await trio.sleep_forever()
    logging.info('soft cancel called')
    cancel_scope.graceful_cancel()

@asynccontextmanager
@async_generator
async def move_on_after_with_grace(timeout, grace_period):
    # make child scopes graceful
    logging.info('calling hard cancel in %ss', timeout + grace_period)
    with trio.CancelScope(graceful=True, deadline=trio.current_time() + timeout + grace_period) as cancel_scope:
        async with trio.open_nursery() as nursery:
            nursery.start_soon(soft_cancel_after, timeout, cancel_scope)
            try:
                await yield_(cancel_scope)
            finally:
                cancel_scope.cancel()

async def main():
    timeout = 10
    grace_period = 5
    try:
        async with move_on_after_with_grace(timeout, grace_period) as cancel_scope:
            logging.info('about to call do_send()')
            await do_send()
            logging.info('done calling do_send()')
    finally:
        if cancel_scope.cancelled_caught:
            logging.info('do_send cancelled')

async def do_send():
    while True:
        # do some normal stuff you want to cancel early here
        logging.info('do_send normal 2')
        await trio.sleep(2)
        with trio.CancelScope(graceful=False):  # protect from early cancel
            # simulate send all
            logging.info('do_send send_all 5')
            await trio.sleep(5)

trio.run(main)

There's probably something that could be done around soft_deadline in CancelScope to remove the need for async with move_on_after_with_grace() but I feel like both are solved by allowing code to opt-in and override the gracefulness of scopes.

Or maybe my examples are too simplistic to expose the issues with this design.

I'm guessing that when you shut down the server, then sometimes you'll want to kick everyone off, and other times you'll want to keep the current allocations in place (e.g. an in-place upgrade with some kind of hand-off between the old and new versions). And it would be pretty awkward if you were trying to do a in-place upgrade, but the soft-cancel accidentally kicked everyone off, so you'll need some kind of out-of-band signalling mechanism here anyway?

I feel like you're starting to describe supervision trees from Elixir/Erlang and all they ways they can decide what processes to restart when one dies. You might be able to model OTP and supervision trees, but I'm not sure how much of the actor concurrency model you'd need to replicate to maintain the same invariants. It's certainly powerful though. If you could make it as easy to use in the simple case as making coroutines and awaiting them, that would be a big win in my mind.

@njsmith
Copy link
Member

njsmith commented Feb 20, 2019

I feel like you're starting to describe supervision trees from Elixir/Erlang and all they ways they can decide what processes to restart when one dies.

Erlang supervisors seem pretty different to me, because their policies are totally local. In Trio terms, they're like assigning restart policies to specific nurseries. That's an interesting thing, and one I'm hoping we'll see trio libraries experimenting with, but it's a different thing from cancellation, which is all about coordination between far-flung code that may not know about each other at all.

@nmalaguti
Copy link
Contributor

I guess I was thinking about the in-place upgrade vs hand-off behavior and how different tasks could be slowly recycled out to be the new task implementations. But you're right, the cancellation behavior doesn't have a good analog.

I was trying to think of ways you could abuse CancelScopes and Cancelled exceptions to broadcast arbitrary messages. The problem is that the only way to interrupt code that isn't aware of your custom message type is to cancel it and have it bubble up until it hits a scope that is aware. It's a blunt instrument. We can extend it for different kinds of cancellations. Instead of just hard and soft, maybe make some different levels and scopes can filter everything below their level, similar to logging levels and filters. But at the end of the day the only way to interrupt a suspended task and restore control at a desired point is to cancel everything below that point.

If you want to run an accept loop and also handle a signal from elsewhere in your app you can use a global Event and race between the accept and Event. But that assumes you don't intend to shutdown, maybe you just want to reload a config from disk and restart your accept loop on a new port but keep all of the existing connections open. That isn't a cancellation.

If there's code you don't control that is running the accept loop and you want to cancel it to pick up the new port but don't want to close the existing connections (which will certainly be closed if control is returned from whatever inner nursery is managing them), if the library hasn't given you a way to send a message inside or let you supply the nursery the workers will spawn in, you don't have many options. There are lots of patterns one could follow/abuse given multiple cancellation levels, but ultimately you'd be risking cancelling things you didn't intend to if you weren't really planning on shutting down.

In Erlang, if the supervision tree was setup in an amenable way, you could just send a message to the accept process and have it die and be restarted (provided that picks up the new configuration). Since tasks in trio aren't addressable the best you can do is cancel everything. With graceful cancel you can stop accepting for a while and wait for existing connections to drain and then restart (there will be some downtime), but that's hardly ideal.

If there was a library like Blinker for the trio ecosystem, library authors could at least expose signals to callers without having to pass control objects from outside, but I think every case will be slightly different and special.

To me, graceful cancellation is only meaningful if you will cancel the whole scope in the near future. Anything else isn't really cancellation, it's signaling. And while trio or a library could potentially improve signaling between tasks, it isn't the topic we've been discussing.

@smurfix
Copy link
Contributor

smurfix commented Feb 21, 2019 via email

@smurfix
Copy link
Contributor

smurfix commented Feb 21, 2019 via email

@nmalaguti
Copy link
Contributor

Use the existing Blinker library to push messages to recipient queues ^W channels; start tasks with an async-for loop to read, maybe-dispatch, and process them.

Even simpler:

import trio
from blinker import Signal, signal
from typing import Union


accept_loop_signal = Signal()


class NamedCancelScope(trio.CancelScope):

    __slots__ = ['_was_signalled']

    def __init__(self, *, name: Union[Signal, str], **kwargs):
        super().__init__(**kwargs)

        if not isinstance(name, Signal):
            name = signal(name)

        self._was_signalled = False
        name.connect(self._signal_cancel)

    @property
    def was_signalled(self):
        return self._was_signalled

    def _signal_cancel(self, _sender):
        self._was_signalled = True
        self.cancel()


def cancel_named(name: Union[Signal, str]):
    if not isinstance(name, Signal):
        name = signal(name)

    name.send()

# Example Usage:

async def accept_loop():
    with NamedCancelScope(name=accept_loop_signal) as cancel_scope:
        await trio.sleep_forever()

    if cancel_scope.was_signalled:
        print('signalled')
    if cancel_scope.cancelled_caught:
        print('cancelled')


async def accept_loop2():
    with NamedCancelScope(name=accept_loop_signal) as cancel_scope:
        await trio.sleep_forever()

    if cancel_scope.was_signalled:
        print('signalled2')
    if cancel_scope.cancelled_caught:
        print('cancelled2')


async def main():
    async with trio.open_nursery() as nursery:
        nursery.start_soon(accept_loop)
        nursery.start_soon(accept_loop2)
        await trio.sleep(3)
        cancel_named(accept_loop_signal)

if __name__ == '__main__':
    trio.run(main)

All scopes that share the same name will be cancelled. This doesn't allow you to only cancel the named scopes below your current scope, but I'm not sure if that is an actual use case. There are also things you could try to do around setting deadlines in named scopes (set them all so they each get the same time to finish their work), but again I think a specific use case would better inform the design than trying to support everything.

@oremanj
Copy link
Member

oremanj commented Jan 12, 2020

There was some good discussion of graceful shutdown issues in gitter today, starting at https://gitter.im/python-trio/general?at=5e1a59a3a859c14fa1ccfd72.

@arthur-tacca
Copy link
Contributor

Maybe it's not helpful to reply to a 5 year old comment, but to answer this question:

Can we have a "task that listens for control-C"?

The answer is certainly yes. I can think of two options:

  1. This is the simpler option but I'm not 100% sure it's reliable (I suppose not given that no one else mentioned it): Just set your own signal handler, like this:
def handle_keyboard_interrupt(nursery: trio.Nursery, signum, frame):
    nursery.cancel_scope.cancel()

You would set it up by passing functools.partial(handle_keyboard_interrupt, nursery) to signal.signal(signal.SIGINT, ...). If you want to still be able to forcefully kill your program sometimes you can add a line that passes signal.default_int_handler to signal.signal(...) so later uses of ctrl+c have their traditional affect.

  1. This is a bit more faff but I'm sure it works: Instead of running trio in the main thread as usual, run it in a separate thread, with a very simple main thread that just starts the trio thread and waits for KeyboardInterrupt; when it gets it, use trio_token.run_sync_soon() to notify the trio thread. There are a couple of annoying edge cases (e.g. what if you get a KeyboardInterrupt before the trio thread has run far enough for you to fetch the trio_token yet) but they are solveable.

It seems to me all the facilities for ergonomic shutdown are already present in Trio (but maybe they were added since this discussion mainly happened, or maybe my applications are not as complex as others). A technique that could often work is to use two nurseries at the top level, one which represents the whole run of the program (including shutdown), and one that just represents normal operation (not including shutdown). When you want to shutdown, cancel the inner nursery (and set a deadline on the outer one) and signal to the remaining tasks in some other way. That way, there is no need for shielding because cancelling is not the mechanism used to tell tasks about graceful shutdown - if a task gets cancelled then it should always give up immediately, because the opportunity to shutdown gracefully has already passed.

Here's an example of how it would look. It accepts incoming connections only during the inner nursery, but handlers for those connections are run in the outer nursery and a message is sent to those handlers to notify them they should shut down (I'm assuming a memory channel is open for each one anyway and a list is kept up to date in some accessible place - that depends on the application).

async with trio.open_nursery() as outer_nursery:
    async with trio.open_nursery() as inner_nursery:
        signal.signal(signal.SIGINT, partial(handle_keyboard_interrupt, inner_nursery))
        inner_nursery.start_soon(partial(trio.serve_tcp, ..., handler_nursery=outer_nursery))
    
    # Somehow notify your connection handlers that they should shut down, e.g.:
    for channel in handler_channels: # Stashed away elsewhere
        channel.send_nowait(None)  # Or whatever to indicate shutdown
    
    outer_nursery.cancel_scope.deadline = trio.current_time() + 3  # Grace of 3 seconds

Personally, I find it easier to understand doing it at the application level like this, rather than using a facility baked into Trio even if it existed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants