You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are actually two separate bugs here that (I believe) require two separate fixes, but I am putting them both in this issue since they are so closely related (and to avoid spamming the issue tracker).
On asyncio, a child task can finish with an unhandled exception without cancelling its group's scope.
On asyncio, there is a delay between a child task finishing with an unhandled exception and its group's scope getting cancelled. This does not match the behavior on Trio. Per Trio's documentation:
If any task inside the nursery finishes with an unhandled exception, then the nursery immediately cancels all the tasks inside the nursery. (emphasis added)
How can we reproduce the bug?
Here are two tests. They all fail on asyncio and pass on trio. The first test is just a weaker version of the second test, but I have separated the two here because I think it is easier to follow that way.
The weaker case:
# Easy mode: `tg.cancel_scope` gets shielded many event loop cycles after the child task# exits with an exception.asyncdeftest_shield_after_task_failed_weak() ->None:
asyncdeftaskfunc() ->None:
raiseException("child task failed")
withpytest.raises(BaseExceptionGroup) asexc:
withCancelScope() asouter_scope:
asyncwithcreate_task_group() astg:
outer_scope.cancel()
tg.start_soon(taskfunc)
withCancelScope(shield=True):
awaitwait_all_tasks_blocked()
# Wait at least one more scheduling round to ensure that taskfunc's# task_done on asyncio has finished.awaitsleep(0.1)
tg.cancel_scope.shield=Trueasserttg.cancel_scope.cancel_calledassertlen(exc.value.exceptions) ==1assertstr(exc.value.exceptions[0]) =="child task failed"
This weaker test is a regression test only for bug (1); it does not test for bug (2).
To fix this case I believe we just need to change the .cancel call in task_done to be unconditional here:
# Hard mode: tg.cancel_scope gets shielded after the child task exits with an exception# but before its task_done (on asyncio) runs.asyncdeftest_shield_after_task_failed() ->None:
taskfunc_exited=anyio.Event()
asyncdeftaskfunc() ->None:
try:
raiseException("child task failed")
finally:
taskfunc_exited.set()
# (Technically we are setting this event slightly before taskfunc exits, but# this won't wake any waiters until the event loop's next batch of task# steps/callbacks, which will be the first batch to run after taskfunc has# exited.)withpytest.raises(BaseExceptionGroup) asexc:
withCancelScope() asouter_scope:
asyncwithcreate_task_group() astg:
outer_scope.cancel()
tg.start_soon(taskfunc)
withCancelScope(shield=True):
# Trio documents: "If any task inside the nursery finishes with an# unhandled exception, then the nursery *immediately* cancels all# the tasks inside the nursery" (emphasis added). So when the event# loop drives taskfunc's coro one step forward and it exits with an# exception, tg.cancel_scope must get cancelled *immediately*, i.e.# it must get cancelled before any other tasks/callbacks get to run# a step.awaittaskfunc_exited.wait()
tg.cancel_scope.shield=Trueasserttg.cancel_scope.cancel_calledassertlen(exc.value.exceptions) ==1assertstr(exc.value.exceptions[0]) =="child task failed"
This stronger case is a regression test for both bug (1) and bug (2).
(It's a bit hard to read because the test has to be pretty careful to be able to deterministically hit the problematic scheduling order.)
To fix this case I believe we need to move the exception-checking logic out of task_done and into a wrapper function:
# in _spawn:@wraps(func)asyncdefwrapper():
try:
returnawaitfunc(...)
finally:
# task_done's exception-checking logic goes here, i.e. (simplified pseudocode)ifexception:
self.cancel_scope.cancel()
This way the .cancel call will happen immediately when the task finishes with an exception, as Trio documents, rather than the .cancel being postponed one scheduling batch later.
(Note, this test can also be written equivalently (I believe) using wait_all_tasks_blocked in order to hit the problematic scheduling order rather than using an event as I did above. I find the event-based version preferable (easier to understand), but the following is (I think) an equivalent way to implement this test:
asyncdeftest_shield_after_task_failed2() ->None:
asyncdeftaskfunc() ->None:
raiseException("child task failed")
withpytest.raises(BaseExceptionGroup) asexc:
withCancelScope() asouter_scope:
asyncwithcreate_task_group() astg:
outer_scope.cancel()
tg.start_soon(taskfunc)
withCancelScope(shield=True):
# Trio documents: "If any task inside the nursery finishes with an# unhandled exception, then the nursery *immediately* cancels all# the tasks inside the nursery" (emphasis added). So when the event# loop drives taskfunc's coro one step forward and it exits with an# exception, tg.cancel_scope must get cancelled *immediately*, i.e.# it must get cancelled before any other tasks/callbacks get to run# a step.awaitwait_all_tasks_blocked()
tg.cancel_scope.shield=Trueasserttg.cancel_scope.cancel_calledassertlen(exc.value.exceptions) ==1assertstr(exc.value.exceptions[0]) =="child task failed"
)
The text was updated successfully, but these errors were encountered:
Things to check first
I have searched the existing issues and didn't find my bug already reported there
I have checked that my bug is still present in the latest release
AnyIO version
master (d1aea98), as well as the latest #774 (f9a1e1a)
Python version
3.12.6, CPython
What happened?
There are actually two separate bugs here that (I believe) require two separate fixes, but I am putting them both in this issue since they are so closely related (and to avoid spamming the issue tracker).
On asyncio, a child task can finish with an unhandled exception without cancelling its group's scope.
On asyncio, there is a delay between a child task finishing with an unhandled exception and its group's scope getting cancelled. This does not match the behavior on Trio. Per Trio's documentation:
How can we reproduce the bug?
Here are two tests. They all fail on asyncio and pass on trio. The first test is just a weaker version of the second test, but I have separated the two here because I think it is easier to follow that way.
The weaker case:
This weaker test is a regression test only for bug (1); it does not test for bug (2).
To fix this case I believe we just need to change the
.cancel
call intask_done
to be unconditional here:anyio/src/anyio/_backends/_asyncio.py
Lines 733 to 734 in d1aea98
The stronger case:
This stronger case is a regression test for both bug (1) and bug (2).
(It's a bit hard to read because the test has to be pretty careful to be able to deterministically hit the problematic scheduling order.)
To fix this case I believe we need to move the exception-checking logic out of
task_done
and into a wrapper function:This way the
.cancel
call will happen immediately when the task finishes with an exception, as Trio documents, rather than the.cancel
being postponed one scheduling batch later.(Note, this test can also be written equivalently (I believe) using
wait_all_tasks_blocked
in order to hit the problematic scheduling order rather than using an event as I did above. I find the event-based version preferable (easier to understand), but the following is (I think) an equivalent way to implement this test:)
The text was updated successfully, but these errors were encountered: