Child tasks don't cancel their group's scope correctly when they raise an exception on asyncio #787

gschaffner · 2024-09-12T06:36:52Z

Things to check first

I have searched the existing issues and didn't find my bug already reported there
I have checked that my bug is still present in the latest release

AnyIO version

master (d1aea98), as well as the latest #774 (f9a1e1a)

Python version

3.12.6, CPython

What happened?

There are actually two separate bugs here that (I believe) require two separate fixes, but I am putting them both in this issue since they are so closely related (and to avoid spamming the issue tracker).

On asyncio, a child task can finish with an unhandled exception without cancelling its group's scope.
On asyncio, there is a delay between a child task finishing with an unhandled exception and its group's scope getting cancelled. This does not match the behavior on Trio. Per Trio's documentation:

If any task inside the nursery finishes with an unhandled exception, then the nursery immediately cancels all the tasks inside the nursery. (emphasis added)

How can we reproduce the bug?

Here are two tests. They all fail on asyncio and pass on trio. The first test is just a weaker version of the second test, but I have separated the two here because I think it is easier to follow that way.

The weaker case:

# Easy mode: `tg.cancel_scope` gets shielded many event loop cycles after the child task
# exits with an exception.
async def test_shield_after_task_failed_weak() -> None:
    async def taskfunc() -> None:
        raise Exception("child task failed")

    with pytest.raises(BaseExceptionGroup) as exc:
        with CancelScope() as outer_scope:
            async with create_task_group() as tg:
                outer_scope.cancel()
                tg.start_soon(taskfunc)
                with CancelScope(shield=True):
                    await wait_all_tasks_blocked()
                    # Wait at least one more scheduling round to ensure that taskfunc's
                    # task_done on asyncio has finished.
                    await sleep(0.1)
                tg.cancel_scope.shield = True
                assert tg.cancel_scope.cancel_called

    assert len(exc.value.exceptions) == 1
    assert str(exc.value.exceptions[0]) == "child task failed"

This weaker test is a regression test only for bug (1); it does not test for bug (2).

To fix this case I believe we just need to change the .cancel call in task_done to be unconditional here:

anyio/src/anyio/_backends/_asyncio.py

Lines 733 to 734 in d1aea98

    
           if not self.cancel_scope._parent_cancelled(): 
        
               self.cancel_scope.cancel()

The stronger case:

# Hard mode: tg.cancel_scope gets shielded after the child task exits with an exception
# but before its task_done (on asyncio) runs.
async def test_shield_after_task_failed() -> None:
    taskfunc_exited = anyio.Event()

    async def taskfunc() -> None:
        try:
            raise Exception("child task failed")
        finally:
            taskfunc_exited.set()
            # (Technically we are setting this event slightly before taskfunc exits, but
            # this won't wake any waiters until the event loop's next batch of task
            # steps/callbacks, which will be the first batch to run after taskfunc has
            # exited.)

    with pytest.raises(BaseExceptionGroup) as exc:
        with CancelScope() as outer_scope:
            async with create_task_group() as tg:
                outer_scope.cancel()
                tg.start_soon(taskfunc)
                with CancelScope(shield=True):
                    # Trio documents: "If any task inside the nursery finishes with an
                    # unhandled exception, then the nursery *immediately* cancels all
                    # the tasks inside the nursery" (emphasis added). So when the event
                    # loop drives taskfunc's coro one step forward and it exits with an
                    # exception, tg.cancel_scope must get cancelled *immediately*, i.e.
                    # it must get cancelled before any other tasks/callbacks get to run
                    # a step.
                    await taskfunc_exited.wait()
                tg.cancel_scope.shield = True
                assert tg.cancel_scope.cancel_called

    assert len(exc.value.exceptions) == 1
    assert str(exc.value.exceptions[0]) == "child task failed"

This stronger case is a regression test for both bug (1) and bug (2).

(It's a bit hard to read because the test has to be pretty careful to be able to deterministically hit the problematic scheduling order.)

To fix this case I believe we need to move the exception-checking logic out of task_done and into a wrapper function:

# in _spawn:
@wraps(func)
async def wrapper():
    try:
        return await func(...)
    finally:
        # task_done's exception-checking logic goes here, i.e. (simplified pseudocode)
        if exception:
            self.cancel_scope.cancel()

This way the .cancel call will happen immediately when the task finishes with an exception, as Trio documents, rather than the .cancel being postponed one scheduling batch later.

(Note, this test can also be written equivalently (I believe) using wait_all_tasks_blocked in order to hit the problematic scheduling order rather than using an event as I did above. I find the event-based version preferable (easier to understand), but the following is (I think) an equivalent way to implement this test:

async def test_shield_after_task_failed2() -> None:
    async def taskfunc() -> None:
        raise Exception("child task failed")

    with pytest.raises(BaseExceptionGroup) as exc:
        with CancelScope() as outer_scope:
            async with create_task_group() as tg:
                outer_scope.cancel()
                tg.start_soon(taskfunc)
                with CancelScope(shield=True):
                    # Trio documents: "If any task inside the nursery finishes with an
                    # unhandled exception, then the nursery *immediately* cancels all
                    # the tasks inside the nursery" (emphasis added). So when the event
                    # loop drives taskfunc's coro one step forward and it exits with an
                    # exception, tg.cancel_scope must get cancelled *immediately*, i.e.
                    # it must get cancelled before any other tasks/callbacks get to run
                    # a step.
                    await wait_all_tasks_blocked()
                tg.cancel_scope.shield = True
                assert tg.cancel_scope.cancel_called

    assert len(exc.value.exceptions) == 1
    assert str(exc.value.exceptions[0]) == "child task failed"

)

The text was updated successfully, but these errors were encountered:

gschaffner added the bug Something isn't working label Sep 12, 2024

gschaffner mentioned this issue Sep 12, 2024

Fixed TaskGroup and CancelScope exit issues on asyncio #774

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Child tasks don't cancel their group's scope correctly when they raise an exception on asyncio #787

Child tasks don't cancel their group's scope correctly when they raise an exception on asyncio #787

gschaffner commented Sep 12, 2024 •

edited

Loading

Child tasks don't cancel their group's scope correctly when they raise an exception on asyncio #787

Child tasks don't cancel their group's scope correctly when they raise an exception on asyncio #787

Comments

gschaffner commented Sep 12, 2024 • edited Loading

Things to check first

AnyIO version

Python version

What happened?

How can we reproduce the bug?

gschaffner commented Sep 12, 2024 •

edited

Loading