Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Child tasks don't cancel their group's scope correctly when they raise an exception on asyncio #787

Open
2 tasks done
gschaffner opened this issue Sep 12, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@gschaffner
Copy link
Collaborator

gschaffner commented Sep 12, 2024

Things to check first

  • I have searched the existing issues and didn't find my bug already reported there

  • I have checked that my bug is still present in the latest release

AnyIO version

master (d1aea98), as well as the latest #774 (f9a1e1a)

Python version

3.12.6, CPython

What happened?

There are actually two separate bugs here that (I believe) require two separate fixes, but I am putting them both in this issue since they are so closely related (and to avoid spamming the issue tracker).

  1. On asyncio, a child task can finish with an unhandled exception without cancelling its group's scope.

  2. On asyncio, there is a delay between a child task finishing with an unhandled exception and its group's scope getting cancelled. This does not match the behavior on Trio. Per Trio's documentation:

    If any task inside the nursery finishes with an unhandled exception, then the nursery immediately cancels all the tasks inside the nursery. (emphasis added)

How can we reproduce the bug?

Here are two tests. They all fail on asyncio and pass on trio. The first test is just a weaker version of the second test, but I have separated the two here because I think it is easier to follow that way.

The weaker case:

# Easy mode: `tg.cancel_scope` gets shielded many event loop cycles after the child task
# exits with an exception.
async def test_shield_after_task_failed_weak() -> None:
    async def taskfunc() -> None:
        raise Exception("child task failed")

    with pytest.raises(BaseExceptionGroup) as exc:
        with CancelScope() as outer_scope:
            async with create_task_group() as tg:
                outer_scope.cancel()
                tg.start_soon(taskfunc)
                with CancelScope(shield=True):
                    await wait_all_tasks_blocked()
                    # Wait at least one more scheduling round to ensure that taskfunc's
                    # task_done on asyncio has finished.
                    await sleep(0.1)
                tg.cancel_scope.shield = True
                assert tg.cancel_scope.cancel_called

    assert len(exc.value.exceptions) == 1
    assert str(exc.value.exceptions[0]) == "child task failed"

This weaker test is a regression test only for bug (1); it does not test for bug (2).

To fix this case I believe we just need to change the .cancel call in task_done to be unconditional here:

if not self.cancel_scope._parent_cancelled():
self.cancel_scope.cancel()

The stronger case:

# Hard mode: tg.cancel_scope gets shielded after the child task exits with an exception
# but before its task_done (on asyncio) runs.
async def test_shield_after_task_failed() -> None:
    taskfunc_exited = anyio.Event()

    async def taskfunc() -> None:
        try:
            raise Exception("child task failed")
        finally:
            taskfunc_exited.set()
            # (Technically we are setting this event slightly before taskfunc exits, but
            # this won't wake any waiters until the event loop's next batch of task
            # steps/callbacks, which will be the first batch to run after taskfunc has
            # exited.)

    with pytest.raises(BaseExceptionGroup) as exc:
        with CancelScope() as outer_scope:
            async with create_task_group() as tg:
                outer_scope.cancel()
                tg.start_soon(taskfunc)
                with CancelScope(shield=True):
                    # Trio documents: "If any task inside the nursery finishes with an
                    # unhandled exception, then the nursery *immediately* cancels all
                    # the tasks inside the nursery" (emphasis added). So when the event
                    # loop drives taskfunc's coro one step forward and it exits with an
                    # exception, tg.cancel_scope must get cancelled *immediately*, i.e.
                    # it must get cancelled before any other tasks/callbacks get to run
                    # a step.
                    await taskfunc_exited.wait()
                tg.cancel_scope.shield = True
                assert tg.cancel_scope.cancel_called

    assert len(exc.value.exceptions) == 1
    assert str(exc.value.exceptions[0]) == "child task failed"

This stronger case is a regression test for both bug (1) and bug (2).

(It's a bit hard to read because the test has to be pretty careful to be able to deterministically hit the problematic scheduling order.)

To fix this case I believe we need to move the exception-checking logic out of task_done and into a wrapper function:

# in _spawn:
@wraps(func)
async def wrapper():
    try:
        return await func(...)
    finally:
        # task_done's exception-checking logic goes here, i.e. (simplified pseudocode)
        if exception:
            self.cancel_scope.cancel()

This way the .cancel call will happen immediately when the task finishes with an exception, as Trio documents, rather than the .cancel being postponed one scheduling batch later.

(Note, this test can also be written equivalently (I believe) using wait_all_tasks_blocked in order to hit the problematic scheduling order rather than using an event as I did above. I find the event-based version preferable (easier to understand), but the following is (I think) an equivalent way to implement this test:

async def test_shield_after_task_failed2() -> None:
    async def taskfunc() -> None:
        raise Exception("child task failed")

    with pytest.raises(BaseExceptionGroup) as exc:
        with CancelScope() as outer_scope:
            async with create_task_group() as tg:
                outer_scope.cancel()
                tg.start_soon(taskfunc)
                with CancelScope(shield=True):
                    # Trio documents: "If any task inside the nursery finishes with an
                    # unhandled exception, then the nursery *immediately* cancels all
                    # the tasks inside the nursery" (emphasis added). So when the event
                    # loop drives taskfunc's coro one step forward and it exits with an
                    # exception, tg.cancel_scope must get cancelled *immediately*, i.e.
                    # it must get cancelled before any other tasks/callbacks get to run
                    # a step.
                    await wait_all_tasks_blocked()
                tg.cancel_scope.shield = True
                assert tg.cancel_scope.cancel_called

    assert len(exc.value.exceptions) == 1
    assert str(exc.value.exceptions[0]) == "child task failed"

)

@gschaffner gschaffner added the bug Something isn't working label Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant