Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make outputs go to correct cell when generated in threads/asyncio #1186

Merged
merged 6 commits into from
Jan 16, 2024

Conversation

krassowski
Copy link
Member

@krassowski krassowski commented Dec 21, 2023

References

Code changes

  • Use contextvars to retrieve the parent header of emitting asyncio task
  • Use thread identity to retrieve the parent header of emitting thread
  • Store iostream buffer per parent header rather than in a single buffer

User facing changes

Before After
before after

Backward-incompatible changes

None

@Zsailer Zsailer added the bug label Dec 21, 2023
@krassowski krassowski force-pushed the fix/make-outputs-go-to-right-cells branch 2 times, most recently from 5dae96f to 10c149c Compare December 21, 2023 20:18
Copy link
Collaborator

@davidbrochart davidbrochart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive PR @krassowski, that's great.
Just to confirm, this wouldn't work with threads that are not created from Python, right? For instance in a C extension.

@krassowski
Copy link
Member Author

Yes, writing from threads which were created outside of Python will still write to the most recently executed cell, as will writing directly to OS-level file descriptor corresponding to stdout/stderr.

if isinstance(stream, OutStream):
stream._thread_parents[self.ident] = parent_header

threading.Thread.start = start_closure # type:ignore[method-assign]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if there could be a way to not monkey-patch threading.Thread.start.
In akernel it's the print function that is monkey-patched, so this PR goes further because it works for lower-level writes to stdout/stderr, but with monkey-patching there is always a risk that another library changes the same object, and we cannot know that our change will be the last applied.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding as per https://bugs.python.org/issue14073 is that there is no other way, but we could mention our use case as another situation motivating introduction of start/exit hooks/callbacks to threads (which is something Python committers have considered in the past but I presume it was not a sufficiently high priority until now).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We crossed 'comments', see my other thread on this exact line for (what I think is) a better solution.

Copy link
Contributor

@maartenbreddels maartenbreddels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

if isinstance(stream, OutStream):
stream._thread_parents[self.ident] = parent_header

threading.Thread.start = start_closure # type:ignore[method-assign]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it thread-safe. If a thread starts a new thread, that new thread will start outputting in the last executed cell.
I've taken a different approach in Solara, where I patch the ctor and run from the start:
https://github.com/widgetti/solara/blob/0ac4767a8c5f8c8b221cafe41fa9ac36270adbe8/solara/server/patch.py#L336

I think this approach might work better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize the strategy, the ctor (__init__) would take a reference to the parent_header, and in run, that parent header is then set in a thread_local storage (in your case _thread_parents).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a thread starts a new thread, that new thread will start outputting in the last executed cell.

Correct (there is no such a problem with asyncio side of things).

I think this approach might work better.

Thank you for the link, I will take a look!

Copy link
Member Author

@krassowski krassowski Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a closer look at solara, current_context takes the role of _thread_parents, right? This does not seem to differ that much different. Your approach of patching earlier on initialization rather than startup has the advantage that it will work with:

from threading import Thread
from time import sleep

def child_target():
    for i in range(iterations):
        print(i, end='', flush=True)
        sleep(interval)

def parent_target():
    thread = Thread(target=child_target)
    sleep(interval)
    thread.start()

Thread(target=parent_target).start()

but still not with:

def parent_target():
    sleep(interval)
    Thread(target=child_target).start()

Thread(target=parent_target).start()

do I see this right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end my implementation converged to overriding the same methods as yours after all in e1258de :)

Comment on lines 450 to 455
self._parent_header: contextvars.ContextVar[Dict[str, Any]] = contextvars.ContextVar(
"parent_header"
)
self._parent_header.set({})
self._thread_parents = {}
self._parent_header_global = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why the ContextVar and the _thread_parents dict? Is a threading.local not more standard?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because contextvar works well for asyncio edge cases. There is some more explanation in the PEP 567, but the gist is:

Thread-local variables are insufficient for asynchronous tasks that execute concurrently in the same OS thread. Any context manager that saves and restores a context value using threading.local() will have its context values bleed to other code unexpectedly when used in async/await code.

That said, I will take another look at using threading.local instead of _thread_parents.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that we need ContextVar now, because also when we create new tasks we want to output to the right output cell, didn't think of that!

I think if we use ContextVar, we do not need the thread local storage, it's a superset of threading.local. In combination with overriding the Thread ctor, we can drop _thread_parents.

@krassowski krassowski force-pushed the fix/make-outputs-go-to-right-cells branch from 34867ac to e1258de Compare December 27, 2023 20:25
@krassowski
Copy link
Member Author

As of now this PR also handles nested threads (started after execution of other cells) as well.

@krassowski
Copy link
Member Author

Hi folks, just bumping this PR as there was no further review/comments in the past three weeks. Any suggestions, or is it good to merge?

@blink1073
Copy link
Contributor

I'll merge tomorrow if there are no further comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Output widget context manager prints to wrong cell (if used in a Thread) Output appears in wrong cell
5 participants