Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow printing task backtraces via profiling peek mechanism #56043

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Drvi
Copy link
Contributor

@Drvi Drvi commented Oct 8, 2024

This would be useful e.g. for debugging stuck tests run via ReTestItems.jl (which uses multiple Julia processes to run the tests)

base/Base.jl Outdated
# Needs to be in Base because Profile is no longer loaded on boot
function profile_printing_listener(cond::Base.AsyncCondition)
profile = nothing
try
while _trywait(cond)
profile = @something(profile, require_stdlib(PkgId(UUID("9abbd945-dff8-562f-b5e8-e1ebf5ef1b79"), "Profile")))::Module
invokelatest(profile.peek_report[])
if Base.get_bool_env("JULIA_PROFILE_PEEK_TASK_BACKTRACES", false) === true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth making this the default even

Suggested change
if Base.get_bool_env("JULIA_PROFILE_PEEK_TASK_BACKTRACES", false) === true
if Base.get_bool_env("JULIA_PROFILE_PEEK_TASK_BACKTRACES", true) === true

It would also potentially be great to put this in signals-unix.c : signal_listener directly, so it operates on fatal signals too

Copy link
Contributor Author

@Drvi Drvi Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the PR so that we print the backtraces by default 👍 As for calling this from the signal handler directly -- do you mean we should do also print these for different signals than SIGUSR1/SIGINFO? I think I'd need some guidance with that as I'm not sure which signals would be suitable to be included.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: at least for us, with our huge sysimg, this can be really slow. Like >5 minutes slow.

So if that impacts others too, do we want to do this by default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does system image size affect this? Yes, that would be the list of all "critical" signals defined there. There's already a place we call jl_print_bt_entry_codeloc on the threads, so it would just tack on right after that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We believe the reason is because symbolizing the traces is super slow with our giant binary size. But to be honest i think we don't know why it's so slow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha, that does make sense. Symbolizing something big (such as the LLVM debug info when asserts are on) can take quite some time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although that happens in parallel with the runtime, so I think that still might be okay, for something the user has to request pretty explicitly

# Needs to be in Base because Profile is no longer loaded on boot
function profile_printing_listener(cond::Base.AsyncCondition)
profile = nothing
try
while _trywait(cond)
profile = @something(profile, require_stdlib(PkgId(UUID("9abbd945-dff8-562f-b5e8-e1ebf5ef1b79"), "Profile")))::Module
invokelatest(profile.peek_report[])
if Base.get_bool_env("JULIA_PROFILE_PEEK_TASK_BACKTRACES", false) === true
println(stderr, "Printing Julia task backtraces...")
ccall(:jl_print_task_backtraces, Cvoid, ())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prints to a different abstraction over stderr than the other println calls here, so may need to explicitly call flush around it to ensure the output doesn't get jumbled

Aside, it can also be better to use print(stderr, "<text>\n") instead of println where reliable async behavior is desired, since it ensures the whole text prints in the same syscall

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very helpful, thanks!

@Drvi Drvi force-pushed the td-profile-peek-task-backtraces branch from db82a3c to 460bab6 Compare October 9, 2024 12:56
@Drvi Drvi force-pushed the td-profile-peek-task-backtraces branch from 460bab6 to 0998426 Compare October 9, 2024 12:56
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditional approval, since it would probably be better to print this async (in signal_listener) immediately instead of waiting for 1 second and then printing just before the next next yield point finally returns to the current Task (here), which could be a very long delay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants