-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements #19
Improvements #19
Conversation
Fixes an error in GC percentage reporting. Also adds ability to report GC time per domain.
match lifecycle_event with | ||
| Runtime_events.EV_RING_START -> start_time := Unix.gettimeofday () | ||
| Runtime_events.EV_RING_STOP -> end_time := Unix.gettimeofday () | ||
| Runtime_events.EV_RING_START -> wall_time.start_time <- Unix.gettimeofday () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that for domain 0 we measure the time with RING_START and RING_STOP, and DOMAIN_SPAWN and DOMAIN_TERMINATE for other domains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the idea. Currently, EV_DOMAIN_SPAWN
and EV_DOMAIN_TERMINATE
are recorded at caml_domain_spawn
and domain_terminate
respectively, so they're not logged for Domain 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
Have you run this PR on the solver-service PR and do the observed numbers match the results observed using perf
. Essentially, can we independently check that the results produced by the PR is correct by checking the results against what's observed in perf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any updates on this? Just want to make sure that our reported results can be validated independently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should validate the results against perf/usr/bin/time
. I'll post the numbers here.
I got sucked into a heisenbug I found on this PR. I'll put it somewhere when I get it to segfault reliably. My hunch is that there's some bad interaction happening with lifecycle events + core, I'm not able to pin point the exact reason yet. I'm almost certain the bug is not in olly, it happens even when we do very little stuff at domain_spawn and domain_terminate. I've looked at the runtime code and don't see anything odd there either. Will dig into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put the bug reproduction here - https://github.com/Sudha247/olly-bug-repro. Will update the PR later this week to avoid the bug.
cc @sadiqj, in case you see something obviously off.
Having tested this with some examples, I think computing CPU time as sum of all domain times is incorrect, as it ends up accounting for CPU time + Sys time, which includes the kernel time. I've switched to
One a side note, the bug I encountered above is seemingly gone in the latest patch. The current patch is not related to #20 (which needs to fixed separately). |
93c40c9
to
b5b76fc
Compare
The CI is happy, I'll merge this now and cut a new release soon. |
CHANGES: * Custom events for json (tarides/runtime_events_tools#24, @Sudha247) * Improvements to correct gc-stats (tarides/runtime_events_tools#19, @Sudha247) * olly trace: ingest custom events starting from OCaml 5.1 (tarides/runtime_events_tools#17, @TheLortex)
CHANGES: * Custom events for json (tarides/runtime_events_tools#24, @Sudha247) * Improvements to correct gc-stats (tarides/runtime_events_tools#19, @Sudha247) * olly trace: ingest custom events starting from OCaml 5.1 (tarides/runtime_events_tools#17, @TheLortex)
This patch has a few improvements to make the output of
olly gc-stats
more accurate and provide additional information.README
is updated to reflect these.GC percentage
It fixes a bug in computing the percentage GC time. Previously the percentage was computed against wall time. That works for single core programs, but is incorrect for multi-domain programs. Percentage is now calculated against CPU time. CPU time is the sum total of individual times domains are alive, this is computed with. I discussed this offline with @kayceesrk and this seemed like the accurate way to do this for OCaml 5 programs.
Per-domain GC time
olly
shows the overall time spent on GC. This is useful, however I thought it'd also be useful to see the time spent on GC at the granularity of domains while looking at the numbers at Eio port of ocurrent's solver service ocurrent/solver-service#71. This could possibly be useful to tune the GC as well.gc-stats
shows the time spent on GC by the domains that were alive in the program.Tests
I've removed the menhir test and added a standalone test without any library dependencies. I originally added the menhir test to have an easily executable test without the hassle of building an executable before the dune rule is active to run olly binary. The new test, adapted from binary trees benchmark performs a lot of allocations and runs on multiple domains. menhir test is no longer necessary, IMO.