Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements #19

Merged
merged 7 commits into from
Sep 22, 2023
Merged

Improvements #19

merged 7 commits into from
Sep 22, 2023

Conversation

Sudha247
Copy link
Member

This patch has a few improvements to make the output of olly gc-stats more accurate and provide additional information. README is updated to reflect these.

GC percentage

It fixes a bug in computing the percentage GC time. Previously the percentage was computed against wall time. That works for single core programs, but is incorrect for multi-domain programs. Percentage is now calculated against CPU time. CPU time is the sum total of individual times domains are alive, this is computed with. I discussed this offline with @kayceesrk and this seemed like the accurate way to do this for OCaml 5 programs.

Per-domain GC time

olly shows the overall time spent on GC. This is useful, however I thought it'd also be useful to see the time spent on GC at the granularity of domains while looking at the numbers at Eio port of ocurrent's solver service ocurrent/solver-service#71. This could possibly be useful to tune the GC as well. gc-stats shows the time spent on GC by the domains that were alive in the program.

Tests

I've removed the menhir test and added a standalone test without any library dependencies. I originally added the menhir test to have an easily executable test without the hassle of building an executable before the dune rule is active to run olly binary. The new test, adapted from binary trees benchmark performs a lot of allocations and runs on multiple domains. menhir test is no longer necessary, IMO.

Sudha247 added 3 commits July 21, 2023 11:22
Fixes an error in GC percentage reporting. Also adds ability to report GC time per domain.
match lifecycle_event with
| Runtime_events.EV_RING_START -> start_time := Unix.gettimeofday ()
| Runtime_events.EV_RING_STOP -> end_time := Unix.gettimeofday ()
| Runtime_events.EV_RING_START -> wall_time.start_time <- Unix.gettimeofday ()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that for domain 0 we measure the time with RING_START and RING_STOP, and DOMAIN_SPAWN and DOMAIN_TERMINATE for other domains?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the idea. Currently, EV_DOMAIN_SPAWN and EV_DOMAIN_TERMINATE are recorded at caml_domain_spawn and domain_terminate respectively, so they're not logged for Domain 0.

Copy link
Collaborator

@kayceesrk kayceesrk Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Have you run this PR on the solver-service PR and do the observed numbers match the results observed using perf. Essentially, can we independently check that the results produced by the PR is correct by checking the results against what's observed in perf.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any updates on this? Just want to make sure that our reported results can be validated independently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should validate the results against perf/usr/bin/time. I'll post the numbers here.

I got sucked into a heisenbug I found on this PR. I'll put it somewhere when I get it to segfault reliably. My hunch is that there's some bad interaction happening with lifecycle events + core, I'm not able to pin point the exact reason yet. I'm almost certain the bug is not in olly, it happens even when we do very little stuff at domain_spawn and domain_terminate. I've looked at the runtime code and don't see anything odd there either. Will dig into this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put the bug reproduction here - https://github.com/Sudha247/olly-bug-repro. Will update the PR later this week to avoid the bug.

cc @sadiqj, in case you see something obviously off.

@Sudha247
Copy link
Member Author

Having tested this with some examples, I think computing CPU time as sum of all domain times is incorrect, as it ends up accounting for CPU time + Sys time, which includes the kernel time. I've switched to Unix.times for capturing the CPU time. Unix.times is supposed to work with both new processes and domains. Here's the comparison with /usr/bin/time and perf stat on two GC-heavy examples - solver service and binary trees.

Binarytrees
Domains /usr/bin/time perf stat olly gc-stats
CPU Time (s) Wall Time (s) CPU Time (s) Wall Time (s) CPU Time (s) Wall Time (s)
1 3.78 3.81 4.101 4.163 4.11 4.11
2 4.49 2.69 4.97 3.03 5.15 3.01
3 5.82 3.25 5.75 3.22 3.12 5.65
4 6.37 3.18 6.37 3.21 6.64 3.32
Solver service
Domains /usr/bin/time perf stat olly gc-stats
CPU Time (s) Wall Time (s) CPU Time (s) Wall Time (s) CPU Time (s) Wall Time (s)
2 58.03 46.45 57.132 46.106 60.5 46.44
3 62.88 32.8 60.83 29.85 61.77 32.09
4 68.58 28.41 67.96 27.95 73.56 30.01

One a side note, the bug I encountered above is seemingly gone in the latest patch. The current patch is not related to #20 (which needs to fixed separately).

@Sudha247
Copy link
Member Author

The CI is happy, I'll merge this now and cut a new release soon.

@Sudha247 Sudha247 merged commit 843240b into main Sep 22, 2023
@Sudha247 Sudha247 deleted the improvements branch September 22, 2023 14:06
Sudha247 added a commit to Sudha247/opam-repository that referenced this pull request Sep 27, 2023
CHANGES:

* Custom events for json (tarides/runtime_events_tools#24, @Sudha247)
* Improvements to correct gc-stats (tarides/runtime_events_tools#19, @Sudha247)
* olly trace: ingest custom events starting from OCaml 5.1 (tarides/runtime_events_tools#17, @TheLortex)
nberth pushed a commit to nberth/opam-repository that referenced this pull request Jun 18, 2024
CHANGES:

* Custom events for json (tarides/runtime_events_tools#24, @Sudha247)
* Improvements to correct gc-stats (tarides/runtime_events_tools#19, @Sudha247)
* olly trace: ingest custom events starting from OCaml 5.1 (tarides/runtime_events_tools#17, @TheLortex)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants