Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gc subcommand #10

Merged
merged 15 commits into from
Feb 24, 2023
Merged

Gc subcommand #10

merged 15 commits into from
Feb 24, 2023

Conversation

Sudha247
Copy link
Member

Tracks the sum total of major GC and minor GC time, partially addressing #8. I've tested it with single domain programs comparing the time reported here with the trace file. Leaving it as a draft as this needs more testing on multi-domain programs.

@Sudha247
Copy link
Member Author

Results on single-domain programs:

$ /usr/bin/time _build/default/benchmarks/multicore-numerical/binarytrees5_multicore.exe 1 22
20.79user 0.25system 0:21.09elapsed 99%CPU (0avgtext+0avgdata 505476maxresident)k
0inputs+0outputs (0major+125785minor)pagefaults 0swaps
$ olly gc "_build/default/benchmarks/multicore-numerical/binarytrees5_multicore.exe 1 22"
Time in GC: 13.141230

Corresponding trace:
image

@kayceesrk
Copy link
Collaborator

kayceesrk commented Feb 18, 2023

Thanks for the contribution.

A few questions and comments:

  • The sum of the major + stw_leader wall durations is 12.1 seconds. Time in GC is reported to be 13.1. Where does the discrepancy come from?
  • "Time in GC: 13.141230" please add units. Also, the absolute time doesn't make sense without the total time. Running the program twice will lead to differences, which makes the number hard to interpret. My suggestion would be to look at how GHC and MLton (and other languages) report GC stats. They usually report them as "total time: xxx s, gc time: xxx s (xxx % of total time)".
  • One thing we can do for multi-threaded programs is to report the GC time for each of the domains separately (perhaps in a follow up PR). This will tell us whether the GC work is unbalanced.
  • I think it is useful to reconsider whether a new subcommand needs to be introduced. I feel that we can include all the GC stats under the same subcommand gc-stats. Concretely, lets do away with latency subcommand, and fold that into gc-stats along with the time in GC. From the user perspective, this would be better UX than many subcommands.

@Sudha247
Copy link
Member Author

Thanks a lot for the comments @kayceesrk!

The sum of the major + stw_leader wall durations is 12.1 seconds. Time in GC is reported to be 13.1. Where does the discrepancy come from?

You're right, I was wondering that myself - the results are from two different runs, I think that's perhaps the source of discrepancy. Will double check this.

"Time in GC: 13.141230" please add units. Also, the absolute time doesn't make sense without the total time. Running the program twice will lead to differences, which makes the number hard to interpret.

Absolutely! I've merged the gc time with the latency command, and renamed it to gc-stats. Makes sense to have all of this in a single command - also addressing your fourth point.

One thing we can do for multi-threaded programs is to report the GC time for each of the domains separately (perhaps in a follow up PR). This will tell us whether the GC work is unbalanced.

Sounds like a good idea. Will try to do this in a follow up PR.


I've added a sequential version of binarytrees benchmark for testing. The dune rule for olly run itself is commented out for now as it's throwing a Unix.ENOENT exception. It's maybe due to some path mismatches, will try to figure this out.

@sadiqj could you enable ocaml-ci for the repo please?

@Sudha247
Copy link
Member Author

The results look like:

$ _build/default/bin/olly.exe gc-stats _build/default/test/binarytrees.exe 
stretch tree of depth 23         check: 16777215
4194304  trees of depth 4        check: 130023424
1048576  trees of depth 6        check: 133169152
262144   trees of depth 8        check: 133955584
65536    trees of depth 10       check: 134152192
16384    trees of depth 12       check: 134201344
4096     trees of depth 14       check: 134213632
1024     trees of depth 16       check: 134216704
256      trees of depth 18       check: 134217472
64       trees of depth 20       check: 134217664
16       trees of depth 22       check: 134217712
long lived tree of depth 22      check: 8388607

Execution times:
#[Wall-time (s):        20.45,   GC-time (s):   13.00]

GC latency profile:
#[Mean (ms):    0.84,    Stddev (ms):   1.33]
#[Min (ms):     0.00,    max (ms):      16.12]

Percentile       Latency (ms)
25.0000          0.00
50.0000          0.09
60.0000          0.45
70.0000          0.97
75.0000          1.27
80.0000          1.87
85.0000          2.01
90.0000          2.35
95.0000          3.31
96.0000          4.93
97.0000          5.39
98.0000          5.55
99.0000          5.69
99.9000          6.50
99.9900          13.21
99.9990          16.12
99.9999          16.12
100.0000         16.12

bin/olly.ml Outdated Show resolved Hide resolved
bin/olly.ml Show resolved Hide resolved
@kayceesrk
Copy link
Collaborator

Minor comment about formatting the output. Rather than

Execution times:
#[Wall-time (s):        20.45,   GC-time (s):   13.00]

It may be useful to do:

Execution times:
  Wall time (s):                20.45 
  GC time (s):                  13.00
  GC overhead (% of wall time): xx.xx

@kayceesrk
Copy link
Collaborator

Is the PR ready for review? :-)

@Sudha247 Sudha247 marked this pull request as ready for review February 22, 2023 06:23
@Sudha247
Copy link
Member Author

I wanted to test on multi-domain programs before marking the PR ready for review. I have done that, and I think it's ready for review. :)

A couple of question regarding multi-domain programs:

  • Is interrupt_remote a more accurate measure of GC time in multi-domain programs rather than stw_handler? I presume stw_handler is when the actual work is done, but when the thread is at interrupt_remote it would still not be able to run mutator code.
  • Calculating GC percentage with wall time may not make sense for multi-domain programs, as the total GC time is a sum total of GC time in all threads. Should we display the CPU time as well and calculate the GC time percentage against it?

@kayceesrk
Copy link
Collaborator

Thanks for the update. Can you please take into account the conversation that we have had on a different channel?

The latency calculation explicitly only tracks the topmost layers of the nested events. The relevant GC event for non-leader stw domains is
@Sudha247
Copy link
Member Author

Thanks @kayceesrk, I've addressed the comments from our offline conversation. Also updated the README and added a test that runs gc-stats for menhir.

bin/olly.ml Outdated Show resolved Hide resolved
bin/olly.ml Outdated Show resolved Hide resolved
bin/olly.ml Outdated Show resolved Hide resolved
Comparing constructors is faster. Also move the GC phase check outside Hashtable for performance
@kayceesrk kayceesrk merged commit 488ebec into tarides:main Feb 24, 2023
@kayceesrk
Copy link
Collaborator

Thanks for the work!

Sudha247 added a commit to Sudha247/opam-repository that referenced this pull request May 31, 2023
CHANGES:

* Fix dependencies (tarides/runtime_events_tools#14, @Sudha247)
* Improve JSON output produced by olly gc-stats (tarides/runtime_events_tools#13, @punchagan)
* Mention Fuchsia format in the README (tarides/runtime_events_tools#11, @Sudha247)
* Gc subcommand (tarides/runtime_events_tools#10, @Sudha247)
* Add Fuchsia Trace Format output to olly (tarides/runtime_events_tools#6, @tomjridge)
* Added --output option to redirect olly printing (tarides/runtime_events_tools#5, @ElectreAAS)
* Added json printing option (tarides/runtime_events_tools#4, @ElectreAAS)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants