Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilise effect handlers #51

Merged
merged 9 commits into from
Nov 27, 2021
Merged

Utilise effect handlers #51

merged 9 commits into from
Nov 27, 2021

Conversation

kayceesrk
Copy link
Contributor

@kayceesrk kayceesrk commented Oct 12, 2021

This PR uses effect handlers to create tasks. The use of effect handlers will fix the issue described in ocaml-multicore/ocaml-multicore#670 (comment). The new test test_deadlock.ml precisely captures the scenario explained in the comment. This test program (after removal of T.run) will deadlock on domainslib.0.3.1 but runs to completion on this branch.

Unfortunately, this introduces a breaking change; the computations need to be enclosed in a run function due to the use of effect handlers. I am also yet to do performance benchmarking of this change.

This PR also makes domainslib work only with 4.12+domains (and OCaml 5.00); the stdlib EffectHandlers module is not made available on 4.12.0+domains+effects. In order to address this, the PR ocaml-multicore/ocaml-multicore#689 adds the EffectHandlers module to 4.12.0+domains+effects.

@kayceesrk
Copy link
Contributor Author

CC @ctk21 @edwintorok @Sudha247.

@kayceesrk
Copy link
Contributor Author

Here are performance results from turing (24 isolated x86 cores in a 28-core machine). The variant 4.12+domains uses the domainslib from this PR, and 4.12+domains+effects uses domainslib.0.3.0. The results show that there is negligible performance impact to using effect handlers in domainslib.

image

Copy link
Contributor

@edwintorok edwintorok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this, see comments inline.

lib/task.ml Outdated Show resolved Hide resolved
lib/task.ml Outdated Show resolved Hide resolved
lib/task.ml Outdated Show resolved Hide resolved
lib/task.ml Outdated Show resolved Hide resolved
lib/task.ml Outdated Show resolved Hide resolved
@kayceesrk
Copy link
Contributor Author

kayceesrk commented Oct 14, 2021

The commit d023dc2 depends on ocaml-multicore/ocaml-multicore#693.

With this feature, this program, which has an exception being thrown from await prints a backtrace that includes both the awaiting and the awaited task.

Raised at Stdlib.failwith in file "stdlib.ml", line 29, characters 17-33
Called from Backtrace.foo in file "test/backtrace.ml", line 6, characters 11-27 
Called from Backtrace.bar in file "test/backtrace.ml", line 14, characters 4-9
Called from Domainslib__Task.do_task in file "lib/task.ml", line 40, characters 23-29
Re-raised at Stdlib__effectHandlers.Deep.discontinue_with_backtrace.(fun) in file "effectHandlers.ml", line 41, characters 4-38
Called from Domainslib__Task.await in file "lib/task.ml", line 61, characters 6-44
Called from Backtrace.main.(fun) in file "test/backtrace.ml", line 23, characters 4-18
Called from Domainslib__Task.do_task in file "lib/task.ml", line 40, characters 23-29
Re-raised at Domainslib__Task.run.loop in file "lib/task.ml", line 101, characters 23-57
Called from Backtrace.main in file "test/backtrace.ml", line 21, characters 2-135
Called from Backtrace in file "test/backtrace.ml", line 28, characters 6-13

"test/backtrace.ml", line 23 is the await call.

@kayceesrk
Copy link
Contributor Author

There were a couple of valid concerns raised by @ctk21 about the current implementation:

  1. Work-first or help-first: Under work-first policy, a newly spawned async will be executed by a domain immediately, leaving its continuation to be stolen. Under help-first policy, a newly spawned async will be pushed to the work-stealing queue, and the domain continues executing the original task. Cilk goes for work-first policy, whereas this PR is help-first. It is easy to support both policies (by taking an optional parameter for the policy in async), but we should decide which policy is the one to go for by default. Previous studies indicate that no one policy is better for all the cases [1].
  2. Task hopping between domains: When a task blocks on an await and is woken up, it is put back into the work-stealing queue. This means that the task may switch to a different domain on the same pool at await. This will make it difficult to program with domain-local state. Should an async task be allowed to be pinned to a domain?
  3. Top-level run function: The need to enclose the computations in top-level run function is unwieldy, but looks necessary.

[1] https://www.cs.rice.edu/~yguo/pubs/PID824943.pdf

@edwintorok
Copy link
Contributor

I assume that once a future version of OCaml gets an effect type-system we'd be able to determine at compile time when a handler is missing.

Meanwhile not sure whether there is a type-level solution that is not too invasive (don't want to force ending up with a monad or other solution that requires plumbing a parameter through everything). Perhaps a compromise would be to include a lightweight (let*) and (and*) monad for more easily converting from Lwt style code (I think that let* can be a simple unit -> 'a delay, and only and would actually need to launch things async), and then a way to 'run' that monad would be to call it with the function that installs the effect handler. But that is probably best done as a separate PR.

lib/task.ml Outdated Show resolved Hide resolved
type _ eff += Wait : 'a promise * task_chan -> unit eff

let get_pool_data p =
match Atomic.get p with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be multiple semantics to pool teardown (might be useful to have a function that just waits/drains the active tasks without causing exceptions to be raised), but once you reach teardown it could just go through the queue of all active tasks and "cancel" them by setting their result to be an exception of cancelled (obviously can't cancel any that are actively running, the await might have some critical "finally" statements to clean up resources). But it might allow removing this Atomic.get if some kind of cancelation is implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It would be useful to convert this into an issue that we tackle in a later PR.

@edwintorok
Copy link
Contributor

edwintorok commented Oct 14, 2021

I've tested this on my original program that failed previously, and it works now.

(It is an implementation of find PATH -size +1c -type f, but using parallel for loops and openat/fstatat to walk the tree.
The Rust implementation in fd is quite fast too (although at times it is outperformed by the single threaded find on something like /usr due to find using openat/fstatat and fd using only opendir and lstat thus having to look up path components over and over again on deep trees).
My OCaml implementation uses openat/fstatat and it seems to beat both find and fd on walking my homedir with a warm cache (with about 13317085 non-empty files in it), looks like quite a good test for IO / syscalls intensive activity:

dune exec --profile=release ./testme.exe /home/edwin > /dev/null  199.19s user 52.84s system 2318% cpu 10.870 total
dune exec --profile=release ./testme.exe /home/edwin > /dev/null  193.58s user 52.49s system 2318% cpu 10.612 total
fd -HI -t f -S +1b -c never . /home/edwin > /dev/null  59.72s user 264.94s system 2182% cpu 14.876 total
find /home/edwin -size +1c -type f 2> /dev/null  4.02s user 19.68s system 99% cpu 23.846 total
13317085

Interestingly the OCaml implementation spends a lot more time in userspace and a lot less time in kernelspace (and less time overall).

With cold caches the benefits of parallelism become more obvious:

sync; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; sync
dune exec --profile=release ./testme.exe /home/edwin > /dev/null  59.28s user 111.25s system 1104% cpu 15.436 total
sync; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; sync
fd -HI -t f -S +1b -c never . /home/edwin > /dev/null  49.85s user 209.28s system 975% cpu 26.564 total
sync; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; sync
find /home/edwin -size +1c -type f 2> /dev/null  5.62s user 52.82s system 33% cpu 2:52.89 total

The speed of Lwt and Luv is rather disappointing though, with warm caches and with not even printing the filenames but just counting them I get something that uses 185% CPU but barely achieves a speedup over the sequential find (this is even if I implement Lwt.Unix.openat):

OCAMLRUNPARAM=b dune exec --profile=release ./testmelwt.exe /home/edwin  65.79s user 176.13s system 185% cpu 2:10.39 total

(I don't have a working Luv version at hand though).

Once I've cleaned up my code I'll post it online for comparison, I was trying to keep the multicore and Lwt/Luv versions quite similar in implementation. It looks like having the ability to execute OCaml code as soon as we get a result from the syscal on the same thread makes a huge difference (otherwise a lot of time is spent sending jobs between threads and doing syscalls to put threads to sleep and wake them again).

@kayceesrk
Copy link
Contributor Author

@ctk21 wanted the results of the test/task_throughput.ml benchmark. Here are the results.

On master (which doesn't use effect handlers):

$ dune exec test/task_throughput.exe 8 32768 32768
n_iterations: 32768   n_units: 32768  n_domains: 8
Timings (ns): n=32768  mean=26277.7
 (       0,       32):      0
 [      32,       64):      0
 [      64,      128):      0
 [     128,      256):      0
 [     256,      512):      0
 [     512,     1024):      0
 [    1024,     2048):      0
 [    2048,     4096):      0
 [    4096,     8192):      0
 [    8192,    16384):      0
 [   16384,    32768):  31811
 [   32768,    65536):    931
 [   65536,   131072):     23
 [  131072,   262144):      2
 [  262144,   524288):      0
 [  524288,  1048576):      1
 [ 1048576,  2097152):      0
 [ 2097152,  4194304):      0
 [ 4194304,  8388608):      0
 [ 8388608,      Inf):      0

with GC stats:

allocated_words: 39466833
minor_words: 39464751
promoted_words: 20624
major_words: 22706
minor_collections: 239
major_collections: 6
forced_major_collections: 0
heap_words: 190448
top_heap_words: 210928
mean_space_overhead: 2528.876316

On this branch:

$ dune exec test/task_throughput.exe 8 32768 32768
dune exec test/task_throughput.exe 8 32768 32768
n_iterations: 32768   n_units: 32768  n_domains: 8
Timings (ns): n=32768  mean=32750.8
 (       0,       32):      0
 [      32,       64):      0
 [      64,      128):      0
 [     128,      256):      0
 [     256,      512):      0
 [     512,     1024):      0
 [    1024,     2048):      0
 [    2048,     4096):      0
 [    4096,     8192):      0
 [    8192,    16384):      0
 [   16384,    32768):  21733
 [   32768,    65536):  10877
 [   65536,   131072):    151
 [  131072,   262144):      5
 [  262144,   524288):      0
 [  524288,  1048576):      2
 [ 1048576,  2097152):      0
 [ 2097152,  4194304):      0
 [ 4194304,  8388608):      0
 [ 8388608,      Inf):      0

with GC stats:

allocated_words: 154131213
minor_words: 154129199
promoted_words: 79934
major_words: 81948
minor_collections: 779
major_collections: 11
forced_major_collections: 0
heap_words: 227312
top_heap_words: 231408
mean_space_overhead: 1517.998989

The effects version has lower latency than master. The effects version also tends to allocate ~4x more than the master version, almost all of it in the minor heap. It would be worthwhile going through the code to reduce allocations.

@jberdine
Copy link

2. Task hopping between domains: When a task blocks on an await and is woken up, it is put back into the work-stealing queue. This means that the task may switch to a different domain on the same pool at await. This will make it difficult to program with domain-local state. Should an async task be allowed to be pinned to a domain?

Is a valid way of thinking about this question that if a program needs "task-local" state, then:

  1. with tasks that are not pinned to domains, the program needs to implement task-local state itself;
  2. with tasks that can be pinned to domains, then domain-local state will also be task-local, and the program has an easy job?

I guess that supporting pinning is somewhat complex, and that it adds scheduling constraints that can reduce performance. On the other hand, if task-local state is needed, would you expect that it might be the most efficient implementation strategy?

@avsm
Copy link
Contributor

avsm commented Nov 1, 2021

@edwintorok wrote:

It looks like having the ability to execute OCaml code as soon as we get a result from the syscal on the same thread makes a huge difference (otherwise a lot of time is spent sending jobs between threads and doing syscalls to put threads to sleep and wake them again).

Does your reimplementation of find use the eio library? That will use io_uring on Linux and should have very little syscall activity indeed. I'd be interested in seeing that vs a direct syscall implementation.

@edwintorok
Copy link
Contributor

edwintorok commented Nov 1, 2021 via email

lib/task.mli Outdated Show resolved Hide resolved
lib/task.mli Outdated Show resolved Hide resolved
@avsm
Copy link
Contributor

avsm commented Nov 1, 2021

resolved the conflicts against trunk and deprecation warnings

@talex5
Copy link

talex5 commented Nov 15, 2021

Not yet, but it looks like io_uring has evolved and can now do openat2 and statx calls too. Would be interesting to patch eio to support those

Note that ocaml-uring and eio already support openat2: https://github.com/ocaml-multicore/ocaml-uring/blob/main/lib/uring/uring.mli#L106

Patches adding statx are welcome too...

Copy link
Contributor

@ctk21 ctk21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a closer look at the code and it looks reasonable to me.

My open ended comments reported by KC from earlier still stand, but I suspect we might be better served by merging and trying it out.

Purely administrative, but when we tag and release the next version of domainslib, we should bump the major version given the nature of the breaking changes.

@kayceesrk
Copy link
Contributor Author

kayceesrk commented Nov 25, 2021

Thanks for the review @ctk21!

@Sudha247 what are your thoughts about the proposal to bump major version. What are the downstream packages that need to be updated?

  • Sandmark
  • Lwt_domain (?)

@Sudha247
Copy link
Contributor

Sudha247 commented Nov 25, 2021

Agreed, we should bump the major version owing to the significance of this change.

You're right, Sandmark and Lwt_domain needs to be updated at the very least, we can do it immediately after the release. The docs need to be updated too!

@kayceesrk
Copy link
Contributor Author

The docs need to be updated too!

The domainslib API docs are already updated. Which docs are you referring to @Sudha247?

@Sudha247
Copy link
Contributor

Ah, I meant https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml :)

Also, after testing this PR with Lwt_domain , it is fine as is and doesn't require any changes, i.e. it works the same with domainslib.0.3.2 and this branch. So, we'll need to update only Sandmark and the above tutorial. Both of which can be done after the new release. I think we are good to merge this one.

@kayceesrk
Copy link
Contributor Author

Thanks. Merging now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants