Utilise effect handlers #51

kayceesrk · 2021-10-12T12:17:59Z

This PR uses effect handlers to create tasks. The use of effect handlers will fix the issue described in ocaml-multicore/ocaml-multicore#670 (comment). The new test test_deadlock.ml precisely captures the scenario explained in the comment. This test program (after removal of T.run) will deadlock on domainslib.0.3.1 but runs to completion on this branch.

Unfortunately, this introduces a breaking change; the computations need to be enclosed in a run function due to the use of effect handlers. I am also yet to do performance benchmarking of this change.

This PR also makes domainslib work only with 4.12+domains (and OCaml 5.00); the stdlib EffectHandlers module is not made available on 4.12.0+domains+effects. In order to address this, the PR ocaml-multicore/ocaml-multicore#689 adds the EffectHandlers module to 4.12.0+domains+effects.

kayceesrk · 2021-10-13T04:36:47Z

CC @ctk21 @edwintorok @Sudha247.

kayceesrk · 2021-10-13T12:22:04Z

Here are performance results from turing (24 isolated x86 cores in a 28-core machine). The variant 4.12+domains uses the domainslib from this PR, and 4.12+domains+effects uses domainslib.0.3.0. The results show that there is negligible performance impact to using effect handlers in domainslib.

edwintorok

Thanks a lot for this, see comments inline.

lib/task.ml

kayceesrk · 2021-10-14T07:59:03Z

The commit d023dc2 depends on ocaml-multicore/ocaml-multicore#693.

With this feature, this program, which has an exception being thrown from await prints a backtrace that includes both the awaiting and the awaited task.

Raised at Stdlib.failwith in file "stdlib.ml", line 29, characters 17-33
Called from Backtrace.foo in file "test/backtrace.ml", line 6, characters 11-27 
Called from Backtrace.bar in file "test/backtrace.ml", line 14, characters 4-9
Called from Domainslib__Task.do_task in file "lib/task.ml", line 40, characters 23-29
Re-raised at Stdlib__effectHandlers.Deep.discontinue_with_backtrace.(fun) in file "effectHandlers.ml", line 41, characters 4-38
Called from Domainslib__Task.await in file "lib/task.ml", line 61, characters 6-44
Called from Backtrace.main.(fun) in file "test/backtrace.ml", line 23, characters 4-18
Called from Domainslib__Task.do_task in file "lib/task.ml", line 40, characters 23-29
Re-raised at Domainslib__Task.run.loop in file "lib/task.ml", line 101, characters 23-57
Called from Backtrace.main in file "test/backtrace.ml", line 21, characters 2-135
Called from Backtrace in file "test/backtrace.ml", line 28, characters 6-13

"test/backtrace.ml", line 23 is the await call.

kayceesrk · 2021-10-14T10:34:12Z

There were a couple of valid concerns raised by @ctk21 about the current implementation:

Work-first or help-first: Under work-first policy, a newly spawned async will be executed by a domain immediately, leaving its continuation to be stolen. Under help-first policy, a newly spawned async will be pushed to the work-stealing queue, and the domain continues executing the original task. Cilk goes for work-first policy, whereas this PR is help-first. It is easy to support both policies (by taking an optional parameter for the policy in async), but we should decide which policy is the one to go for by default. Previous studies indicate that no one policy is better for all the cases [1].
Task hopping between domains: When a task blocks on an await and is woken up, it is put back into the work-stealing queue. This means that the task may switch to a different domain on the same pool at await. This will make it difficult to program with domain-local state. Should an async task be allowed to be pinned to a domain?
Top-level run function: The need to enclose the computations in top-level run function is unwieldy, but looks necessary.

[1] https://www.cs.rice.edu/~yguo/pubs/PID824943.pdf

edwintorok · 2021-10-14T19:39:38Z

I assume that once a future version of OCaml gets an effect type-system we'd be able to determine at compile time when a handler is missing.

Meanwhile not sure whether there is a type-level solution that is not too invasive (don't want to force ending up with a monad or other solution that requires plumbing a parameter through everything). Perhaps a compromise would be to include a lightweight (let*) and (and*) monad for more easily converting from Lwt style code (I think that let* can be a simple unit -> 'a delay, and only and would actually need to launch things async), and then a way to 'run' that monad would be to call it with the function that installs the effect handler. But that is probably best done as a separate PR.

lib/task.ml

edwintorok · 2021-10-14T19:56:09Z

lib/task.ml

+type _ eff += Wait : 'a promise * task_chan -> unit eff
+
+let get_pool_data p =
+  match Atomic.get p with


I think there might be multiple semantics to pool teardown (might be useful to have a function that just waits/drains the active tasks without causing exceptions to be raised), but once you reach teardown it could just go through the queue of all active tasks and "cancel" them by setting their result to be an exception of cancelled (obviously can't cancel any that are actively running, the await might have some critical "finally" statements to clean up resources). But it might allow removing this Atomic.get if some kind of cancelation is implemented.

Agreed. It would be useful to convert this into an issue that we tackle in a later PR.

edwintorok · 2021-10-14T22:21:58Z

I've tested this on my original program that failed previously, and it works now.

(It is an implementation of find PATH -size +1c -type f, but using parallel for loops and openat/fstatat to walk the tree.
The Rust implementation in fd is quite fast too (although at times it is outperformed by the single threaded find on something like /usr due to find using openat/fstatat and fd using only opendir and lstat thus having to look up path components over and over again on deep trees).
My OCaml implementation uses openat/fstatat and it seems to beat both find and fd on walking my homedir with a warm cache (with about 13317085 non-empty files in it), looks like quite a good test for IO / syscalls intensive activity:

dune exec --profile=release ./testme.exe /home/edwin > /dev/null  199.19s user 52.84s system 2318% cpu 10.870 total
dune exec --profile=release ./testme.exe /home/edwin > /dev/null  193.58s user 52.49s system 2318% cpu 10.612 total
fd -HI -t f -S +1b -c never . /home/edwin > /dev/null  59.72s user 264.94s system 2182% cpu 14.876 total
find /home/edwin -size +1c -type f 2> /dev/null  4.02s user 19.68s system 99% cpu 23.846 total
13317085

Interestingly the OCaml implementation spends a lot more time in userspace and a lot less time in kernelspace (and less time overall).

With cold caches the benefits of parallelism become more obvious:

sync; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; sync
dune exec --profile=release ./testme.exe /home/edwin > /dev/null  59.28s user 111.25s system 1104% cpu 15.436 total
sync; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; sync
fd -HI -t f -S +1b -c never . /home/edwin > /dev/null  49.85s user 209.28s system 975% cpu 26.564 total
sync; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; sync
find /home/edwin -size +1c -type f 2> /dev/null  5.62s user 52.82s system 33% cpu 2:52.89 total

The speed of Lwt and Luv is rather disappointing though, with warm caches and with not even printing the filenames but just counting them I get something that uses 185% CPU but barely achieves a speedup over the sequential find (this is even if I implement Lwt.Unix.openat):

OCAMLRUNPARAM=b dune exec --profile=release ./testmelwt.exe /home/edwin  65.79s user 176.13s system 185% cpu 2:10.39 total

(I don't have a working Luv version at hand though).

Once I've cleaned up my code I'll post it online for comparison, I was trying to keep the multicore and Lwt/Luv versions quite similar in implementation. It looks like having the ability to execute OCaml code as soon as we get a result from the syscal on the same thread makes a huge difference (otherwise a lot of time is spent sending jobs between threads and doing syscalls to put threads to sleep and wake them again).

kayceesrk · 2021-10-15T04:32:26Z

@ctk21 wanted the results of the test/task_throughput.ml benchmark. Here are the results.

On master (which doesn't use effect handlers):

$ dune exec test/task_throughput.exe 8 32768 32768
n_iterations: 32768   n_units: 32768  n_domains: 8
Timings (ns): n=32768  mean=26277.7
 (       0,       32):      0
 [      32,       64):      0
 [      64,      128):      0
 [     128,      256):      0
 [     256,      512):      0
 [     512,     1024):      0
 [    1024,     2048):      0
 [    2048,     4096):      0
 [    4096,     8192):      0
 [    8192,    16384):      0
 [   16384,    32768):  31811
 [   32768,    65536):    931
 [   65536,   131072):     23
 [  131072,   262144):      2
 [  262144,   524288):      0
 [  524288,  1048576):      1
 [ 1048576,  2097152):      0
 [ 2097152,  4194304):      0
 [ 4194304,  8388608):      0
 [ 8388608,      Inf):      0

with GC stats:

allocated_words: 39466833
minor_words: 39464751
promoted_words: 20624
major_words: 22706
minor_collections: 239
major_collections: 6
forced_major_collections: 0
heap_words: 190448
top_heap_words: 210928
mean_space_overhead: 2528.876316

On this branch:

$ dune exec test/task_throughput.exe 8 32768 32768
dune exec test/task_throughput.exe 8 32768 32768
n_iterations: 32768   n_units: 32768  n_domains: 8
Timings (ns): n=32768  mean=32750.8
 (       0,       32):      0
 [      32,       64):      0
 [      64,      128):      0
 [     128,      256):      0
 [     256,      512):      0
 [     512,     1024):      0
 [    1024,     2048):      0
 [    2048,     4096):      0
 [    4096,     8192):      0
 [    8192,    16384):      0
 [   16384,    32768):  21733
 [   32768,    65536):  10877
 [   65536,   131072):    151
 [  131072,   262144):      5
 [  262144,   524288):      0
 [  524288,  1048576):      2
 [ 1048576,  2097152):      0
 [ 2097152,  4194304):      0
 [ 4194304,  8388608):      0
 [ 8388608,      Inf):      0

with GC stats:

allocated_words: 154131213
minor_words: 154129199
promoted_words: 79934
major_words: 81948
minor_collections: 779
major_collections: 11
forced_major_collections: 0
heap_words: 227312
top_heap_words: 231408
mean_space_overhead: 1517.998989

The effects version has lower latency than master. The effects version also tends to allocate ~4x more than the master version, almost all of it in the minor heap. It would be worthwhile going through the code to reduce allocations.

jberdine · 2021-10-15T10:33:59Z

2. Task hopping between domains: When a task blocks on an await and is woken up, it is put back into the work-stealing queue. This means that the task may switch to a different domain on the same pool at await. This will make it difficult to program with domain-local state. Should an async task be allowed to be pinned to a domain?

Is a valid way of thinking about this question that if a program needs "task-local" state, then:

with tasks that are not pinned to domains, the program needs to implement task-local state itself;
with tasks that can be pinned to domains, then domain-local state will also be task-local, and the program has an easy job?

I guess that supporting pinning is somewhat complex, and that it adds scheduling constraints that can reduce performance. On the other hand, if task-local state is needed, would you expect that it might be the most efficient implementation strategy?

avsm · 2021-11-01T15:56:44Z

@edwintorok wrote:

It looks like having the ability to execute OCaml code as soon as we get a result from the syscal on the same thread makes a huge difference (otherwise a lot of time is spent sending jobs between threads and doing syscalls to put threads to sleep and wake them again).

Does your reimplementation of find use the eio library? That will use io_uring on Linux and should have very little syscall activity indeed. I'd be interested in seeing that vs a direct syscall implementation.

edwintorok · 2021-11-01T16:17:50Z

On 1 November 2021 15:56:56 GMT, Anil Madhavapeddy ***@***.***> wrote:

***@***.*** wrote:

> It looks like having the ability to execute OCaml code as soon as we get a result from the syscal on the same thread makes a huge difference (otherwise a lot of time is spent sending jobs between threads and doing syscalls to put threads to sleep and wake them again). Does your reimplementation of find use the `eio` library? That will use io_uring on Linux and should have very little syscall activity indeed. I'd be interested in seeing that _vs_ a direct syscall implementation.

Not yet, but it looks like `io_uring` has evolved and can now do `openat2` and `statx` calls too. Would be interesting to patch `eio` to support those and then try an eio/domainslib hybrid that uses domainslib tasks only for readdir (which has no urging equivalent yet).

…

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

lib/task.mli

avsm · 2021-11-01T18:35:37Z

resolved the conflicts against trunk and deprecation warnings

talex5 · 2021-11-15T09:41:25Z

Not yet, but it looks like io_uring has evolved and can now do openat2 and statx calls too. Would be interesting to patch eio to support those

Note that ocaml-uring and eio already support openat2: https://github.com/ocaml-multicore/ocaml-uring/blob/main/lib/uring/uring.mli#L106

Patches adding statx are welcome too...

ctk21

I've taken a closer look at the code and it looks reasonable to me.

My open ended comments reported by KC from earlier still stand, but I suspect we might be better served by merging and trying it out.

Purely administrative, but when we tag and release the next version of domainslib, we should bump the major version given the nature of the breaking changes.

kayceesrk · 2021-11-25T06:01:15Z

Thanks for the review @ctk21!

@Sudha247 what are your thoughts about the proposal to bump major version. What are the downstream packages that need to be updated?

Sandmark
Lwt_domain (?)

Sudha247 · 2021-11-25T06:48:59Z

Agreed, we should bump the major version owing to the significance of this change.

You're right, Sandmark ~~and Lwt_domain~~ needs to be updated at the very least, we can do it immediately after the release. The docs need to be updated too!

kayceesrk · 2021-11-26T11:25:51Z

The docs need to be updated too!

The domainslib API docs are already updated. Which docs are you referring to @Sudha247?

Sudha247 · 2021-11-26T18:06:04Z

Ah, I meant https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml :)

Also, after testing this PR with Lwt_domain , it is fine as is and doesn't require any changes, i.e. it works the same with domainslib.0.3.2 and this branch. So, we'll need to update only Sandmark and the above tutorial. Both of which can be done after the new release. I think we are good to merge this one.

kayceesrk · 2021-11-27T10:12:18Z

Thanks. Merging now.

kayceesrk force-pushed the effect_handlers branch from 6b463c5 to f840535 Compare October 13, 2021 03:00

kayceesrk mentioned this pull request Oct 13, 2021

Add EffectHandlers module ocaml-multicore/ocaml-multicore#689

Merged

Utilise effect handlers

9348e59

kayceesrk force-pushed the effect_handlers branch from f840535 to 9348e59 Compare October 13, 2021 06:33

edwintorok reviewed Oct 13, 2021

View reviewed changes

lib/task.ml Outdated Show resolved Hide resolved

lib/task.ml Outdated Show resolved Hide resolved

lib/task.ml Outdated Show resolved Hide resolved

lib/task.ml Outdated Show resolved Hide resolved

lib/task.ml Outdated Show resolved Hide resolved

kayceesrk added 2 commits October 14, 2021 09:23

Prevent tasks blocking on await from hopping pools. Optimisations.

2755c84

Backtraces from await

d023dc2

edwintorok reviewed Oct 14, 2021

View reviewed changes

edwintorok approved these changes Oct 14, 2021

View reviewed changes

edwintorok mentioned this pull request Oct 14, 2021

Multi_channel: allow more than one instance per program with different configurations #50

Merged

kayceesrk added 2 commits October 15, 2021 09:32

Make await non-recursive

db22e17

Fix throughput test

b34a56b

kayceesrk mentioned this pull request Oct 20, 2021

Domainslib nesting issue ocaml-multicore/ocaml-multicore#670

Closed

avsm reviewed Nov 1, 2021

View reviewed changes

lib/task.mli Outdated Show resolved Hide resolved

avsm reviewed Nov 1, 2021

View reviewed changes

lib/task.mli Outdated Show resolved Hide resolved

Merge branch 'master' into effect_handlers

390c489

use Domain.cpu_relax instead of deprecated Domain.Sync.cpu_relax

6024887

avsm force-pushed the effect_handlers branch from aefa115 to 6024887 Compare November 1, 2021 18:44

Fix typo in task ocamldoc

0d7fd11

Fix clarity in Task ocamldoc

0c0f8fa

ctk21 reviewed Nov 15, 2021

View reviewed changes

kayceesrk merged commit df30722 into master Nov 27, 2021

kayceesrk mentioned this pull request Dec 22, 2021

Task.await deadlock (task finishes but await never returns) #47

Closed

jmid mentioned this pull request Dec 22, 2021

Getting segfault/undefined behavior using multicore with custom blocks ocaml-multicore/ocaml-multicore#810

Closed

This was referenced Dec 23, 2021

[new release] domainslib (0.4.0) ocaml/opam-repository#20298

Merged

Task library slowdown if the number of domains is greater than 8 #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilise effect handlers #51

Utilise effect handlers #51

kayceesrk commented Oct 12, 2021 •

edited

Loading

kayceesrk commented Oct 13, 2021

kayceesrk commented Oct 13, 2021

edwintorok left a comment

kayceesrk commented Oct 14, 2021 •

edited

Loading

kayceesrk commented Oct 14, 2021

edwintorok commented Oct 14, 2021

edwintorok Oct 14, 2021

kayceesrk Oct 15, 2021

edwintorok commented Oct 14, 2021 •

edited

Loading

kayceesrk commented Oct 15, 2021

jberdine commented Oct 15, 2021

avsm commented Nov 1, 2021

edwintorok commented Nov 1, 2021 via email

avsm commented Nov 1, 2021

talex5 commented Nov 15, 2021

ctk21 left a comment

kayceesrk commented Nov 25, 2021 •

edited

Loading

Sudha247 commented Nov 25, 2021 •

edited

Loading

kayceesrk commented Nov 26, 2021

Sudha247 commented Nov 26, 2021

kayceesrk commented Nov 27, 2021

Utilise effect handlers #51

Utilise effect handlers #51

Conversation

kayceesrk commented Oct 12, 2021 • edited Loading

kayceesrk commented Oct 13, 2021

kayceesrk commented Oct 13, 2021

edwintorok left a comment

Choose a reason for hiding this comment

kayceesrk commented Oct 14, 2021 • edited Loading

kayceesrk commented Oct 14, 2021

edwintorok commented Oct 14, 2021

edwintorok Oct 14, 2021

Choose a reason for hiding this comment

kayceesrk Oct 15, 2021

Choose a reason for hiding this comment

edwintorok commented Oct 14, 2021 • edited Loading

kayceesrk commented Oct 15, 2021

jberdine commented Oct 15, 2021

avsm commented Nov 1, 2021

edwintorok commented Nov 1, 2021 via email

avsm commented Nov 1, 2021

talex5 commented Nov 15, 2021

ctk21 left a comment

Choose a reason for hiding this comment

kayceesrk commented Nov 25, 2021 • edited Loading

Sudha247 commented Nov 25, 2021 • edited Loading

kayceesrk commented Nov 26, 2021

Sudha247 commented Nov 26, 2021

kayceesrk commented Nov 27, 2021

kayceesrk commented Oct 12, 2021 •

edited

Loading

kayceesrk commented Oct 14, 2021 •

edited

Loading

edwintorok commented Oct 14, 2021 •

edited

Loading

kayceesrk commented Nov 25, 2021 •

edited

Loading

Sudha247 commented Nov 25, 2021 •

edited

Loading