Add batched I/O for CO OT #13

rozag · 2023-07-25T11:29:33Z

Backstory

I've been working on a server that acts as a garbler for several quite large circuits. It's performance was inappropriately poor. Profiler made it clear that most of the time we wait for connection read/write. After digging some more I've noticed that ot.CO makes a lot of very small flushes. In my case I have ~20 MiB of data transferred over the TCP connection, which would be fine for several huge batches, but ot.CO Send and Recieve were sending 40-72 bytes, flushing, and waiting for the "response" to arrive before sending the next part. It is mentioned in the ot/co.go header that the implementation is based on emp-toolkit. I've noticed that they do this stage separation of write everything first, then flush, then wait for response. And CO does a loop with send-flush-read inside which makes p2p.Conn flush these small batches of data.

Fix

I propose to use the same stage separation approach. Basically, here we only flush once after everything is written. The downside: we'll loop twice, but it's still sort of O(n). I've already tested this fix with my server, observed behavior: 4-5 minutes of execution before, 12-14 seconds after, which is a huge difference.

Tests

$ cd ot && go test -v --count 1

Output

=== RUN   TestCO
--- PASS: TestCO (0.00s)
=== RUN   TestLabel
--- PASS: TestLabel (0.00s)
=== RUN   TestOTCO
--- PASS: TestOTCO (0.01s)
=== RUN   TestOTRSA
--- PASS: TestOTRSA (0.95s)
=== RUN   TestPipe
--- PASS: TestPipe (0.00s)
PASS
ok      github.com/markkurossi/mpc/ot   1.328s

Benchmarks

$ cd ot && go test --bench .

This change doesn't seem to be influencing current benchmarks much:

Before

goos: darwin
goarch: amd64
pkg: github.com/markkurossi/mpc/ot
cpu: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
BenchmarkCO-8                       3817            277631 ns/op
BenchmarkLabelMul2-8            542378509                2.217 ns/op
BenchmarkLabelMul4-8            522979458                2.205 ns/op
BenchmarkLabelXor-8             1000000000               0.3096 ns/op
BenchmarkOTCO_1-8                   4683            238019 ns/op
BenchmarkOTCO_8-8                   1036           1133589 ns/op
BenchmarkOTCO_16-8                   537           2197241 ns/op
BenchmarkOTCO_32-8                   278           4344838 ns/op
BenchmarkOTCO_64-8                   139           8550312 ns/op
BenchmarkOTRSA_2048_1-8               97          12104427 ns/op
BenchmarkOTRSA_2048_8-8               12          97815929 ns/op
BenchmarkOTRSA_2048_64-8               2         779272778 ns/op
BenchmarkRSA512-8                   3300            349835 ns/op
BenchmarkRSA1024-8                   619           1904981 ns/op
BenchmarkRSA2048-8                    96          12051810 ns/op
PASS
ok      github.com/markkurossi/mpc/ot   23.257s

After

goos: darwin
goarch: amd64
pkg: github.com/markkurossi/mpc/ot
cpu: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
BenchmarkCO-8                       4022            278031 ns/op
BenchmarkLabelMul2-8            541403588                2.209 ns/op
BenchmarkLabelMul4-8            547191450                2.188 ns/op
BenchmarkLabelXor-8             1000000000               0.3042 ns/op
BenchmarkOTCO_1-8                   4851            243142 ns/op
BenchmarkOTCO_8-8                    739           1596737 ns/op
BenchmarkOTCO_16-8                   370           3189234 ns/op
BenchmarkOTCO_32-8                   182           6561498 ns/op
BenchmarkOTCO_64-8                    93          12940329 ns/op
BenchmarkOTRSA_2048_1-8               98          12186653 ns/op
BenchmarkOTRSA_2048_8-8               12          96631495 ns/op
BenchmarkOTRSA_2048_64-8               2         772423033 ns/op
BenchmarkRSA512-8                   3055            353650 ns/op
BenchmarkRSA1024-8                   634           1876732 ns/op
BenchmarkRSA2048-8                    99          12032335 ns/op
PASS
ok      github.com/markkurossi/mpc/ot   23.513s

At first I wanted to update ot.Pipe with something like NewLatencyPipe(time.Duration) and to perform time.Sleep on e.g. each Write to better represent my case of flushing often on a real TCP connection. The problem with Pipe here is that it doesn't really use Flush, it simply reads/writes on demand. p2p.Conn over something pipe-like with this latency emulation would probably be a more representative benchmark, but io.Pipe won't do in this case, we'd deadlock. djherbis/nio has a nice pipe that'd probably do, but adding dependency for a benchmark seems to be an overkill. So, I decided to avoid benchmark changes at the moment, but would love to hear your opinion on this. If you find the proposed approach reasonable, I can do it.

markkurossi · 2023-07-25T13:05:22Z

Thanks for the pull request, makes sense. My initial idea was to optimize the memory allocations by doing each bit OT separately but of course this write-sync-read loop kills performance.

No need to simulate the latency on the benchmark, I would rather avoid the extra dependency.

Add batched I/O for CO OT

4549d8c

markkurossi merged commit 68e0663 into markkurossi:master Jul 25, 2023
2 checks passed

rozag deleted the ot-batching branch July 25, 2023 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batched I/O for CO OT #13

Add batched I/O for CO OT #13

rozag commented Jul 25, 2023

markkurossi commented Jul 25, 2023

Add batched I/O for CO OT #13

Add batched I/O for CO OT #13

Conversation

rozag commented Jul 25, 2023

Backstory

Fix

Tests

Benchmarks

markkurossi commented Jul 25, 2023