slow anonymous downloads: Crypto CPU bottleneck #1882

synctext · 2016-01-20T16:28:46Z

The CPU seams to be the reason for slow <1 MByte/sec anonymous downloads.

possible problem
Running crypto on twisted thread blocks all other Tribler activity. Unclear if we needs 256bit GCM mode. Anything that checks a signature, decrypt a message, etc. needs to be traced.

possible long-term solution
Separate thread for the tunnel community or even an isolated thread for each community.
Low hanging fruit: parallel execution of relaying, make it multi-threaded. Real threads: Twisted reactor versus import multiprocessing...

goal

benchmark raw openssl C code GCM MBps.
Create a minimal benchmark comparing current situation in Tribler with alternatives. Not re-using Tribler code, but a performance test processing 10.000 UDP

Only generating packets and decrypting one layer, then another layer. Current approach.
Doing the same on a threadpool within Twisted.
Separate Python OS process, communicate using a ringbuffer. Twisted has build-in support for this, treating it like just another thread.

EDIT: use 10k UDP packets through exit node as benchmark.

whirm · 2016-01-20T17:30:14Z

@synctext I don't think this is related to 6.6

synctext · 2016-04-19T15:56:06Z

Download performance insight

Within the roadmap for high-performance privacy it is important to identify bottlenecks. We do not know if the CPU, IO, database IO, buffering, event handling, queue waiting, threading, or networking is the bottleneck.

The following experiment determines the crypto performance of the HiddenTunnelCommunity at the downloading side.

Step-by-step we build and measure the raw networking throughput expressed in MByte/second in various setups. First is the minimal Dispersy NULL receiver, dropping packets immediately coming from several senders. Next is receiving the packet on the network thread and creating a Twisted event to handle this packet; again by simply dropping it. Third is using the on-data approach of the hidden community. Fourth experiment is benchmarking a single layer of Tor decryption at the receiver. Finally, the full tunnel community.

synctext · 2016-05-18T14:11:34Z

Connected to: #2137, #2215 #2216, #2217

Note: 4th item for thesis: torrent checker dramatic improvements/repair.

synctext · 2016-05-26T09:35:44Z

Approach: our exit node helpers are maxed out in CPU. It will be extremely valuable to profile them, identify hotspots, test improvements, and benchmark various changes in the field.
This could move our donated network capacity from 9 TByte/day to perhaps even 15TByte/day.

lfdversluis · 2016-05-26T09:37:20Z

I would love to profile them yes. I already have a hunch where bottlenecks are, but it would be great having a profiler output data that can be visualized by e.g. Vallgrind to verify and confirm the bottlenecks.

synctext · 2016-06-23T18:31:26Z

Anonymous downloads are quite slow. As of V6.6 pre-release version we have with a solid i7 CPU core at 125% only 330 KByte/sec. Confirms @whirm his view. Both the tunnel helpers and anon downloader are CPU constrained.

qstokkink · 2016-06-24T06:47:56Z

I have made some measurements for the endpoint.py throughput per logging level:
https://gist.github.com/qstokkink/57203ced56dccd95287193d43e7e6a34

Long story short: choosing the wrong logging level can cost you around 400 MB/s throughput.

whirm · 2016-06-24T07:52:57Z

@qstokkink hah. Just yesterday I was talking about removing unnecessary log calls when the big cleanup happens with @devos50.

qstokkink · 2016-08-25T09:18:57Z

Ok, I finally have a design I am happy with and which will provide a significant speed increase.

The main issues I faced were the following:

Data passing through Dispersy clogs up Dispersy and thus the main process.
As per (1), simply offloading received packets to subprocesses will only provide some benefit in encryption/decryption and add overhead for forwarding these packets.
Forwarding peers' circuits to child processes (ergo the set-up happens in the main process), with their own assigned port, would require a lot of extra logic in the Tunnel Community.

So the shocking design I came up with (working with Dispersy, instead of against it, for a change), is as follows:

The main process does not join the Dispersy community.
Instead, all child processes (subprocesses) join the community, using their own Dispersy on their own process.

This means that the main process controls the circuits (the amount, their statistics and stuff) but no longer needs to be the man in the middle for data passing through. In turn, this means that the main process' Dispersy is freed up and therefore becomes a lot faster. An added benefit of this approach, is that it is a lot easier to implement than the options described in (2) and (3) too, leaving the TunnelCommunity almost completely the same.

These are the reasons for why I believe this to be the superior design, instead of the one of (2) as discussed during the last group session/meeting. Feel free to provide comments or critiques.

synctext · 2016-08-25T14:42:39Z

wow, wait, what? @qstokkink So one Dispersy instance per community. So we fork a new Python process for each community, with their own IPv4 listen socket, own database, and walker.

Is the process doing the setup on a different listen port then the Tunnel community itself? How does this work with UDP puncturing. Are we still puncturing the correct listen socket?

Sounds quite interesting. I would suggestion doing quick prototyping and get the Tunnel community in it's own Dispersy process.

qstokkink · 2016-08-25T20:09:20Z

@synctext Correct: one TunnelCommunity for one Dispersy with one unique listen port for one subprocess/forked process. Note that this will also be the only community loaded for a subprocess.

This should work perfectly fine with the UDP puncturing.

Because all of the TunnelCommunity messages are sent directly through the endpoint with NoAuthentication() the database of each subprocess is hardly used (only for Dispersy internals). In spite of this, there is definitely a case to be made for having a shared set of walker candidates and a shared database between subprocesses, as you suggested.
For the first version/prototype of this I will not implement this however, the result should already be very dramatic without these optimizations.

devos50 · 2016-08-25T20:15:04Z

@qstokkink sounds like an interesting design. I'm curious to know how this works out. I'm a huge supporter of splitting Tribler in various processes, however, we should be careful that we are not overcomplicating communication between processes since I think it's important that (external) developers can easily modify the communication protocol used between the Dispersy processes.

Another advantage of splitting it up in processes is separation of concerns: developers can focus on a single part of the system (and in the future, even implement a small part of the system in another language).

With this design, we can utilise all these additional cores in the CPU 👍

qstokkink · 2016-08-26T04:53:46Z

@devos50 If anything, this should even simplify communications. The three messages being transferred are the following:

Main process to subprocess: create(<>)
Subprocess to main process: stats(<>)
Main process to subprocess: destroy()

That's it.

EDIT: Sorry, there is a fourth: the notifications.

synctext · 2016-08-26T05:01:23Z

morning, sooo....
When selecting hops for a tunnel, we need a UDP punctured connection to it. Thus the Tunnel community needs to do all Create messages from its own external visible external IP:port.
correct?

qstokkink · 2016-08-26T05:54:33Z

@synctext Correct, each subprocess will need to be punctured separately. If this is really a problem, the design can also use a single port: some very nasty code in endpoint.py already takes care of this.

Do note that using the single port will already hit the performance pretty hard.

synctext · 2016-08-26T11:41:53Z

soooo v2...
The Tunnel community has a certain UDP punctured peers. It can tunnel through them. The main Dispersy then needs to either take that into account when selecting the hops or let the Tunnel community do everything autonomously.

qstokkink · 2016-10-30T10:59:52Z

Finished the comments/style corrections and sanity check. Almost there.. I might even have the PR done by tonight. The TODO list has become quite short.

devos50 · 2016-10-30T14:38:47Z

@qstokkink looking forward to it!

Note that you don't have to squash everything into a single commit. Please make a logical units of works that make sense and make sure your commit history is clean (consists of distinct changes so no fixes on fixes).

qstokkink · 2016-10-30T20:44:47Z

Woops, not happening tonight, I accidentally merged in some devel changes.. 😢 devos50/tribler@single_config...qstokkink:tunnelprocess_rc : I'll have to fix it up tomorrow.

qstokkink · 2016-10-31T08:24:07Z

Ok, I know this needs no further hyping up, but this is pretty cool:

EDIT: Oh and I almost forgot to mention it: but this is casually running 18 circuits.

synctext · 2016-10-31T10:58:35Z

Repeating: #2106 (comment)
Thesis material:

Understanding performance and emergent behavior using Visual Dispersy (correctness)
Improving performance and community coding in Dispersy using Protocol buffers (general performance)
Exploiting multi-core architectures for Dispersy speedups (multi-core performance)

Together:
Multi-core architecture for anonymous Internet streaming

First priority: PooledTunnelCommunity stable 👏

synctext · 2016-10-31T11:12:25Z

Further future (by Yawning Angel): #1066 (comment)

synctext · 2016-11-01T16:37:49Z

Related work fro thesis from MIT Riffle: An Efficient Communication System With Strong Anonymity, uses central servers, lacks incentive mechanism.

qstokkink · 2016-12-15T13:58:01Z

The first clinical Gumby experiment has finished, the new 1-hop max download speed with the pooled implementation seems to be around 18 MB/s for a consumer quad-core.

devos50 · 2016-12-15T14:11:48Z

@qstokkink nice results, so you are using hidden seeding and not plain seeding?

qstokkink · 2016-12-15T14:45:57Z

@devos50 This is plain seeding, the y-axis is me being too lazy to edit the label.

I also ran this on the 48-core machine (ergo 48 processes per peer) with the same amount of peers (8) and the results are pretty interesting (ignore the 2-hops, which fell back to non-anonymous downloading):

In this case circuits actually seem to fight over data transmission (due to the round-robin implementation), causing the whole thing to slow down. Also note the high base-cost in download speed.

EDIT: By the way, is bbq still O.K.? The above experiment had 1152 circuits and 384 processes running concurrently on the same machine.

synctext · 2016-12-15T19:20:47Z

18 MByte/sec. Our users are going to love that. Real strange scaling to 48cores. great thesis material.

qstokkink · 2016-12-16T07:36:16Z

@synctext Well from the point of the 2 seeders they have to serve 6 times as many downloads/leechers. Because of the download mechanism's overhead, it is better for a seeder to serve one leecher at high speed than multiple leechers with lower speed.

This problem should disappear if the amount of seeders scales up with the amount of deployed proxies.

qstokkink · 2016-12-29T14:21:43Z

Alright, so, I just managed to lock up a 48-core Linux machine with my Gumby experiment.

We might need to be careful about allowing people to manage the amount of processes they spawn.

synctext · 2016-12-29T15:03:14Z

:-) How do you lock a 64GByte, 48-core machine? What resource ran out?
When Tribler bsc and msc students try to mess around with process spawning, it's OK if they get bitten occasionally. Guess we don't want expose this without "seatbelts" within the Tribler Setting menu.

qstokkink · 2016-12-29T17:15:21Z

@synctext I have no way of knowing what resource ran out (probably either CPU or sockets ran out). Since it hasn't come back online yet, I can only assume it was the CPU and the building has burned down.

And, yes we might have to put a warning symbol above certain settings, or along a slider.
😃💻
⬇️
💀🔥

devos50 · 2016-12-29T21:30:49Z

@synctext @qstokkink I will probably go to EWI tomorrow to reboot the thing :)

devos50 · 2016-12-30T10:53:37Z

@qstokkink the building did not burn down and bbq is up again :) (took me some effort to get access since my card was not working correctly...)

qstokkink · 2016-12-30T11:46:31Z

@devos50 Thanks! I'll run a less flamboyant experiment from now on (which I know it can handle).

synctext · 2017-01-05T13:46:03Z

latest thesis results: MSc_Thesis_v2.pdf

synctext · 2017-02-01T14:03:06Z

Cardinal & Closing graph of thesis: fast anon download.

or extra chapter with multi-core Javascript + fancy homomorphic crypto math!

When using the 48-core BBQ and entire DAS5 together it should be possible to show nice anon download speed graphs. Goal is to have performance towards 48x on our 48-core download machine. DAS5 then acts as a dedicated seeding and relaying cluster. Showstopper (as always) : Gumby

qstokkink · 2017-02-03T13:12:39Z

After some additional runs and behavior analysis, it seems like organising the DAS5-bbq experiment will require some fundamental code changes in the pooled tunnel code. Therefore, I have decided to definitively pull the plug on that and focus on the fancy crypto chapter instead.

synctext · 2017-02-03T21:50:49Z

in science formulas get more respect then running code..

synctext · 2017-02-14T13:02:39Z

MSc_Thesis_v4.pdf
Update: New chapter with solid formulas:

1.2 4th research questions, can we use Homomorphic cryptography to radically improve performance and/or security
chapter 2 after intro stuff "in the upcoming chapters we focus on enhancing the architecture, final chapter we replace crypto"
Fig 7.3: more datapoints for straight line evidence
Chapter 7: more mathematical depth

qstokkink · 2017-02-18T11:12:36Z

(First) Release candidate:
MSc_Thesis_v5.pdf

synctext · 2017-02-20T12:56:10Z

Good story flow!

Define all symbols of equation 5.2.1 and beyond, for instance, E⊕(m1), H(M | m1, m2, ..., mk−1), .
Section 5.4.1, clarify. Start point and ending of circuits is concealed.
"presented the underappreciated technique of message serialization". We show a 80% CPU reduction.
please "go overboard" with formulas and repeat prior solid definition work!
Figure 7.3: Download speeds for differing amounts of processes, KBytes/sec, Kbits/sec
"up to 18 Mbps in this paper", observed once under favorable conditions.
remove modesty: 1st line of Conclusions. This thesis presents a solution to the anonymous streaming problem.
remove paragraph: Finally, this chapter will end by discussing two limitations
differnt
combine "Conclusions and Future Work" with compact, to the point writing.

qstokkink · 2017-02-21T13:37:54Z

Fixed in second release candidate:
MSc_Thesis_v6.pdf

P.S. Apparently differnt passes the spelling checker 😕

synctext · 2017-02-21T16:35:44Z

Good, 6 pages with the math fundamentals.

synctext · 2018-02-26T13:32:26Z

This 2016 ticket did not yet focus on the latency and uTP protocol. #2620 is dedicated to this. This 2016 ticket documents important ideas from the multi-core and crypto CPU load. Note, we still have stuff open. closing.

synctext added the type: MSc Thesis Work label Jan 20, 2016

synctext assigned lfdversluis Jan 20, 2016

synctext added this to the V6.6 milestone Jan 20, 2016

synctext modified the milestones: V6.6 WX3, V6.7 credits Feb 16, 2016

synctext modified the milestones: Backlog, V6.7 credits Apr 19, 2016

synctext changed the title ~~Crypto performance experiment~~ slow anonymous downloads: Crypto CPU bottleneck Jun 23, 2016

synctext assigned qstokkink Jun 23, 2016

synctext unassigned lfdversluis Jun 23, 2016

This was referenced Jun 23, 2016

Perfect Darknet Roadmap Ticket #3

Open

attack-resilient micro-economy for media #1

Open

qstokkink mentioned this issue Nov 1, 2016

WIP: TunnelCommunity multiprocessing #2607

Closed

synctext closed this as completed Feb 26, 2018

slow anonymous downloads: Crypto CPU bottleneck #1882

slow anonymous downloads: Crypto CPU bottleneck #1882

Comments

synctext commented Jan 20, 2016 • edited Loading

whirm commented Jan 20, 2016

synctext commented Apr 19, 2016 • edited Loading

synctext commented May 18, 2016 • edited Loading

synctext commented May 26, 2016

lfdversluis commented May 26, 2016 • edited Loading

synctext commented Jun 23, 2016 • edited Loading

qstokkink commented Jun 24, 2016

whirm commented Jun 24, 2016

qstokkink commented Aug 25, 2016

synctext commented Aug 25, 2016

qstokkink commented Aug 25, 2016

devos50 commented Aug 25, 2016

qstokkink commented Aug 26, 2016 • edited Loading

synctext commented Aug 26, 2016

qstokkink commented Aug 26, 2016 • edited Loading

synctext commented Aug 26, 2016

qstokkink commented Oct 30, 2016 • edited Loading

devos50 commented Oct 30, 2016

qstokkink commented Oct 30, 2016

qstokkink commented Oct 31, 2016 • edited Loading

synctext commented Oct 31, 2016

synctext commented Oct 31, 2016

synctext commented Nov 1, 2016

qstokkink commented Dec 15, 2016

devos50 commented Dec 15, 2016

qstokkink commented Dec 15, 2016 • edited Loading

synctext commented Dec 15, 2016

qstokkink commented Dec 16, 2016

qstokkink commented Dec 29, 2016 • edited Loading

synctext commented Dec 29, 2016

qstokkink commented Dec 29, 2016

devos50 commented Dec 29, 2016

devos50 commented Dec 30, 2016

qstokkink commented Dec 30, 2016

synctext commented Jan 5, 2017

synctext commented Feb 1, 2017 • edited Loading

qstokkink commented Feb 3, 2017

synctext commented Feb 3, 2017

synctext commented Feb 14, 2017

qstokkink commented Feb 18, 2017

synctext commented Feb 20, 2017 • edited Loading

qstokkink commented Feb 21, 2017 • edited Loading

synctext commented Feb 21, 2017

synctext commented Feb 26, 2018

synctext commented Jan 20, 2016 •

edited

Loading

synctext commented Apr 19, 2016 •

edited

Loading

synctext commented May 18, 2016 •

edited

Loading

lfdversluis commented May 26, 2016 •

edited

Loading

synctext commented Jun 23, 2016 •

edited

Loading

qstokkink commented Aug 26, 2016 •

edited

Loading

qstokkink commented Aug 26, 2016 •

edited

Loading

qstokkink commented Oct 30, 2016 •

edited

Loading

qstokkink commented Oct 31, 2016 •

edited

Loading

qstokkink commented Dec 15, 2016 •

edited

Loading

qstokkink commented Dec 29, 2016 •

edited

Loading

synctext commented Feb 1, 2017 •

edited

Loading

synctext commented Feb 20, 2017 •

edited

Loading

qstokkink commented Feb 21, 2017 •

edited

Loading