-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use asyncio for TCP/TLS comms #5450
Conversation
This is a WIP PR for using asyncio instead of tornado for TCP/TLS comms. There are a few goals here: - Reduce our dependency on tornado, in favor of the builtin asyncio support - Lower latency for small/medium sized messages. Right now the tornado call stack in the IOStream interface increases our latency for small messages. We can do better here by making use of asyncio protocols. - Equal performance for large messages. We should be able to make zero copy reads/writes just as before. In this case I wouldn't expect a performance increase from asyncio for large (10 MiB+) messages, but we also shouldn't see a slowdown. - Improved TLS performance. The TLS implementation in asyncio (and more-so in uvloop) is better optimized than the implementation in tornado. We should see a measurable performance boost here. - Reduced GIL contention when using TLS. Right now a single write or read through tornado drops and reaquires the GIL several times. If there are busy background threads, this can lead to the IO thread having to wait for the GIL mutex multiple times, leading to lower IO performance. The implementations in asyncio/uvloop don't drop the GIL except for IO (instead of also dropping it for TLS operations), which leads to fewer chances of another thread picking up the GIL and slowing IO. This PR is still a WIP, and still has some TODOs: - Pause the read side of the protocol after a certain limit. Right now the protocol will buffer in memory without a reader, providing no backpressure. - The tornado comms version includes some unnecessary length info in it's framing that I didn't notice when implementing the asyncio version originally. As is, the asyncio comm can't talk to the tornado comm. We'll want to fix this for now, as it would break cross-version support (but we can remove the excess framing later if/when we make other breaking changes). - Right now the asyncio implementation is slower than expected for large frames. Need to profile and debug why. - Do we want to keep both the tornado and asyncio implementations around for a while behind a config knob, or do we want to fully replace the implementation with asyncio? I tend towards the latter, but may add an interim config in this PR (to be ripped out before merge) to make it easier for users to test and benchmark this PR.
This is by no means ready for review yet, just getting it up to save it somewhere not on my computer. If people want to try this out (it works, but I have known TODOs) though, please feel free. cc @gjoseph92 and @jakirkham, both of whom may be interested. |
Also PR ( MagicStack/uvloop#445 ) may be relevant for us when sending multiple buffers |
- Workaround cpython bug that prevents listening on all interfaces when also using a random port. - Assorted other small cleanups
- Make asyncio protocol compatible with the tornado protocol. This adds back an unnecessary header prefix, which can be removed later if desired. - Add an environment variable for enabling asyncio-based comms. They are now disabled by default.
The builtin asyncio socket transport makes lots of copies, which can slow down large writes. To get around this, we implement a hacky wrapper for the transport that removes the use of copies (at the cost of some more bookkeeping).
Ok, this all works now and is generally (with some caveats) faster than the existing tornado comms. It's surprisingly hard to isolate the comms from everything else (e.g. the protocol) - here's the benchmark I've been using: Benchmark Scriptimport argparse
import os
from dask.utils import format_time, format_bytes
from distributed.comm import listen, connect, CommClosedError
from distributed.protocol.serialize import Serialized
import asyncio, time
async def handle_comm(comm):
try:
while True:
msg = await comm.read()
await comm.write(msg=msg)
except CommClosedError:
pass
class Bench:
def __init__(self, n_bytes, n_frames, seconds):
self.n_bytes = n_bytes
self.n_frames = n_frames
self.seconds = seconds
self.running = True
async def _loop(self):
frames = [os.urandom(self.n_bytes) for _ in range(self.n_frames)]
msg = Serialized({}, frames)
async with listen("tcp://127.0.0.1:8080", handle_comm, deserialize=False):
comm = await connect("tcp://127.0.0.1:8080", deserialize=False)
start = time.time()
nmsgs = 0
while self.running:
await comm.write(msg=msg)
await comm.read()
nmsgs += 1
end = time.time()
print(f"{format_time((end - start) / nmsgs)} roundtrip")
await comm.close()
async def _stopper(self):
await asyncio.sleep(self.seconds)
self.running = False
async def entry(self):
asyncio.create_task(self._stopper())
await self._loop()
def run(self):
asyncio.get_event_loop().run_until_complete(self.entry())
def main():
parser = argparse.ArgumentParser(description="Benchmark channels")
parser.add_argument(
"--frames",
"-f",
default=1,
type=int,
help="Number of frames per message",
)
parser.add_argument(
"--bytes", "-b", default=1000, type=float, help="total payload size in bytes"
)
parser.add_argument(
"--seconds", "-s", default=5, type=int, help="bench duration in secs"
)
parser.add_argument("--uvloop", action="store_true", help="Whether to use uvloop")
args = parser.parse_args()
if args.uvloop:
import uvloop
uvloop.install()
print(
f"Benchmarking -- n_frames = {args.frames}, "
f"bytes_per_frame = {format_bytes(args.bytes)}, "
f"uvloop={args.uvloop}"
)
Bench(int(args.bytes), args.frames, args.seconds).run()
if __name__ == "__main__":
main() A summary of results: Tornado based comms
Asyncio based comms
So we're generally faster for single frame messages. However! When writing multiple frames per message, we're slower:
I've tracked this down to this line here: distributed/distributed/comm/tcp.py Line 208 in 06ba74b
The tornado based comms are allocating a single large array for all frames in a message, then filling that in. In the asyncio comm, I allocate a separate array for each frame. The allocation overhead of creating multiple Note that to get the perf on write we had to muck with asyncio's internals. The default socket transport in asyncio will make unnecessary copies on write (hence the |
Just curious, what are the results with
This is interesting, and at least deserves an xref to #5208. @jakirkham can speak to this a lot more, but because we currently split frames in our protocol, often 1 frame < 1 output object that needs contiguous memory (such as a NumPy array). So we do want each block of sub-frames to all use the same contiguous memory, so we can zero-copy deserialize them. |
Ok, here's a more complete benchmark script. It now supports multiple clients, enabling/disabling uvloop, and enabling/disabling TLS. Benchmark Scriptimport argparse
import asyncio
import datetime
import os
from concurrent.futures import ProcessPoolExecutor
from distributed.comm import listen, connect, CommClosedError
from distributed.protocol.serialize import Serialized
DIR = os.path.abspath(os.path.dirname(__file__))
CERT = os.path.join(DIR, "bench-cert.pem")
KEY = os.path.join(DIR, "bench-key.pem")
def ensure_certs():
if not (os.path.exists(KEY) and os.path.exists(CERT)):
from cryptography import x509
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.x509.oid import NameOID
key = rsa.generate_private_key(
public_exponent=65537, key_size=2048, backend=default_backend()
)
key_bytes = key.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.PKCS8,
encryption_algorithm=serialization.NoEncryption(),
)
subject = issuer = x509.Name(
[x509.NameAttribute(NameOID.COMMON_NAME, "ery-bench")]
)
now = datetime.datetime.utcnow()
cert = (
x509.CertificateBuilder()
.subject_name(subject)
.issuer_name(issuer)
.public_key(key.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(now)
.not_valid_after(now + datetime.timedelta(days=365))
.sign(key, hashes.SHA256(), default_backend())
)
cert_bytes = cert.public_bytes(serialization.Encoding.PEM)
with open(CERT, "wb") as f:
f.write(cert_bytes)
with open(KEY, "wb") as f:
f.write(key_bytes)
def get_ssl_context():
import ssl
context = ssl.SSLContext(ssl.PROTOCOL_TLS)
context.load_cert_chain(CERT, KEY)
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE
return context
class BenchProc:
def __init__(self, n_bytes, n_frames, n_seconds, n_clients, use_tls):
self.n_bytes = n_bytes
self.n_frames = n_frames
self.n_seconds = n_seconds
self.n_clients = n_clients
self.use_tls = use_tls
async def client(self, address, comm_kwargs, msg):
comm = await connect(address, deserialize=False, **comm_kwargs)
while self.running:
await comm.write(msg)
await comm.read()
await comm.close()
def stop(self):
self.running = False
async def run(self):
self.running = True
if self.use_tls:
kwargs = {"ssl_context": get_ssl_context()}
prefix = "tls"
else:
kwargs = {}
prefix = "tcp"
address = f"{prefix}://127.0.0.1:8080"
msg = Serialized({}, [os.urandom(self.n_bytes) for _ in range(self.n_frames)])
loop = asyncio.get_running_loop()
loop.call_later(self.n_seconds, self.stop)
tasks = [
asyncio.create_task(self.client(address, kwargs, msg))
for _ in range(self.n_clients)
]
await asyncio.gather(*tasks, return_exceptions=True)
def bench_proc_main(n_bytes, n_frames, n_seconds, n_clients, use_tls):
bench = BenchProc(n_bytes, n_frames, n_seconds, n_clients, use_tls)
asyncio.run(bench.run())
async def run(n_bytes, n_frames, n_seconds, n_procs, n_clients, use_tls):
if use_tls:
ensure_certs()
kwargs = {"ssl_context": get_ssl_context()}
prefix = "tls"
else:
kwargs = {}
prefix = "tcp"
address = f"{prefix}://127.0.0.1:8080"
loop = asyncio.get_running_loop()
count = 0
async def handle_comm(comm):
nonlocal count
try:
while True:
msg = await comm.read()
await comm.write(msg=msg)
count += 1
except CommClosedError:
pass
async with listen(address, handle_comm, deserialize=False, **kwargs):
with ProcessPoolExecutor(max_workers=n_procs) as executor:
tasks = [
loop.run_in_executor(
executor,
bench_proc_main,
n_bytes,
n_frames,
n_seconds,
n_clients,
use_tls,
)
for _ in range(n_procs)
]
await asyncio.gather(*tasks)
print(f"{count / n_seconds} RPS")
print(f"{n_seconds / count * 1e6} us per request")
print(f"{n_bytes * count / (n_seconds * 1e6)} MB/s each way")
def main():
parser = argparse.ArgumentParser(description="Benchmark channels")
parser.add_argument(
"--procs", "-p", default=1, type=int, help="Number of client processes"
)
parser.add_argument(
"--concurrency",
"-c",
default=1,
type=int,
help="Number of clients per process",
)
parser.add_argument(
"--frames",
"-f",
default=1,
type=int,
help="Number of frames per message",
)
parser.add_argument(
"--bytes", "-b", default=1000, type=float, help="total payload size in bytes"
)
parser.add_argument(
"--seconds", "-s", default=5, type=int, help="bench duration in secs"
)
parser.add_argument("--uvloop", action="store_true", help="Whether to use uvloop")
parser.add_argument("--tls", action="store_true", help="Whether to use TLS")
args = parser.parse_args()
if args.uvloop:
import uvloop
uvloop.install()
n_bytes = int(args.bytes)
print(
f"processes = {args.procs}, "
f"concurrency = {args.concurrency}, "
f"bytes = {n_bytes}, "
f"frames = {args.frames}, "
f"seconds = {args.seconds}, "
f"uvloop = {args.uvloop}, "
f"tls = {args.tls}"
)
asyncio.run(
run(n_bytes, args.frames, args.seconds, args.procs, args.concurrency, args.tls)
)
if __name__ == "__main__":
main() There are many different axis to benchmark on, instead of posting complete results I'll just summarize the current status below. If you're interested in specific numbers, I encourage you to pull this branch locally and run the benchmark yourself. Does uvloop help?Kind of? It really depends on the bench flags used.
Are the asyncio comms faster than the tornado comms (no TLS)?Generally yes. This is especially true for small messages - as the message size increases the performance difference decreases. Here we benchmark with uvloop exclusively to show the maximum payoff. Without uvloop, asyncio yields at max 20% lower RPS (for small messages), and is slightly faster for large messages. Some numbers are below, but in general expect the asyncio comms to be 2x faster than the tornado comms for messages <= 1e4 bytes, decreasing to equal performance for messages of 1e7 bytes. msg size = 1e3 bytes
msg size = 1e4 bytes
msg size = 1e5 bytes
msg size = 1e6 bytes
msg size = 1e7 bytes
Are the asyncio comms faster than the tornado comms (with TLS)?It depends. For small messages, yes. For large messages, no.
I haven't figured out why yet, but I suspect it's unnecessary buffer copying (probably on the Note that if uvloop is not used, then the asyncio comm performance tanks even further. This is due to a few things:
With some work (mostly copying and modifying code out of asyncio in the stdlib to reduce copies) we could likely close the gap between uvloop & standard asyncio, but I don't think that work would be a good use of time. msg size = 1e3 bytes
msg size = 1e4 bytes
msg size = 1e5 bytes
msg size = 1e6 bytes
Assorted Thoughts
|
This is unfortunate, and would be good to fix. For messages that contain multiple buffers (e.g. a pandas dataframe composed of multiple numpy arrays), currently all numpy arrays will be backed by the same memoryview after deserialization. If an operation returns a single column, this would keep the whole memoryview in memory a lot longer than expected, and may masquerade as a memory leak. Frame splitting doesn't belong at the protocol level and would be better handled by the transports affected by it IMO. |
TLS?In case of uncertanity betewen no-TLS and TLS, we should emphasize TLS in this analysis. I expect most large scale users care about TLS. Large scale users will also be the ones who care about performance the most. I don't see anybody paying money for large scale deployments and high perf clusters who wouldn't want to protect their cluster with TLS. Although, running the benchmarks myself, I'm a bit shocked about how much we pay for TLS for lage payload... (I meassure a drop to ~30% throughtput using TLS with messages at 1e6 bytes) How realistic are the benchmarks?My initial thoughts about these benchmarks are that we have two different usages (actually three) of our comms and we might want to discuss both distinctly. I'm trying to discuss the edge cases where I see the system to be under heavy stress, i.e. the worst case. Worker<->Worker communication to fetch data / send_recvThe worst case here is that we have many dependencies to fetch from many different workers. We already apply batching until we reach a size of I would assume that something like This particular command gives me almost identical measurements for new vs old. In this example, the main process would be the worker and it's communicating to ten other workers (processes). In this approximation we assume the remote workers are not under super heavy load. The concurrency of two is merely there for stressing out the main process. Scheduler<->Worker administrative stream using
|
Final remark: I think we should put those benchmarks into ASV or something similar for the sake of preservation. This is likely not the last time we're going to do this. |
My recollection was that we started this work because we observed
GIL-holding issues in shuffle workloads, and also surmised that this might
affect scheduler performance (also odd GIL issues there). The benchmarks
that I'm seeing here don't seem related to that work. Should we assess
impact there?
…On Wed, Nov 10, 2021 at 8:33 AM Florian Jetter ***@***.***> wrote:
Final remark: I think we should put those benchmarks into ASV or something
similar for the sake of preservation. This is likely not the last time
we're going to do this.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5450 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTAAHXJ7WMQKNTHN3MLULJ7EZANCNFSM5GO6IHKA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
My understanding of the relation to GIL is:
But we've also talked about switching to asyncio as a benefit unto itself, since it reduces the technical debt of depending on tornado. If we wanted to benchmark how much we could improve GIL convoy effect performance with asyncio vs tornado, then I think we'd want to use both @jcrist's patched uvloop, and patch cpython to do the same, then rerun these benchmarks while a GIL-holding thread is running in the background. Intuitively though, I think GIL-convoy-effect performance is orthogonal to the asyncio vs tornado question, and even to uvloop vs pure asyncio. The problem is just that a nonblocking So basically, when the convoy effect is in play, I'd expect any performance difference between asyncio/uvloop/tornado to be washed away. And if a low-level patch eliminates the convoy effect across the board, then I'd expect the performance differences we see in these current benchmarks to remain about the same whether or not there's a GIL-holding background thread. Of course, if one library (uvloop) can eliminate the convoy effect while another (asyncio) can't, then that would become the dominant factor and any asyncio-vs-tornado effect might fade out—except if tornado doesn't use uvloop's sockets/SSL in a way that lets it take advantage of that (which IIUC maybe it doesn't?). |
I agree with everything @gjoseph92 said above. I made a quick modification to the benchmark script above to measure the effect of the GIL's interaction with the comms (source here). This adds two new flags The results are interesting. First - some background on how cpython's GIL currently works. Given a process with two threads (like the benchmark script), where one thread hogs the GIL and the other does some IO:
This means that we'd expect the IO thread to support releasing the GIL a maximum of 200 times a second. For asyncio & tornado (where each request has a For simplicity, we'll benchmark with a background thread that never releases the GIL (worst case scenario). We'll test 3 configurations:
One Client, no TLS
All 3 are stuck at ~100 RPS, matching the theory from above (each request releases the GIL twice, each release takes 0.005 seconds to reacquire). Now watch what happens if we increase the number of active clients: 6 clients, no TLS
In this case both tornado and asyncio are still capped at ~100 RPS, but uvloop is higher at 290 RPS. This surprised me upon first glance, but upon closer reading of uvloop's code it's due to uvloop being able to avoid excess GIL drop & reacquire calls during a single event loop iteration. This is only possible with multiple clients, as the request-response model means the event loop only has one thing to read/write every loop per socket. In practice a fully-blocking background thread is likely rare, but this does mean that uvloop handles this case better (though still not well). Tornado is not able to take advantage of this behavior of uvloop (since it uses raw socket operations), so no benefits there.
1 client, TLS Lastly, we test TLS behavior. Originally I thought that asyncio would handle this case better - I had misread tornado's SSL handling code and thought that every write tornado made with ssl enabled would release the GIL twice (once in ssl code, once in socket code), but this doesn't appear to be the case. The benchmarks show that actually asyncio code must release the GIL twice for every request, halving performance.
The previously described uvloop behavior helps a bit here with multiple clients, but it's stilll disadvantaged:
I haven't dug in yet to figure out where in the asyncio/uvloop code the GIL is being released twice. I suspect the write buffer size is smaller, so we're making more |
Thanks for this. It might be useful to see how these changes affect the
scheduler performance and shuffle benchmarks directly. I appreciate that
this is possibly a pain, but it might shine more representative light on
the situation.
…On Wed, Nov 10, 2021, 1:29 PM Jim Crist-Harif ***@***.***> wrote:
I agree with everything @gjoseph92 <https://github.com/gjoseph92> said
above.
I made a quick modification to the benchmark script above to confirm this
(source here
<https://gist.github.com/jcrist/c6336718edaabde21f2e1c269ac2bc88>).
This adds two new flags --gil-hold-time and --gil-release-time. If
--gil-hold-time is set, a background thread is started that loops
forever, holding the gil for the hold time and releasing it for the release
time (so if only --gil-hold-time is set, the background thread holds the
GIL forever). Note that the GIL holding thread only runs in the *server*,
slowing down responses only.
The results are interesting.
First - some background on how cpython's GIL currently works. Given a
process with two threads (like the benchmark script), where one thread hogs
the GIL and the other does some IO:
- The background thread should acquire the GIL every time the IO
thread releases it.
- The background thread shouldn't release the GIL until the switch
interval is hit (default 0.005 s).
This means that we'd expect the IO thread to support releasing the GIL a
maximum of 200 times a second. For asyncio & tornado (where each request
has a socket.recv and a socket.send), we'd expect to cap around 100 RPS.
UVLoop doesn't use the stdlib socket module, but does release and acquire
the GIL as necessary during each iteration of its event loop.
For simplicity, we'll benchmark with a background thread that never
releases the GIL (worst case scenario). We'll test 3 configurations:
- tornado comms
- asyncio comms
- asyncio comms with uvloop
*One Client, no TLS*
$ DISTRIBUTED_USE_ASYNCIO_FOR_TCP=0 python bench.py --gil-hold-time=1
processes = 1, concurrency = 1, bytes = 1000, frames = 1, seconds = 5, uvloop = False, tls = False
gil_hold_time = 1.00 s, gil_release_time = 0.00 us
96.52289783563648 RPS
10360.235989835746 us per request
0.09652289783563649 MB/s each way
$ DISTRIBUTED_USE_ASYNCIO_FOR_TCP=1 python bench.py --gil-hold-time=1
processes = 1, concurrency = 1, bytes = 1000, frames = 1, seconds = 5, uvloop = False, tls = False
gil_hold_time = 1.00 s, gil_release_time = 0.00 us
97.04507931799715 RPS
10304.489491148766 us per request
0.09704507931799715 MB/s each way
$ DISTRIBUTED_USE_ASYNCIO_FOR_TCP=1 python bench.py --gil-hold-time=1 --uvloop
processes = 1, concurrency = 1, bytes = 1000, frames = 1, seconds = 5, uvloop = True, tls = False
gil_hold_time = 1.00 s, gil_release_time = 0.00 us
97.2629619923594 RPS
10281.405989656743 us per request
0.09726296199235938 MB/s each way
All 3 are stuck at ~100 RPS, matching the theory from above (each request
releases the GIL twice, each release takes 0.005 seconds to reacquire). Now
watch what happens if we increase the number of active clients:
*6 clients, no TLS*
$ DISTRIBUTED_USE_ASYNCIO_FOR_TCP=0 python bench.py -p3 -c2 --gil-hold-time=1
processes = 3, concurrency = 2, bytes = 1000, frames = 1, seconds = 5, uvloop = False, tls = False
gil_hold_time = 1.00 s, gil_release_time = 0.00 us
99.26745186019129 RPS
10073.795400816818 us per request
0.09926745186019129 MB/s each way
$ DISTRIBUTED_USE_ASYNCIO_FOR_TCP=1 python bench.py -p3 -c2 --gil-hold-time=1
processes = 3, concurrency = 2, bytes = 1000, frames = 1, seconds = 5, uvloop = False, tls = False
gil_hold_time = 1.00 s, gil_release_time = 0.00 us
84.39836410347574 RPS
11848.570889051362 us per request
0.08439836410347575 MB/s each way
$ DISTRIBUTED_USE_ASYNCIO_FOR_TCP=1 python bench.py -p3 -c2 --gil-hold-time=1 --uvloop
processes = 3, concurrency = 2, bytes = 1000, frames = 1, seconds = 5, uvloop = True, tls = False
gil_hold_time = 1.00 s, gil_release_time = 0.00 us
290.9671084955164 RPS
3436.8145773267333 us per request
0.2909671084955164 MB/s each way
In this case both tornado and asyncio are still capped at ~100 RPS, but
uvloop is higher at 290 RPS. This surprised me upon first glance, but upon
closer reading of uvloop's code it's due to uvloop being able to avoid
excess GIL drop & reacquire calls during a single event loop iteration.
This is only possible with multiple clients, as the request-response model
means the event loop only has one thing to read/write every loop per
socket. In practice a fully-blocking background thread is likely rare, but
this does mean that uvloop handles this case better (though still not
well). Tornado is not able to take advantage of this behavior of uvloop
(since it uses raw socket operations), so no benefits there.
$ DISTRIBUTED_USE_ASYNCIO_FOR_TCP=0 python bench.py -p3 -c2 --gil-hold-time=1 --uvloop
processes = 3, concurrency = 2, bytes = 1000, frames = 1, seconds = 5, uvloop = True, tls = False
gil_hold_time = 1.00 s, gil_release_time = 0.00 us
96.68774637809385 RPS
10342.572223056446 us per request
0.09668774637809385 MB/s each way
*1 client, TLS*
Lastly, we test TLS behavior. Originally I thought that asyncio would
handle this case better - I had misread tornado's SSL handling code and
thought that every write tornado made with ssl enabled would release the
GIL twice (once in ssl code, once in socket code), but this doesn't appear
to be the case. The benchmarks show that actually asyncio code must release
the GIL twice for every request, halving performance.
$ DISTRIBUTED_USE_ASYNCIO_FOR_TCP=0 python bench.py --tls --gil-hold-time=1processes = 1, concurrency = 1, bytes = 1000, frames = 1, seconds = 5, uvloop = False, tls = True
gil_hold_time = 1.00 s, gil_release_time = 0.00 us
96.0717432132376 RPS
10408.887843124005 us per request
0.0960717432132376 MB/s each way
$ DISTRIBUTED_USE_ASYNCIO_FOR_TCP=1 python bench.py --tls --gil-hold-time=1processes = 1, concurrency = 1, bytes = 1000, frames = 1, seconds = 5, uvloop = False, tls = True
gil_hold_time = 1.00 s, gil_release_time = 0.00 us
38.73910709375392 RPS
25813.708033586412 us per request
0.03873910709375392 MB/s each way
$ DISTRIBUTED_USE_ASYNCIO_FOR_TCP=1 python bench.py --tls --gil-hold-time=1 --uvloop
processes = 1, concurrency = 1, bytes = 1000, frames = 1, seconds = 5, uvloop = True, tls = True
gil_hold_time = 1.00 s, gil_release_time = 0.00 us
39.254480500048835 RPS
25474.798985016652 us per request
0.03925448050004884 MB/s each way
The previously described uvloop behavior helps a bit here with multiple
clients, but it's stilll disadvantaged:
$ DISTRIBUTED_USE_ASYNCIO_FOR_TCP=1 python bench.py --tls --gil-hold-time=1 --uvloop -p10
processes = 10, concurrency = 1, bytes = 1000, frames = 1, seconds = 5, uvloop = True, tls = True
gil_hold_time = 1.00 s, gil_release_time = 0.00 us
61.14442014312276 RPS
16354.722109707918 us per request
0.061144420143122755 MB/s each way
I haven't dug in yet to figure out where in the asyncio/uvloop code the
GIL is being released twice. I suspect the write buffer size is smaller, so
we're making more transport.write calls than tornado does, but that's
just a guess.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5450 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCHUHM6A2HAJCUPGTDULLP2TANCNFSM5GO6IHKA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Have caught up on reading this thread. Though not quite all of the code yet. Still digesting. So this may be adding more dimensions than we want to think about atm, though I recall with ery UNIX domain sockets were explored. This has also come up here ( #3630 ) as well. IIRC these were faster than TCP. Is this something we've explored? No is a fine answer (and understand if other things take priority) |
I've made a patch to CPython here. This results in negligible slowdown of asyncio when mixing IO and CPU intensive threads, provided that IO is mostly saturated. Anytime the asyncio thread would block (waiting for new IO events) the background thread reclaims the GIL and the IO thread will have to wait to reacquire it. This makes sense to me - do IO work when possible (to the detriment of the CPU work), then do CPU work when idle (note that the CPU thread can still steal the GIL after the switch interval if the IO thread doesn't release it by then). Numbers:
Since this patch will take a while to get into cpython (and/or may never get in), I don't think it's worth spending a ton more time on it (I spent an hour or so this morning looking into this), but it was an interesting experiment. |
Agreed. This is flagged as a TODO in our internal queue, but it's good to flag it here as well.
This is unrelated, but also not terribly hard to support in asyncio. Unix sockets generally have lower latency on supported platforms, but on osx (and maybe others) they have significantly lower bandwidth than using TCP with loopback. So as to whether they're faster, the answer would be "it depends". They also only work for processes communicating on a single machine, which would be relevant for |
@jcrist this CPython GIL experiment is really cool. Would it be possible to also report how many cycles the GIL-holding thread made it through? How much is the CPU thread's performance degraded by improving IO performance? I'm hoping that it's imbalanced (more IO doesn't hurt CPU much); that would make a patch like this much more likely to get into uvloop/cpython someday. |
Ok, this should be ready for review. The windows test failures are a known flaky test. I'm not sure about the osx failures, they seem unrelated and don't fail every time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed a bit over half of this so far; _ZeroCopyWriter
coming up tomorrow. This is really cool!
I hadn't realized until reading this that you're now loading each frame into its own buffer, instead of one big buffer per message like the current TCP comm. I'm curious how much this is affecting the performance numbers, beyond just tornado vs asyncio. I would imagine it's a disadvantage compared to current TCP (more small mallocs). This is definitely the right way to go once we get frame-splitting out of serialization though.
Was rereading your comment above ( #5450 (comment) ) Jim about handling a large number of connections and this is probably already on your radar at Coiled, but one thing that comes up is users wanting to spin up large numbers of workers. For example ( #3691 ). ATM things don't behave particularly well there. Though that may be for all sorts of reasons. That said, faster/more reliable handling of more connections seems like an important first step. If |
Due to flaws in distributed's current serialization process, it's easy for a message to be serialized as hundreds/thousands of tiny buffers. We now coalesce these buffers into a series of larger buffers following a few heuristics. The goal is to minimize copies while also minimizing socket.send calls. This also moves to using `sendmsg` for writing from multiple buffers. Both optimizations seem critical for some real world workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ZeroCopyWriter
is cool!
During extraction of `coalesce_buffers` a bug handling memoryviews of non-single-byte formats was introduced. This fixes that.
- Adds write backpressure to the `_ZeroCopyWriter` patch - Sets the high water mark for all transports - the default can be quite low in some implementations. - Respond to some feedback (still have more to do).
- Fix windows support (hopefully) - Lots of comments and docstrings - A bit of simplification of the buffer management
- Closed comms always raise `CommClosedError` instead of underlying OS error. - Simplify buffer peak implementation - Decrease threshold for always concatenating buffers. With `sendmsg` there's less pressure to do so, as scatter IO is equally efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcrist just curious, why'd you get rid of SENDMSG_MAX_SIZE
? Just because there's not really extra cost to giving sendmsg more data than can fit into the socket's buffer?
Yeah. The main point was to reduce our processing cost for prepping the buffers for each |
This keeps failing on some unrelated windows tests that I think are flaky (I've seen them fail elsewhere). I'm not sure how the changes here would cause these failures, as the comms used in those cases are still the tornado ones. Anyway, I believe this to be good to merge now. |
The recent test failures on main are visualized on https://ian-r-rose.github.io/distributed/test_report.html can you find the flaky test there? I would defer approval to @gjoseph92 |
One of the test failures is |
I see the first failure for Edit: it seems it's a brand new test from @crusaderky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Jim! 😄
Tried to do a more thorough review now that I have a bit more time to do so. Apologies for getting to this late.
I believe this to be ready for merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've got a LGTM from M!
I'm very excited about this. Even if there are issues—which I don't think there are are after a number of reads over—there's really no risk in merging this, so let's get it in. This is a nice stepping-stone for a number of things we'd like to improve (GIL convoy effect, framing in serialization, serialization in general, proper uvloop support, ...)
Will merge EOD if no comments |
Acknowledged; I'm looking it up |
This is a WIP PR for using asyncio instead of tornado for TCP/TLS comms.
There are a few goals here:
support
call stack in the IOStream interface increases our latency for small
messages. We can do better here by making use of asyncio protocols.
copy reads/writes just as before. In this case I wouldn't expect a
performance increase from asyncio for large (10 MiB+) messages, but we
also shouldn't see a slowdown.
more-so in uvloop) is better optimized than the implementation in
tornado. We should see a measurable performance boost here.
read through tornado drops and reaquires the GIL several times. If there
are busy background threads, this can lead to the IO thread having to
wait for the GIL mutex multiple times, leading to lower IO
performance. The implementations in asyncio/uvloop don't drop the GIL
except for IO (instead of also dropping it for TLS operations), which
leads to fewer chances of another thread picking up the GIL and slowing
IO.
This PR is still a WIP, and still has some TODOs:
the protocol will buffer in memory without a reader, providing no
backpressure.
it's framing that I didn't notice when implementing the asyncio version
originally. As is, the asyncio comm can't talk to the tornado comm.
We'll want to fix this for now, as it would break cross-version support
(but we can remove the excess framing later if/when we make other
breaking changes).
frames. Need to profile and debug why.
for a while behind a config knob, or do we want to fully replace the
implementation with asyncio? I tend towards the latter, but may add an
interim config in this PR (to be ripped out before merge) to make it
easier for users to test and benchmark this PR.
Fixes #4513.