Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce performance measurements #604

Open
sorcio opened this issue Aug 14, 2018 · 34 comments
Open

Introduce performance measurements #604

sorcio opened this issue Aug 14, 2018 · 34 comments

Comments

@sorcio
Copy link
Contributor

sorcio commented Aug 14, 2018

Obligatory background: we all agree (or at least nodded our heads in agreement when this was mentioned many times in the past) that Trio has a number of priorities over speed of execution. Optimizing for speed or any other performance metric should never hinder simplicity, "humanness", correctness, maintainability, predictability, security, debuggability, or other design principles. Premature optimization might be the root of some evil, and focusing on efficiency carries the risk of getting in the way of innovation. Even just identifying what is performance critical takes a lot of effort and is a moving target.

Given all of this, I believe we should provide some tools to make sure that Trio doesn't add too much overhead and allows efficient user code to run efficiently. This is more complicated than "create a bunch of benchmarks" but I want to start by creating a bunch of benchmarks, at least to start thinking about this. I had found myself caring about performance in a few cases and I want to do more work in that direction. It's relevant to me and it might be to someone else so it would be nice to share.

When I say a bunch of benchmarks I think of something that should reproduce a realistic workload for a Trio application. I don't think we are in a position to actually understand what a "realistic workload" is. I have a couple examples that I can derive from things I do for my paid job, and we can think of some more more (including something from projects you are working on, porting asyncio benchmarks and examples, adding trio-asyncio examples, etc). We can also produce some microbenchmarks. Also "performance" is a broad term that can mean many different things. In an I/O application it can mean throughput, latency, bandwidth usage, fairness, CPU impact, memory impact, or anything I did not think about. Covering all of these aspects can be hard (although sometimes some is a good proxy for others, e.g. a microbenchmark for execution time of some operation may provide hints about CPU, latency and throughput).

In terms of process, I don't think that performance measurements should get in the way of approving a PR or anything like that. Yeah, it can be nice to have comparisons but, really, I don't think that we are anywhere close to ready to make performance part of the development process. Benchmarks are lies that can have a mild informative value that is often hard to interpret. Still, like a good test suite, a benchmark suite can sometimes make it obvious that something is a regression, or an improvement, and it can be improved over time to provide better coverage of real-world use cases.

Anything we can learn from other projects? Tools (benchmark runners and the such) we can borrow? General feedback?

@njsmith would you see the trio repo as the space for this or would it better live in a different project?

@njsmith
Copy link
Member

njsmith commented Aug 14, 2018

Yeah, ok, you're right :-). I've had some bad experiences with benchmark-driven development before, and it's going to be difficult to figure out what benchmarks are actually useful, but speed is important and data is valuable. I would strongly prefer "macrobenchmarks" whenever possible, i.e. let's serve some webpages or transfer some large files or something that at least looks like solving a real problem.

The two tools I know of for tracking benchmarks over time are airspeed velocity (used by e.g. SciPy) and codespeed (used by e.g. speed.python.org). There may be others. I haven't used either seriously myself. My impression is that asv is quicker to get started with, though for serious usage then any solution will need stable infrastructure to do the actual benchmark running. (Travis is not great for this.)

Benchmarks should certainly live in the python-trio/ org. I don't know whether it's better to put them in the trio repo or make a new repo – there are technical tradeoffs because you generally want to run benchmarks across multiple versions of trio. Whatever tool we use may also have a preference.

It would be nice if this infrastructure could be somehow shared or reused across projects. Maybe we want to track the speed of trio-asyncio? Both because we care about trio-asyncio, and because it's a way to get macrobenchmarks for trio itself.

@oremanj
Copy link
Member

oremanj commented May 1, 2019

A Trio benchmark suite would probably want to include some thread benchmarks too, e.g. as discussed in #709.

@njsmith
Copy link
Member

njsmith commented Jun 9, 2019

The discussion in this discourse thread convinced me that we have a fair amount of low-hanging fruit that even some simple microbenchmarks could help us flush out. (E.g., I'm pretty sure we're spending way too much time calling random.shuffle(runnable_tasks), but I don't have any benchmark to prove it!) And I'm less nervous about benchmark-driven development than I once was; I think it's pretty clear now what Trio's commitments to features looks like. For example, it's possible that adding a global deadlock detector (#1085) and pluggable scheduler support (#239 (comment)) will have some small cost, and it would be nice to be able to quantify that cost, but we're going to do them anyway.

So I've been thinking about what could be a useful "first benchmark" for trio core stuff. Some desiderata:

  • more than a timeit call, but mostly focused on basic operations like scheduling and TCP sockets
  • cross-platform
  • it'd be handy if it could do comparison benchmarks to other frameworks like asyncio, just to see
  • needs to scale to very large numbers of concurrent connections
  • it should place some emphasis on tail latency/fairness, not just raw throughput
  • it should collect more fine-grained data than averages – maybe dumping raw data for later analysis, or maybe using something like http://hdrhistogram.org/. And ideally we should be able to look at a timeline of performance to see "glitches", not just summaries at the end.

So what I'm thinking is: an echo server benchmark client:

  • makes a tunable number of new connections per second (poisson distributed? something burstier? search term: "bursty arrival rate modelling", sample reference)
  • Each connection sends a tunable (random?) amount of data
  • We measure latency from initiating the connection until the echo'ed arrival of the first byte of data (first byte latency)
  • We measure latency from when the last byte of data is accepted by the network, until it's echo'ed back (last byte latency – this is about how quickly the server can respond if an active client changes its mind about what it wants to do, to measure bufferbloat etc.)
  • We also measure throughput (not sure the best way – arrival of first byte to arrival of last byte? total data sent/received over all clients?)
  • Maybe we allow multiple simultaneous traffic generators at once – so e.g. you can combine a small number of "elephants" that use lots of bandwidth, plus lots of "mice" that show up in large numbers and make brief connections

Possible further extensions:

  • persistent connections that cycle between idle periods and active periods (similar to HTTP traffic)

This is a good talk on pitfalls with measuring latency: https://www.azul.com/presentation/how-not-to-measure-latency-4/

It seems like it would be difficult to implement this using Python, because the work the client has to do is similar to what the server has to do (or even more once you consider all the statistics recording). Probably we need to use some fast compiled language. Rust or Go are the most obvious options. Swift is plausible too. Not C.

@adontz
Copy link

adontz commented Jun 9, 2019

I think, complexity of implementation highly depends on acceptable complexity of setup.

Not that I teach you how to benchmark, but let me share what I would do.

Taking into account how complex is analysis you want to get, I do not think writing anything new in fast language is a good plan. There are lots of ways to make a mistake. I'd rather get some existing load test tool. Tool should be fast enough, but not smart. I am not sure about how complex are your desired load test scenarios, so can't recommend anything. There a lot's of lists of such tools, I'll post one for completeness https://gist.github.com/denji/8333630

Then I'd run N copies of that tool to make sure client side is not bottleneck. What should be N? For Python server and C client two copies are enough I think, but you may try three. I mean ten is too many, clients will be highly underloaded, but make sure client is never more than 50% loaded. For Python-Python I'd go with at least five. Also, you mentioned multiplatforming, so I'll make a quick Windows related comment. In Windows local connections are processed in different optimized way, so clients should not be local to get realistic numbers. Which means clients can be non-Windows.

So, what I would do for analysis is to run tcpdump on client side and write a tool (in Python?) to process multiple tcpdump files from all clients afterwards to calculate all kinds of interesting numbers. Probably writing dump to tmpfs may give better performance. Given a dump file you can measure all kinds of latency observed by client, number of packets send, etc. See all effects of low level TCP optimizations.

If you have graphite/carbon or similar tool installed and if you are familiar with this tool, then you probably can use it too. Speaking of carbon, network protocol is just a UDP packet with text of "metric-path metric-value metric-timestamp" format. Note metric-timestamp field. You can recover metrics from dump, send them to carbon and get nice graphs for free.

@njsmith
Copy link
Member

njsmith commented Jun 9, 2019

There a lot's of lists of such tools, I'll post one for completeness https://gist.github.com/denji/8333630

I've looked at some lists like this, but all the ones I've found are focused on HTTP load testing. Which is certainly interesting, but it means that we'll largely be benchmarking our code for parsing and generating HTTP. I thought an echo server benchmarker would be nicer for isolating details of fairness, task scheduling, etc. But I haven't found any good pre-existing echo server benchmarking tools :-/.

Then I'd run N copies of that tool [...] process multiple tcpdump files from all clients afterwards

That's an interesting idea! It does sound pretty complex to set up and run in a repeatable way – I'm not sure building a tool to do all this would be easier than writing the tool in Rust or something directly?

There's also a subtlety with measuring latency on persistent connections, where you don't necessarily want to measure the latency from the first packet hitting the wire. This is discussed in the "how not to measure latency" talk I linked above. What happens is for a lot of HTTP benchmarkers in particular, each connection executes a loop like:

while True:
    start_time = now()
    send_request(conn)
    read_response(conn)
    request_latency = now() - start_time

Then the problem is that if the server is really slow to respond one time, then the client sits in read_response for a while... and stops generating load! So even if the server flat out freezes for a second+, then that only counts as one slow request per connection. Their suggestion is to instead make a schedule ahead of time of when you want to send the requests, and you measure their latency from when you would have sent the request. That way if one request/response is slow and clogs up the connection for a while, you count that as slowing down all the requests that are queued up behind it waiting for a chance to go. But only the load generator knows when it wanted to send a request; you can't get this information from tcpdump.

I'm not sure how important this is for the kind of benchmarking I'm talking about with an echo server, but it's an interesting thing to think about.

@adontz
Copy link

adontz commented Jun 9, 2019

I've looked at some lists like this, but all the ones I've found are focused on HTTP load testing.

From one side HTTP is dominating protocol and what most people are talking about, no? I hardly remember anyone discussing SMTP load testing. From another side, yeah, parsing HTTP is not very fast, but is it very realistic. What's the point of benchmark without any protocol parsing? Higher numbers, but very synthetic conditions not correlating to any real life load. If you want to focus on network I/O and not parse HTTP in Python, then try httptools library. I use it, it is written in C/Cython and is pretty fast.

It does sound pretty complex to set up and run in a repeatable way – I'm not sure building a tool to do all this would be easier than writing the tool in Rust or something directly?

I totally agree that setup is relatively complex and relatively hardly repeatable. Relative to single command line. I'm pretty sure ugly bash script may do it, if you stick to particular Linux distro. Major advantage for me is that Trio is Python library and if you write your client in Rust, then probably fewer people will even understand how it works, to say nothing about actually helping. And if you postprocess with Python it seems more friendly to Python community you already have. You may post huge dumps taken in well-known controlled environment for people to review them, make decisions, even find bugs. Maybe I'm too optimistic, but I have good experience with this approach. I'd parse some dumps probably :-D. Also, it lets you gather new stats from old experiments.

There's also a subtlety with measuring latency on persistent connections, where you don't necessarily want to measure the latency from the first packet hitting the wire.

Yes, and that's why I'd stick to existing async/threaded tools, which already address this and many other catches; and run multiple clients anyway, just in case.

Their suggestion is to instead make a schedule ahead of time of when you want to send the requests, and you measure their latency from when you would have sent the request.

So, if I got you correctly, just checking. Let's say we want to send one request per two second, five requests total. And usually request takes one second to process. So we:

  • 0th second. We send request 1.
  • 1st second. We receive response 1. Processing time is 1 second.
  • 2nd second. We send request 2.

And here suddenly server freezes for 3 seconds. So

  • 3rd second. Nothing happens.
  • 4th second. Nothing happens.
  • 5th second. Nothing happens.
  • 6th second. We receive response 2. Processing time is 4 seconds. We send request 3.
  • 7th second. We receive response 3. Processing time is 3 seconds, because we wanted to send at 4th second. We send request 4.
  • 8th second. We receive response 4. Processing time is 2 seconds, because we wanted to send at 6th second. We send request 5.
  • 9th second. We receive response 5. Processing time is 1 second.

With schedule (app view) average latency is 2.2s, without schedule (tcpdump view) average latency is 1.6s.

Now, how realistic is this? What real applications need to send network requests (polling?) at specific intervals and suffer from latency.

Web browsers? - no.
E-Mail clients? - they poll, but latency is not critical at all.
Games? - They use bidirectional communication with pub/sub server notifications.

@smurfix
Copy link
Contributor

smurfix commented Jun 9, 2019

What's the point of benchmark without any protocol parsing? Higher numbers, but very synthetic conditions not correlating to any real life load.

The point is to isolate send+receive from protocol handling, so that we get a clearer picture of exactly which areas are affected when we have a regression.

This is not about real-life load. Real-life loads are disparate enough, your real-life load might not correspond to mine in any meaningful way.

Now, how realistic is this? What real applications need to send network requests (polling?) at specific intervals and suffer from latency.

Forget the "specific intervals" part, that's an artefact of this being just one possible way of doing such a test.

However, there is a material difference between "one request is stalled for 3sec but the server is otherwise responsive" and "the server is blocked and requests queue up". We want to design our tests so that we actually see that difference in the results: a lone 3-second spike is indicative of the former while a spike with a long(ish) trail of lower-latency requests behind it probably points to the latter.

One possible application where this would matter: a file server. Or a streaming media server. Or, yes, games. Sure it's bidirectional with pub/sub, but so what? you still want an indication that the opposing RPC has been blown up immediately (or as immediate as feasible) after firing your BFG.

@njsmith
Copy link
Member

njsmith commented Jun 10, 2019

What's the point of benchmark without any protocol parsing? Higher numbers, but very synthetic conditions not correlating to any real life load

I tend to group benchmarks into two categories: ones that focus on answering specific questions about specific known components of a system under precisely controlled conditions ("microbenchmarks"), and ones that try to estimate actual end-to-end performance in a specific real scenario ("macrobenchmarks"). It's very similar to the distinction between white box or "unit" testing versus black box or "integration" testing. It's not a perfect split, benchmarking is about compromises, but it's a useful one.

The echo server benchmark idea is on the microbenchmark side – you're right that it doesn't correspond to any real scenario, but that's not the goal. The goal is to understand the performance characteristics of Trio's scheduling and socket layer, because the better we can understand them, the better job we can do optimizing them for all kinds of workloads.

My concern about wedging some httptools kinda parser into a benchmark like this is that it doesn't seem like it fits well into either category. We don't care about understanding this specific parser's performance characteristics, so it's not helping us microbenchmark. But it also doesn't produce a realistic end-to-end macrobenchmark – for that we'd need to use a real HTTP server like Quart-Trio, and ideally a real web application and a realistic mix of request types.

Major advantage for me is that Trio is Python library and if you write your client in Rust, then probably fewer people will even understand how it works, to say nothing about actually helping. And if you postprocess with Python it seems more friendly to Python community you already have. You may post huge dumps taken in well-known controlled environment for people to review them, make decisions, even find bugs

If we did go the Rust/Go route, then both those languages make it very easy to distribute self-contained binaries that people can download and run on whatever system they have. And probably we'd make the client itself dump large quantities of raw-ish data, and then do any detailed analysis in Python afterwards, because you're right that just makes more sense :-)

Now, how realistic is this? What real applications need to send network requests (polling?) at specific intervals and suffer from latency.

[Note: I realized that the link to the Azul systems talk in my post above was pointing to the wrong version of the talk, that was missing most of the stuff I'm talking about here! I updated the link to point to the right version.]

Well, here's why people started worrying about this: they noticed that for most popular benchmarking tools, if you run say a 10 second test, and your server glitches out so that it's completely frozen and non-responsive for 8 seconds of the test, then... the benchmarking tools say "99.99% of the time, the latency was under 1 ms". Which makes no sense at all. So the problem seems real.

I'm not sure whether their proposed solution is the best one. I think the root issue is: we mostly care about how our HTTP server performs when there are lots of different clients each trying to make a small number of requests. But most benchmarking tools aren't set up to create lots of independent, simultaneous connections. Instead they try to fake it, by having a few connections, and then sending lots of requests on each connection. This works OK in some cases, but it really falls down when measuring tail latencies. Because with the small-connections-with-lots-of-requests approach, when one request exhibits a really long latency, then the benchmark tool just stops issuing requests entirely for a while. But this is unrealistic, because in a real system with lots of independent clients, if you're slow to respond to one client, that doesn't stop requests arriving from other clients.

I think another way to solve the problem is for the benchmarking client to actually create lots and lots of independent connections.

@adontz
Copy link

adontz commented Jun 10, 2019

Well, here's why people started worrying about this: they noticed that for most popular benchmarking tools, if you run say a 10 second test, and your server glitches out so that it's completely frozen and non-responsive for 8 seconds of the test, then... the benchmarking tools say "99.99% of the time, the latency was under 1 ms". Which makes no sense at all. So the problem seems real.

This is because describing everything with one number is what people on It often do, but also wrong; and percentile thing is just easy to understand oversimplification. Whatever I test, I get much more usable (and by usable I mean decision making) results with classic average and standard deviation. And laggish freezing performance significantly increases standard deviation. So since I learned this, I tell a range of [μ - 3σ, μ + 3σ] and not average. If and only if σ is orders of magnitude smaller than μ, average makes any sense. So even with some assumption of distribution it should be at least two numbers.

Reference material: https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule

If you're not familiar with concept, just try playing with AVERAGE and STDDEV.P in Excel or Google Sheets, you will see that you will not able to simulate unnoticeable lag.

@njsmith
Copy link
Member

njsmith commented Jun 10, 2019

IME most good benchmarking tools report results as a set of quantiles or a histogram, which tells you much more about the shape of the distribution than mean/stddev. The 68-95-99.7 rule is definitely not applicable here – it only applies to data that follows a gaussian distribution, and latency distributions are extremely far from being gaussian. The idea of using quantiles/histograms makes sense; the problem is that they're being computed from bad data.

Which isn't to say mean/stddev are useless. And it's true that in this case mean/stddev would show an anomaly, because they're extremely sensitive to small numbers of outliers. But they'll still wildly underestimate the true mean/stddev of your application's latency, because they're being computed from bad data too.

For example, lets say our tool does 1000 requests in the first second, then gets stuck for 8 seconds, then does 1000 more requests. The regular requests take 1 ms, and during the stuck period it does one request that takes 8 seconds. Then our mean latency is 5 ms, with stddev of ~180 ms:

In [4]: a = np.concatenate(([0.001] * 1000, [8], [0.001] * 1000))

In [5]: np.mean(a)                                                              
Out[5]: numpy.float64(0.004997501249375313)

In [7]: np.std(a)                                                               
Out[7]: numpy.float64(0.17877369067487073)

OTOH if our tool had kept up the 1000 requests/second across the whole 10 second measuring period, then we would have gotten a mean of ~3.2 seconds, and stddev of ~2.6 seconds:

In [16]: b = np.concatenate(([0.001] * 1000, np.linspace(8, 0, 8000), [0.001] * 1000))

In [17]: np.mean(b)                                                             
Out[17]: numpy.float64(3.2002)

In [18]: np.std(b)                                                              
Out[18]: numpy.float64(2.6127482899589345)

So the results are still wildly incorrect.

@adontz
Copy link

adontz commented Jun 10, 2019

Histograms are too much information to me. More confusing, than helping.

If I know that one option gives me 5000 on average and another option gives me 7000, that is oversimplification and I demand more information.

If I know that one option gives me 4000-6000 and another option gives me 3000-10000, I have enough information to make a decision. I can think if I prioritize average or worst case. I can think if I will be able to utilize that lucky moments when performance is the best or not. This information is succinct, but also usable.

If I see two histograms, I honestly get a drink, because comparing hundreds of numbers represented with coorditates and colors is not my thing.

You are right about distribution, but hey, if you get mean of 5 with stddev of 180, it is red light. If stddev is not two orders of magnitude less than mean, I become highly suspicious. Either number of measurements is not enough, or system is so unstable it can't be helped (hello AWS EBS I/O). So while you are totally right, I'd prefer simplified, but not oversimplified model, to have something I can compare.

@belm0
Copy link
Member

belm0 commented Jun 15, 2019

I'm a fan of instrumenting portions of the user's actual code via means as lightweight as possible to get performance information. Mostly thinking about CPU performance here, which is quite relevant given that Python is effectively constrained to a single core. Sampling profilers perform well but are not useful for finding execution times, let alone tail durations-- and any real code is going to have multiple complex paths so having a histogram is important. I already have a little PerfTimer context manager to do this-- just slap it around any block of code-- but the current implementation only makes sense for CPU-bound code (non-async, non-IO blocking). I'd love to expand it to be aware of the Trio scheduler so that it's useful in async situations as well.

I have a fast, streaming, approximate histogram implementation (observation takes ~2 usec using numba implementation on my old laptop) which makes no assumption about the input distribution (i.e. no fretting about bin configuration). I've been meaning to open source it.

@belm0
Copy link
Member

belm0 commented Jun 15, 2019

Along the lines of time(), process_time(), thread_time(), it would be useful to have some time measurement primitives which are aware of Trio task scheduling:

trio.nursery_time() - sum of user and Trio internal CPU time of the current task and any child tasks. Like process/thread_time(), it includes system and user CPU time and excludes OS sleep time (though that shouldn't exist in Trio code).

trio.task_time() - sum of user and Trio internal CPU time of the current task. Child tasks are excluded.

@njsmith
Copy link
Member

njsmith commented Jun 15, 2019

@belm0 Unfortunately, the only way to compute per-task CPU time (with or without child tasks) is to call thread_time at every task switch, which is probably too much overhead for something that's only used in rare occasions. I think this would be a good place to use an Instrument?

I'd also be curious to hear more about what you would use this for... a sampling profiler is a good way to see where your CPU time hotspots are in general, and a regular wall-clock timer is good for measuring latency, but I don't (currently) know any reason to measure, e.g., tail CPU time.

@belm0
Copy link
Member

belm0 commented Jun 15, 2019

I think using Instrument is fine.

I do use a sampling profiler, but it's a blunt tool. It will point out gross inefficiencies, but it's not going to help you with a death by 1,000 cuts situation. Specifically, it's not going to help you decide if an algorithm tweak or new implementation of some often-called small function is a win. That's going to be significant if we want guidance on optimizing Trio's scheduling internals. This is why staring at a sampling profiler for days yielded no useful suggestions for Trio optimizations (#943).

I've found it very useful to measure execution time of a block of code of interest-- over the full session of an application, encompassing all inputs and their effect on code paths (and hence needing a histogram). It's a very sensitive measurement, immune to the fickleness of sampling, and lets you clearly see effects of program changes at any level in the instrumented code's scope.

@belm0
Copy link
Member

belm0 commented Jun 24, 2019

Here's a stab at a task timing function. It's based on perf_counter() rather than thread_time(), as the former is about 10x faster.

def task_perf_counter():
    """
    For the current Trio task, return the value (in fractional seconds) of a
    performance counter, i.e. a clock with the highest available resolution to
    measure a short duration.  It includes time elapsed during time.sleep,
    but not trio.sleep.  The reference point of the returned value is
    undefined, so that only the difference between the results of consecutive
    calls is valid.
    """

https://gist.github.com/belm0/4fdb51ce04c5c26f01d58913ae5c43da

@Tronic
Copy link
Contributor

Tronic commented Aug 14, 2019

I put together a small HTTP server benchmark for asyncio, uvloop and trio: https://paste.zi.fi/p/ahbench.py/view

Results on Macbook with i7-4870HQ: https://paste.zi.fi/p/ahbench-results.txt/view

@njsmith
Copy link
Member

njsmith commented Aug 14, 2019

@Tronic huh, neat! It's obviously a very narrow/specific microbenchmark, but those results are pretty promising as a start (the trio socket-level implementation beats the stdlib on throughput!), and it already raises some interesting questions:

  • Is there any low-hanging fruit in the socket version? I'd be interested to see a profile
  • What's slowing down the trio streams, and can we fix it?
  • Why is trio's latency distribution so much worse than all the asyncio variants?

@njsmith
Copy link
Member

njsmith commented Aug 14, 2019

@Tronic would you be okay with stashing those somewhere more permanent, so people reading this thread later will know what we're talking about?

@Tronic
Copy link
Contributor

Tronic commented Aug 14, 2019

@Tronic
Copy link
Contributor

Tronic commented Aug 14, 2019

Interesting find about Trio latencies. I ran the tests again and it isn't just a random hiccup. I guess some tracing would be needed to find out whether it is unfair scheduling or something more severe that is causing it.

Another interesting bit that one encounters implementing the same program against multiple async I/O libraries, is that from the outside and for the main logic they seem quite similar. The devil is in the details: proper cancellation, not getting stuck on some event that never comes, and sockets that Just Do The Right Thing, and this is where Trio really shines.

I've recently worked with all three asyncio socket APIs, as well as curio and trio. Every problem I've encountered with the other modules seems to be taken care of in Trio, and usually briefly mentioned in its documentation or source code comments. Amazing work!

@belm0
Copy link
Member

belm0 commented Aug 14, 2019

It would be useful to show median latency given that variance.

@Tronic
Copy link
Contributor

Tronic commented Sep 10, 2019

It turns out that random.shuffle is not a bottle neck but setting deadlines is something that definitely needs updating. I already removed separate CancelScopes for different deadlines, in favour of adjusting main nursery deadline with new values (a lot faster), but still even resetting the deadline once per HTTP request slows the server down from 7300 req/s to 5800 req/s.

Profiling summary at sanic-org/sanic#1662 (comment)

I did not look too closely into what exactly _might_change_registered_deadline is doing but if its implementation cannot be otherwise optimized, it might be good to defer updating task structures until the deadlines would actually make a difference (i.e. right before canceling an expired task and at suitable scheduler breaks soon enough but not immediately).

There is also considerable overhead in high level streams, making it worth a look if there are any easy ways to make them faster but not less correct. The profiling was based on plain TCP streams that still have fairly simple logic, not SSL.

The good thing is that, to the degree that cProfile could tell me, the core scheduling of Trio is not slowing things down.

@pquentin
Copy link
Member

Interesting results! Did you try using a sampling profiler such as https://github.com/benfred/py-spy to make sure that cProfile isn't skewing your results too much?

@Tronic
Copy link
Contributor

Tronic commented Sep 11, 2019

Nope but reducing the number of deadline-setting points per request led to clear improvement in benchmark scores, so the profiling results seem accurate in that regard, and the results didn't differ too much with and without profiler.

I suppose that one could not use Trio deadlines and instead work this around in application level, having the requests update "last activity" timestamps whenever they do something (reading the clock is quite fast), and running a background task that sleeps 10 s and wakes up periodically to check if other tasks are past their deadlines, asking Trio to cancel those that are expired. But then, this would be reimplementing specifically the kind of thing that Trio is supposed to handle.

@njsmith
Copy link
Member

njsmith commented Sep 12, 2019

It would be nice to see detailed profiles, so we could tell what part of the deadline code was actually taking up time :-).

I already removed separate CancelScopes for different deadlines, in favour of adjusting main nursery deadline with new values (a lot faster),

Just to check, are you on trio 0.12? that rewrote a bunch of the machinery for adding/removing CancelScopes to make it more efficient. And maybe there are still some unnecessary inefficiencies we haven't noticed because we haven't looked at the right profile.

Beyond that – deadline tracking is notorious source of bottlenecks in servers, so I guess it's not a surprise if it shows up here. The problem is that you need to be able to efficiently find expired deadlines, which requires some kind of sorted or heap-like structure, but you also need to be able to efficiently remove deadlines, which requires some kind of random access. It's very difficult to build a data structure that's super-fast at both of these things at once, and very easy to build one that's accidentally super-inefficient in some corner cases. So there are tons of different designs people use – search for e.g. "timer wheel" to see some examples.

Right now, Trio uses a very simple-minded approach, by storing pending deadlines in a sortedcontainers.SortedDict. This has the advantage that it's reasonably efficient in all cases, and it was very simple to implement because we're letting this other library do all the work. It's almost certainly possible to do something more efficient though. It needs a bit of research – e.g. hashed hierarchical timer wheels are nice, but a bit intricate, and the way Python overhead works, simple code that can take advantage of pre-built C data structures often beats intricate code that has to do complicated things in Python, so I don't know whether they're the best approach or not.

One simple approach that might work well would be: Use the stdlib heapq functions to manage the set of pending timers, with entries like (deadline, arbitrary integer to break ties, scope). The heap structure makes it quick to find the next deadline. But, heaps don't support random access, so we need something extra for that. One approach would be: when a deadline changes or a cancel scope goes out of scope, leave the old entry in place. Whenever we pull something from the queue, first validate it by checking that the deadline still matches; if not, that's a stale entry, and we can discard it. That makes it fast to remove entries. But, now there's still one more problem: if we're creating and discarding deadlines fast enough, the queue might grow arbitrarily large, and take up arbitrary amounts of memory holding stale entries. To avoid that, we can keep a count of the total number of stale entries in the queue (this is cheap – the expensive thing is finding the removing the stale entries). We can use this to calculate what percentage of the queue is taken up by stale entries. And if this gets too large, we that can trigger a kind of "garbage collection" run, where we do a single loop over the queue to remove all the stale entries at once.

It's also possible that the _might_change_registered_deadline code is inefficient simply because it's implemented as a context manager... converting it into a regular method might speed things up.

So there's definitely stuff we can do here... is this something you're interested in working on? I think the first step would be to make a PR with some benchmarks. I can speculate about random ideas for speeding things up, but as you can see it's a pretty complicated topic, so we definitely can't predict ahead of time what will work best. Reproducible benchmarks are great because they let us try stuff and see what works, and compare our results.

@Tronic
Copy link
Contributor

Tronic commented Sep 14, 2019

It'll take a while (estimate two weeks) until I can get back to this but I'll see if I can optimize it and make a PR. This week I'm at movie shootings with no access to the machine where I did the profiling, but even the summary shows that too much time is spent both inside of _might_change_registered_deadline and outside it within the deadline setter, so both should be looked at, and possibly also creation of CancelScopes for this purpose needs some work (those I already eliminated from the benchmark code).

@Tronic
Copy link
Contributor

Tronic commented Oct 6, 2019

PR #1234

@Marco-Sulla
Copy link

Excuse me, maybe is not useful for your case, but do you know pyperf? It's used by py devs too.

@njsmith
Copy link
Member

njsmith commented Jun 15, 2020

I accidentally nerd-sniped my friend @nelhage with @Tronic's HTTP benchmarks that he pasted up-thread (and in particular the nasty latency distribution, which was the part that bothered me the most). It turns out that most of the latency issue was an artifact of the benchmark setup: Tronic/async-http-benchmark#1

In the updated numbers there's definitely still some room to improve Trio's throughput and latency, but now it looks like the kind of thing I'd expect and that we can handle through regular profiling/optimization, not some mysterious totally different latency distribution.

@Tronic
Copy link
Contributor

Tronic commented Jun 16, 2020

Thanks. The benchmark was originally intended only for throughput but since you found the latency results interesting as well, it's good to have that fixed (PR merged).

@belm0
Copy link
Member

belm0 commented Jul 1, 2020

plugging my perf-timer tool in case it could be useful somewhere

  • it's for measuring time of a block of code or function
  • it supports sync, threaded, and Trio async contexts
  • it has pluggable observers and can provide std dev as well as percentiles
  • typical observation time on my old laptop is < 2 microseconds for sync case, < 5 microseconds for Trio async. Using the histogram adds ~4 microseconds. (I have ideas for speeding that up, and will also check out HdrHistogram.)

https://github.com/belm0/perf-timer

@jeremysalwen
Copy link

Just chiming in to express an interest in simple benchmarks of nursery creation/deletion overhead.

@stalkerg
Copy link

any updates in such an area?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests