-
-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: cgi-mode 1700% improvement #933
Conversation
FYI, to make these flame graphs: sudo apt install graphviz
go install github.com/uber-archive/go-torch@latest
git clone git@github.com:brendangregg/FlameGraph.git
cd FlameGraph
# start server and run a load test
go-torch -u http://localhost:2019/debug/pprof/profile -t 10 Clicking on them allows you to dig into them and search them as well. Pretty neat. |
Nice! I re-ran my benchmarks on this branch and cgi mode is now even faster than worker mode for a 'Hello World' cgi before (num_threads 1): cgi after (num_threads 1): worker mode (1 worker): Great Job 👍. |
I sent an email to internals asking for help/reviews as I'm not 100% clear on the impact this will have on TSRM. |
I'll review in depth when I'll be back from vacation, but we have an extensive test suite so if tests are green I'm pretty sure this will not cause unexpected issues. |
Amazing stuff! But I'm still wondering why there's such little improvement when adding more threads/workers. Is the bottleneck something out of our control (eg network bandwidth, os capacity for handling requests, etc)? I also wonder if there is performance decay by the time it gets to 40 threads. At the very least 1, 10, 20, 40 threads could be informative (or more granular if you're inclined) - maybe it's 60000 at 10 threads and then declines due to some sort of congestion. Finally, what really needs to be tested is something with an actual php workload (even if just synthetic with a loop, like @AlliBalliBaba did in previous tests). Right now we're surely testing caddy itself and it's hand-off to php, more than anything. @withinboredom mentioned something in the issue that seemed to suggest that opcache (or something similar) wasn't being used properly (it was recompiling scripts on each request, or something like that). Has this been tested/addressed with any of this? Here's the quote
|
You are ultimately limited by cores/context switching. Adding more threads won't make it faster past a certain point.
I'm working on it :) There's a lot of other stuff going on here too. For example, on the lock-free handles, I submitted a CL to google for Go: https://go-review.googlesource.com/c/go/+/600875 so that it can be even better—assuming they even want it (its approximately 3-4x faster than regular handles under contention, so I hope they want it). I'll be throwing this at some production servers soon. It just takes time; I would only have considered this "done" a few hours ago...
Sorry to get you worried, everything is working fine. 😎 |
I think this is expected, in the 'Hello world' case the C threads do very little work so 1-2 threads are enough. Most work is happening on the go side and go manages its threads automatically. I'll maybe also try to run those techempower benchmarks locally for a slightly less synthetic comparison when I have time. |
Sorry if I came across towards anyone as impatient, ungrateful etc... Its quite the opposite! Just eager and thought those few questions were worth bringing up Ps maybe what I was missing is what a "thread" is in this context, and also how many physical cores are on the test system. My expectation would be for decent performance improvements until there's 1 thread per core, but it is quite possible something more fundamental (os level network sockets etc...) is the bottleneck, and more cpu just doesn't help. Still, I do think a few more test runs with more granular changes in thread count would be informative. Anyway, I'll leave you folks to it. Keep up the great work! This defitnualy makes frankenphp much more "production-ready" and will encourage more usage and contributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!!
It depends on what the workloads, buffers, network, etc. Nobody is going to have exactly the same numbers. You are somewhat correct, one core per thread is the "best" but that assumes a single thread can saturate a core forever. In reality, there are synchronization constructs at-play, and a request eventually ends, so for FrankenPHP, it's about 2x cores to really be efficient. This way when a thread hits a synchronization point, where it is waiting on another thread, a ready-to-go thread can come along and immediately start doing its work while the other thread waits. If your workload is small enough, your network is saturated enough, you can probably have far more threads than cores and get more speed. For production workloads, there will be tradeoffs (like everything else). I suspect (but haven't tested) that 1-2x cores provide the lowest latency, while greater than that provides more throughput up to a point. |
I'm electing to remove the memset at startup and keeping it in shutdown. Why? Security! Keeping around the request data after a request may result in a leak of important information in a core dump or something. By setting it to zero when we shutdown the request, we can ensure no information is left laying around
Here are the techempower benchmarks for the cgi mode with fpm/caddy and fpm/nginx as baseline. The benchmarks are still mostly synthetic, so take them with a grain of salt. Most notably FrankenPHP is now faster than fpm + Caddy. Nginx still beats Caddy in the plaintext/json categories. Probably because it's tuned for the 512 users with keep-alive, Caddy settings are default. I also ran another round to show that the laravel octane worker mode is pretty much on par with the swoole worker mode |
Thanks for the insights, but this all just seems to confirm everything I've been saying - the test results previously shared just don't seem to make sense. They seem to practically saturate with 1 thread, with very marginal improvements by adding 40 threads. More granular testing across different thread counts would be very instructive... but I'm just being an ungrateful broken record at this point. Anyway, congrats on this immense improvement! |
The details are important. In this case, we were testing "hello world" so PHP returns in like <1ms. This allows a single thread to get such performance. If PHP were to return in, say 100-200ms to represent a Symfony application, these numbers would be impossible in a single thread. Since we aren't testing the performance of php, but frankenphp (the part that turns a caddy request into a php request and back), testing something like "hello world" is more valuable here and that there isn't much (if any) blockages. That being said, there is more room for improvement (but not much) and this PR isn't the end.
Thanks! |
Thanks for the clarification. It sounds like the "bottleneck" is, as I suggested previously, at the OS/system level - network sockets, etc. (ps I do intend to actually contribute to debugging, features etc at some point, but am pre-occupied with other work right now) |
No, it wasn't. The reason I was testing with "hello world" is precisely because it returns so quickly with no variance. If it returns <1ms in worker mode, why doesn't it in cgi-mode? I discovered that it was on the php side by dropping logging statements all over the place with how long things were taking, trying to hunt down whether it was a php problem or a go problem. They should be identical. Then you identify where it is taking so long. It was at this point, a coworker reminded me of flame graphs after I was being too quiet at lunch (I usually don't stop talking). These are in the PR description, and it gives a massive hint of what function to dive into. Then it was a matter of stepping into the function with |
You haven't tried a callgraph yet ? |
I assume you mean the one built into pprof? It just shows the total time spent in cgo, but doesn't break it down by C function calls, so it isn't clear/useful. |
This one was a doozy to track down... I had a guess it was on the PHP side of things and it wasn't until I got this flame graph that I was able to confirm it.
But I also knew @dunglas had optimized the heck out of this code. While I was in there, I went with the obvious
todo
and made theserver_context
into a thread-local variable and used the variable as a sentinel value for whether or not to free the server_context.If anything, I am most worried about that part... though we do a
memset
to0
at the beginning of each request... so it is probably fine.Aaaannnyway. Along the journey to get here, I learned that having a mutex lock at the beginning of each request effectively makes the requests single-threaded. There's another PR that speeds this up even more ... but needless to say, when I made the above changes and didn't see the expected performance increase, I had to dig into
ts_resource(0)
... and lo-and-behold: a mutex./tableflip
Well, from there, I started investigating what is really going on under the hood, and I'm 85% sure this is ok: delete
ts_free_thread
and only callts_resource(0)
once per thread... This appears to work fine. That being said, tests aren't passing locally, we will see if they pass here...Here's the current flame graph: