perf: cgi-mode 1700% improvement #933

withinboredom · 2024-07-23T22:26:59Z

This one was a doozy to track down... I had a guess it was on the PHP side of things and it wasn't until I got this flame graph that I was able to confirm it.

But I also knew @dunglas had optimized the heck out of this code. While I was in there, I went with the obvious todo and made the server_context into a thread-local variable and used the variable as a sentinel value for whether or not to free the server_context.

If anything, I am most worried about that part... though we do a memset to 0 at the beginning of each request... so it is probably fine.

Aaaannnyway. Along the journey to get here, I learned that having a mutex lock at the beginning of each request effectively makes the requests single-threaded. There's another PR that speeds this up even more ... but needless to say, when I made the above changes and didn't see the expected performance increase, I had to dig into ts_resource(0)... and lo-and-behold: a mutex.

/tableflip

Well, from there, I started investigating what is really going on under the hood, and I'm 85% sure this is ok: delete ts_free_thread and only call ts_resource(0) once per thread... This appears to work fine. That being said, tests aren't passing locally, we will see if they pass here...

Here's the current flame graph:

withinboredom · 2024-07-24T06:28:01Z

FYI, to make these flame graphs:

sudo apt install graphviz
go install github.com/uber-archive/go-torch@latest
git clone git@github.com:brendangregg/FlameGraph.git
cd FlameGraph
# start server and run a load test
go-torch -u http://localhost:2019/debug/pprof/profile -t 10

Clicking on them allows you to dig into them and search them as well. Pretty neat.

AlliBalliBaba · 2024-07-24T07:29:21Z

Nice! I re-ran my benchmarks on this branch and cgi mode is now even faster than worker mode for a 'Hello World'
(20 CPU cores amd64 Ubuntu WSL with wrk -t4 -c50):

cgi before (num_threads 1): ~3000 requests/second
cgi before (num_threads 40): ~2000 requests/second

cgi after (num_threads 1): ~40.000 requests/second
cgi after (num_threads 40): ~48.000 requests/second

worker mode (1 worker): ~40.000 requests/second
worker mode (40 workers): ~44.000 requests/second

Great Job 👍.

withinboredom · 2024-07-24T08:15:03Z

I sent an email to internals asking for help/reviews as I'm not 100% clear on the impact this will have on TSRM.

dunglas · 2024-07-24T11:28:26Z

I'll review in depth when I'll be back from vacation, but we have an extensive test suite so if tests are green I'm pretty sure this will not cause unexpected issues.

nickchomey · 2024-07-24T18:51:36Z

Nice! I re-ran my benchmarks on this branch and cgi mode is now even faster than worker mode for a 'Hello World'
(20 CPU cores amd64 Ubuntu WSL with wrk -t4 -c50):

cgi before (num_threads 1): ~3000 requests/second
cgi before (num_threads 40): ~2000 requests/second

cgi after (num_threads 1): ~40.000 requests/second
cgi after (num_threads 40): ~48.000 requests/second

worker mode (1 worker): ~40.000 requests/second
worker mode (40 workers): ~44.000 requests/second

Great Job 👍.

Amazing stuff!

But I'm still wondering why there's such little improvement when adding more threads/workers. Is the bottleneck something out of our control (eg network bandwidth, os capacity for handling requests, etc)?

I also wonder if there is performance decay by the time it gets to 40 threads. At the very least 1, 10, 20, 40 threads could be informative (or more granular if you're inclined) - maybe it's 60000 at 10 threads and then declines due to some sort of congestion.

Finally, what really needs to be tested is something with an actual php workload (even if just synthetic with a loop, like @AlliBalliBaba did in previous tests). Right now we're surely testing caddy itself and it's hand-off to php, more than anything.

@withinboredom mentioned something in the issue that seemed to suggest that opcache (or something similar) wasn't being used properly (it was recompiling scripts on each request, or something like that). Has this been tested/addressed with any of this?

Here's the quote

It looks like the main issue boils down to how we load in CGI mode. Most of the time is being spent in PHP, not Go.

Since we close the file after executing it and delete the file_handle, we recompile the file from scratch, every time. There may also be a segfault in here as well since during compilation PHP updates interned strings (which isn't thread-safe unless you have opcache enabled).

A better approach would be to leave the file open between requests, sharing the file handles among all threads -- I think that could work -- so it doesn't need to recompile. (see fpm: zend_destroy_file_handle(&file_handle); which only happens in specific cases).

TL;DR: we are recompiling the entire project on each and every request in CGI mode.

withinboredom · 2024-07-24T19:17:47Z

But I'm still wondering why there's such little improvement when adding more threads/workers. Is the bottleneck something out of our control (eg network bandwidth, os capacity for handling requests, etc)?

You are ultimately limited by cores/context switching. Adding more threads won't make it faster past a certain point.

Finally, what really needs to be tested is something with an actual php workload (even if just synthetic with a loop, like @AlliBalliBaba did in previous tests). Right now we're surely testing caddy itself, and it's hand-off to php, more than anything.

I'm working on it :) There's a lot of other stuff going on here too. For example, on the lock-free handles, I submitted a CL to google for Go: https://go-review.googlesource.com/c/go/+/600875 so that it can be even better—assuming they even want it (its approximately 3-4x faster than regular handles under contention, so I hope they want it).

I'll be throwing this at some production servers soon. It just takes time; I would only have considered this "done" a few hours ago...

@withinboredom mentioned something in the issue that seemed to suggest that opcache (or something similar) wasn't being used properly (it was recompiling scripts on each request, or something like that). Has this been tested/addressed with any of this?

Sorry to get you worried, everything is working fine. 😎

AlliBalliBaba · 2024-07-24T19:55:03Z

But I'm still wondering why there's such little improvement when adding more threads/workers. Is the bottleneck something out of our control (eg network bandwidth, os capacity for handling requests, etc)?

I think this is expected, in the 'Hello world' case the C threads do very little work so 1-2 threads are enough. Most work is happening on the go side and go manages its threads automatically.

I'll maybe also try to run those techempower benchmarks locally for a slightly less synthetic comparison when I have time.

nickchomey · 2024-07-24T20:25:24Z

Sorry if I came across towards anyone as impatient, ungrateful etc... Its quite the opposite! Just eager and thought those few questions were worth bringing up

Ps maybe what I was missing is what a "thread" is in this context, and also how many physical cores are on the test system. My expectation would be for decent performance improvements until there's 1 thread per core, but it is quite possible something more fundamental (os level network sockets etc...) is the bottleneck, and more cpu just doesn't help. Still, I do think a few more test runs with more granular changes in thread count would be informative.

Anyway, I'll leave you folks to it. Keep up the great work! This defitnualy makes frankenphp much more "production-ready" and will encourage more usage and contributions.

dunglas

Great work!!

frankenphp.c

withinboredom · 2024-07-25T07:29:16Z

My expectation would be for decent performance improvements until there's 1 thread per core

It depends on what the workloads, buffers, network, etc. Nobody is going to have exactly the same numbers. You are somewhat correct, one core per thread is the "best" but that assumes a single thread can saturate a core forever. In reality, there are synchronization constructs at-play, and a request eventually ends, so for FrankenPHP, it's about 2x cores to really be efficient. This way when a thread hits a synchronization point, where it is waiting on another thread, a ready-to-go thread can come along and immediately start doing its work while the other thread waits. If your workload is small enough, your network is saturated enough, you can probably have far more threads than cores and get more speed.

For production workloads, there will be tradeoffs (like everything else). I suspect (but haven't tested) that 1-2x cores provide the lowest latency, while greater than that provides more throughput up to a point.

I'm electing to remove the memset at startup and keeping it in shutdown. Why? Security! Keeping around the request data after a request may result in a leak of important information in a core dump or something. By setting it to zero when we shutdown the request, we can ensure no information is left laying around

AlliBalliBaba · 2024-07-25T15:34:03Z

Here are the techempower benchmarks for the cgi mode with fpm/caddy and fpm/nginx as baseline.

before

after

The benchmarks are still mostly synthetic, so take them with a grain of salt. Most notably FrankenPHP is now faster than fpm + Caddy. Nginx still beats Caddy in the plaintext/json categories. Probably because it's tuned for the 512 users with keep-alive, Caddy settings are default.

I also ran another round to show that the laravel octane worker mode is pretty much on par with the swoole worker mode
laravel octane

nickchomey · 2024-07-26T02:49:54Z

My expectation would be for decent performance improvements until there's 1 thread per core

It depends on what the workloads, buffers, network, etc. Nobody is going to have exactly the same numbers. You are somewhat correct, one core per thread is the "best" but that assumes a single thread can saturate a core forever. In reality, there are synchronization constructs at-play, and a request eventually ends, so for FrankenPHP, it's about 2x cores to really be efficient. This way when a thread hits a synchronization point, where it is waiting on another thread, a ready-to-go thread can come along and immediately start doing its work while the other thread waits. If your workload is small enough, your network is saturated enough, you can probably have far more threads than cores and get more speed.

For production workloads, there will be tradeoffs (like everything else). I suspect (but haven't tested) that 1-2x cores provide the lowest latency, while greater than that provides more throughput up to a point.

Thanks for the insights, but this all just seems to confirm everything I've been saying - the test results previously shared just don't seem to make sense. They seem to practically saturate with 1 thread, with very marginal improvements by adding 40 threads. More granular testing across different thread counts would be very instructive... but I'm just being an ungrateful broken record at this point.

Anyway, congrats on this immense improvement!

withinboredom · 2024-07-26T06:29:20Z

the test results previously shared just don't seem to make sense. They seem to practically saturate with 1 thread, with very marginal improvements by adding 40 threads.

The details are important. In this case, we were testing "hello world" so PHP returns in like <1ms. This allows a single thread to get such performance. If PHP were to return in, say 100-200ms to represent a Symfony application, these numbers would be impossible in a single thread. Since we aren't testing the performance of php, but frankenphp (the part that turns a caddy request into a php request and back), testing something like "hello world" is more valuable here and that there isn't much (if any) blockages.

That being said, there is more room for improvement (but not much) and this PR isn't the end.

Anyway, congrats on this immense improvement!

Thanks!

nickchomey · 2024-07-26T13:21:50Z

the test results previously shared just don't seem to make sense. They seem to practically saturate with 1 thread, with very marginal improvements by adding 40 threads.

The details are important. In this case, we were testing "hello world" so PHP returns in like <1ms. This allows a single thread to get such performance. If PHP were to return in, say 100-200ms to represent a Symfony application, these numbers would be impossible in a single thread. Since we aren't testing the performance of php, but frankenphp (the part that turns a caddy request into a php request and back), testing something like "hello world" is more valuable here and that there isn't much (if any) blockages.

That being said, there is more room for improvement (but not much) and this PR isn't the end.

Anyway, congrats on this immense improvement!

Thanks!

Thanks for the clarification. It sounds like the "bottleneck" is, as I suggested previously, at the OS/system level - network sockets, etc.

(ps I do intend to actually contribute to debugging, features etc at some point, but am pre-occupied with other work right now)

withinboredom · 2024-07-26T19:39:14Z

It sounds like the "bottleneck" is, as I suggested previously, at the OS/system level - network sockets, etc.

No, it wasn't. The reason I was testing with "hello world" is precisely because it returns so quickly with no variance. If it returns <1ms in worker mode, why doesn't it in cgi-mode? I discovered that it was on the php side by dropping logging statements all over the place with how long things were taking, trying to hunt down whether it was a php problem or a go problem. They should be identical. Then you identify where it is taking so long. It was at this point, a coworker reminded me of flame graphs after I was being too quiet at lunch (I usually don't stop talking). These are in the PR description, and it gives a massive hint of what function to dive into. Then it was a matter of stepping into the function with gdb and down the rabbit hole. With the php code being "hello world" it also makes it really easy to step into compilation/execution too, without spending 10 years going token by token. From there, it was easy to find the mutex and analyze where it came from (the ts_resource() call). Finally, I went digging around fpm, apache, and other places to understand how it is supposed to be used, which resulted in the first draft of this PR.

vasilvestre · 2024-07-27T07:49:24Z

It was at this point, a coworker reminded me of flame graphs after I was being too quiet at lunch (I usually don't stop talking).

You haven't tried a callgraph yet ?

withinboredom · 2024-07-27T08:09:42Z

You haven't tried a callgraph yet ?

I assume you mean the one built into pprof? It just shows the total time spent in cgo, but doesn't break it down by C function calls, so it isn't clear/useful.

withinboredom added 3 commits July 24, 2024 00:02

major perf

b572583

clean up before thread returns

c57e415

fix lint

6fd9bf0

clean up and refactor server-context initialization

96e95b8

withinboredom mentioned this pull request Jul 24, 2024

Lock free handles #934

Closed

dunglas reviewed Jul 24, 2024

View reviewed changes

frankenphp.c Show resolved Hide resolved

frankenphp.c Show resolved Hide resolved

dunglas approved these changes Jul 25, 2024

View reviewed changes

withinboredom merged commit fb23c64 into main Jul 26, 2024
35 of 41 checks passed

withinboredom deleted the refactor-ts branch July 26, 2024 07:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cgi-mode 1700% improvement #933

perf: cgi-mode 1700% improvement #933

withinboredom commented Jul 23, 2024

withinboredom commented Jul 24, 2024

AlliBalliBaba commented Jul 24, 2024

withinboredom commented Jul 24, 2024

dunglas commented Jul 24, 2024

nickchomey commented Jul 24, 2024 •

edited

Loading

withinboredom commented Jul 24, 2024 •

edited

Loading

AlliBalliBaba commented Jul 24, 2024

nickchomey commented Jul 24, 2024 •

edited

Loading

dunglas left a comment

withinboredom commented Jul 25, 2024

AlliBalliBaba commented Jul 25, 2024

nickchomey commented Jul 26, 2024

withinboredom commented Jul 26, 2024

nickchomey commented Jul 26, 2024

withinboredom commented Jul 26, 2024

vasilvestre commented Jul 27, 2024

withinboredom commented Jul 27, 2024

perf: cgi-mode 1700% improvement #933

perf: cgi-mode 1700% improvement #933

Conversation

withinboredom commented Jul 23, 2024

withinboredom commented Jul 24, 2024

AlliBalliBaba commented Jul 24, 2024

withinboredom commented Jul 24, 2024

dunglas commented Jul 24, 2024

nickchomey commented Jul 24, 2024 • edited Loading

withinboredom commented Jul 24, 2024 • edited Loading

AlliBalliBaba commented Jul 24, 2024

nickchomey commented Jul 24, 2024 • edited Loading

dunglas left a comment

Choose a reason for hiding this comment

withinboredom commented Jul 25, 2024

AlliBalliBaba commented Jul 25, 2024

nickchomey commented Jul 26, 2024

withinboredom commented Jul 26, 2024

nickchomey commented Jul 26, 2024

withinboredom commented Jul 26, 2024

vasilvestre commented Jul 27, 2024

withinboredom commented Jul 27, 2024

nickchomey commented Jul 24, 2024 •

edited

Loading

withinboredom commented Jul 24, 2024 •

edited

Loading

nickchomey commented Jul 24, 2024 •

edited

Loading