-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in zmq_poll #3373
Comments
We recently switched the reference implementation of libzmq that we use from 4.2.3 (latest at the time we started develoment) to 4.3.0, partly to take advantage of some bug fixes that were made in the meantime. (And many thanks to the contributors for those, including @bluca and @bjovke)! We have a simple "ping-pong" microbench that we use to test for performance regressions, and we found a rather large regression when we switched. (The average latency went from ~ 150us to ~250us). After some head-scratching and a very helpful conversation with @somdoron we were able to narrow it down to a specific scenario -- viz. when a thread-safe socket is included in the list of sockets being polled. With that clue I've been able to narrow the regression down to the 4.2.4 release,and also to create a repro, which can be found at https://github.com/WallStProg/zmqtests/tree/master/pingpong. The results of our tests are as follows:
That lines up almost perfectly with the results from our in-house microbench, which is to say that zmq_poll takes an addl. 100us as of 4.2.4 when a thread-safe socket is included in the socket list. (This on CentOS 6.9, 4 x i5-2400 @ 3.10GHz, 16GB RAM). Apparently if What next?The advice we got from @somdoron is to re-implement our main dispatch thread using zmq_poller, rather than zmq_poll. That makes sense, and since the sockets we poll don't change, should be relatively straight-forward. Presumably there's also a bit of a performance win with zmq_poller as well, but the proof is in the pudding. (We could also replace the CLIENT/SERVER sockets used for IPC signalling, but much prefer sticking with them, as discussed at some length in #2759). Having said that, there are a few things that I'm curious about:
|
Were you using DRAFT builds beforehand too? If so that might be because of: ab1607f Now, unless you have a massive amount of sockets, looping to check if there's a thread safe one shouldn't really cause a performance drop - all the data there has to be loaded anyway to be used in the same function, so it's unlikely to thrash the cache or something like that... but I haven't profiled so can't really say for sure |
No wait, scratch that - I thought the problem arrived in 4.3.0, but I see now it's 4.2.4 which doesn't have that fix. Try to bisect and see if you can pin point to the right commit. |
It worked properly, in both versions. Thread safe sockets require creating an FD to be signaled when they are ready to be polled. In the past, an internal FD was created even if the poll didn't include thread safe socket. The change is to only create an FD if thread safe socket is also being polled. This change is probably what caused the issue. This is why zmq_poller can be more efficient because the FD is created once when creating the poller. |
@WallStProg it might be the solution, however, for some, it was actually slower with thread-safe sockets. |
But in that case, what caused the regression between 4.2.3 and 4.2.4? |
@bluca wrote:
The regression exists in 4.3.0 -- we backtracked it to 4.2.4, which is where it first shows up. |
I only suspect this one for the moment: 147fe9e#diff-01df48f9d498dba994077a8675169b88 But it doesn't make a lot of sense. I will be able to run the performance test only tomorrow. |
@sigiesec I agree, as I said, it doesn't make sense. However, it is the only change in the class between the versions. |
Ah great, you will be at the Hackathon as well :) yes, let's give that a look on Thursday! |
Hope everyone had a good time at hackathon -- hopefully will be able to attend in future. In the meantime, just curious if anyone has been able to confirm my results using code at https://github.com/WallStProg/zmqtests/tree/master/pingpong? |
Sorry, we somehow missed to give this a look at the hackaton, but I just did. Unfortunately, I am not able to reproduce your findings. In general, the results seem to vary much, but just with a few runs on my machine with ping -poll -control, I observed the opposite: with 4.2.3, most runs (with some outliers) showed a latency of 280-290us, while with 4.2.4, they showed a latency of 160-170us. Maybe this is attributable to the VM I am running in, but without being able to reproduce, I fear I cannot investigate this any further :( |
Thanks @sigiesec! I'm guessing that the issue may have to do with the VM setup -- host and/or guest (assume that youre running Linux guest on a Windows host?). I did reproduce from source on a different machine, with results consistent with previous results:
Anyone take a shot at this on native Linux? |
I also tried to run the performance and got same performance. @WallStProg can you do a git bisect? it will really help to find the issue |
Interesting -- never knew about git bisect. Anyway, here's the culprit:
FWIW, this is reliably reproduced (i.e., every time) on CentOS 6 & 7, on both bare-metal and VM. |
OK, apparently the issue with However, the PR seems to be a case of the cure being worse than the disease. From a quick look at the code it appears that the behavior described in zeromq/czmq#1873 (comment) may have to do with a race condition in the wait method itself:
It is possible for another thread to grab the
The other question, of course, is why using CLOCK_MONOTONIC as the clock attribute adds something like 100-120 microseconds to the latency. But in my testing (from this blog post) the overhead of the CLOCK_MONOTONIC call is something like 32 nanoseconds. How do we get from 32 nanos to 120 micros? |
I'm not sure which code is this relevant to? This is only one mutex, not sure what are the lck, mutex and mtx here... condition variable semantic is important, before calling wait, you must enter a mutex, wait release the mutex, wait for a signal (or timeout), re-enter the mutex and only then the function release. So we don't really call mutex->unlock, that is being done by the conditional variable. Nothing we can do about that. I might have an idea for something that will improve performance though, let me write it quickly and we can test. |
@WallStProg can you try following commit: So the trick is, because you are using zmq_poll, the conditional variable is only being invoked with zero timeout. timedwait with zero timeout is only going to release the mutex and immediately try to enter it again. We can do this without the conditional variable, and skip all the conditional variable overhead. |
The original problem, from which this "solution" derives, is at zeromq/czmq#1873 (comment). The stated problem is that the
Yes we do....
No, the condition variable is using a different mutex ( The reason the original "fix" avoids the race is presumably because using CLOCK_MONOTONIC alters the timing such that the race occurs much less frequently (even though it is still there). Before addressing the performance issue, I'd like to rule out the possibility that there is in fact a race condition in |
Ooops -- I see we're talking about different things here. My comments refer to the Let me take a look at the Linux version, which I see now is at lines 289-333 of the file. |
which lck.unlock are you talking about? There is no race condition. The fix solve that. The issue was, that you sleep for zero time and never release the mutex you don't give the background thread a chance to enqueue a command on the pipe. That was the race condition. So the application thread always kept the mutex and never release it, looping forever with zero timeout. The fix was correct and fix that. |
The race condition is between the application thread and background thread. The application thread need the mutex for safety of the entire socket. The background need it in order to enqueue commands to the application thread. They always race for the mutex. However, if the application thread never release the mutex, it doesn't give the background thread a chance to enter it and enqueue a command (which tells the application thread a messages can be consumed again, e.g read activated). Now, with the bug, the application thread got to the timedwait with a timeout, however, timeout was incorrectly used (my code, by the way) and it was actually zero. Once the timeout is zero, maybe what happen is that the OS never released the mutex, or it released it and immediately re-enter and winning the background thread. Doing that it a loop forever. Which stopped it from getting the command from the background thread. Once the fix was provided, now the timedwait was actually called with a timeout. Giving the background thread a lot of time to enter the mutex and enqueueing the command. Hope it is clear now. |
@somdoron -- the patch looks like it solves the performance regression, and also improves performance overall by a small but measurable amount. @sigiesec -- I do think that the C++11/Windows version may have a race condition (as per #3373 (comment)). I'm unable to build or test on Windows, but just want to point this out for your review. Thanks! |
@WallStProg great news, can you report the new numbers? I will make a pull request. We should at least make sure that the windows custom conditional variable is only used on windows xp |
Don't have detailed results at the moment, but the latency went from ~167us to ~41us. Will post detailed #'s when I get a chance, prob. tmow. |
Here are detailed results on two machines -- an HP w/4x i5-2400 CPU @ 3.10GHz running CentOS6, and a brix w/4x i7-4770R CPU @ 3.20GHz (w/HT) running CentOS 7. Apparently the slowdown in clock_gettime(CLOCK_MONOTONIC....) is not an issue in newer OS's (Ubuntu, Fedora), but our shop is, for better or worse, committed to the RedHat/CentOS ecosystem. |
Good analysis and good fix. Should we close now? |
Yes -- thanks to all and esp. @somdoron for all the help! |
@WallStProg Thanks for pointing out the issue in #3373 (comment). I am not sure, but what you are writing sounds plausible. What I find strange here is that there are two mutexes here. I think the only reason is that std::condition_variable::wait(_for) expects a std::unique_lockstd::mutex as a parameter, but the method is passed a locked zmq::mutex_t. By aligning the types, this should be reducable to a single mutex/lock, and then the race condition goes away as well. |
I opened #3404 to track this further. |
Environment
The text was updated successfully, but these errors were encountered: