Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in zmq_poll #3373

Closed
WallStProg opened this issue Jan 24, 2019 · 33 comments
Closed

Performance regression in zmq_poll #3373

WallStProg opened this issue Jan 24, 2019 · 33 comments

Comments

@WallStProg
Copy link
Contributor

Environment

  • libzmq version (commit hash if unreleased): 4.2.4
  • OS: CentOS 6.9
@WallStProg
Copy link
Contributor Author

We recently switched the reference implementation of libzmq that we use from 4.2.3 (latest at the time we started develoment) to 4.3.0, partly to take advantage of some bug fixes that were made in the meantime. (And many thanks to the contributors for those, including @bluca and @bjovke)!

We have a simple "ping-pong" microbench that we use to test for performance regressions, and we found a rather large regression when we switched. (The average latency went from ~ 150us to ~250us).

After some head-scratching and a very helpful conversation with @somdoron we were able to narrow it down to a specific scenario -- viz. when a thread-safe socket is included in the list of sockets being polled.

With that clue I've been able to narrow the regression down to the 4.2.4 release,and also to create a repro, which can be found at https://github.com/WallStProg/zmqtests/tree/master/pingpong.

The results of our tests are as follows:

ping params 4.2.3 4.2.4
(none) 54 55
-poll 58 60
-poll -control 83 186

That lines up almost perfectly with the results from our in-house microbench, which is to say that zmq_poll takes an addl. 100us as of 4.2.4 when a thread-safe socket is included in the socket list. (This on CentOS 6.9, 4 x i5-2400 @ 3.10GHz, 16GB RAM).

Apparently if ZMQ_BUILD_DRAFT_API is defined, the new zmq_poller implementation is used by default, and the "old" zmq_poll implementation is completely unavailable. (Which sort of makes sense, since the old zmq_poll apparently doesn't support thread-safe sockets -- although just because thread-safe sockets are enabled doesn't mean that they'll be used).

What next?

The advice we got from @somdoron is to re-implement our main dispatch thread using zmq_poller, rather than zmq_poll. That makes sense, and since the sockets we poll don't change, should be relatively straight-forward. Presumably there's also a bit of a performance win with zmq_poller as well, but the proof is in the pudding. (We could also replace the CLIENT/SERVER sockets used for IPC signalling, but much prefer sticking with them, as discussed at some length in #2759).

Having said that, there are a few things that I'm curious about:

  • Did the 4.2.3 code work properly, even with thread-safe sockets? Or was it just a "ticking time bomb"? If that's the case, it seems to make sense to let people know -- others may be innocently using it in a similar way.
  • What specifically is the cause of the regression? Did it fix a bug? (see above)
  • Is there any mechanism to detect performance regressions in any of the CI tests?
    • I confess that I'm completely ignorant of the CI infrastructure used for libzmq, so I don't know what it would take to duplicate the CI tests in a different environment.
    • If there's value in the CI tests, it might make sense to "re-factor" them so they can run outside the CI environment. Again, I have no idea whether that's even possible without Travis, etc.

WallStProg added a commit to WallStProg/zmqtests that referenced this issue Jan 24, 2019
@bluca
Copy link
Member

bluca commented Jan 24, 2019

Were you using DRAFT builds beforehand too? If so that might be because of: ab1607f
before that, with draft enabled it directly called the poller implementation, but we got reports that performances were much worse, so I changed it to do so only if there are thread safe sockets being watched, otherwise the old implementation is used.

Now, unless you have a massive amount of sockets, looping to check if there's a thread safe one shouldn't really cause a performance drop - all the data there has to be loaded anyway to be used in the same function, so it's unlikely to thrash the cache or something like that... but I haven't profiled so can't really say for sure

@bluca
Copy link
Member

bluca commented Jan 24, 2019

No wait, scratch that - I thought the problem arrived in 4.3.0, but I see now it's 4.2.4 which doesn't have that fix. Try to bisect and see if you can pin point to the right commit.

@somdoron
Copy link
Member

Did the 4.2.3 code work properly, even with thread-safe sockets? Or was it just a "ticking time bomb"? If that's the case, it seems to make sense to let people know -- others may be innocently using it in a similar way.

It worked properly, in both versions.

Thread safe sockets require creating an FD to be signaled when they are ready to be polled.
Because zmq_poll is stateless this FD has to be created on every zmq_poll iteration.

In the past, an internal FD was created even if the poll didn't include thread safe socket. The change is to only create an FD if thread safe socket is also being polled.

This change is probably what caused the issue.

This is why zmq_poller can be more efficient because the FD is created once when creating the poller.

@somdoron
Copy link
Member

@WallStProg it might be the solution, however, for some, it was actually slower with thread-safe sockets.

@WallStProg
Copy link
Contributor Author

It worked properly, in both versions.

But in that case, what caused the regression between 4.2.3 and 4.2.4?

@WallStProg
Copy link
Contributor Author

@bluca wrote:

No wait, scratch that - I thought the problem arrived in 4.3.0, but I see now it's 4.2.4 which doesn't have that fix. Try to bisect and see if you can pin point to the right commit.

The regression exists in 4.3.0 -- we backtracked it to 4.2.4, which is where it first shows up.

@somdoron
Copy link
Member

I only suspect this one for the moment:

147fe9e#diff-01df48f9d498dba994077a8675169b88

But it doesn't make a lot of sense. I will be able to run the performance test only tomorrow.

@sigiesec
Copy link
Member

@somdoron In 147fe9e you mention, only the code that queries the thread-safe sockopt was extracted into a separate method. I cannot imagine how this could lead to a (perceptible) performance regression, if any.

@somdoron
Copy link
Member

@sigiesec I agree, as I said, it doesn't make sense. However, it is the only change in the class between the versions.
Anyway ,we can check it out together on Thursday :)

@sigiesec
Copy link
Member

Ah great, you will be at the Hackathon as well :) yes, let's give that a look on Thursday!

@WallStProg
Copy link
Contributor Author

Hope everyone had a good time at hackathon -- hopefully will be able to attend in future.

In the meantime, just curious if anyone has been able to confirm my results using code at https://github.com/WallStProg/zmqtests/tree/master/pingpong?

@sigiesec
Copy link
Member

sigiesec commented Feb 5, 2019

Sorry, we somehow missed to give this a look at the hackaton, but I just did. Unfortunately, I am not able to reproduce your findings. In general, the results seem to vary much, but just with a few runs on my machine with ping -poll -control, I observed the opposite: with 4.2.3, most runs (with some outliers) showed a latency of 280-290us, while with 4.2.4, they showed a latency of 160-170us. Maybe this is attributable to the VM I am running in, but without being able to reproduce, I fear I cannot investigate this any further :(

@WallStProg
Copy link
Contributor Author

Thanks @sigiesec!

I'm guessing that the issue may have to do with the VM setup -- host and/or guest (assume that youre running Linux guest on a Windows host?).

I did reproduce from source on a different machine, with results consistent with previous results:

ping params 4.2.3 4.2.4
(none) 47 47
-poll 53 55
-poll -control 50 167

Anyone take a shot at this on native Linux?

@somdoron
Copy link
Member

somdoron commented Feb 5, 2019

I also tried to run the performance and got same performance.
Linux host, no VM.

@WallStProg can you do a git bisect? it will really help to find the issue

@WallStProg
Copy link
Contributor Author

Interesting -- never knew about git bisect.

Anyway, here's the culprit:

/home/btorpey/work/zmqtests/libzmq/master/repo $ git bisect bad
3658b2b580c9e8b84058df11cdc1bfd68a5eec14 is the first bad commit
commit 3658b2b580c9e8b84058df11cdc1bfd68a5eec14
Author: Ryan Hileman <lunixbochs@gmail.com>
Date:   Mon Mar 5 08:36:10 2018 -0800

    Problem: pthread condvar timeouts are broken (#2967)
    
    * fix pthread condvar timeouts
    
    This fixes a race condition / hang for threadsafe sockets.
    Context: https://github.com/zeromq/czmq/issues/1873#issuecomment-370279244

:040000 040000 4da8c56bcd21e52fb50d1011f93deaca6e72564d 59aa6055175432f39cab27221c8452968262daa3 M	src
/home/btorpey/work/zmqtests/libzmq/master/repo $ 

FWIW, this is reliably reproduced (i.e., every time) on CentOS 6 & 7, on both bare-metal and VM.

@sigiesec
Copy link
Member

sigiesec commented Feb 6, 2019

3658b2b

@WallStProg
Copy link
Contributor Author

OK, apparently the issue with pthread_cond_timedwait being susceptible to the system clock being set back is a thing -- see here and here.

However, the PR seems to be a case of the cure being worse than the disease.

From a quick look at the code it appears that the behavior described in zeromq/czmq#1873 (comment) may have to do with a race condition in the wait method itself:

inline int wait (mutex_t *mutex_, int timeout_)
{
    std::unique_lock<std::mutex> lck (_mtx); // lock mtx
    mutex_->unlock ();                       // unlock mutex_
    ...
    lck.unlock ();   // unlock mtx
    mutex_->lock (); // lock mutex_
    return res;
}

It is possible for another thread to grab the mutex (the _sync member) after the lck.unlock call. Would it not be better to structure the code as follows:

inline int wait (mutex_t *mutex_, int timeout_)
{
    std::unique_lock<std::mutex> lck (_mtx); // lock mtx
    mutex_->unlock ();                       // unlock mutex_
    ...
    mutex_->lock (); // lock mutex_
    lck.unlock ();   // unlock mtx
    return res;
}

The other question, of course, is why using CLOCK_MONOTONIC as the clock attribute adds something like 100-120 microseconds to the latency. But in my testing (from this blog post) the overhead of the CLOCK_MONOTONIC call is something like 32 nanoseconds. How do we get from 32 nanos to 120 micros?

@somdoron
Copy link
Member

somdoron commented Feb 6, 2019

I'm not sure which code is this relevant to? This is only one mutex, not sure what are the lck, mutex and mtx here...

condition variable semantic is important, before calling wait, you must enter a mutex, wait release the mutex, wait for a signal (or timeout), re-enter the mutex and only then the function release.

So we don't really call mutex->unlock, that is being done by the conditional variable.

Nothing we can do about that.

I might have an idea for something that will improve performance though, let me write it quickly and we can test.

@somdoron
Copy link
Member

somdoron commented Feb 6, 2019

@WallStProg can you try following commit:

somdoron@462dd36

So the trick is, because you are using zmq_poll, the conditional variable is only being invoked with zero timeout. timedwait with zero timeout is only going to release the mutex and immediately try to enter it again. We can do this without the conditional variable, and skip all the conditional variable overhead.

@WallStProg
Copy link
Contributor Author

I'm not sure which code is this relevant to?

The original problem, from which this "solution" derives, is at zeromq/czmq#1873 (comment). The stated problem is that the _sync member is locked by another thread. This can happen in the existing code (condition_variable_t::wait()) because of the race between lck.unlock and mutex_->lock, and has nothing to do with the semantics of the underlying pthread_cond_wait call.

So we don't really call mutex->unlock,

Yes we do....

that is being done by the conditional variable.

No, the condition variable is using a different mutex (lck) that is declared on the stack.

The reason the original "fix" avoids the race is presumably because using CLOCK_MONOTONIC alters the timing such that the race occurs much less frequently (even though it is still there).

Before addressing the performance issue, I'd like to rule out the possibility that there is in fact a race condition in condition_variable_t::wait(). It sure looks like a real race to me.

@WallStProg
Copy link
Contributor Author

Ooops -- I see we're talking about different things here. My comments refer to the wait method defined at lines 133-150 of condition_variable.hpp. I see now that this is the C++11 version of the Windows code -- not the Linux version. I apologize for the confusion, but I continue to maintain that there's a race in that code.

Let me take a look at the Linux version, which I see now is at lines 289-333 of the file.

@somdoron
Copy link
Member

somdoron commented Feb 6, 2019

which lck.unlock are you talking about?

There is no race condition. The fix solve that.

The issue was, that you sleep for zero time and never release the mutex you don't give the background thread a chance to enqueue a command on the pipe. That was the race condition. So the application thread always kept the mutex and never release it, looping forever with zero timeout.

The fix was correct and fix that.

@somdoron
Copy link
Member

somdoron commented Feb 6, 2019

The race condition is between the application thread and background thread. The application thread need the mutex for safety of the entire socket. The background need it in order to enqueue commands to the application thread. They always race for the mutex. However, if the application thread never release the mutex, it doesn't give the background thread a chance to enter it and enqueue a command (which tells the application thread a messages can be consumed again, e.g read activated).

Now, with the bug, the application thread got to the timedwait with a timeout, however, timeout was incorrectly used (my code, by the way) and it was actually zero. Once the timeout is zero, maybe what happen is that the OS never released the mutex, or it released it and immediately re-enter and winning the background thread. Doing that it a loop forever. Which stopped it from getting the command from the background thread. Once the fix was provided, now the timedwait was actually called with a timeout. Giving the background thread a lot of time to enter the mutex and enqueueing the command.

Hope it is clear now.

@WallStProg
Copy link
Contributor Author

@somdoron -- the patch looks like it solves the performance regression, and also improves performance overall by a small but measurable amount.

@sigiesec -- I do think that the C++11/Windows version may have a race condition (as per #3373 (comment)). I'm unable to build or test on Windows, but just want to point this out for your review.

Thanks!

@somdoron
Copy link
Member

somdoron commented Feb 6, 2019

@WallStProg great news, can you report the new numbers? I will make a pull request.
I think it might be related to the version of the kernel, it might be that the new linux kernel have better performance for monotonic clock.

We should at least make sure that the windows custom conditional variable is only used on windows xp

@somdoron
Copy link
Member

somdoron commented Feb 6, 2019

#3395

@WallStProg
Copy link
Contributor Author

Don't have detailed results at the moment, but the latency went from ~167us to ~41us.

Will post detailed #'s when I get a chance, prob. tmow.

@WallStProg
Copy link
Contributor Author

Here are detailed results on two machines -- an HP w/4x i5-2400 CPU @ 3.10GHz running CentOS6, and a brix w/4x i7-4770R CPU @ 3.20GHz (w/HT) running CentOS 7.

image

Apparently the slowdown in clock_gettime(CLOCK_MONOTONIC....) is not an issue in newer OS's (Ubuntu, Fedora), but our shop is, for better or worse, committed to the RedHat/CentOS ecosystem.

@bluca
Copy link
Member

bluca commented Feb 7, 2019

Good analysis and good fix. Should we close now?

@WallStProg
Copy link
Contributor Author

Yes -- thanks to all and esp. @somdoron for all the help!

@sigiesec
Copy link
Member

sigiesec commented Feb 8, 2019

@WallStProg Thanks for pointing out the issue in #3373 (comment). I am not sure, but what you are writing sounds plausible. What I find strange here is that there are two mutexes here. I think the only reason is that std::condition_variable::wait(_for) expects a std::unique_lockstd::mutex as a parameter, but the method is passed a locked zmq::mutex_t. By aligning the types, this should be reducable to a single mutex/lock, and then the race condition goes away as well.

@sigiesec
Copy link
Member

sigiesec commented Feb 8, 2019

I opened #3404 to track this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants