Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: zmq::signaler_t::send may loop forever #2362

Merged
merged 1 commit into from
Mar 6, 2017
Merged

Problem: zmq::signaler_t::send may loop forever #2362

merged 1 commit into from
Mar 6, 2017

Conversation

10-5
Copy link
Contributor

@10-5 10-5 commented Mar 6, 2017

Solution: restore the wsa_assert statement which was previously removed.

Solution: restore the wsa_assert statement previously removed.
10-5 referenced this pull request Mar 6, 2017
Problem: Assertion failed in zmq::signaler_t::send
@@ -189,6 +189,7 @@ void zmq::signaler_t::send ()
unsigned char dummy = 0;
while (true) {
int nbytes = ::send (w, (char*) &dummy, sizeof (dummy), 0);
wsa_assert (nbytes != SOCKET_ERROR);
if (unlikely (nbytes == SOCKET_ERROR))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the assert above, can nbytes be ever == SOCKET_ERROR?

Copy link
Contributor Author

@10-5 10-5 Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure it won't be. I am watching this in my project, things go pretty well so far. Without this patch, zmq_assert fails very often in my program.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous pull request should be reverted.

version 4.1.5 had this code

    unsigned char dummy = 0;
    int nbytes = ::send (w, (char*) &dummy, sizeof (dummy), 0);
    wsa_assert (nbytes != SOCKET_ERROR);
    zmq_asserwsa_assertt (nbytes == sizeof (dummy));

to me it seems that the problem from at least one of the the reported chrash (#2358) was this

    zmq_assert (nbytes == sizeof (dummy));

and not the wsa_assert

the other bugs mentioned in #2360 , at least one, are from a different platform (ubuntu, not windows)
since this code is not the same on both, I think the problem is somewhere else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems strange, but ok

@bluca bluca merged commit 651f81e into zeromq:master Mar 6, 2017
Copy link
Contributor

@a4z a4z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look at this, but to me it seems that both, send on windows and write on Linux can return without an error, but send less bytes the the expected bytes to send (8 on Linux 1 on Windows)
on Windows this is funny because it basically means that it can send 0 bytes without problem :-)
but it seems to be possible, otherwise there would not be those reports.
but

#2339 reports 4.2.0 line 185, that is windows code, but reported on ubuntu !
so either the reported version, or platform is wrong

#2303 reports no line number, we do not know which of those is the problem

wsa_assert (nbytes != SOCKET_ERROR);
zmq_asserwsa_assertt (nbytes == sizeof (dummy));

#2358 reports the line number and the version 4.1.5
zmq::signaler_t::send() Line 182 + 0x49 bytes

https://github.com/zeromq/zeromq4-1/blob/v4.1.5/src/signaler.cpp#L182

what is for me clearly this line
zmq_assert (nbytes == sizeof (dummy));

so I think the correct solution would be to handle the situation if less bytes than expected have been sent correct.
currently it is asserted that always the requested amount of bytes are sent, but this is clearly not the case

I might create an issue out of this later in the evening, or tomorrow, if no one else has done it meanwhile.

@@ -189,6 +189,7 @@ void zmq::signaler_t::send ()
unsigned char dummy = 0;
while (true) {
int nbytes = ::send (w, (char*) &dummy, sizeof (dummy), 0);
wsa_assert (nbytes != SOCKET_ERROR);
if (unlikely (nbytes == SOCKET_ERROR))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous pull request should be reverted.

version 4.1.5 had this code

    unsigned char dummy = 0;
    int nbytes = ::send (w, (char*) &dummy, sizeof (dummy), 0);
    wsa_assert (nbytes != SOCKET_ERROR);
    zmq_asserwsa_assertt (nbytes == sizeof (dummy));

to me it seems that the problem from at least one of the the reported chrash (#2358) was this

    zmq_assert (nbytes == sizeof (dummy));

and not the wsa_assert

the other bugs mentioned in #2360 , at least one, are from a different platform (ubuntu, not windows)
since this code is not the same on both, I think the problem is somewhere else.

@10-5
Copy link
Contributor Author

10-5 commented Mar 7, 2017

#2339 is on Ubuntu but with wine.

The assertion failed is zmq_assert. But in this problem (at least mine for sure), nbytes is -1, not another value less than sizeof(dummy).

The statement wsa_assert (nbytes != SOCKET_ERROR); doesn't work exactly as it looks. It will check the return value from WSAGetLastError(), and won't fail if it gets 0 or EWOULDBLOCK.

@a4z
Copy link
Contributor

a4z commented Mar 7, 2017

Thanks for the info about wine @nexcvon , this maks at least this clear.
So all 3 Problems are on Windows, one via log as stated in the zmq_assert, the other as reported maybe in the wsa_assert.

All 3 reported indicate that send had a problem. Is ignoring the problem the right solution?

and finally, what does the current code solve?

    while (true) {
        int nbytes = ::send (w, (char*) &dummy, sizeof (dummy), 0);
        wsa_assert (nbytes != SOCKET_ERROR);
        if (unlikely (nbytes == SOCKET_ERROR))
            continue;
        zmq_assert (nbytes == sizeof (dummy));
        break;
    }

it does not solve #2339
and I am unsure if it solves any problem at all, except that the code still looks like it can endless loop.

wsa_assert does not work as you say, it checks if there is some error string, what is very likely to be the case.
https://github.com/zeromq/libzmq/blob/master/src/err.hpp#L74

I really suggest to revert the code to the state as it has been before #2361 ,

  • rephrase the problem that lead to the assertion failure
  • try to reproduce it
  • if possible, create a test case for it
  • come than with a solution that has no complex loop where code reader need to reason about what this does for uncertain amount of time, just to find an potential endless loop.

Please. Thanks!

@10-5
Copy link
Contributor Author

10-5 commented Mar 7, 2017

I can reproduce this problem using pyzmq 16.0.2 (with zmq 4.1.6). The error code is EWOULDBLOCK (I was wrong about it may be 0), so wsa_assert doesn't fail (see https://github.com/zeromq/libzmq/blob/master/src/err.cpp#L95 and https://github.com/zeromq/libzmq/blob/master/src/err.cpp#L119).

I don't known why EWOULDBLOCK is returned, and how it should be handled correctly. Retrying seems to be better than crashing for now, before someone reports a infinite loop.

I am not familiar with sockets. I hope @a4z you can help improve this.

Following is how I reproduce this problem using Python. I will try to get a c++ version working.

Modified zmq::signaler_t::send:

#elif defined ZMQ_HAVE_WINDOWS
    unsigned char dummy = 0;
    while (true) {
        int nbytes = ::send (w, (char*) &dummy, sizeof (dummy), 0);
        if (nbytes == SOCKET_ERROR)
            printf("WSAGetLastError() returns %d\n", WSAGetLastError());
        wsa_assert (nbytes != SOCKET_ERROR);
        if (unlikely (nbytes == SOCKET_ERROR))
            continue;
        zmq_assert (nbytes == sizeof (dummy));
        break;
    }
#else

Python code:

import time
import zmq

url = "tcp://127.0.0.1:22222"
context = zmq.Context.instance()

for i in range(1000):
    print(i)

    router = context.socket(zmq.ROUTER)
    router.bind(url)

    dealer = context.socket(zmq.DEALER)
    dealer.connect(url)

    dealer.send(b"hello")
    msgs = router.recv_multipart()
    assert msgs[1] == b"hello"

    dealer.close()
    router.close()
    time.sleep(0.1)

Output:

0
1
2
3
4
<...>
201
202
203
204
WSAGetLastError() returns 10035
205
206
207
208
<...>
446
447
448
449
WSAGetLastError() returns 10035
450
WSAGetLastError() returns 10035
451
452
453
454
<...>
576
577
578
579
WSAGetLastError() returns 10035
580
WSAGetLastError() returns 10035
581
582
583
584
<...>
560
661
662
663
WSAGetLastError() returns 10035
664
665
666
667
<...>
944
945
946
947
WSAGetLastError() returns 10035
948
WSAGetLastError() returns 10035
949
950
951
WSAGetLastError() returns 10035
952
WSAGetLastError() returns 10035
953
WSAGetLastError() returns 10035
954
955
956
957
<...>
995
996
997
998
999

@a4z
Copy link
Contributor

a4z commented Mar 7, 2017

I am at work with less time unfortunately today, and no windows around me
but I have no idea why your socket might return once every x times it feels bad, but that's leagal.

@nexcvon , could you please test if this solve it?
It's untested / not compiled code, so maybe adoption required
The key is, it is OK to retry, but please not for ever :-)

    unsigned char dummy = 0;  
    
    const int max_retry = 3 ;
    int nbytes  = 0 ;
    for (int i = 0; i < max_retry ; ++i) {
    
        nbytes = ::send (w, (char *) &dummy, sizeof (dummy), 0);
    
        if (nbytes == SOCKET_ERROR);
            continue ;      
      
        if(nbytes == sizeof (dummy))
            break      
    }
    
    wsa_assert (nbytes != SOCKET_ERROR);
    zmq_assert (nbytes == sizeof (dummy));

ps: looking at the code, the Linux version could also need some adoption,
since it it theoretical possible to not send all 8 bytes at once, (MS sends just 1 byte)
in this case we should retry again with the remaining bytes.
but it seems to be super unlikely that this ever happens.
would you like a issue for this @bluca ?
I will not make one as long it does not crash on my machine or I'll fix it someone in the future on the fly

@bluca
Copy link
Member

bluca commented Mar 7, 2017

This is the signaler, so remember that on Linux it's not a TCP socket like on Windows. I honestly couldn't think of any reason why it should succeed to send, but less than 8 bytes.

But if you want to send a PR to check for that special case please feel free to do so, as long as the overall functionality when there's an actual error is not changed.

@a4z
Copy link
Contributor

a4z commented Mar 7, 2017

yes, as unlikely that this ever happen as that 'll touch the linux code now ;-)

@rabrahamkns
Copy link

update on #2358 - I changed our code ( we use 4.1.5) only for signaler_t::send() to the code in this commit, and we do not see the abort anymore. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants