Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPE when testing connection timeout on Windows #468

Closed
maxtomilov opened this issue Sep 27, 2018 · 16 comments
Closed

IPE when testing connection timeout on Windows #468

maxtomilov opened this issue Sep 27, 2018 · 16 comments
Labels
Type: Bug Indicates an unexpected problem or unintended behavior

Comments

@maxtomilov
Copy link
Contributor

connection_timeout_test.zip

Trying to test SRT connection timeout on Windows results with the following output after srt_connect call:

Running main() from gtest_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from SRT
[ RUN      ] SRT.ConnectionTimeoutTest
16:14:32.340661!!FATAL!!: SRT.c: CChannel reported ERROR DURING TRANSMISSION - IPE. INTERRUPTING worker anyway.

Code from connection_timeout_test.cpp (sample is attached):

TEST(SRT, ConnectionTimeoutTest) {
    ASSERT_EQ(srt_startup(), 0);

    int yes = 1;
    int no = 0;

    SRTSOCKET m_client_sock = srt_socket(AF_INET, SOCK_DGRAM, 0);
    ASSERT_NE(m_client_sock, SRT_ERROR);

    ASSERT_NE(srt_setsockopt(m_client_sock, 0, SRTO_RCVSYN, &no, sizeof no), SRT_ERROR); // for async connect
    ASSERT_NE(srt_setsockopt(m_client_sock, 0, SRTO_SNDSYN, &no, sizeof no), SRT_ERROR); // for async connect
    ASSERT_NE(srt_setsockflag(m_client_sock, SRTO_SENDER, &yes, sizeof yes), SRT_ERROR);
    ASSERT_NE(srt_setsockopt(m_client_sock, 0, SRTO_TSBPDMODE, &yes, sizeof yes), SRT_ERROR);

    int pollid = srt_epoll_create();
    int epoll_out = SRT_EPOLL_OUT | SRT_EPOLL_ERR;
    srt_epoll_add_usock(pollid, m_client_sock, &epoll_out);

    sockaddr_in sa;
    memset(&sa, 0, sizeof sa);
    sa.sin_family = AF_INET;
    sa.sin_port = htons(5555);

    ASSERT_EQ(inet_pton(AF_INET, "127.0.0.1", &sa.sin_addr), 1);

    sockaddr* psa = (sockaddr*)&sa;

    ASSERT_NE(srt_connect(m_client_sock, psa, sizeof sa), SRT_ERROR);


    // Socket readiness for connection is checked by polling on WRITE allowed sockets.

    {
        int rlen = 2;
        SRTSOCKET read[2];

        int wlen = 2;
        SRTSOCKET write[2];

        ASSERT_NE(srt_epoll_wait(pollid, read, &rlen,
                                 write, &wlen,
                                 -1, // -1 is set for debuging purpose.
                                     // in case of production we need to set appropriate value
                                 0, 0, 0, 0), SRT_ERROR);

        ASSERT_EQ(rlen, 1);
        ASSERT_EQ(read[0], m_client_sock);

    }

    ASSERT_NE(srt_epoll_remove_usock(pollid, m_client_sock), SRT_ERROR);
    srt_close(m_client_sock);
    (void)srt_epoll_release(pollid);
    (void)srt_cleanup();
}

Removing WSAECONNRESET handling as a fatal error after WSARecvFrom call in srtcore/channel.cpp seem to fix the issue, but I'm not sure if it is correct for all cases.

@alexpokotilo
Copy link
Contributor

It looks like nobody use SRT in async mode for Windows :)
A few details:

  1. this code worked on 1.2.0 for both Linux and Windows.
  2. this code doesn't work on 1.3.1 and 1.3.0 for Windows. We have not checked which commit broken Windows version.
  3. this code works on Linux for all versions.
    Let us know if we can bring more details to help with analysis

@ethouris
Copy link
Collaborator

I remember what we have introduced that has caused this message to be printed. We are checking the status code of the call to WSARecvFrom and for certain types of errors we state them fatal, while all others are treated as "again" (there has happened something in a large variety of situations, but the only meaningful information for us is that this call did not provide any data, although you can always try again). We didn't do this before - in 1.2.0 version errors from this function were simply ignored (treated as "again"). Not sure if we aren't simply over-eager with this.

If you turn on the debug logging, you should see the line printed by the following instruction. This should show the exact return code:

HLOGC(mglog.Debug, log << CONID() << "(sys)WSARecvFrom: " << SysStrError(err) << " [" << err << "]");

Would be nice if you get this error, probably this is excessive. This is the list of non-success codes treated as fatal errors:

    static const int fatals [] =
    {
        WSAECONNRESET,
        WSAEFAULT,
        WSAEINVAL,
        WSAENETDOWN,
        WSANOTINITIALISED,
        WSA_OPERATION_ABORTED
    };

@maxtomilov
Copy link
Contributor Author

The error is 10054 - WSAECONNRESET.

@maxtomilov
Copy link
Contributor Author

BTW there is a note for WSAECONNRESET on https://docs.microsoft.com/en-us/windows/desktop/api/winsock2/nf-winsock2-wsarecvfrom page: "For a UDP datagram socket, this error would indicate that a previous send operation resulted in an ICMP "Port Unreachable" message." which may be related to the issue.

@ethouris
Copy link
Collaborator

Ok, if you can please test if commenting this out from this above array (it's channel.cpp around line 547, CChannel::recvfrom) fixes the problem. Might be that it should be removed from the list.

Looks weird, really. "Previous send operation resulted in (...)" as a reason for an error for a receiving operation...

Might be that this list should be seriously revised.

@maxtomilov
Copy link
Contributor Author

Commenting out channel.cpp:547 (WSAECONNRESET entry in "fatals" array) fixes the issue for me (I actually mentioned it in the first comment).

@alexpokotilo
Copy link
Contributor

alexpokotilo commented Sep 27, 2018

Thanks for reply @ethouris.
Just to make sure we're on the same page:
This test expects(and this works for all srt versions on Linux for and for old version of Windows) that we will receive read event to async socket and once we try to read we get a fail indicating connection failed.

Ok, if you can please test if commenting this out from this above array (it's channel.cpp around line 547, CChannel::recvfrom) fixes the problem. Might be that it should be removed from the list.

@maxtomilov has already checked this and it works as expected after the fix. Do you think we need to push this fix ?

Looks weird, really. "Previous send operation resulted in (...)" as a reason for an error for a receiving operation...

maybe it's postponed issue Windows kernel can report only on the next recv operation? I'm just guessing:)

Might be that this list should be seriously revised.

We want to release new 1.3.1 version for Nimble Streamer soon. When do you plan to inspect this code ? We have regression and unit test for SRT and they found the problem.

@ethouris
Copy link
Collaborator

Ok, so push the fix, please. And many thanks for finding it.

Please note that whatever things happen visible in the epoll flags, the call to a function that reads a UDP packet is completely asynchronous to it. The packets in SRT first get received from a UDP socket (and stored into the receiver buffer), then they become ready for extraction, when the ACK is sent for these packets (this is the original UDT design, and not even the most desired thing in live mode, but we haven't yet planned any changes here) , and then the TsbPd thread, when the time comes, makes them ready for delivery. The ACK action is being performed in the RcvQ:worker thread (that is, the same that read from the UDP socket) just once per a timely event, and TsbPd is its own thread. But once the error is reported from the UDP receiving function, the connection is immediately broken, no matter what is currently happening in other contexts.

@alexpokotilo
Copy link
Contributor

alexpokotilo commented Sep 28, 2018

@ethouris, we're always ready to find, fix and contribute !
please read your previous message and my reply to it. My guess is that Windows api cannot fail on async send immediately as it's async but gets ICMP reply before SRT engine call the next receive. At this point Windows kernel already knows that previous send failed but it cannot report this error code to send call as it's already finished. And this is why it returns error on the next receive. it can ignore previous send issue or report it in nearest call for the same socket. It's just my guess. It's not related to SRT design at all.

@ethouris
Copy link
Collaborator

Thanks :)

Well, it probably makes sense how the Windows API functions behave, just there's no clear declaration, which exactly errors inform about inability to continue (makes no sense to retry, or retrying will result in the same error anyway), and which are simply informative about the current situation - which might, but need not be crucial to your program. Windows API lacks that clear information, or maybe I wasn't smart enough to find it.

@alexpokotilo
Copy link
Contributor

Maybe it's just my view to Windows api. Could you please bless our commit :)

@ethouris
Copy link
Collaborator

It is blessed.

@rndi
Copy link
Collaborator

rndi commented Sep 28, 2018

I am not sure this is correct conclusion: the behavior of returning an error to recv() due to incoming ICMP error packet that resulted from a send() on the same socket is basically correct. Linux should be doing the same thing, see https://tools.ietf.org/html/rfc1122, section 4.1.3.3. The difference is that Windows returns WSAECONNRESET and Linux should be returning ECONNREFUSED but the handing is SRT code is currently the same for either. Have you confirmed that in your Linux runs you get ICMP error back when running this test?

@maxtomilov
Copy link
Contributor Author

Yes, I see ICMP messages with "Destination unreachable (Port unreachable)" info after each sendmsg() call while running the test, but recvmsg() always returns EAGAIN, not ECONNREFUSED.

@rndi
Copy link
Collaborator

rndi commented Oct 11, 2018

Hello @alexpokotilo ,

So here is a quick test app that demonstrates that ECONNREFUSED can be returned from recvmsg() on Linux. The app attempts to replicate socket calls made by SRT stack when your test case is executed:
test.cpp.txt

Somehow we do not (usually) get to that condition in SRT on Linux but the possibility of this return code remains, as demonstrated by the above sample, and should be handled in the same way as WSAECONNRESET on Windows.

Could you please add similar handling for ECONNREFUSED to your PR? It should result in RST_AGAIN status return on Linux/OSX, same as for WSAECONNRESET on Windows.

Thank you!

@maxtomilov
Copy link
Contributor Author

Hello @rndi,

Pull request has been updated with appropriate changes for handling ECONNREFUSED.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

5 participants