Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TURN_STATE_GET_CREDENTIALS fails after update from 1.7.2. to 1.10.2 of the SDK #2004

Open
3 tasks done
skndi opened this issue Jun 3, 2024 · 18 comments
Open
3 tasks done
Labels
bug Something isn't working needs-triage v1.10.2

Comments

@skndi
Copy link

skndi commented Jun 3, 2024

Please confirm you have already done the following

  • I have searched the repository for related/existing bug reports
  • I have all the details the issue requires

Please answer the following prompt

  • This issue is replicable using the unmodified sample application

Describe the bug

The state TURN_STATE_GET_CREDENTIALS of the TurnConnectionStateMachine fails due to error 0x58000022.

Expected Behavior

The expected behavior is to get the credentials successfully and discover a viable TURN ICE candidate.

Current Behavior

2024-06-03 12:20:26.645 DEBUG socketConnectionClosed(): Close socket 15
2024-06-03 12:20:26.645 DEBUG connectionListenerReceiveDataRoutine(): recvfrom() failed with errno Connection reset by peer for socket 15
2024-06-03 12:20:26.653 VERBOSE stepStateMachine(): [TURN-0x7f6c054d00] State Machine - Current state: 0x0000000000000004, Next state: 0x0000000000000004, Current local state retry count [1]
, Max local state retry count [0], State transition wait time [0] ms
2024-06-03 12:20:26.653 VERBOSE stepStateMachine(): [TURN-0x7f6c0863c0] State Machine - Current state: 0x0000000000000004, Next state: 0x0000000000000004, Current local state retry count [1]
, Max local state retry count [0], State transition wait time [0] ms
2024-06-03 12:20:26.654 WARN socketConnectionSendData(): Warning: Failed to send data. Socket closed already
2024-06-03 12:20:26.654 ERROR iceUtilsSendData(): operation returned status code: 0x58000022
2024-06-03 12:20:26.654 ERROR iceUtilsSendStunPacket(): operation returned status code: 0x58000022
2024-06-03 12:20:26.654 VERBOSE stepStateMachine(): [TURN-0x7f6c0863c0] State Machine - Current state: 0x0000000000000004, Next state: 0x0000000000000100, Current local state retry count [0]
, Max local state retry count [0], State transition wait time [0] ms
2024-06-03 12:20:26.654 WARN executeFailedTurnState(): TurnConnection in TURN_STATE_FAILED due to 0x58000022. Aborting TurnConnection
2024-06-03 12:20:26.654 DEBUG stepTurnConnectionStateMachine(): [0x7f6c0863c0] Turn connection state changed from TURN_STATE_GET_CREDENTIALS to TURN_STATE_FAILED.
2024-06-03 12:20:26.654 VERBOSE stepStateMachine(): [TURN-0x7f6c0863c0] State Machine - Current state: 0x0000000000000100, Next state: 0x0000000000000080, Current local state retry count [0]
, Max local state retry count [0], State transition wait time [0] ms
2024-06-03 12:20:26.654 DEBUG stepTurnConnectionStateMachine(): [0x7f6c0863c0] Turn connection state changed from TURN_STATE_FAILED to TURN_STATE_CLEAN_UP.
2024-06-03 12:20:26.654 VERBOSE stepStateMachine(): [TURN-0x7f6c0863c0] State Machine - Current state: 0x0000000000000080, Next state: 0x0000000000000100, Current local state retry count [0]
, Max local state retry count [0], State transition wait time [0] ms
2024-06-03 12:20:26.654 WARN executeFailedTurnState(): TurnConnection in TURN_STATE_FAILED due to 0x58000022. Aborting TurnConnection
2024-06-03 12:20:26.654 DEBUG stepTurnConnectionStateMachine(): [0x7f6c0863c0] Turn connection state changed from TURN_STATE_CLEAN_UP to TURN_STATE_FAILED.
2024-06-03 12:20:26.654 VERBOSE stepStateMachine(): [TURN-0x7f6c0863c0] State Machine - Current state: 0x0000000000000100, Next state: 0x0000000000000100, Current local state retry count [1]
, Max local state retry count [0], State transition wait time [0] ms
2024-06-03 12:20:26.654 WARN executeFailedTurnState(): TurnConnection in TURN_STATE_FAILED due to 0x58000022. Aborting TurnConnection

Reproduction Steps

AWS_KVS_LOG_LEVEL=1 ./kvsWebrtcClientMaster <channel_name> 0 opus h264
and the KVS WebRTC Test Page as viewer, with NAT traversal set to TURN only

WebRTC C SDK version being used

1.10.2

If it was working in a previous version, which one?

1.7.2

Compiler and Version used

GCC 10.2.1

Operating System and version

Embedded aarch64 Linux

Platform being used

aarch64

@skndi skndi added bug Something isn't working needs-triage labels Jun 3, 2024
@disa6302
Copy link
Contributor

disa6302 commented Jun 3, 2024

@skndi ,

Can you provide the complete logs? Has anything changed in you set up?

@skndi
Copy link
Author

skndi commented Jun 7, 2024

Hello @disa6302, of course, here are the logs
https://gist.github.com/skndi/add4220e87540bf719601cd651c1d818
For your second question, we have small adaptations for our specific use case, but to eliminate the possibility of problems from them, I checked with the sample code - the 1.7.2 sample works as expected, the sample from 1.10.2 fails in the TURN_STATE_GET_CREDENTIALS state

@disa6302
Copy link
Contributor

@skndi ,

Thank you for the logs. Although the get-credentials (reason unknown at the moment), want to confirm if you see a connection establishment failure? There are 2 state machines that run in the TURN path, I see logs only for one of them and the other one seems incomplete. Do you more logs than this? Want to check if the ICE agent state machine fails too and what happens to the other TURN state machine loop.

@skndi
Copy link
Author

skndi commented Jun 11, 2024

Hi @disa6302,

The file is opened truncated by default from the gist link and shows only 7000 lines from ~50000, I think that's why you don't see the second TURN state machine logs. There is a link in the gist to the raw file - https://gist.githubusercontent.com/skndi/add4220e87540bf719601cd651c1d818/raw/4d03564900e68ed693bdb0a5d324aa95b1b96616/gistfile1.txt, can you check if it contains all logs you need? From my check I see that both state machines have entered the FAILED state, but they report different errors for it - 0x5a00002c, which is STATUS_TURN_CONNECTION_GET_CREDENTIALS_FAILED and the other one 0x58000022 which is STATUS_SOCKET_CONNECTION_CLOSED_ALREADY. Also at the end of the logs the ICE agent state machine fails with 0x5a00000d - STATUS_ICE_NO_CONNECTED_CANDIDATE_PAIR

@skndi
Copy link
Author

skndi commented Jun 13, 2024

Today I did a git bisect and the commit where the problem starts appearing is the one for the 1.8.0 release - e9f3cab, if that helps your investigation. My next step is to go on the develop branch and perform a bisect there, to see the exact commit which creates the issue.

@disa6302
Copy link
Contributor

@skndi ,

Thank you for the details. What appears to be happening is that the credentials could not be fetched in 5 seconds, which is the timeout on that state. But this timeout existed in v1.7.2. Could you confirm if anything changed for you in the platform?

To test it out, could you also try to increase the value here in 1.10.2 and see what happens?

@skndi
Copy link
Author

skndi commented Jun 14, 2024

Hi @disa6302,

Today I performed a bisect on #1776 which introduces the 1.8.0 release and have found the commit after which the sample stops working - ee5e345. I tried reverting it in the PR branch and the sample worked after that. On Monday I will try if reverting it on v1.10.2 works and will investigate what exactly in the changes makes it stop working on our system.

@skndi
Copy link
Author

skndi commented Jun 21, 2024

Hi,

I have investigated from our side and by a process of elimination reached the conclusion that not calling getaddrinfo in getIpWithHostName() later causes the TURN state machines to fail. In this function if I add a redundant call to getaddrinfo in the branch where parsing the IP address from the TURN server hostname is successful, then ICE succeeds and we get a working stream. I tried researching what getaddrinfo does under the hood and looked into the libc source code, but wasn't able to get an idea why this might be so, since it should only resolve hostnames to IP addresses. Our solution for the moment would be to get back to using getaddrinfo for address resolution. Do you have any insight why this might work that way, since to me it's quite confusing?

@disa6302
Copy link
Contributor

@skndi ,

Thanks for all the deep dive!. This is very interesting. I see this log line in your logs: ICE Server address for 3-74-151-217.t-7d7a0979.kinesisvideo.eu-central-1.amazonaws.com: 3.74.151.217 which indicates successful parsing of IP from the ICE server URL.

What is the IP address that gets logged if you use getaddrinfo? Is it different?

@skndi
Copy link
Author

skndi commented Jun 24, 2024

@disa6302 No, the address found by getaddrinfo is the same. The parsing of the address works fine, but for the later TURN state machine connections to work a call to getaddrinfo must be made, even if it's result is unused. As a test I used the following code:

STATUS getIpWithHostName(PCHAR hostname, PKvsIpAddress destIp)
{
    STATUS retStatus = STATUS_SUCCESS;
    INT32 errCode;
    UINT16 hostnameLen, addrLen;
    PCHAR errStr;
    struct addrinfo *res, *rp;
    BOOL resolved = FALSE;
    struct sockaddr_in* ipv4Addr;
    struct sockaddr_in6* ipv6Addr;
    struct in_addr inaddr;

    CHAR addr[KVS_IP_ADDRESS_STRING_BUFFER_LEN + 1] = {'\0'};

    CHK(hostname != NULL, STATUS_NULL_ARG);
    DLOGI("ICE SERVER Hostname received: %s", hostname);

    hostnameLen = STRLEN(hostname);
    addrLen = SIZEOF(addr);

    // Adding this check in case we directly get an IP address. With the current usage pattern,
    // there is no way this function would receive an address directly, but having this check
    // in place anyways
    if (isIpAddr(hostname, hostnameLen)) {
        MEMCPY(addr, hostname, hostnameLen);
    } else {
        retStatus = getIpAddrFromDnsHostname(hostname, addr, hostnameLen, addrLen);
    }

    // Verify the generated address has the format x.x.x.x
    if (!isIpAddr(addr, hostnameLen) || retStatus != STATUS_SUCCESS) {
        DLOGW("Parsing for address failed for %s, fallback to getaddrinfo", hostname);
        errCode = getaddrinfo(hostname, NULL, NULL, &res);
        if (errCode != 0) {
            errStr = errCode == EAI_SYSTEM ? (strerror(errno)) : ((PCHAR) gai_strerror(errCode));
            CHK_ERR(FALSE, STATUS_RESOLVE_HOSTNAME_FAILED, "getaddrinfo() with errno %s", errStr);
        }
        for (rp = res; rp != NULL && !resolved; rp = rp->ai_next) {
            if (rp->ai_family == AF_INET) {
                ipv4Addr = (struct sockaddr_in*) rp->ai_addr;
                destIp->family = KVS_IP_FAMILY_TYPE_IPV4;
                MEMCPY(destIp->address, &ipv4Addr->sin_addr, IPV4_ADDRESS_LENGTH);
                resolved = TRUE;
            } else if (rp->ai_family == AF_INET6) {
                ipv6Addr = (struct sockaddr_in6*) rp->ai_addr;
                destIp->family = KVS_IP_FAMILY_TYPE_IPV6;
                MEMCPY(destIp->address, &ipv6Addr->sin6_addr, IPV6_ADDRESS_LENGTH);
                resolved = TRUE;
            }
        }
        freeaddrinfo(res);
        CHK_ERR(resolved, STATUS_HOSTNAME_NOT_FOUND, "Could not find network address of %s", hostname);
    }

    else {

        /////////////////////////////////////////////////////////////
        errCode =  getaddrinfo(hostname, NULL, NULL, &res);
        freeaddrinfo(res);
        ////////////////////////////////////////////////////////////

        inet_pton(AF_INET, addr, &inaddr);
        destIp->family = KVS_IP_FAMILY_TYPE_IPV4;
        MEMCPY(destIp->address, &inaddr, IPV4_ADDRESS_LENGTH);
    }

CleanUp:
    CHK_LOG_ERR(retStatus);
    return retStatus;
}

The result of the getaddrinfo call in the else case is not used in any way, but with this code we have a successful stream establishment. So my guess is that getaddrinfo has some side effect, which affects later connections, but I wasn't able to find anything obvious to me in the source code of the function. It's possible there is some quirk in our system also. For our use case it's good enough to go back to just using getaddrinfo, so that's what we decided to do. I'm posting the info in case you want to further research the issue, and since it would be very interesting to me if someone discovered the cause for this.

@m-j-ahmadi
Copy link

Hi,
We have encountered the same problem after updating the SDK from version 1.7.3 to 1.10.2. However, when we downgrade the SDK to 1.9.1, the problem is resolved. Any advice or further updates on this would be greatly appreciated.

@disa6302
Copy link
Contributor

@m-j-ahmadi ,
If 1.9.1 works, perhaps it is not the same issue since the commit @skndi is pointing to has been in the SDK since 1.8.0. Have you tried to increase the value here in 1.10.2 and see what happens? Also, is your system windows too?

@skndi ,
On that note, curious if you tested 1.9.1 as well and saw the same issue.

@samanazadpour
Copy link

Hi @disa6302 I am a colleague of @m-j-ahmadi we were thinking that 1.9.1 works but after testing few times, I could see that sometimes it works and most of the time it does not work. It seems like 1.9.1 does not work as well

@hassanctech
Copy link
Contributor

@skndi Can you please note which hardware platform you're on? You mentioned embedded linux arm64, but a few more details would help are you running linux on an SoC / any further details you can provide?

@samanazadpour
Copy link

@hassanctech we are also experiencing the same issue, not possible to migrate from 1.7.3 to 1.10.2 we had to downgrade and stay with 1.7.3 until we find the solution. we tested on Jetson nano previous version, Jetson Orin new version, and X64 Microsoft Azure VM with Ubuntu 22

@skndi
Copy link
Author

skndi commented Jul 15, 2024

Hi @hassanctech, we are running on an SoC with an Ambarella CV2 chip, the Linux distro is also provided by them.

@skndi
Copy link
Author

skndi commented Jul 18, 2024

As a bit of extra info here is the user agent string user-agent -> AWS-WEBRTC-KVS-AGENT/1.10.2/1.5.2 GCC/10.2.1 Linux/5.4.215 aarch64

@hassanctech
Copy link
Contributor

@skndi Thanks for those details. Can you please share some more details about your setup, are you using openssl or mbedtls? Are you using a static build?

We have a release candidate branch: https://github.com/awslabs/amazon-kinesis-video-streams-webrtc-sdk-c/tree/1.10.3

We've done some tests on Ambarella S5L SoC, specifically with mbedtls option and static build and we are not seeing any degradation of TURN success rates. We now have an cmake option for you to specify the location of the cert.pem on device as well -DKVS_CA_CERT_PATH which you can specify to where the cert will be on the device, including the filename itself. The S5L is definitely less capable than the CV2 chip so I would expect still good performance for you. Can you please try out that branch with mbedtls? I have not tested with openssl yet.

For reference I've used the following to build the SDK:

cmake .. -DBUILD_STATIC_LIBS=ON -DKVS_CA_CERT_PATH=/path/to/cert.pem -DUSE_OPENSSL=OFF -DUSE_MBEDTLS=ON -DBUILD_LIBSRTP_HOST_PLATFORM=x86_64-unknown-linux-gnu -DBUILD_LIBSRTP_DESTINATION_PLATFORM=aarch64-linux-gnu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage v1.10.2
Projects
None yet
Development

No branches or pull requests

6 participants