Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apps] Added url percent decoding to url query string keys and values. #2015

Merged
merged 1 commit into from
Sep 18, 2023

Conversation

jschultz410
Copy link
Contributor

Apps like srt-live-transmit take their source and destination parameters as urls. Urls require several common characters (e.g. - space) to be percent encoded. This PR adds a simple ASCII / Latin-1 percent decoding function to uriparser.cpp that I applied to the url query string keys and values, so that the url encoding doesn't persist in SRT fields such as streamid.

I don't apply the decoding to the whole url, just the query keys + vals, and I didn't write a matching encoder for re-building a uri out of the decoded components. I'm not even sure this is really the right spot to be doing this. Maybe it should instead just be done when passing the url parameters in to build an actual SRT connection rather than inside the parser's data itself?

Anyway, this PR partially addresses issues #1173 and #1871 and might be useful in a more full fledged approach to handling url encoding in the future.

@codecov-commenter

This comment has been minimized.

@maxsharabayko maxsharabayko added [apps] Area: Test applications related improvements Type: Enhancement Indicates new feature requests labels May 21, 2021
@maxsharabayko maxsharabayko added this to the v1.4.4 milestone May 21, 2021
@jschultz410
Copy link
Contributor Author

jschultz410 commented May 21, 2021

I looked at your unit test that failed. Your test and/or the thing it is testing is likely flaky because the code I changed should have absolutely zero impact on whether or not your thread synchronization and signaling primitives are working as expected.

The failed unit test was:

175: Test command: /home/runner/work/srt/srt/_build/test-srt "--gtest_filter=SyncEvent.WaitForTwoNotifyOne"

@maxsharabayko
Copy link
Collaborator

I looked at your unit test that failed. Your test and/or the thing it is testing is likely flaky because the code I changed should have absolutely zero impact (...)

Yes, please ignore those. They work at the edge of success :)

BTW I had some unfinished unit tests for the URI parser here.
Might be helpful if you could reuse them to test your change.

As I understand, the value part of say streamid=my%20stream is expected to be interpreted as my stream, right?

@jschultz410
Copy link
Contributor Author

jschultz410 commented May 21, 2021

As I understand, the value part of say streamid=my%20stream is expected to be interpreted as my stream, right?

Right. If the streamid's value contains characters that have to be encoded in a url, like a space, then when we actually get down into the SRT library we want the streamid that gets passed into the socket and to the other side to be decoded into utf-8 encoding instead, just like in your example.

I kind of did a hack and just did the decoding as part of the parsing of the uri, which isn't really great as I mentioned, because it doesn't support well things like rebuilding the uri back out of all of its components (i.e. - I didn't re-encode in the reverse direction). It also isn't really clear to me if the components of the uri should be internally stored and made accessible as decoded strings or left as percent encoded.

What is abundantly clear to me, however, is that when the uri parameters get passed into the SRT library to init the socket, etc., at that point those strings definitely should be percent decoded into ASCII / utf-8.

One other weakness here is that I didn't try to tackle percent encoded multi-byte unicode / utf-8 characters. Because we are using string's everywhere, which I believe only are single byte character strings, the necessary changes to truly support unicode characters would likely be much more pervasive in the types throughout the code.

I'll try to take a look at your uriparser unit tests when I can too.

@ethouris
Copy link
Collaborator

ethouris commented May 25, 2021

I think this version looks better:

string uri_decode(const string& in, bool printonly)
{
    char buf[4];

    std::ostringstream out;

    for (size_t i = 0; i < in.size(); ++i)
    {
        char c = in[i];
        if (c == '%' && i + 2 < in.size())
        {
            buf[0] = in[i+1];
            buf[1] = in[i+2];
            buf[2] = 0;
            unsigned hexval = 0;
            int st = sscanf(buf, "%x", &hexval);
            if (st == 1 && (!printonly || isprint(hexval)))
            {
                i += 2; // +1 will be added in the loop ++i
                out << char(hexval);
                continue;
            }
        }

        out << c;
    }

    return out.str();
}

@jschultz410
Copy link
Contributor Author

jschultz410 commented May 25, 2021

I think this version looks better:

Using an ostringstream and going character by character is probably better than doing searches, substr's, and string self additions. I considered doing that as well.

However, doing the sscanf without the checks that the characters are hex digits is more misleading when the input is bad. For example "%2z" will be improperly interpreted by sscanf as hexval == 2 with the bogus 'z' character being silently ignored and skipped and a wrong hexval being extracted and decoded.

This code also continues trying to decode even after a bad encoding sequence has been crossed, which might not be advisable on bad input (e.g. - "%%20").

I'm not sure about the printonly boolean saying un-printables shouldn't be decoded into the returned string and should be left as encoded sequences instead. (Originally, I was going to recommend restructuring the control flow before I realized you wanted to fall back to the main code path when decoding failed. Maybe a comment pointing that out would be good.)

The extra check I had for hexval == 0 can be dropped since std::string can contain nul characters these days.

@ethouris
Copy link
Collaborator

ethouris commented May 25, 2021

However, doing the sscanf without the checks that the characters are hex digits is more misleading when the input is bad. For example "%2z" will be improperly interpreted by sscanf as hexval == 2 with the bogus 'z' character being silently ignored and skipped and a wrong hexval being extracted and decoded.

Right. The condition should be fixed to:

        if (c == '%' && i + 2 < in.size() && isxdigit(in[i+2]))

No need to check the first one. If the first character isn't a hex digit, sscanf will not extract anything and will return 0.

This code also continues trying to decode even after a bad encoding sequence has been crossed, which might not be advisable on bad input.

Not sure exactly what the standard says about syntax errors here. This procedure checks if the % character is followed by two characters that can be deciphered as hex. If not, it just passes on. %%2a will then result in %*.

I'm not sure about the printonly boolean saying un-printables shouldn't be decoded into the returned string and should be left as encoded sequences instead. (Originally, I was going to recommend restructuring the control flow before I realized you wanted to fall back to the main code path when decoding failed. Maybe a comment pointing that out would be good.)

This was because I wasn't sure actually if the non-printable characters should be allowed or not. For printing in the logs or stdout, surely not. But for encoding as a string to be set, sure.

The extra check I had for hexval == 0 can be dropped since std::string can contain nul characters these days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[apps] Area: Test applications related improvements Type: Enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants