Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't force query string normalization #1234

Closed
1 task done
Giotino opened this issue May 7, 2020 · 34 comments · Fixed by #1246
Closed
1 task done

Don't force query string normalization #1234

Giotino opened this issue May 7, 2020 · 34 comments · Fixed by #1246
Labels
enhancement This change will extend Got features

Comments

@Giotino
Copy link
Collaborator

Giotino commented May 7, 2020

What problem are you trying to solve?

In an url like http://example.org/random?param=SOMETHING~SOMETHING the special character ~ is percent-encoded before the request, resulting in http://example.org/random?param=SOMETHING%7ESOMETHING which is not supported (decoded) by some HTTP servers.

As described by RFC 3986 in section 2.3
"
URI comparison implementations do not always perform normalization prior to comparison. For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.
"

Also in RFC 3986, section 6.2.2.2
"
The percent-encoding mechanism is a frequent source of variance among otherwise identical URIs. In addition to the case normalization issue noted above, some URI producers percent-encode octets that do not require percent-encoding, resulting in URIs that are equivalent to their non-encoded counterparts. These URIs should be normalized by decoding any percent-encoded octet that corresponds to an unreserved character, as described in Section 2.3.
"

The percent-encoding of ~ by got happen because NodeJS follows the "WHATWG URL API" (https://nodejs.org/api/url.html#url_the_whatwg_url_api) which misses ~ from the unreserved characters (https://url.spec.whatwg.org/#interface-urlsearchparams, the Note below the example, and https://url.spec.whatwg.org/#urlencoded-serializing) and, by the way, includes *.

Describe the feature

My proposal is to add a flag to the options to prevent the normalization by skipping the append and delete of "_GOT_INTERNAL_TRIGGER_NORMALIZATION"

Checklist

  • I have read the documentation and made sure this feature doesn't already exist.
@Giotino Giotino changed the title Better percent encoding Better serach normalization May 7, 2020
@Giotino Giotino changed the title Better serach normalization Better search normalization May 7, 2020
@szmarczak
Copy link
Collaborator

Duplicate of #1180 #1202 #1113

@szmarczak
Copy link
Collaborator

Unfortunately we cannot do anything about this :(

@Giotino
Copy link
Collaborator Author

Giotino commented May 8, 2020

Sorry, I hadn't seen the previous issues, but i think that something can be done.

#1180 and #1202 described that behavior as a bug, i don't think of it as bug, NodeJS decided to go with the WHATWG URL specs, and it follows it.

What can be done is giving the user a choice about the standard (or whether the normalization should take place), maybe with a flag into the options.
example:

got('http://example.org/random?param=SOMETHING~SOMETHING', { rfc3986: true })
or
got('http://example.org/random?param=SOMETHING~SOMETHING', { preventSearchNomalization: true })

Another thing that can be done is to only reimplement the URLSearchParams class to generate the URL as described by the RFC3986 (without giving the user a choice).

The last thing that came to my mind is to remove this code (core/index.ts:713)

if (options.url.search) {
  const triggerSearchParameters = "_GOT_INTERNAL_TRIGGER_NORMALIZATION";

  options.url.searchParams.append(triggerSearchParameters, "");
  options.url.searchParams.delete(triggerSearchParameters);
}

That would be a little hack to the Javascript url parser. It seems that when a url is parsed (new URL('SOME_URL')) the query string is not re-generated (normalized), it stays as in SOME_URL, in fact got has the code above to normalize it.
I don't think this last one is a valid solution and I don't think it will be reliable for the future.

Actually the thing is to have a little discussion about what has to be done: leaving things the way they are blaming the NodeJS specs, or giving the user tools to decide by himself.

All that said, I'm willing to help with the implementation of any solution.

@szmarczak
Copy link
Collaborator

Unfortunately the WHATWG URL is used in the Node.js http module. Got just enforces the behavior to be always the same. If we didn't do that, the sometimes you'd end up with ~ and sometimes with %7E.

@Giotino
Copy link
Collaborator Author

Giotino commented May 8, 2020

Ok, but what about leaving the query string as it is when it's inside the URL as the http module do?

got('http://example.org/random?param=SOMETHING~SOMETHING') => /random?param=SOMETHING~SOMETHING

got('http://example.org/random', { searchParams: { param: 'SOMETHING~SOMETHING' } }) => `/random?param=SOMETHING%7ESOMETHING`

@szmarczak
Copy link
Collaborator

That's the behavior we want to avoid. The proper "fix" would be to pass an object instead of a URL instance to the http module.

@Giotino
Copy link
Collaborator Author

Giotino commented May 8, 2020

That's the behavior we want to avoid.

So with got I can't send a request like this:

GET /random?=^~ HTTP/1.1
...

Because /random?=^~ gets normalized to /random?=%5E%7E

Event if it respects the HTTP/1.1 (RFC 2616) and URI (RFC 3986) RFCs?

RFC 3986 section 2.2
"
URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.
"

RFC 2616 section 6.2.2 describes an HTTP URL as
http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]

@szmarczak
Copy link
Collaborator

WHATWG URL specs, and it follows it.

The spec is buggy. If you want to get this fixed faster, feel free to send a temporary fix to Node.js. On Got side we just enforce normalization.

@Giotino
Copy link
Collaborator Author

Giotino commented May 8, 2020

My previous example was a little misleading.
What I'm trying to say is that whenever Got find something as URL.search (when parsing the URL) it treats that string as URLSearchParams.

But that's wrong.
If I make a request to an URL like http://example.org/?aaa, Got normalize it to an equivalent of http://example.org/?aaa= and so the server receives /?aaa= as url.

With this last example i think this is going to be a "bug report" instead of a "feature request"

@szmarczak
Copy link
Collaborator

According to this RFC it's still valid because the value is empty:

However, as query components
are often used to carry identifying information in the form of
"key=value" pairs and one frequently used value is a reference to
another URI, it is sometimes better for usability to avoid percent-
encoding those characters.

@Giotino
Copy link
Collaborator Author

Giotino commented May 8, 2020

The RFC states that query components are often (not always) in the form of "key=value".

@szmarczak
Copy link
Collaborator

That's right. But ?key= means the same as ?key. Both have no value.

@Giotino
Copy link
Collaborator Author

Giotino commented May 8, 2020

The problem is that I can't find a single standard that states that the query string must be interpreted as key-value and be in the form of key1=value1&key2=value2 (application/x-www-form-urlencoded)

A lot of servers parse it like that (and also offer access to the raw string) but to me it seems not to be mandatory.
I think most servers do that because it's the preferred (most used) way.

Event the WHATWG URL spec does not states that the query string must be application/x-www-form-urlencoded, but it propose a standard to apply to it.

@Giotino
Copy link
Collaborator Author

Giotino commented May 9, 2020

@szmarczak
Copy link
Collaborator

The spec is buggy. If you want to get this fixed faster, feel free to send a temporary fix to Node.js. On Got side we just enforce normalization.

#1234 (comment)

@Giotino
Copy link
Collaborator Author

Giotino commented May 9, 2020

The spec is fine, but Got tries to enforce application/x-www-form-urlencoded onto the query even if it's not.

On Got side we just enforce normalization.

You are trying to normalize things that should not be normalized (because there's no standard).

@Giotino Giotino changed the title Better search normalization Wrong URL query string handling May 9, 2020
@szmarczak
Copy link
Collaborator

szmarczak commented May 9, 2020

The spec is fine

It's not. Even one of their maintainers says it's broken: https://twitter.com/domenic/status/1257377082704900096

You are trying to normalize things that should not be normalized (because there's no standard).

@sindresorhus One of the workarounds would be to consider the query in the input and the searchParams option mutually exclusive. But that would be a breaking change.

@Giotino
Copy link
Collaborator Author

Giotino commented May 9, 2020

What i really meant with

The spec is fine

was that that's not the problem.

I know this "Feature request" started as "I've a problem with ~, damn WHATWG spec", but after this comment #1234 (comment) , it became a "Bug report" on the incorrect handling of the query string.

I wanted to discuss exhaustively about this issue because, as you said, it would be a breaking change. In particular I think it is going to break the merging of the URL params.

@Giotino
Copy link
Collaborator Author

Giotino commented May 11, 2020

Any news about this issue?

Should this issue be reopened?

@szmarczak szmarczak reopened this May 11, 2020
@szmarczak szmarczak changed the title Wrong URL query string handling decodeURI(url) before sending a request May 11, 2020
@szmarczak szmarczak added the enhancement This change will extend Got features label May 11, 2020
@Giotino
Copy link
Collaborator Author

Giotino commented May 11, 2020

May I ask how decodeURI is going to solve this issue?

I think the only correct solution is to leave the URL intact (if given as a string).

BTW, this issue isn't breaking some random small HTTP server, it's breaking compatibility with things like Akamai's "Adaptive Media Delivery"
Example URL: http://domain.akamaihd.net/my/uri/file.mp4?token=exp=1462383360~data=my_signature_data~hmac=2d5a69e9a3d2b6d304958a05afc2c139045adb71f65115861736be5cba98359d
If the = (and ~) inside the token "value" are percent encoded the URL stops working.

@szmarczak szmarczak changed the title decodeURI(url) before sending a request decodeURIComponent(url) before sending a request May 11, 2020
@szmarczak
Copy link
Collaborator

Sorry, I meant decodeURIComponent.

@Giotino
Copy link
Collaborator Author

Giotino commented May 11, 2020

Sorry, I meant decodeURIComponent.

My doubts remain the same as before.

This time you have %## disappearing from the URL. Also this won't fix the appended =.

First example:
URL: http://example.org/?key=some%legit%value, if you call decodeURIComponent('some%legit%value') the result is an URIError.

Second example:
URL: http://example.org/?some_legit_query_string is going to became http://example.org/?some_legit_query_string= which is different.

@szmarczak
Copy link
Collaborator

First example:

Invalid URL. The % character should be escaped.

Second example:

Will be fixed.

@Giotino
Copy link
Collaborator Author

Giotino commented May 11, 2020

Invalid URL. The % character should be escaped.

It's not invalid, your statement would be true if the format applied to the query string was application/x-www-form-urlencoded, which requires that % character should be escaped.

What about a custom format that don't use percent escapes?
It's still into the specs, as described before HTTP do not have specs about the query string content (except for the # character, as it's the terminator of the query string).

You can argue that application/x-www-form-urlencoded is a spec and Got follows it, but in the homepage I read

Human-friendly and powerful HTTP request library for Node.js

and not

Human-friendly and powerful HTTP (with query string as application/x-www-form-urlencoded) request library for Node.js

@szmarczak
Copy link
Collaborator

The WHATWG URL standard (query string) specifies that only these characters may be unescaped: https://url.spec.whatwg.org/#url-code-points

decodeURIComponent('some%legit%value') throws an error. The thing that is incorrect in the WHATWG spec is that searchParams assumes that the query is x-www-form-urlencoded.

@szmarczak szmarczak changed the title decodeURIComponent(url) before sending a request decodeURIComponent(query) before sending a request May 11, 2020
@Giotino
Copy link
Collaborator Author

Giotino commented May 11, 2020

the WHATWG spec is that searchParams assumes that the query is x-www-form-urlencoded

I don't think that statement is true. (they also make an example against it)

There's why:
There's also search which is the raw string.
Reading the WHATWG spec seems like searchParams is an attempt to read search as x-www-form-urlencoded but I can't find anything about its mandatoriness of that format (meaning that if search it's not x-www-form-urlencoded searchParams has no real meaning)

The WHATWG URL standard (query string) specifies that only these characters may be unescaped

Sorry, I missed it... (it's only written in the implementation of the parser)

@szmarczak
Copy link
Collaborator

I don't think that statement is true. (they also make an example against it)

Wrong example. They're talking about query string, not search params.

Sorry, I missed it... (it's only written in the implementation of the parser)

No problem, I just figured out that too. Was wondering why decodeURIComponent('some%legit%value') was throwing...

@Giotino
Copy link
Collaborator Author

Giotino commented May 11, 2020

Wrong example. They're talking about query string, not search params.

My bad, I phrased it wrong.
I meant that there's still the search value (which represents the query string) and there's nothing about how to parse it (except the escapes), hence WHATWG is not trying to force a spec on the query string, it's only offering a format/tool (URLSearchParams) that can be used or not.

@szmarczak
Copy link
Collaborator

szmarczak commented May 11, 2020

WHATWG is not trying to force a spec on the query string

But there are inconsistencies in the normalization...

@Giotino
Copy link
Collaborator Author

Giotino commented May 11, 2020

But there are inconsistencies in the normalization...

I'm sorry, I can't understand what you are referring to, can you explain

@stevenvachon
Copy link
Contributor

stevenvachon commented May 11, 2020

This occurs with basic auth as well:

const url = new URL('http://host');
url.username = '=user';
url.password = '=pass';
got(new URL(url));
// sends "%3Duser" and "%3Dpass" to `http.request`

@szmarczak
Copy link
Collaborator

I'm sorry, I can't understand what you are referring to, can you explain

You can pass a perfectly valid urlencoded query in new URL(string) and if you call .append('_', '').delete('_') it will be improperly escaped. Hence those issues.

@szmarczak
Copy link
Collaborator

@stevenvachon You need to compute the authorization header on your own for now...

@Giotino
Copy link
Collaborator Author

Giotino commented May 12, 2020

You can pass a perfectly valid urlencoded query in new URL(string) and if you call .append('_', '').delete('_') it will be improperly escaped. Hence those issues.

The URL Class of WHATWG that states that searchParams is readonly, but I can't understand if it means that it can't be swapped (readonly reference) or it can't be "touched" (readonly object).

In any case if you are using searchParams (with append and delete) it means that you treat the query string as x-www-form-urlencoded, hence it's correct to apply that normalization.

If you're using x-www-form-urlencoded there's no need to "keep the URL as it is" because that spec define how to encode/decode that string, so it will be fine.

@Giotino Giotino changed the title decodeURIComponent(query) before sending a request Don't force query string normalization May 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This change will extend Got features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants