Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Needs to ignore repeated slashes in file URL #232

Closed
watilde opened this issue Feb 5, 2017 · 25 comments
Closed

Needs to ignore repeated slashes in file URL #232

watilde opened this issue Feb 5, 2017 · 25 comments

Comments

@watilde
Copy link

watilde commented Feb 5, 2017

Hello!

Since the file state doesn't have a way to jump to the special authority slashes state, once the code enters into the file state, repeated slashes in the beginning of a file URL(e.g. file://///foo/bar should be file:///foo/bar) can't be ignored.

To cover the case, it could be done by the following steps:

  1. In scheme state, stop jumping to file state directly before checking special authority slashes state.
  2. In special authority ignore slashes state, add file state check.
  3. In scheme state, add file state check before jumping to special relative or authority state.

Resources:

What do you think?

watilde added a commit to watilde/url that referenced this issue Feb 6, 2017
Repeated slashes should be arranged in not only the http scheme, but
also the file scheme as well.

Tests:

Fixes: whatwg#232
watilde added a commit to watilde/url that referenced this issue Feb 6, 2017
Repeated slashes should be arranged in not only the http scheme, but
also the file scheme as well.

Tests:

Fixes: whatwg#232
watilde added a commit to watilde/url that referenced this issue Feb 6, 2017
Repeated slashes should be arranged in not only the http scheme, but
also the file scheme as well.

Tests:

Fixes: whatwg#232
@annevk
Copy link
Member

annevk commented Feb 7, 2017

It does seem that Chrome and Edge have this behavior, while Firefox and Safari (likely changed I suppose) do not.

I would support changing this based on that. We'll need new tests as well.

@watilde
Copy link
Author

watilde commented Feb 7, 2017

@annevk Thanks for your comment!

Then I will try reading external/google-url(correct?) impl as a reference, and write some tests to WPT. To make a PR to WPT, could I possibility make a PR to this repo with this commit before that?

@annevk
Copy link
Member

annevk commented Feb 7, 2017

Yeah, you can do the PRs independently as you wish. We'll link them all up before landing.

As for what to use as reference, google-url seems good, but testing Edge would be good too. As for what to test, switching between \ and / would be good, and whether the eventual non-//\ code point ends up as path or host. Not sure whether there is much else here.

watilde added a commit to watilde/url that referenced this issue Feb 7, 2017
Repeated slashes should be arranged in not only the http scheme, but
also the file scheme as well.

Tests:

Fixes: whatwg#232
watilde added a commit to watilde/web-platform-tests that referenced this issue Feb 8, 2017
Related links
+ The URL Standard change: whatwg/url#234
+ The original bug report: whatwg/url#232
watilde added a commit to watilde/web-platform-tests that referenced this issue Feb 8, 2017
Related links
+ The URL Standard change: whatwg/url#234
+ The original bug report: whatwg/url#232
watilde added a commit to watilde/url that referenced this issue Feb 8, 2017
Repeated slashes should be ignored in not only the http scheme, but
also the file scheme as well.

Tests: web-platform-tests/wpt#4762

Fixes: whatwg#232
watilde added a commit to watilde/web-platform-tests that referenced this issue Feb 8, 2017
Related links
+ The URL Standard change: whatwg/url#234
+ The original bug report: whatwg/url#232
watilde added a commit to watilde/web-platform-tests that referenced this issue Feb 8, 2017
Related links
+ The URL Standard change: whatwg/url#234
+ The original bug report: whatwg/url#232
watilde added a commit to watilde/web-platform-tests that referenced this issue Feb 8, 2017
Related links
+ The URL Standard change: whatwg/url#234
+ The original bug report: whatwg/url#232
watilde added a commit to watilde/web-platform-tests that referenced this issue Feb 8, 2017
Related links
+ The URL Standard change: whatwg/url#234
+ The original bug report: whatwg/url#232
@rmisev
Copy link
Member

rmisev commented Feb 8, 2017

I think @watilde you must be careful with this issue: also consider case with localhost, for example:
file://localhost///foo/bar

It looks likes Edge and Chrome ignores slashes after localhost. If not, then this can cause a re-parsing issue:

After first pass we will get:
href: file://///foo/bar
pathname: ///foo/bar

If now we try to parse href, then will get:
href: file:///foo/bar
pathname: /foo/bar

Other test cases to try:
file:///localhost///foo/bar
file:////localhost///foo/bar

@watilde
Copy link
Author

watilde commented Feb 9, 2017

@rmisev Thanks for your comment, that's a good point indeed! I will try supporting it, and potentially we need to update some logics around the file host state.

I also find an another case that should ignore some slashes:
input: new URL('////////server/file', 'file:///tmp/mock/path')
href: file:///server/file
pathname: /server/file

So it's not only about when the base is localhost but also includes some of the other base hosts.

@annevk
Copy link
Member

annevk commented Feb 13, 2017

I have another problem that I don't really know a good solution to. If you have file:/// as input and then you set its pathname to //, that would create a URL that cannot be serialized and then parsed into the same URL. In Chrome this can indeed lead to problems: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/4883.

@yutakahirano
Copy link
Member

Let me check my understanding.

According to RFC3986, path can start with multiple slashes only when authority is present. Note that authority can be empty as reg-name can be empty.

      URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

      hier-part   = "//" authority path-abempty
                  / path-absolute
                  / path-rootless
                  / path-empty
...
      authority   = [ userinfo "@" ] host [ ":" port ]
      host        = IP-literal / IPv4address / reg-name
      reg-name    = *( unreserved / pct-encoded / sub-delims )
...
      path-abempty  = *( "/" segment )
      path-absolute = "/" [ segment-nz *( "/" segment ) ]
      path-noscheme = segment-nz-nc *( "/" segment )
      path-rootless = segment-nz *( "/" segment )
      path-empty    = 0<pchar>
      segment       = *pchar
      segment-nz    = 1*pchar
      segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                    ; non-zero-length segment without any colon ":"

      pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

So, after seeing "file:",

  • If the parser sees "//":
    authority follows (note that it can be empty), followed by path-abempty which is either empty or starting with (possible multiple) slashes.
  • Otherwise, if the parser sees "/":
    path-absolute follows (including the "/"). Note that path-absolute cannot start with "//".
  • Otherwise:
    ...

Here are some examples:

input host path comment
file:/x/y/z n/a /x/y/z authority is not present
file://x/y/z x /y/z
file://x//y/z x //y/z
file:///x/y/z /x/y/z authority is empty
file:////x/y/z //x/y/z authority is empty
file:/localhost/y/z n/a /localhost/y/z authority is not present
file://localhost/y/z localhost /y/z
file:///localhost/y/z /localhost/y/z authority is empty

Please correct me if I'm wrong.

@annevk
Copy link
Member

annevk commented Feb 15, 2017

That is correct per that RFC, but nobody really follows it closely and the compatibility constrains are likely different too.

@yutakahirano
Copy link
Member

It looks @watilde is proposing to ignore repeating leading slashes in pathname. If we want to do that, we would also want to remove repeating leading slashes when setting pathname, which will solve your problem.

@annevk
Copy link
Member

annevk commented Feb 15, 2017

But only ignore them when authority is empty, or just always ignoring them? I guess file systems do not allow for empty directories, so maybe that is okay?

@yutakahirano
Copy link
Member

If we ignore repeating leading slashes in pathname on file://localhost/ (i.e., "file://localhost////////////path" is parsed as "file:///path"), I think we should ignore them even when authority is not empty. In other words, if we ignore repeating leading slahes in pathname only when authority is empty, I think we should not omit 'localhost'.

@annevk
Copy link
Member

annevk commented Feb 15, 2017

Sounds reasonable to me. (It's very heard to search for "file systems that support directories without name".)

@annevk
Copy link
Member

annevk commented Feb 16, 2017

Always trimming would also work for when you set host to the empty string through the API. (Another case we should make sure to test.)

@watilde
Copy link
Author

watilde commented Feb 17, 2017

@annevk @yutakahirano Sorry for my late response, and that sounds what we're looking for. I will catch up now!

@rmisev
Copy link
Member

rmisev commented Feb 21, 2017

Yet another sample:
file://localhost///a//../..//

Currently it is serialized to:
file://////

It shows what repeated leading slashes must be removed at the end of path parse (path state), when all possible /../ are resolved. At this point localhost is already converted to empty host, so when switching from path state or at EOF:

  • If url’s scheme is "file" and url’s host is empty, then remove all leading empty strings from path excluding last path's string.

This also works for pathname setter.

@annevk
Copy link
Member

annevk commented Mar 10, 2017

@watilde are you still interested in working on this?

@watilde
Copy link
Author

watilde commented Mar 20, 2017

@annevk I may take more time to fix. I would appreciate if you could take a patch :)

annevk added a commit to web-platform-tests/wpt that referenced this issue Mar 22, 2017
Closes #4762 by superseding it.

URL Standard: whatwg/url#232.
annevk added a commit that referenced this issue Mar 22, 2017
@annevk
Copy link
Member

annevk commented Mar 22, 2017

Note that in #278 I went with a slightly different solution than @rmisev proposes since I wanted to go with @yutakahirano's proposal of ignoring the leading slashes even when the URL contains a host.

I also ended up finding an unrelated bug and fixed that too.

annevk added a commit that referenced this issue Mar 28, 2017
Also copy file URL hosts correctly for path-absolute input.

Tests: web-platform-tests/wpt#5195.

Closes #234 by superseding it. Fixes #232.
annevk added a commit to web-platform-tests/wpt that referenced this issue Mar 29, 2017
Closes #4762 by superseding it.

URL Standard: whatwg/url#232.
annevk added a commit that referenced this issue Mar 29, 2017
Also copy file URL hosts correctly for path-absolute input.

Tests: web-platform-tests/wpt#5195.

Closes #234 by superseding it. Fixes #232.
@watilde
Copy link
Author

watilde commented Mar 29, 2017

brilliant! thanks.

@alwinb
Copy link
Contributor

alwinb commented Jul 19, 2018

Could someone explain to me the reasoning behind this decision?

I believe that it was motivated by the desire to interpret file:////host as file://host file:///host, making file URLs and other 'special' URLs more alike. But the result is that now paths of file URLs are different from paths of other special URLs, and less expressive.

I've been working on an URL parser that follows the WhatWG behaviour but separates the parsing and the normalization phases. In that light, the following cases from the web platform tests are especially confusing.

  • /..//localhost//pig against file://lion/ resolves to file://lion/localhost//pig rather than file://lion//localhost//pig.
  • file://localhost//a//../..// resolves to file:/// rather than file:////

Note that /..//localhost//pig against http://lion/ resolves to http://lion//localhost//pig
(and http://localhost//a//../..// resolves to http://localhost///).

Why would you make this trade-off? Is there a motivation that I am missing?

Thank you,

@annevk
Copy link
Member

annevk commented Jul 21, 2018

@alwinb did you read through this issue and the links provided? Skimming it the rationale appears to be related to existing implementations and avoiding reparsing issues.

@alwinb
Copy link
Contributor

alwinb commented Jul 21, 2018

I did read through it, it seems to have gone from wanting to ignore slashes after file:/// via a reparsing issue to this solution.

I want to make a case for keeping path normalization of file URLs and other 'special' URLs the same.
They have been the same (barring drive letters) until this change was made.

However, none of the browsers (or Edge?) implement the specced behaviour.
I tested Chrome canary, Safari preview and Firefox nightly.

Input Firefox and Safari Chrome
/..//localhost//pig against file://lion/ file:////localhost//pig file://lion//localhost//pig (seems right)
file://localhost//a//../..// file:///// file://localhost///
file://localhost////foo file://////foo file://localhost////foo
file:////foo file:////foo file:///foo
  • All agree on the pathName except in the fourth case. (I don't know why Safari and Firefox drop the host in the first.)
  • Only Chrome (Edge?) match the spec in only the fourth case.
  • Firefox and Safari do not ignore slashes and thereby avoid the reparsing issue.
  • Chrome (Edge?) only ignores slashes immediately following file:/// and avoids the reparsing issue by keeping localhost.

Some ideas for alternatives:

  1. Always replace the empty host with localhost.
  2. Do so only in ambiguous cases.
  3. Prefix ambiguous pathNames with ./.
  4. Don't collapse slashes at all.
  5. Come up with something else?

I guess 1, 4 (and 5) seem to be the more reasonable ones.

Thanks,

@annevk
Copy link
Member

annevk commented Jul 22, 2018

Hmm, okay, I guess this warrants another look then. Could you file a new issue with the above findings? And maybe add the specified behavior as a column to the table?

@alwinb
Copy link
Contributor

alwinb commented Jul 23, 2018

I created a new issue, #405

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants