Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing of URL requirements with git+file scheme #264

Closed
wants to merge 1 commit into from

Conversation

sbidoul
Copy link
Member

@sbidoul sbidoul commented Jan 27, 2020

Issue this PR fixes:

Requirement("name @ git+file:///data/repo") fails when it should succeed, much like Requirement("name @ git+https://g.c/u/r.git").

Such requirements are allowed by PEP 440.

@sbidoul
Copy link
Member Author

sbidoul commented Feb 22, 2020

Gentle nudge. See also pypa/pip#7650 (comment) for context.

I believe the test error is a transient.

@di
Copy link
Member

di commented Feb 24, 2020

On a quick glance the test error seems relevant? The failing test is the one you've added in this PR, though not sure why it's only failing for PyPy.

@sbidoul
Copy link
Member Author

sbidoul commented Feb 25, 2020

@di thanks for looking into this PR.

Right, transient is probably not the right word to qualify the test failure. Indeed the failing test is one that I added and I can reproduce by running nox -s tests-pypy3 locally. But what is very strange is that the stack trace displayed in the log does not show the code I added to make this test pass. Could it be the pypy test does not pick-up the right code?

@pradyunsg
Copy link
Member

@sbidoul Consider adding pip install .; cd tests to the nox task for running tests? It might just be a case of picking up the wrong version of packaging?

@sbidoul
Copy link
Member Author

sbidoul commented Feb 26, 2020

Okay this was actually fixed in #267, so I rebased and it's green now.

@di
Copy link
Member

di commented Mar 24, 2020

It seems like the actual issue here is the difference in behavior between how urlparse treats an omitted netloc:

>>> urlparse.urlparse('git+file:///data/repo')
ParseResult(scheme='git+file', netloc='', path='/data/repo', params='', query='', fragment='')
>>> urlparse.urlparse('git+file://localhost/data/repo')
ParseResult(scheme='git+file', netloc='localhost', path='/data/repo', params='', query='', fragment='')

And how PEP 440 interprets an omitted netloc:

File URLs take the form of file://<host>/<path>. If the <host> is omitted it is assumed to be localhost and even if the <host> is omitted the third slash MUST still exist. The <path> defines what the file path on the filesystem that is to be accessed.

Which means that using git+file://localhost/data/repo as a workaround should work:

>>> from packaging.requirements import Requirement
>>> Requirement("name @ git+file://localhost/data/repo")
<Requirement('name@ git+file://localhost/data/repo')>

I'm not sure what the right thing to do here is. Should we always treat an omitted netloc as localhost? I don't think so, https:///data/repo is not the same as https://localhost/data/repo.

Should we just consider it localhost for some set of schemes? Maybe file:// and *+file://?

Or should we update the PEP to just not make this assumption?

@pradyunsg
Copy link
Member

Or should we update the PEP to just not make this assumption?

I think it'd be useful to figure out why the PEP makes this assumption. :)

@di
Copy link
Member

di commented Apr 2, 2020

@ncoghlan @dstufft as the PEP authors, any insight here?

@sgg
Copy link

sgg commented Apr 24, 2020

@di @pradyunsg I think there are two separate questions here:

  1. What is the semantic meaning of file URIs when there is no netloc component?
  2. Is the file URI parsing behavior in urllib correct?

Question 1

PEP440's assumption that an omitted host (netloc in urllib terms and file-auth in RFC8089 terms) means "local" is to spec in my opinion.

RFC 8089 says this:

A file URI can be dependably dereferenced or translated to a local
file path only if it is local. A file URI is considered "local" if
it has no "file-auth", or the "file-auth" is the special string
"localhost", or a fully qualified domain name that resolves to the
machine from which the URI is being interpreted (Section 2).

One subtle issue w/ PEP440's phrasing is that is says omitted is assumed to be localhost, when it might be more precise to say that "omitted and localhost are assumed to be equivalent" when talking about file URIs.

Question 2

I would argue that urllib's behavior is correct here. urlparse is a generic URL parser and the URLgit+file:///data/repo does not contain a netloc, hence netloc is empty.

Furthermore, absent a file URI specific parser, I would think that it is the responsibility of the consumer of the urls (in this case packaging) to understand that file:///, file://localhost/, file://::1/, etc) are semantically the same.


Note 1 - RFC8089 was first drafted in 2015 and finalized in 2017 so both PEP440 and PEP508 predate this RFC. However, RFC8089 is backwards compatible and the assumption about empty hostnames is sound going back to RFC1738.

Note 2 - RFC 8089 Appendix A lists the differences between 8089 and previous specs and Appendix B lists some examples.


Cross-linking #120 since it's related.

@di
Copy link
Member

di commented Apr 24, 2020

Nice analysis @sgg! I think I agree.

Looks like we could fix this just with:

            if parsed_url.scheme == "file":
                 if urlparse.urlunparse(parsed_url) != req.url:
                     raise InvalidRequirement("Invalid URL given")
-            elif not (parsed_url.scheme and parsed_url.netloc) or (
-                not parsed_url.scheme and not parsed_url.netloc
-            ):
+            elif not parsed_url.scheme:
                 raise InvalidRequirement("Invalid URL: {0}".format(req.url))

However this causes this test to fail:

def test_invalid_url(self):
with pytest.raises(InvalidRequirement) as e:
Requirement("name @ gopher:/foo/com")
assert "Invalid URL: " in str(e.value)
assert "gopher:/foo/com" in str(e.value)

And based on my understanding, this URL is just as valid here, since there's no restriction on what schemes are valid:

>>> urlparse.urlparse('git+file:///data/repo')
ParseResult(scheme='git+file', netloc='', path='/data/repo', params='', query='', fragment='')
>>> urlparse.urlparse('gopher:/foo/com')
ParseResult(scheme='gopher', netloc='', path='/foo/com', params='', query='', fragment='')

So we should probably remove that test as well, do you agree?

@sgg
Copy link

sgg commented Apr 27, 2020

@di Acknowledging that there are some known issues/bugs w/ urlparse, I would opt to remove any of the validation logic that goes beyond RFC3986 compliance. Put simply, gopher:/foo/com is a valid URL, and is no sillier than gopher://in_a_tunnel/foo/com :)

If the intent is for setuptools is to align with PEP440/508, then my philosophical view is that we should either allow any and all valid URLs in a Requirement, or those PEPs should be further specified with some constraints around URLs. I ended up getting bit by #120 because I was going on what the PEPs said was valid. :/

@sbidoul
Copy link
Member Author

sbidoul commented Aug 22, 2020

I also agree any valid URL should be accepted. I updated the PR according to @di's suggestion above.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 23, 2020

@di good point, I added a test with a couple of exotic URLs.

I also added more tests for absolute and relative paths... and I found a potential issue: it now accepts c:\foo\bar as valid, while it rejects (/foo/bar)... not sure what to do with that.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 23, 2020

Also, it seems the PEP 508 grammar does allow paths without scheme?

URI_reference = <URI | relative_ref>
URI           = scheme ':' hier_part ('?' query )? ( '#' fragment)?
hier_part     = ('//' authority path_abempty) | path_absolute | path_rootless | path_empty
absolute_URI  = scheme ':' hier_part ( '?' query )?
relative_ref  = relative_part ( '?' query )? ( '#' fragment )?
relative_part = '//' authority path_abempty | path_absolute | path_noscheme | path_empty

The more I look into this the more confused I get..

self._assert_requirement(req, "name", "git+file:///data/repo")

@pytest.mark.parametrize(
"url", ["gopher:/foo/com", "mailto:me@example.com", "c:/foo/bar"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that here, c is an URL scheme, not a drive letter.

@sbidoul
Copy link
Member Author

sbidoul commented Mar 21, 2021

Rebased and squashed.

@piotr-dobrogost piotr-dobrogost mentioned this pull request Mar 21, 2021
@uranusjr
Copy link
Member

uranusjr commented Mar 21, 2021

I’m wondering, what is a relative ref supposed to mean? In normal contexts, it means using the current entity’s scheme and replace the domain and path etc., but where should such a URL in a requirement resolve against?

It’s likely possible to come up with a reasonable answer for every possible way to install a distribution (wheel, sdist, local VCS, remote VCS, local VCS clone from a remote, an on-disk source tree, etc.), but there would be way too much subtlty I doubt the end result would be useful for many, and can easily see it be a minefield that every tutorial on the internet gets wrong and causes endless pain for users and packaging tools alike. TBH I’d be much happier if we could just pretend the relative ref thing doesn’t exist until someone actually comes up with a coherent plan (and probably an Informational PEP) on how it can be used.

@ncoghlan
Copy link
Member

ncoghlan commented Mar 23, 2021

I don't remember the rationale per se, but my assumption would be that we were just ensuring that file URLs were processed the same way browsers, curl, etc process them: three forward slashes implies "localhost" for "file" URLs, since it's an intrinsically local protocol.

@ncoghlan
Copy link
Member

For the relative reference support in the grammar, re-reading PEP 508 suggests to me that that is a setuptools feature that inadvertently slipped into the spec when the grammar was derived from the pkg_resources one. PEP 440 doesn't allow relative URLs, and PEP 508 doesn't define semantics for them either.

@sbidoul
Copy link
Member Author

sbidoul commented Apr 2, 2021

Can we say this PR is an improvement it itself, and refinements to URI parsing / validation could be deferred ?

@brettcannon brettcannon added the bug label Jul 1, 2021
@pradyunsg
Copy link
Member

I think we can say that. :)

@uranusjr
Copy link
Member

uranusjr commented Oct 16, 2021

I'm only willing to consider this is a net improvement if the "exotic URLs" continue to be disallowed. Otherwise we are making the parsing logic from being too restrictive to too permissive, which is at best a wash (and arguably worse because it'd be a pain to remove support for those later on if people start to depend on those "features").

@brettcannon
Copy link
Member

What do we want to do with this PR? There's no attached issue and I'm not enough of an expert when it comes to the URL schemes to know what to do.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 28, 2022

In #120 (comment) @uranusjr and @brettcannon seemed to be happy with relaxing URL validation in packaging. @sgg says something similar in #264 (comment). That is also my preference, and what is implemented here.

In its current state, this PR, which I just rebased,

This may let invalid URls that were caught before go through in downstream tools but I personally don't think this can create much more trouble than harder to understand downstream errors when passed exotic URLs.

@sbidoul
Copy link
Member Author

sbidoul commented Apr 13, 2023

Closing in favor of #684

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants