Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement glob-like pattern matching #1512

Merged
merged 2 commits into from
Aug 31, 2021

Conversation

MVrachev
Copy link
Collaborator

@MVrachev MVrachev commented Jul 29, 2021

Fixes #1505, #1506

Description of the changes being introduced by the pull request:

Implement glob-like pattern matching - 1505

According to the recently updated version of the specification the shell
style wildcard matching is glob-like (see theupdateframework/specification#174),
and therefore a path separator in a path should not be matched by a
wildcard in the PATHPATTERN.

That's not what happens with fnmatch.fnmatch() which doesn't
see "/" separator as a special symbol.
For example: fnmatch.fnmatch("targets/foo.tgz", "*.tgz") will return
True which is not what glob-like implementation will do.

We should make sure that target_path and the pathpattern contain the
same number of directories and because each part of the pathpattern
could include a glob pattern we should check that fnmatch.fnmatch() is true
on each target and pathpattern directory fragment separated by "/".

Avoid lstrip(os.sep) on target paths - 1506

For targetpath: we don't want to support corner cases such as
file paths starting with separator.
Why this case should be treated especially than any other case where
you have multiple "/" for example "foo//bar/tar.gz"?

For pathpattern: it's recommended that the separator in the pathpattern
should be "/":
see https://theupdateframework.github.io/specification/latest/#targetpath
I believe it could lead to issues for a client implementation if it
supports arbitrary separators - every implementation needs to choose one
and stick with it.
Then, if we decide that "/" is our separator using lstrip on "os.sep" is
wrong, because the os separator from the server could be different that
the one used in the client.

Because of the above arguments, it makes sense to just remove
lstrip on os separators.

Additionally, document in the public API that we only support "/" as a
separator and don't handle corner cases such as leading separators
in either pathpattern or target_filepath.

Signed-off-by: Martin Vrachev mvrachev@vmware.com

Please verify and check that the pull request fulfills the following
requirements
:

  • The code follows the Code Style Guidelines
  • Tests have been added for the bug fix or new feature
  • Docs have been added for the bug fix or new feature

@MVrachev MVrachev force-pushed the glob-pattern-matching branch 2 times, most recently from a631b4e to 4593342 Compare July 29, 2021 17:52
Copy link
Contributor

@sechkova sechkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing the dirnames is a clever trick that matches the most common examples in TUF. But if we want to implement proper "globbing" it should be applied on each of the components of a pathname separately.
Should this be a valid test case too (I don't know if someone would use it 🤷) : ("foo/bar/zoo/k.tgz", "foo/*/zoo/*"),?

Also IIUC, '*' should not match path separator, I've added comments about it.

tuf/ngclient/updater.py Outdated Show resolved Hide resolved
tuf/ngclient/updater.py Outdated Show resolved Hide resolved
tuf/ngclient/updater.py Outdated Show resolved Hide resolved
tests/test_updater_ng.py Outdated Show resolved Hide resolved
tests/test_updater_ng.py Outdated Show resolved Hide resolved
@MVrachev MVrachev force-pushed the glob-pattern-matching branch 2 times, most recently from 7efb6a6 to 57cd50c Compare July 30, 2021 19:10
@MVrachev
Copy link
Collaborator Author

MVrachev commented Jul 30, 2021

After the good review by @sechkova, I decided to change my approach.

As Teodora suggested, I will focus on checking the list of directories returned by os.path.dirname(pathpattern).split("/")
and the list of directories returned by os.path.dirname(targetname).split("/").
If both of them match, I can use fnmatch.fnmatch().
Also, that way I will be able to accept * as a directory name for pathpattern directories and thus this example:
targetname = "foo/bar/zoo/k.tgz" and pathpattern = foo/*/zoo/* will be supported.

Please let me know if any of the reviewers have ideas for a combination of targetname and pathpattern that will return a false positive or false negative

@MVrachev
Copy link
Collaborator Author

MVrachev commented Aug 4, 2021

Added type annotations and better function description for _is_target_in_pathpattern as suggested by @avelichka.

@jku
Copy link
Member

jku commented Aug 4, 2021

This seems to be handling directories as a special case. I don't think you'll be able to make that work: as Teodora said globbing should probably be applied on each of the components of a pathname separately (unless of course the python base modules include a method to handle the whole path).

Think of e.g. these cases to see new issues:

("foo/bar", "f?o/bar"),
("foo/bar", "*o/bar"),

Other quick comments:

  • let's pay attention to os.path usage: both pathpattern and targetpath are (or should be, I think) url fragments and not local filesystem paths. So if we use a path module do we actually want to use posixpath and not os.path?
  • we probably want to do posixpath.abspath() at some point to handle cases like "foo//bar" in input -- but I guess that should happen somewhere closer to the API level, and not in this code

@MVrachev
Copy link
Collaborator Author

MVrachev commented Aug 4, 2021

This seems to be handling directories as a special case. I don't think you'll be able to make that work: as Teodora said globbing should probably be applied on each of the components of a pathname separately (unless of course the python base modules include a method to handle the whole path).

Think of e.g. these cases to see new issues:

("foo/bar", "f?o/bar"),
("foo/bar", "*o/bar"),

I added a new commit in which I change my approach.
In the last commit, I call fnmatch.fnmatch() on each directory in the pathpattern
separately.
Also, I added two unit test cases testing that:

  • ("foo/bar", "f?o/bar") will pass
  • ("foo//bar", "*o/bar") will fail because of the number of slashes.
    I remember we discussed before we don't want to artificially support corner cases.

Other quick comments:

  • let's pay attention to os.path usage: both pathpattern and targetpath are (or should be, I think) url fragments and not local filesystem paths. So if we use a path module do we actually want to use posixpath and not os.path?
  • we probably want to do posixpath.abspath() at some point to handle cases like "foo//bar" in input -- but I guess that should happen somewhere closer to the API level, and not in this code

We had a discussion with @joshuagl about what targets do we expect as arguments
and his opinion was that pathpattern and targetpath should point in a local directory
(or at least that's how I interpreted his words).

Additionally, I run all of our tests and checked what do we pass to fnmatch.fnmatch() inside
_visit_child_roleand _target_matches_path_pattern() and none of them had test using URLs.
The old code provides an API call get_valid_targetinfo() for the MultiRepoUpdater
which seems to me that doesn't handle URLs in any specific way.

@jku
Copy link
Member

jku commented Aug 5, 2021

I remember we discussed before we don't want to artificially support corner cases.

If you mean this code expects the "paths" to be canonical representation (so things like "foo//bar" or "foo/../foo/bar" are not supported) that seems correct. Or in other words: I'm fine with requiring well-formed input, I'm cautious about defining things as corner cases :)

We had a discussion with @joshuagl about what targets do we expect as arguments
and his opinion was that pathpattern and targetpath should point in a local directory
(or at least that's how I interpreted his words).

The spec is broken about this IMO but I don't think we can or should do that in the implementation -- how could the client know which separator the server is using? The API as it's currently designed should maybe accept filesystem paths as input but I believe the input handling should immediately make the transmogrification to a URL-path so rest of the code can assume everything is URL/posix path...

Alternatively someone needs to explain to me how we can support random directory separators in all of the code. As an example: if pathpatterns and targetpaths were filesystem paths then you are missing all the windows path tests, and all the tests with half windows paths and half posix paths

Copy link
Member

@jku jku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe makes sense to finish this after #1463 is in but a few more specific comments:

  • the matching looks about right now
  • we'll maybe still want to remove the possible starting separators from pathpattern somewhere in this code (I left a longer comment in ngclient: avoid lstrip(os.sep) on target paths #1506) -- the separator just shouldn't be clients os.sep because that makes no sense for pathpattern that comes from the server

tuf/ngclient/updater.py Outdated Show resolved Hide resolved
Comment on lines 168 to 176
invalid_use_cases = [
("targets/foo.tgz", "*.tgz"),
("/foo.tgz", "*.tgz",),
("targets/foo.tgz", "*"),
("foo/bar/k.tgz", "foo/bar/zoo/*"),
("foo/bar/zoo/k.tgz", "foo/bar/*"),
("foo-version-alpha.tgz", "foo-version-?.tgz"),
("foo//bar", "*o/bar"),
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

six cases out of seven actually hit if len(target_parts) != len(pathpattern_parts) branch so fnmatch() only gets called single time in all of this. Please make sure there are failing tests for fnmatch() in the filename part and also in a directory part

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed two of the test cases that seemed more redundant and added an additional one for a wrong directory regex.
Any other suggestions for tests?

tests/test_updater_ng.py Outdated Show resolved Hide resolved
@joshuagl
Copy link
Member

We had a discussion with @joshuagl about what targets do we expect as arguments
and his opinion was that pathpattern and targetpath should point in a local directory
(or at least that's how I interpreted his words).

I think I just reiterated what the spec suggests – that we should expect a relative path using UNIX path separators.

It's possible that we should add an explicit section to the spec, or the oft discussed and frequently desired secondary literature theupdateframework/specification#91, on path handling?

Copy link
Member

@jku jku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(marking "changes requested" so it's visible in PR list)

@MVrachev MVrachev force-pushed the glob-pattern-matching branch 3 times, most recently from 470f753 to 79c41f8 Compare August 17, 2021 21:25
@MVrachev
Copy link
Collaborator Author

MVrachev commented Aug 17, 2021

I had to rebase on top of develop and move _is_target_in_pathpattern inside tuf/api/metadata.py.
Additionally, I changed some of the tests and added a commit removing the lstrip on os.sep

In order to fix #1506 do I need to document that we don't use lsstrip, because in the issue description it's said:
Define a valid target file path and pathpattern strings and reject everything else.?

@MVrachev MVrachev requested a review from jku August 25, 2021 11:40
@MVrachev MVrachev force-pushed the glob-pattern-matching branch 3 times, most recently from 9d7c490 to 16ca84d Compare August 25, 2021 18:36
@MVrachev
Copy link
Collaborator Author

I updated the comments and commit message to use glob pattern instead of regex.

Copy link
Member

@jku jku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, functionality looks fine I think and moving the code to Metadata API looks good to me. Two questions/comments:

  • should the function be a stand-alone one? I'm not against this sort of design in general but it looks quite out of place in Metadata API: I think this would be the first top-level function in tuf.api.metadata so far. As an alternative it could be a private staticmethod on DelegatedRole, expressing that this is not some generic pattern matching function but an implementation detail of DelegatedRole
  • commenting feels heavy (are all comments useful, will they stay up-to-date?): the complete path pattern handling in DelegatedRole is now ~8 lines of fairly straight forward code covered by 13 lines of comments. At least the large comment in is_delegated_path() needs an update as it's repetitive and already outdated (as no stripping is done).

Please fix the outdated comment. I'll defer to your decision on the other raised issues.

@MVrachev
Copy link
Collaborator Author

Thanks, functionality looks fine I think and moving the code to Metadata API looks good to me. Two questions/comments:

  • should the function be a stand-alone one? I'm not against this sort of design in general but it looks quite out of place in Metadata API: I think this would be the first top-level function in tuf.api.metadata so far. As an alternative it could be a private staticmethod on DelegatedRole, expressing that this is not some generic pattern matching function but an implementation detail of DelegatedRole

I agree to make the function a private static method in DelegatedRole sounds like the most logical solution.

  • commenting feels heavy (are all comments useful, will they stay up-to-date?): the complete path pattern handling in DelegatedRole is now ~8 lines of fairly straight forward code covered by 13 lines of comments. At least the large comment in is_delegated_path() needs an update as it's repetitive and already outdated (as no stripping is done).

Please fix the outdated comment. I'll defer to your decision on the other raised issues.

I removed the strip comments from is_delegated_path as well as another comment from _is_target_in_pathpattern that was non essential I think.

@MVrachev MVrachev force-pushed the glob-pattern-matching branch 2 times, most recently from aa0bf0f to 6d7ac41 Compare August 26, 2021 11:32
According to the recently updated version of the specification the shell
style wildcard matching is glob-like (see theupdateframework/specification#174),
and therefore a path separator in a path should not be matched by a
wildcard in the PATHPATTERN.

That's not what happens with `fnmatch.fnmatch()` which doesn't
see "/" separator as a special symbol.
For example: fnmatch.fnmatch("targets/foo.tgz", "*.tgz") will return
True which is not what glob-like implementation will do.

We should make sure that target_path and the pathpattern contain the
same number of directories and because each part of the pathpattern
could include a glob pattern we should check that fnmatch.fnmatch() is
true on each target and pathpattern directory fragment separated by "/".

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
@MVrachev
Copy link
Collaborator Author

@jku do you think the changes I made in 6ee2dac are enough to consider #1506 solved?

Copy link
Member

@jku jku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left one comment about the docstring but I'll leave decision to you: current one seems ok too.

tuf/api/metadata.py Outdated Show resolved Hide resolved
For targetpath: we don't want to support corner cases such as
file paths starting with separator.
Why this case should be threated specially than any other case where
you have multiple "/" for example "foo//bar/tar.gz"?

For pathpattern: it's recommended that the separator in the pathpattern
should be "/":
see https://theupdateframework.github.io/specification/latest/#targetpath
I believe it could lead to issues for a client implementation if it
supports arbitrary separators - every implementation needs to choose one
and stick with it.
Then, if we decide that "/" is our separator using lstrip on "os.sep" is
wrong, because the os separator from the server could be different that
the one used in the client.

Because of the above arguments, it makes sense to just remove
lstrip on os separators.

Additionally, document that the target_filepath and the DelegatedRole
paths are expected to be in their canonical forms and only "/" is
supported as target path separator.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>

in the public API that we only support "/" as a
separator and don't handle corner cases such as leading separators
in either pathpattern or target_filepath.
@jku jku merged commit 7d77eee into theupdateframework:develop Aug 31, 2021
@MVrachev MVrachev deleted the glob-pattern-matching branch September 1, 2021 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ngclient: implement glob-like pattern matching
4 participants