-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement glob-like pattern matching #1512
Implement glob-like pattern matching #1512
Conversation
a631b4e
to
4593342
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing the dirnames is a clever trick that matches the most common examples in TUF. But if we want to implement proper "globbing" it should be applied on each of the components of a pathname separately.
Should this be a valid test case too (I don't know if someone would use it 🤷) : ("foo/bar/zoo/k.tgz", "foo/*/zoo/*"),
?
Also IIUC, '*' should not match path separator, I've added comments about it.
7efb6a6
to
57cd50c
Compare
After the good review by @sechkova, I decided to change my approach. As Teodora suggested, I will focus on checking the list of directories returned by Please let me know if any of the reviewers have ideas for a combination of |
57cd50c
to
630b35a
Compare
Added type annotations and better function description for |
This seems to be handling directories as a special case. I don't think you'll be able to make that work: as Teodora said globbing should probably be applied on each of the components of a pathname separately (unless of course the python base modules include a method to handle the whole path). Think of e.g. these cases to see new issues:
Other quick comments:
|
I added a new commit in which I change my approach.
We had a discussion with @joshuagl about what targets do we expect as arguments Additionally, I run all of our tests and checked what do we pass to |
If you mean this code expects the "paths" to be canonical representation (so things like "foo//bar" or "foo/../foo/bar" are not supported) that seems correct. Or in other words: I'm fine with requiring well-formed input, I'm cautious about defining things as corner cases :)
The spec is broken about this IMO but I don't think we can or should do that in the implementation -- how could the client know which separator the server is using? The API as it's currently designed should maybe accept filesystem paths as input but I believe the input handling should immediately make the transmogrification to a URL-path so rest of the code can assume everything is URL/posix path... Alternatively someone needs to explain to me how we can support random directory separators in all of the code. As an example: if pathpatterns and targetpaths were filesystem paths then you are missing all the windows path tests, and all the tests with half windows paths and half posix paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe makes sense to finish this after #1463 is in but a few more specific comments:
- the matching looks about right now
- we'll maybe still want to remove the possible starting separators from pathpattern somewhere in this code (I left a longer comment in ngclient: avoid lstrip(os.sep) on target paths #1506) -- the separator just shouldn't be clients os.sep because that makes no sense for pathpattern that comes from the server
tests/test_updater_ng.py
Outdated
invalid_use_cases = [ | ||
("targets/foo.tgz", "*.tgz"), | ||
("/foo.tgz", "*.tgz",), | ||
("targets/foo.tgz", "*"), | ||
("foo/bar/k.tgz", "foo/bar/zoo/*"), | ||
("foo/bar/zoo/k.tgz", "foo/bar/*"), | ||
("foo-version-alpha.tgz", "foo-version-?.tgz"), | ||
("foo//bar", "*o/bar"), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
six cases out of seven actually hit if len(target_parts) != len(pathpattern_parts)
branch so fnmatch() only gets called single time in all of this. Please make sure there are failing tests for fnmatch() in the filename part and also in a directory part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed two of the test cases that seemed more redundant and added an additional one for a wrong directory regex.
Any other suggestions for tests?
I think I just reiterated what the spec suggests – that we should expect a relative path using UNIX path separators. It's possible that we should add an explicit section to the spec, or the oft discussed and frequently desired secondary literature theupdateframework/specification#91, on path handling? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(marking "changes requested" so it's visible in PR list)
470f753
to
79c41f8
Compare
I had to rebase on top of develop and move In order to fix #1506 do I need to document that we don't use |
9d7c490
to
16ca84d
Compare
I updated the comments and commit message to use |
16ca84d
to
9e66faf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, functionality looks fine I think and moving the code to Metadata API looks good to me. Two questions/comments:
- should the function be a stand-alone one? I'm not against this sort of design in general but it looks quite out of place in Metadata API: I think this would be the first top-level function in
tuf.api.metadata
so far. As an alternative it could be a private staticmethod on DelegatedRole, expressing that this is not some generic pattern matching function but an implementation detail of DelegatedRole - commenting feels heavy (are all comments useful, will they stay up-to-date?): the complete path pattern handling in DelegatedRole is now ~8 lines of fairly straight forward code covered by 13 lines of comments. At least the large comment in
is_delegated_path()
needs an update as it's repetitive and already outdated (as no stripping is done).
Please fix the outdated comment. I'll defer to your decision on the other raised issues.
9e66faf
to
b9de725
Compare
I agree to make the function a private static method in
I removed the strip comments from |
aa0bf0f
to
6d7ac41
Compare
6d7ac41
to
1c54c43
Compare
1c54c43
to
f9da287
Compare
According to the recently updated version of the specification the shell style wildcard matching is glob-like (see theupdateframework/specification#174), and therefore a path separator in a path should not be matched by a wildcard in the PATHPATTERN. That's not what happens with `fnmatch.fnmatch()` which doesn't see "/" separator as a special symbol. For example: fnmatch.fnmatch("targets/foo.tgz", "*.tgz") will return True which is not what glob-like implementation will do. We should make sure that target_path and the pathpattern contain the same number of directories and because each part of the pathpattern could include a glob pattern we should check that fnmatch.fnmatch() is true on each target and pathpattern directory fragment separated by "/". Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
f9da287
to
6ee2dac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I left one comment about the docstring but I'll leave decision to you: current one seems ok too.
6ee2dac
to
65f9468
Compare
For targetpath: we don't want to support corner cases such as file paths starting with separator. Why this case should be threated specially than any other case where you have multiple "/" for example "foo//bar/tar.gz"? For pathpattern: it's recommended that the separator in the pathpattern should be "/": see https://theupdateframework.github.io/specification/latest/#targetpath I believe it could lead to issues for a client implementation if it supports arbitrary separators - every implementation needs to choose one and stick with it. Then, if we decide that "/" is our separator using lstrip on "os.sep" is wrong, because the os separator from the server could be different that the one used in the client. Because of the above arguments, it makes sense to just remove lstrip on os separators. Additionally, document that the target_filepath and the DelegatedRole paths are expected to be in their canonical forms and only "/" is supported as target path separator. Signed-off-by: Martin Vrachev <mvrachev@vmware.com> in the public API that we only support "/" as a separator and don't handle corner cases such as leading separators in either pathpattern or target_filepath.
65f9468
to
34e7546
Compare
Fixes #1505, #1506
Description of the changes being introduced by the pull request:
Implement glob-like pattern matching - 1505
According to the recently updated version of the specification the shell
style wildcard matching is glob-like (see theupdateframework/specification#174),
and therefore a path separator in a path should not be matched by a
wildcard in the PATHPATTERN.
That's not what happens with
fnmatch.fnmatch()
which doesn'tsee "/" separator as a special symbol.
For example: fnmatch.fnmatch("targets/foo.tgz", "*.tgz") will return
True which is not what glob-like implementation will do.
We should make sure that target_path and the
pathpattern
contain thesame number of directories and because each part of the
pathpattern
could include a glob pattern we should check that
fnmatch.fnmatch()
is trueon each target and
pathpattern
directory fragment separated by "/".Avoid lstrip(os.sep) on target paths - 1506
For targetpath: we don't want to support corner cases such as
file paths starting with separator.
Why this case should be treated especially than any other case where
you have multiple "/" for example "foo//bar/tar.gz"?
For pathpattern: it's recommended that the separator in the pathpattern
should be "/":
see https://theupdateframework.github.io/specification/latest/#targetpath
I believe it could lead to issues for a client implementation if it
supports arbitrary separators - every implementation needs to choose one
and stick with it.
Then, if we decide that "/" is our separator using lstrip on "os.sep" is
wrong, because the os separator from the server could be different that
the one used in the client.
Because of the above arguments, it makes sense to just remove
lstrip on os separators.
Additionally, document in the public API that we only support "/" as a
separator and don't handle corner cases such as leading separators
in either pathpattern or target_filepath.
Signed-off-by: Martin Vrachev mvrachev@vmware.com
Please verify and check that the pull request fulfills the following
requirements: