Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved URL Parsing #111

Merged
merged 20 commits into from
Apr 13, 2020
Merged

Improved URL Parsing #111

merged 20 commits into from
Apr 13, 2020

Conversation

brendan-matroid
Copy link
Contributor

@brendan-matroid brendan-matroid commented Mar 24, 2020

  • Use shellquote to split lines in commands.txt files so filenames with spaces can be used
  • Escape regex characters in prefixes when initializing ObjectURLs. These may be a part of valid S3 prefixes.

Copy link
Member

@igungor igungor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Brendan, thanks for the pull request.

Changes are looking good. I've a few requests though:

  • Could you add tests please? You can find sample tests at e2e/run_test.go.
  • We vendor dependencies. Could you run go mod tidy && go mod vendor and commit the changes?

Thanks

objurl/objurl.go Outdated Show resolved Hide resolved
@igungor igungor modified the milestone: v1.0.0 Mar 24, 2020
@brendan-matroid
Copy link
Contributor Author

Hey @igungor, thanks for the review.

Are there any steps I need to take before running the tests locally? I'm seeing a bunch of errors similar to this:

--- FAIL: TestCopyMultipleFlatS3ObjectsToLocal (0.05s)
    cp_test.go:138: assertion failed: directory /tmp/TestCopyMultipleFlatS3ObjectsToLocal-726849334/workdir does not match expected:
        /another_test_file.txt
          mode: expected -rw-r--r-- got -rw-rw-r--
        /filename-with-hypen.gz
          mode: expected -rw-r--r-- got -rw-rw-r--
        /readme.md
          mode: expected -rw-r--r-- got -rw-rw-r--
        /testfile1.txt
          mode: expected -rw-r--r-- got -rw-rw-r--

@igungor
Copy link
Member

igungor commented Mar 25, 2020

Are there any steps I need to take before running the tests locally? I'm seeing a bunch of errors similar to this:

--- FAIL: TestCopyMultipleFlatS3ObjectsToLocal (0.05s)
    cp_test.go:138: assertion failed: directory /tmp/TestCopyMultipleFlatS3ObjectsToLocal-726849334/workdir does not match expected:
        /another_test_file.txt
          mode: expected -rw-r--r-- got -rw-rw-r--
        /filename-with-hypen.gz
          mode: expected -rw-r--r-- got -rw-rw-r--
        /readme.md
          mode: expected -rw-r--r-- got -rw-rw-r--
        /testfile1.txt
          mode: expected -rw-r--r-- got -rw-rw-r--

Some of the tests are failing due to the new o.Prefix = regexp.QuoteMeta(o.Prefix) line. The above test has an assertion that file modbits are expected to be 0644 but got 0664.

New run test is failing because of the prefix change as well.

--- FAIL: TestRunSpecialCharactersInPrefix (0.20s)
    run_test.go:247: line 0: equals: (-want +got):
          string(
        - 	"cp s3://test-run-special-characters-in-prefix/special-chars_!@#$%!^(MISSING)&_()_+{[_%!C(MISSING)äè| __;'_,_._-中文 =/_!@#$%!^(MISSING)&_()_+{[_%!C(MISSING)äè| __;'_,_._-中文 =image.jpg",
        + 	"",
          )
    run_test.go:247: expected a comparison function for line "cp s3://test-run-special-characters-in-prefixspecial-chars_!@#$%^&_()_+{[_%5Cäè| __;'_,_._-中文 =/_!@#$%^&_()_+{[_%5Cäè| __;'_,_._-中文 =image.jpg" (lineno: 1)
    run_test.go:247: cp s3://test-run-special-characters-in-prefixspecial-chars_!@#$%^&_()_+{[_%5Cäè| __;'_,_._-中文 =/_!@#$%^&_()_+{[_%5Cäè| __;'_,_._-中文 =image.jpg

That'd be very helpful if you could provide a concrete use-case to demonstrate the problem. We can debug the issue as well.

@igungor
Copy link
Member

igungor commented Apr 6, 2020

That'd be very helpful if you could provide a concrete use-case to demonstrate the problem. We can debug the issue as well.

Ping @brendan-matroid

@brendan-matroid
Copy link
Contributor Author

brendan-matroid commented Apr 6, 2020

Hey @igungor, sorry for the delay.
There are two separate use cases addressed by this PR:

  1. Some AWS S3 prefixes may contain spaces. E.g. s3://some-bucket/some prefix/some file is a valid S3 path. Splitting commands by whitespace in s5cmd run does not permit copying to/from such files.

  2. Escaping regex metacharacters for the prefix portion of the wildcard is necessary in cases where the path contains regex metacharacters before the wildcard. This is also problematic in cases where there is no wildcard (since the entire path is treated as the prefix).

At objurl.go:220: filterRegex = o.Prefix + filterRegex which is then compiled in the next line.

If o.Prefix contains unescaped regex metacharacters then they will be compiled (incorrectly), which may either result in some unexpected, valid regex (and silently be used in addition to any regexes resulting from wildcard expressions), or (in the case of the s3 prefix used in the added testcase), it will result in an invalid regex and compilation of said regex will throw an error.

@igungor
Copy link
Member

igungor commented Apr 12, 2020

There are two separate use cases addressed by this PR:

1. Some AWS S3 prefixes may contain spaces. E.g. `s3://some-bucket/some prefix/some file` is a valid S3 path. Splitting commands by whitespace in `s5cmd run` does not permit copying to/from such files.

I'm OK with this change. I think it is necessary.

2. Escaping regex metacharacters for the `prefix` portion of the wildcard is necessary in cases where the path contains regex metacharacters before the wildcard. This is also problematic in cases where there is no wildcard (since the entire path is treated as the `prefix`).

At objurl.go:220: filterRegex = o.Prefix + filterRegex which is then compiled in the next line.

If o.Prefix contains unescaped regex metacharacters then they will be compiled (incorrectly), which may either result in some unexpected, valid regex (and silently be used in addition to any regexes resulting from wildcard expressions), or (in the case of the s3 prefix used in the added testcase), it will result in an invalid regex and compilation of said regex will throw an error.

I understand the problem now, thanks for the detailed answer.

The problem with this change is that u.Prefix needs to stay as is since it's the canonical prefix information. The change should be, as you said earlier, at objurl.go:220: filterRegex = o.Prefix + filterRegex. It should be filterRegex = regexp.QuoteMeta(o.Prefix) + filterRegex.

Could you try and check if it's working for your use case? If so, and if you're still interested, could you rebase your branch please, so that we can progress.

Copy link
Member

@sonmezonur sonmezonur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR again. Mostly LGTM with minor comments.

e2e/util_test.go Outdated Show resolved Hide resolved
e2e/run_test.go Outdated Show resolved Hide resolved
@ilkinulas
Copy link
Member

will re-open to trigger circle-ci

@ilkinulas ilkinulas closed this Apr 13, 2020
@ilkinulas ilkinulas reopened this Apr 13, 2020
@igungor
Copy link
Member

igungor commented Apr 13, 2020

will re-open to trigger circle-ci

let's try again now.

@igungor igungor closed this Apr 13, 2020
@igungor igungor reopened this Apr 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants