-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SourceId: Test and fix ambiguous serialization of Git references #11086
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @ehuss (or someone else) soon. Please see the contribution instructions for more information. |
Re compatibility, this changes the way some git urls are written to Cargo.lock if they contain certain characters (anything outside of Running cargo update before/after switching cargo versions will write different git urls when they contain characters that are now escaped. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the pull request! It is really unfortunate that Cargo parses it incorrectly 😞.
The first thing is Cargo should display a friendly readable URL. Looks like domain and path parts are already displayed in encoded format. I would argue that maybe EncodablePackageId
should use a dedicated function other than relying on Display
trait to perform encoding, so that we can decode URI to get a pretty display.
The other thing I feel uncertain is the change of lockfile serialization. As you mentioned, lockfile will be updated after this change. cargo build --locked
will start failing after this PR, since Cargo detects changes in lockfile. Outside Cargo world, there may exist other dependency analysis tools would get confused as well.
@@ -273,7 +273,7 @@ fn cargo_compile_git_dep_pull_request() { | |||
.cargo("build") | |||
.with_stderr(&format!( | |||
"[UPDATING] git repository `{}`\n\ | |||
[COMPILING] dep1 v0.5.0 ({}?rev=refs/pull/330/head#[..])\n\ | |||
[COMPILING] dep1 v0.5.0 ({}?rev=refs%2Fpull%2F330%2Fhead#[..])\n\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unlucky that we display an encoded format 😢
If a branch name contains non-ASCII character it will be hard to read.
This adds a failing test for rust-lang#11085. For some branch names (or tags or refs), generate_lockfile will serialize poorly. Parsing the Cargo.lock will refer to a branch that wasn't the intended one in Cargo.toml. Because a branch name can contain the '#' character, the precise reference could be corrupted through a branch/tag/ref as well. This is because serialisation is done with a custom Display implementation that concatenates strings as-is and is unaware that some characters may need escaping; on the other hand, deserialization uses url::Url::query_pairs which assumes <https://url.spec.whatwg.org/#application/x-www-form-urlencoded>.
Use x-www-form-urlencoded which is the standard for query strings, matching what the deserializer expects. https://url.spec.whatwg.org/#application/x-www-form-urlencoded Fixes rust-lang#11085.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, sorry for taking such a long time to revisit this.
The Cargo team discussed this a while back. This is definitely an overlook from our side. However, we tend not to change encoding for this that work today. Besides breaking the current lockfile foramt, I forgot the other specific reasons we concluded during the meeting 😅.
Looking into this today again. In my opinion, I feel like this issue
- affects a relatively small portion(?) using special characters in their branch/tag/rev.
- has a simple workaround by changing the name of the Git reference.
Your solution is not wrong. My own take is we wait for more lockfile format changes emerging, and then we do a version bump of lockfile format for a certain amount of changes all at once.
What if we want a fix now?
To fix it now before the next lockfile version bump, I'd recommend implementing an ad-hoc URL parser. It shouldn't be too hard to have a naive one here. It may look like:
rsplit_once('#')
to get the precise#<commit-sha>
.rsplit_once(&["branch=", "rev=", "tag="])
to get the Git reference.- If none of the above works, fallback to the old logic.
We may need to check in what scenario #<commit-sha>
is missing in source
field in lockfile, as well as any assumption we will make in this simple parser.
I appreciate the review feedback, and I would be happy to see where the cargo team discussed this, if there's a transcript or a good discussion channel. I want to stress once more that this is not a parsing problem and that the parsing path should not be changed. Serializing two different inputs to the same serial form is always a problem on the serialisation side. The information is lost after serialisation, and deserialisation will never be able to divine the intended branch name no matter how ad-hoc the parser. Let's keep the perfectly good deserialisation path, it follows the WHATWG URL standard, it should stay there. So, here is what I'm suggesting: Deserialisation will stay the same pre and post version bump, no ambiguities there. This way will minimise churn to VCS-tracked lockfiles. |
True that we should keep deserialization correct. I'll label this as https://github.com/rust-lang/cargo/labels/S-blocked at this moment. We'll revisit once we decide to have a lockfile version bump. Thank you for your efforts and suggestions so far! |
☔ The latest upstream changes (presumably #11501) made this pull request unmergeable. Please resolve the merge conflicts. |
#12120 is an issue tracking for all bugfixes needing a lockfile version bump. Given this PR isn't likely to merge in its current status, I'll go ahead closing it. We can track its progress in #11085 and #12120. If I got time I'll have this fixed within a lockfile version bump (or someone gets it faster than me). Thank you for calling out the ambiguity and the fix! |
fix: encode URL params correctly for SourceId in Cargo.lock We use [`form_urlencoded::byte_serialize`](https://docs.rs/form_urlencoded/1.2.0/form_urlencoded/fn.byte_serialize.html), which is re-exported by `url` crate. Tests are copied from <#11086>. Kudos to the original author!
For some branch names (or tags or refs), generate_lockfile will serialize poorly. Parsing the Cargo.lock will refer to a branch that wasn't the intended one in Cargo.toml.
Because a branch name can contain the '#' character, the precise reference could be corrupted through a branch/tag/ref as well.
First commit adds a test showing the problem, second commit fixes it by having the serializer match the deserializer by encoding a x-www-form-urlencoded query string.
Fixes #11085.