-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GB18030] Fix "Unable to translate Unicode character" #10063
Conversation
This change looks more like a workaround. Do we understand why \uD86E fails to be encoded to UTF-8? For what it's worth, this test program runs fine for me: var defaultEncoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);
File.WriteAllText("C:\\temp\\encoded.txt", "\u3407\ud840\udc60\ud86a\ude30\ud86e\udc0a\ud86e\udda0\ud879\udeae\u2fd5\u0023", defaultEncoding);
|
True, actually smallest repro does not help here either:
works. Edit: |
If we moved
to C#:
it should fix the issue. MsBuild does not provide a way to detect surrogates and if the last char is a beginning of surrogate, it cuts it in half. |
Tagging @JanKrivanek as this is an unfortunate regression introduced with the recent stable hash changes. Jan, do you think we can get away without shortening? if not, it would probably make sense to add a string shortening property function which would be safe with respect to surrogates. |
Oh good find! Shortening was one of the requests that the change was addressing. Let's move it to prop function |
@ilonatommy, rather than a task, we would prefer the functionality to be exposed as an intrinsic property function. Here's a PR where one was recently added: #9665 |
@JanKrivanek could you please take over? I'm not sure how to test the changes and how to access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thnak you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
The new prop function should be added to the doc page: https://learn.microsoft.com/en-us/visualstudio/msbuild/property-functions?view=vs-2022#msbuild-property-functions. Not urgent but we should make sure we don't forget. |
<!-- For a long MSBuildProjectFile let's shorten this to 17 chars - using the first 8 chars of the filename and a filename hash. --> | ||
<MSBuildCopyMarkerName Condition="'$(MSBuildCopyMarkerName.Length)' > '17'">$(MSBuildProjectFile.Substring(0,8)).$([MSBuild]::StableStringHash($(MSBuildProjectFile)).ToString("X8"))</MSBuildCopyMarkerName> | ||
<!-- For a long MSBuildProjectFile let's shorten this to 17 chars - using the first 8 codepoints of the filename and a filename hash. --> | ||
<MSBuildCopyMarkerName Condition="'$(MSBuildCopyMarkerName.Length)' > '17'">$([MSBuild]::SubstringByTextElements($(MSBuildProjectFile), 0, 8)).$([MSBuild]::StableStringHash($(MSBuildProjectFile)).ToString("X8"))</MSBuildCopyMarkerName> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are now text elements rather than codepoints. Which has consequences:
$(MSBuildCopyMarkerName.Length)
can still be greater than 17 after this shortening.- If
$(MSBuildProjectFile)
is👨👨👦👦👨👨👦👦.proj
, then StringInfo.SubstringByTextElements(0, 8) will throw ArgumentOutOfRangeException. That string has 27 UTF-16 code units but only 7 text elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're going to have an [MSBuild]::SubstringByTextElements intrinsic function, then either it should check for out-of-range arguments itself, or there should be an [MSBuild]::CountTextElements function as well. Otherwise it's too difficult to use reliably in projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think what this really needs is a property function that returns all the text elements that entirely fit within the specified substring, with arguments given in UTF-16 code units. So that the resulting string never has more UTF-16 code units than requested. Because those are what matters for file name lengths in NTFS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are excellent points. We may get away with this functionality to be just a best-effort "make the file name reasonably long" but we definitely must fix the possible ArgumentOutOfRangeException and it's worth thinking about a more useful property function too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider moving the entire operation to a new, dedicated property function so we could just do
<MSBuildCopyMarkerName>$([MSBuild]::ShortUniqueProjectName())</MSBuildCopyMarkerName>
or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still keep it general for other strings as well - there seems to be other usages in org on path args that might need revision: https://github.com/search?q=repo%3Adotnet%2Fsdk+path%3A**%2F*.targets+substring&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After re-checking the list of places where we use substring
, none of them require a PR.
- code in net5 - we do not backport to it
- https://github.com/dotnet/sdk/blob/52acdcaad568cc56ca67b33b6b2ea7fa1e489d24/src/Tasks/Microsoft.NET.Build.Tasks/targets/Microsoft.NET.Publish.targets#L301 - the substring is done on a hash of a string, so it's always in a form e.g.
_NormalizedPublishDirHash = 13cbcd0c3203541f1c707c5ea0b45e9f20f4d3cf71f0e2b68e723cab7e9427c3
- no surrogates. - https://github.com/dotnet/sdk/blob/1f91c5339897b46f5119ed5d43d9b29be237aca5/src/BuiltInTools/dotnet-watch/DotNetWatch.targets#L87 - the substring is executed only if the
Identity
starts with a 7-char string:wwwroot/
, so theSubstring(8)
will never cut a surrogate in half.
Fixes #10066
[GB18030] Failed to run the WebAPI project that named as GB18030 provided level3 strings
Context
OS: Windows 11 Enterprise 22H2 ZH-CN
Affected Build: 9.0.0-preview.4.24209.5, Aso repro on 8.0.300-preview.24203.14(8.0.2)
Steps to reproduce:
Use CLI to create WebAPI project copying GB18030 characters, for example:
Without this PR:
With this PR:
Changes Made
Changed the way we create a substring to prevent cutting surrogates in half.
Testing
Added a test for new substring function.
Regerssion
Yes, after #9346.