-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking change proposal: Encoding.UTF8 singleton should not have a BOM #51353
Comments
Tagging subscribers to this area: @tarekgh, @krwq, @eiriktsarpalis, @layomia Issue Detailstl;drThe The DiscussionMore information: dotnet/standard#260, #7779, with further discussion at #28218 Historically, the The Unicode maintainers have also discussed recommending against the use of BOMs by default unless explicitly required by the protocol or file format.
This would be a breaking change. However, this breaking change should be an overall net positive for the ecosystem because it would prevent our writers from emitting bytes which many tools do not properly discard upon read. We have a history of making breaking changes in this area for .NET Core to assist with interoperability. For example, we changed Parsers can still opt to honor BOMs at the beginning of files opened for read. Nothing in this proposal discourages readers from parsing the first few bytes and selecting an appropriate This proposal does not suggest changing the BOM behavior for
|
I am a supporter of this proposal. We need to provide a config switch to go back to old behavior if needed. |
IMO this is such a significant (and difficult to discover) breaking change that "provide a config switch to go back to old behavior" is not sufficient. I propose:
|
I wonder if default encoding can be changed from ANSI to UTF8, as a breaking change. |
I am confused. The documentation says that it does include a BOM. Which method does that? When I try var b = System.Text.Encoding.UTF8.GetBytes("Hello world");
foreach (byte bb in b)
Console.WriteLine(bb.ToString("X2")); This outputs (on .NET 5):
|
Ah, I see. |
It seems confusing if |
UTF32 and UTF16 need a Byte Order Mark to indicate the endianness of the data apart from anything else; UTF8 doesn't have any endianness so doesn't require it for that purpose. Also while ASCII text encoded using UTF-8 is backward compatible with ASCII, this is not true when Unicode Standard recommendations are ignored and a BOM is added. |
|
Obsolete |
This breaking change can break msbuild or powershell scripts. For example, the fix from dotnet/runtimelab#782 would break since it depends on |
@jkotas Yeah, the challenge is to weigh how many people will be broken against how many future .NET developers won't run into this trap. In your scenario, the ideal resolution would be to update the tooling to understand UTF-8 natively without a BOM. If this is impractical and the file format remains "you must emit a BOM before writing UTF-8", then this falls squarely under the scenario at https://www.unicode.org/L2/L2021/21038-bom-guidance.pdf, bottom of pg. 6. In that case, I'd suggest the following three changes:
Encoding encoding = Encoding.GetEncoding(requestedEncoding);
if (encoding.CodePage == Encoding.UTF8.CodePage)
{
encoding = new UTF8Encoding(writeBom);
}
else if (encoding.CodePage = Encoding.Unicode.CodePage)
{
encoding = new UnicodeEncoding(writeBom);
}
else if (/* ... */) { } |
In this case, it would be a breaking change and/or new feature for VC++ compiler/linker.
Agree. In fact, msbuild has an issue on this already: dotnet/msbuild#6168. It sounds like we would have some coordination to do for this one with msbuild and other similar projects. |
Mono mscorlib.dll used to have |
We discussed a little bit internally the idea of having I'm not sold on that as a good long-term solution. The spirit of this work item is that we want to reduce the number of developers who are exposed to the concept of a BOM. By having static factories for "with BOM" and "without BOM", we'd be foisting this concept upon every developer who starts typing |
I am strongly against this proposal. Not only is the current behavior documented such that applications can depend on it (object on compatibility grounds), but the BOM has proven widely beneficial in my experience at avoiding encoding errors in documents that change over time (object to the principle of the proposed direction). |
I completely agree that BOM-less UTF8 is definitely better and having only one instead of two brings better developer experience. But ABI-breaking changes should be the last resort and should not be made for "somewhat better" developer experience. Aged platforms have their appropriate reasons for technical choices. To not make ABI breaking changes, there are still ways to improve developer experience with updating the API:
|
@GrabYourPitchforks are there any news here? Did we get into any consensus? Is this conversation only about Encoding.UTF8 or also |
Good catch, @krwq: By contrast, (A systematic review of the docs with respect to recommending / discouraging a UTF-8 BOM is called for either way, as certain pages contradict each other.) I think consistency is called for, and my vote is to consistently default to BOM-less UTF-8 and only ever return a with-BOM instance if explicitly requested. While undoubtedly a breaking change, @GrabYourPitchforks has already made compelling (to me) arguments for it in the initial post; let me add a few points:
|
@krwq The idea is that all UTF-8 factories hanging off |
@GrabYourPitchforks fair to assume we're punting this for a post .NET 6 release? |
Yup, post-6.0. This is really the type of thing that needs to go into an early preview. |
@GrabYourPitchforks I think we should pull the trigger right after 7.0 snap and have this smoke out through entire 8.0 period |
I personally think it would be ideal to have both as statics just like PowerShell has for Here's what I see with go to definition in VS Code: //
// Summary:
// Gets an encoding for the UTF-8 format.
//
// Returns:
// An encoding for the UTF-8 format. And in Visual Studio: // Returns an encoding for the UTF-8 format. The returned encoding will be
// an instance of the UTF8Encoding class.
public static Encoding UTF8 => UTF8Encoding.s_default; As you can see, the comments related to Cheers |
Maybe it would be sufficient to change |
Hmm, I checked this earlier today and |
Certainly, |
More specifically, in .NET Framework The only case in which
The very purpose of this proposal is to remove the BOM from |
Just an additional note that is somewhat related. As of today, the very latest version of Visual Studio 2022 creates all files in projects with a BOM too. That includes *.sln, *csproj and *.cs files too. |
tl;dr
The
Encoding.UTF8
singleton currently says "please emit a BOM when writing." This is an anachronism. Nowadays, it should say "please do not emit a BOM when writing."The
Encoding.UTF8
singleton should continue to perform U+FFFD substitution on invalid subsequences, just as it does today.Discussion
More information: dotnet/standard#260, #7779, with further discussion at #28218
Historically, the
Encoding.UTF8
singleton has been equivalent tonew UTF8Encoding(encoderShouldEmitUTF8Identifier: true, throwOnInvalidBytes: false)
. This is largely for historical reasons, as these types were introduced during a period when multiple different encodings were commonplace, and the world hadn't yet settled on UTF-8 as the de facto standard. Now, 20 years later, UTF-8 has cemented its place as the true winner, and many tools across Unix and Windows natively operate on UTF-8. But as mentioned in the above linked issues, these tools can fail if they encounter a BOM at the start of the data.The Unicode maintainers have also discussed recommending against the use of BOMs by default unless explicitly required by the protocol or file format.
This would be a breaking change. However, this breaking change should be an overall net positive for the ecosystem because it would prevent our writers from emitting bytes which many tools do not properly discard upon read. We have a history of making breaking changes in this area for .NET Core to assist with interoperability. For example, we changed
Encoding.Default
to be UTF-8 w/o BOM across all OSes. We also changedUTF8Encoding
to be more standards-compliant when it comes to replacing ill-formed input sequences with U+FFFD chars.Parsers can still opt to honor BOMs at the beginning of files opened for read. Nothing in this proposal discourages readers from parsing the first few bytes and selecting an appropriate
Encoding
based on that data.This proposal does not suggest changing the BOM behavior for
Encoding.UTF32
,Encoding.Unicode
, or other built-in singletons. For writers which query the preamble before writing text, it is useful for these writers to continue to emit a "this data is not UTF-8!" marker before the bytestream. This should help preserve compatibility in the less-common scenarios where people want to continue writing XML files as UTF-16.The text was updated successfully, but these errors were encountered: