Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking change proposal: Encoding.UTF8 singleton should not have a BOM #51353

Open
GrabYourPitchforks opened this issue Apr 15, 2021 · 29 comments
Labels
area-System.Text.Encoding breaking-change Issue or PR that represents a breaking API or functional change over a prerelease.
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

tl;dr

The Encoding.UTF8 singleton currently says "please emit a BOM when writing." This is an anachronism. Nowadays, it should say "please do not emit a BOM when writing."

The Encoding.UTF8 singleton should continue to perform U+FFFD substitution on invalid subsequences, just as it does today.

Discussion

More information: dotnet/standard#260, #7779, with further discussion at #28218

Historically, the Encoding.UTF8 singleton has been equivalent to new UTF8Encoding(encoderShouldEmitUTF8Identifier: true, throwOnInvalidBytes: false). This is largely for historical reasons, as these types were introduced during a period when multiple different encodings were commonplace, and the world hadn't yet settled on UTF-8 as the de facto standard. Now, 20 years later, UTF-8 has cemented its place as the true winner, and many tools across Unix and Windows natively operate on UTF-8. But as mentioned in the above linked issues, these tools can fail if they encounter a BOM at the start of the data.

The Unicode maintainers have also discussed recommending against the use of BOMs by default unless explicitly required by the protocol or file format.

This would be a breaking change. However, this breaking change should be an overall net positive for the ecosystem because it would prevent our writers from emitting bytes which many tools do not properly discard upon read. We have a history of making breaking changes in this area for .NET Core to assist with interoperability. For example, we changed Encoding.Default to be UTF-8 w/o BOM across all OSes. We also changed UTF8Encoding to be more standards-compliant when it comes to replacing ill-formed input sequences with U+FFFD chars.

Parsers can still opt to honor BOMs at the beginning of files opened for read. Nothing in this proposal discourages readers from parsing the first few bytes and selecting an appropriate Encoding based on that data.

This proposal does not suggest changing the BOM behavior for Encoding.UTF32, Encoding.Unicode, or other built-in singletons. For writers which query the preamble before writing text, it is useful for these writers to continue to emit a "this data is not UTF-8!" marker before the bytestream. This should help preserve compatibility in the less-common scenarios where people want to continue writing XML files as UTF-16.

@GrabYourPitchforks GrabYourPitchforks added area-System.Text.Encoding breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. labels Apr 15, 2021
@ghost
Copy link

ghost commented Apr 15, 2021

Tagging subscribers to this area: @tarekgh, @krwq, @eiriktsarpalis, @layomia
See info in area-owners.md if you want to be subscribed.

Issue Details

tl;dr

The Encoding.UTF8 singleton currently says "please emit a BOM when writing." This is an anachronism. Nowadays, it should say "please do not emit a BOM when writing."

The Encoding.UTF8 singleton should continue to perform U+FFFD substitution on invalid subsequences, just as it does today.

Discussion

More information: dotnet/standard#260, #7779, with further discussion at #28218

Historically, the Encoding.UTF8 singleton has been equivalent to new UTF8Encoding(encoderShouldEmitUTF8Identifier: true, throwOnInvalidBytes: false). This is largely for historical reasons, as these types were introduced during a period when multiple different encodings were commonplace, and the world hadn't yet settled on UTF-8 as the de facto standard. Now, 20 years later, UTF-8 has cemented its place as the true winner, and many tools across Unix and Windows natively operate on UTF-8. But as mentioned in the above linked issues, these tools can fail if they encounter a BOM at the start of the data.

The Unicode maintainers have also discussed recommending against the use of BOMs by default unless explicitly required by the protocol or file format.

This would be a breaking change. However, this breaking change should be an overall net positive for the ecosystem because it would prevent our writers from emitting bytes which many tools do not properly discard upon read. We have a history of making breaking changes in this area for .NET Core to assist with interoperability. For example, we changed Encoding.Default to be UTF-8 w/o BOM across all OSes. We also changed UTF8Encoding to be more standards-compliant when it comes to replacing ill-formed input sequences with U+FFFD chars.

Parsers can still opt to honor BOMs at the beginning of files opened for read. Nothing in this proposal discourages readers from parsing the first few bytes and selecting an appropriate Encoding based on that data.

This proposal does not suggest changing the BOM behavior for Encoding.UTF32, Encoding.Unicode, or other built-in singletons. For writers which query the preamble before writing text, it is useful for these writers to continue to emit a "this data is not UTF-8!" marker before the bytestream. This should help preserve compatibility in the less-common scenarios where people want to continue writing XML files as UTF-16.

Author: GrabYourPitchforks
Assignees: -
Labels:

area-System.Text.Encoding, breaking-change

Milestone: -

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 15, 2021
@tarekgh
Copy link
Member

tarekgh commented Apr 15, 2021

I am a supporter of this proposal. We need to provide a config switch to go back to old behavior if needed.

@SimonCropp
Copy link
Contributor

IMO this is such a significant (and difficult to discover) breaking change that "provide a config switch to go back to old behavior" is not sufficient. I propose:

  • obsolete Encoding.UTF8 with warning
  • add Encoding.UTF8IncludingBom and Encoding.UTF8ExcludingBom (or equivalent)

@huoyaoyuan
Copy link
Member

I wonder if default encoding can be changed from ANSI to UTF8, as a breaking change.
In practice, especially for East Asian users, ASNI codepages are much more annoying than BOM.

@heemskerkerik
Copy link

I am confused. The documentation says that it does include a BOM. Which method does that? When I try GetBytes, it doesn't appear to return a BOM:

var b = System.Text.Encoding.UTF8.GetBytes("Hello world");

foreach (byte bb in b)
    Console.WriteLine(bb.ToString("X2"));

This outputs (on .NET 5):

48
65
6C
6C
6F
20
77
6F
72
6C
64

@heemskerkerik
Copy link

Ah, I see. GetBytes never returns a BOM, but a StreamWriter will first write the result of GetPreamble and then the result of GetBytes.

@tfenise
Copy link
Contributor

tfenise commented Apr 16, 2021

It seems confusing if Encoding.UTF32 and Encoding.Unicode emit BOM, but Encoding.UTF8 does not.

@benaadams
Copy link
Member

It seems confusing if Encoding.UTF32 and Encoding.Unicode emit BOM, but Encoding.UTF8 does not.

UTF32 and UTF16 need a Byte Order Mark to indicate the endianness of the data apart from anything else; UTF8 doesn't have any endianness so doesn't require it for that purpose.

Also while ASCII text encoded using UTF-8 is backward compatible with ASCII, this is not true when Unicode Standard recommendations are ignored and a BOM is added.

@tfenise
Copy link
Contributor

tfenise commented Apr 16, 2021

Encoding.UTF8 is explicitly documented to emit BOM. Changing Encoding.UTF8 is unlike changing Encoding.Default to be UTF-8 w/o BOM. Encoding.Default is platform-dependent anyway. If Encoding.UTF8 is easily misused, obsoletion or analyzer or warning seems better than a breaking change on Encoding.UTF8.

@benaadams
Copy link
Member

obsoletion or analyzer or warning seems better than a breaking change on Encoding.UTF8.

Obsolete Encoding.UTF8 introduce Encoding.Utf8 🤔 Though VB isn't case sensitive so that wouldn't work

@jkotas
Copy link
Member

jkotas commented Apr 16, 2021

This breaking change can break msbuild or powershell scripts. For example, the fix from dotnet/runtimelab#782 would break since it depends on Encoding.GetEncoding("utf-8") returning encoding with BOM. What would be the recommended action to resolve the break for this specific case?

@GrabYourPitchforks
Copy link
Member Author

@jkotas Yeah, the challenge is to weigh how many people will be broken against how many future .NET developers won't run into this trap. In your scenario, the ideal resolution would be to update the tooling to understand UTF-8 natively without a BOM. If this is impractical and the file format remains "you must emit a BOM before writing UTF-8", then this falls squarely under the scenario at https://www.unicode.org/L2/L2021/21038-bom-guidance.pdf, bottom of pg. 6. In that case, I'd suggest the following three changes:

  • The WriteLinesToFile task get an additional property WriteBOM, which defaults to false.
  • The WriteLinesToFile code check the WriteBOM property and special-case it for the known Unicode encodings (see sample code below).
  • The Native AOT compilation toolchain puts encoding="utf-8" writeBom="true" in their targets file.
Encoding encoding = Encoding.GetEncoding(requestedEncoding);
if (encoding.CodePage == Encoding.UTF8.CodePage)
{
    encoding = new UTF8Encoding(writeBom);
}
else if (encoding.CodePage = Encoding.Unicode.CodePage)
{
    encoding = new UnicodeEncoding(writeBom);
}
else if (/* ... */) { }

@jkotas
Copy link
Member

jkotas commented Apr 16, 2021

In your scenario, the ideal resolution would be to update the tooling to understand UTF-8 natively without a BOM

In this case, it would be a breaking change and/or new feature for VC++ compiler/linker.

WriteLinesToFile task get an additional property WriteBOM

Agree. In fact, msbuild has an issue on this already: dotnet/msbuild#6168. It sounds like we would have some coordination to do for this one with msbuild and other similar projects.

@atsushieno
Copy link
Contributor

Mono mscorlib.dll used to have Encoding.UTF8Unmarked (internal only) which I believe is still much better than mixing cases or introducing breaking changes. Having this alongside, Encoding.UTF8 would not look confusing anymore.

@GrabYourPitchforks
Copy link
Member Author

We discussed a little bit internally the idea of having Encoding.UTF8NoBOM as a first-class citizen alongside Encoding.UTF8. That suggestion has come up a few times in this thread as well.

I'm not sold on that as a good long-term solution. The spirit of this work item is that we want to reduce the number of developers who are exposed to the concept of a BOM. By having static factories for "with BOM" and "without BOM", we'd be foisting this concept upon every developer who starts typing Encoding.* in their code editor. A developer who is well-versed in these concepts can quickly and correctly answer the question of "do I want a BOM or not?", but for the majority of the developer audience these terms would be unfamiliar and they wouldn't know how to answer the question. Ultimately I think exposing these concepts on a primary API would result in a poorer user experience than exists today.

@sharwell
Copy link
Member

sharwell commented Apr 17, 2021

I am strongly against this proposal. Not only is the current behavior documented such that applications can depend on it (object on compatibility grounds), but the BOM has proven widely beneficial in my experience at avoiding encoding errors in documents that change over time (object to the principle of the proposed direction).

@atsushieno
Copy link
Contributor

I completely agree that BOM-less UTF8 is definitely better and having only one instead of two brings better developer experience.

But ABI-breaking changes should be the last resort and should not be made for "somewhat better" developer experience. Aged platforms have their appropriate reasons for technical choices.

To not make ABI breaking changes, there are still ways to improve developer experience with updating the API:

  • Add CLEARER coding hints (<summar> on Encoding.* members) to tell those casual developers using simple words like "you would not like to use it unless you understand what BOM is."
  • Deprecate Encoding.UTF8 with same or similar comment, indicating the new alternative like UTF8NoBOM. If I were to design "long term" solution, I would deprecate UTF8 first, then remove it from the API, then bring it back as BOM-less.

@eiriktsarpalis eiriktsarpalis removed the untriaged New issue has not been triaged by the area owner label Apr 22, 2021
@krwq
Copy link
Member

krwq commented May 10, 2021

@GrabYourPitchforks are there any news here? Did we get into any consensus? Is this conversation only about Encoding.UTF8 or also Encoding.GetEncoding("utf-8") which currently returns UTF8 instance which produces BOM? As a first step perhaps we could only make latter return BOM-less instance? Encoding.UTF8 will probably cause more issues than that

@mklement0
Copy link

mklement0 commented May 10, 2021

Good catch, @krwq: Encoding.GetEncoding("utf-8") does emit a BOM (though you wouldn't be able to tell from the the description of the .Preamble property, which doesn't mention .GetEncoding() and explicitly states that only using System.Text.Encoding.UTF8 and using the argument-less or no-BOM UTF8Encoding constructor results in a BOM).

By contrast, System.Text.Encoding.GetEncoding(0), documented to return UTF-8 in .NET Core, does not emit a BOM.

(A systematic review of the docs with respect to recommending / discouraging a UTF-8 BOM is called for either way, as certain pages contradict each other.)


I think consistency is called for, and my vote is to consistently default to BOM-less UTF-8 and only ever return a with-BOM instance if explicitly requested.

While undoubtedly a breaking change, @GrabYourPitchforks has already made compelling (to me) arguments for it in the initial post; let me add a few points:

  • The Unix world has moved to BOM-less UTF-8 a long time ago, and it is primarily tools with a Unix heritage that do not expect a BOM, and in the presence of one either choke or misinterpret the BOM as part of the data.

  • The Windows world is undoubtedly moving towards assuming UTF-8 in the absence of a BOM as well:

    • The major (cross-platform) text editors nowadays write and read BOM-less UTF-8 by default - see System.Xml.XmlDocument.Save() unexpectedly creates UTF-8 files *with BOM* #28218 (comment) for an overview.

    • PowerShell Core (the cross-platform edition built on .NET Core / 5+) too uses BOM-less UTF-8 as its consistent default, both when reading its source code and in its file-processing cmdlets.

    • Node.js (node.exe) - and others? - have chosen to "speak" (BOM-less) UTF-8 by default, irrespective of the active OEM code page as determined by the system locale (aka language for non-Unicode programs).

      • Python chose a different (also nonstandard) approach, defaulting to the active ANSI code page even when called from the console, rather than the OEM code page console applications are expected to use. However, it is is trivial to configure python to use (BOM-less) UTF-8 instead, via an environment variable (PYTHONUTF8) or, situationally, via a CLI parameter (-X utf8).
    • Windows 10 now offers a - still-in-beta as of this writing - feature to switch to (BOM-less) UTF-8 system-wide, by setting the system locale so that both the OEM and the ANSI code pages use code page 65001, i.e. UTF-8; see this Stack Overflow answer for details and a discussion of the ramifications.

      • With this configuration:
        • Even Windows PowerShell and Python, for instance, then default to BOM-less UTF-8 (since the ANSI code page is then effectively UTF-8).
        • So will all conventional console applications that use the OEM code page, with the caveat that legacy applications that aren't equipped to handle the variable-length aspect of UTF-8 encoding malfunction.
    • Last but not least: .NET's own default encoding for its System.IO APIs has - commendably - been BOM-less UTF-8 since v1.

@GrabYourPitchforks
Copy link
Member Author

@krwq The idea is that all UTF-8 factories hanging off Encoding will be no-BOM, unless the caller calls new UTF8Encoding(true).

@eiriktsarpalis
Copy link
Member

@GrabYourPitchforks fair to assume we're punting this for a post .NET 6 release?

@eiriktsarpalis eiriktsarpalis added this to the 7.0.0 milestone Jul 21, 2021
@GrabYourPitchforks
Copy link
Member Author

Yup, post-6.0. This is really the type of thing that needs to go into an early preview.

@jeffhandley jeffhandley modified the milestones: 7.0.0, 8.0.0 Jul 9, 2022
@krwq
Copy link
Member

krwq commented Aug 11, 2022

@GrabYourPitchforks I think we should pull the trigger right after 7.0 snap and have this smoke out through entire 8.0 period

@fgimian
Copy link

fgimian commented Apr 1, 2024

I personally think it would be ideal to have both as statics just like PowerShell has for Set-Content but I respect the decision made here. My only thoughts are that the XML documentation for Encoding.UTF8 could be improved.

Here's what I see with go to definition in VS Code:

        //
        // Summary:
        //     Gets an encoding for the UTF-8 format.
        //
        // Returns:
        //     An encoding for the UTF-8 format.

And in Visual Studio:

        // Returns an encoding for the UTF-8 format. The returned encoding will be
        // an instance of the UTF8Encoding class.

        public static Encoding UTF8 => UTF8Encoding.s_default;

As you can see, the comments related to Encoding.UTF8 don't clearly indicate that it will include the BOM. One's natural expectation would be for a BOM to not be included today, and thus I think it would be beneficial to clearly state it in the docs.

Cheers
Fotis

@vladd
Copy link

vladd commented Apr 1, 2024

Maybe it would be sufficient to change Encoding.Default to be UTF-8 without BOM? I assume the current common wisdom is that Encoding.Default should not be used at all.

@fgimian
Copy link

fgimian commented Apr 1, 2024

Maybe it would be sufficient to change Encoding.Default to be UTF-8 without BOM? I assume the current common wisdom is that Encoding.Default should not be used at all.

Hmm, I checked this earlier today and Encoding.Default was actually UTF-8 without a BOM. So I'm using it along with Encoding.UTF8 to get both variants without having to create my own instance of UTF8Encoding. Not sure if I'm missing something though. 😊

@tarekgh
Copy link
Member

tarekgh commented Apr 1, 2024

Certainly, Encoding.Default returns a UTF-8 encoding object without a Byte Order Mark (BOM). However, it's important to note that if your code is executed on the .NET Framework, Encoding.Default may not necessarily return a UTF-8 encoding object. Instead, it retrieves the default active codepage encoding on Windows.

@mklement0
Copy link

mklement0 commented Apr 1, 2024

More specifically, in .NET Framework Encoding.Default is the varying encoding that represents a given machine's legacy system locale's ANSI code page, such as Windows-1252 on a US-English system, or Windows-1250 on a Czech system.

The only case in which Encoding.Default in .NET Framework reports UTF-8 - without a BOM (the invariable default in .NET (Core)) - is if you've changed your system locale to the still-in-beta option to use UTF-8 system-wide, as a result of which both the ANSI and the OEM code page are set to 65001 (UTF-8); note that doing this has far-reaching consequences.

along with Encoding.UTF8 to get both variants without having to create my own instance of UTF8Encoding.

The very purpose of this proposal is to remove the BOM from Encoding.UTF8, so I wouldn't rely on that.

@fgimian
Copy link

fgimian commented Apr 2, 2024

Just an additional note that is somewhat related. As of today, the very latest version of Visual Studio 2022 creates all files in projects with a BOM too. That includes *.sln, *csproj and *.cs files too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Text.Encoding breaking-change Issue or PR that represents a breaking API or functional change over a prerelease.
Projects
None yet
Development

No branches or pull requests