Breaking change proposal: Encoding.UTF8 singleton should not have a BOM #51353

GrabYourPitchforks · 2021-04-15T23:36:59Z

tl;dr

The Encoding.UTF8 singleton currently says "please emit a BOM when writing." This is an anachronism. Nowadays, it should say "please do not emit a BOM when writing."

The Encoding.UTF8 singleton should continue to perform U+FFFD substitution on invalid subsequences, just as it does today.

Discussion

More information: dotnet/standard#260, #7779, with further discussion at #28218

Historically, the Encoding.UTF8 singleton has been equivalent to new UTF8Encoding(encoderShouldEmitUTF8Identifier: true, throwOnInvalidBytes: false). This is largely for historical reasons, as these types were introduced during a period when multiple different encodings were commonplace, and the world hadn't yet settled on UTF-8 as the de facto standard. Now, 20 years later, UTF-8 has cemented its place as the true winner, and many tools across Unix and Windows natively operate on UTF-8. But as mentioned in the above linked issues, these tools can fail if they encounter a BOM at the start of the data.

The Unicode maintainers have also discussed recommending against the use of BOMs by default unless explicitly required by the protocol or file format.

https://www.unicode.org/L2/L2021/21038-bom-guidance.pdf (guidance being drafted, but not yet adopted)
https://corp.unicode.org/pipermail/unicode/2020-October/009070.html (previous discussion on this issue which led to above draft guidance)
https://corp.unicode.org/pipermail/unicode/2020-June/008713.html (earlier discussion on this issue)

This would be a breaking change. However, this breaking change should be an overall net positive for the ecosystem because it would prevent our writers from emitting bytes which many tools do not properly discard upon read. We have a history of making breaking changes in this area for .NET Core to assist with interoperability. For example, we changed Encoding.Default to be UTF-8 w/o BOM across all OSes. We also changed UTF8Encoding to be more standards-compliant when it comes to replacing ill-formed input sequences with U+FFFD chars.

Parsers can still opt to honor BOMs at the beginning of files opened for read. Nothing in this proposal discourages readers from parsing the first few bytes and selecting an appropriate Encoding based on that data.

This proposal does not suggest changing the BOM behavior for Encoding.UTF32, Encoding.Unicode, or other built-in singletons. For writers which query the preamble before writing text, it is useful for these writers to continue to emit a "this data is not UTF-8!" marker before the bytestream. This should help preserve compatibility in the less-common scenarios where people want to continue writing XML files as UTF-16.

The text was updated successfully, but these errors were encountered:

ghost · 2021-04-15T23:37:02Z

Tagging subscribers to this area: @tarekgh, @krwq, @eiriktsarpalis, @layomia
See info in area-owners.md if you want to be subscribed.

Issue Details

tl;dr

The Encoding.UTF8 singleton currently says "please emit a BOM when writing." This is an anachronism. Nowadays, it should say "please do not emit a BOM when writing."

The Encoding.UTF8 singleton should continue to perform U+FFFD substitution on invalid subsequences, just as it does today.

Discussion

More information: dotnet/standard#260, #7779, with further discussion at #28218

Historically, the Encoding.UTF8 singleton has been equivalent to new UTF8Encoding(encoderShouldEmitUTF8Identifier: true, throwOnInvalidBytes: false). This is largely for historical reasons, as these types were introduced during a period when multiple different encodings were commonplace, and the world hadn't yet settled on UTF-8 as the de facto standard. Now, 20 years later, UTF-8 has cemented its place as the true winner, and many tools across Unix and Windows natively operate on UTF-8. But as mentioned in the above linked issues, these tools can fail if they encounter a BOM at the start of the data.

The Unicode maintainers have also discussed recommending against the use of BOMs by default unless explicitly required by the protocol or file format.

https://www.unicode.org/L2/L2021/21038-bom-guidance.pdf (guidance being drafted, but not yet adopted)
https://corp.unicode.org/pipermail/unicode/2020-October/009070.html (previous discussion on this issue which led to above draft guidance)
https://corp.unicode.org/pipermail/unicode/2020-June/008713.html (earlier discussion on this issue)

This would be a breaking change. However, this breaking change should be an overall net positive for the ecosystem because it would prevent our writers from emitting bytes which many tools do not properly discard upon read. We have a history of making breaking changes in this area for .NET Core to assist with interoperability. For example, we changed Encoding.Default to be UTF-8 w/o BOM across all OSes. We also changed UTF8Encoding to be more standards-compliant when it comes to replacing ill-formed input sequences with U+FFFD chars.

Parsers can still opt to honor BOMs at the beginning of files opened for read. Nothing in this proposal discourages readers from parsing the first few bytes and selecting an appropriate Encoding based on that data.

This proposal does not suggest changing the BOM behavior for Encoding.UTF32, Encoding.Unicode, or other built-in singletons. For writers which query the preamble before writing text, it is useful for these writers to continue to emit a "this data is not UTF-8!" marker before the bytestream. This should help preserve compatibility in the less-common scenarios where people want to continue writing XML files as UTF-16.

Author:	GrabYourPitchforks
Assignees:	-
Labels:	`area-System.Text.Encoding`, `breaking-change`
Milestone:	-

tarekgh · 2021-04-15T23:46:22Z

I am a supporter of this proposal. We need to provide a config switch to go back to old behavior if needed.

SimonCropp · 2021-04-16T01:51:47Z

IMO this is such a significant (and difficult to discover) breaking change that "provide a config switch to go back to old behavior" is not sufficient. I propose:

obsolete Encoding.UTF8 with warning
add Encoding.UTF8IncludingBom and Encoding.UTF8ExcludingBom (or equivalent)

huoyaoyuan · 2021-04-16T05:11:24Z

I wonder if default encoding can be changed from ANSI to UTF8, as a breaking change.
In practice, especially for East Asian users, ASNI codepages are much more annoying than BOM.

heemskerkerik · 2021-04-16T05:14:06Z

I am confused. The documentation says that it does include a BOM. Which method does that? When I try GetBytes, it doesn't appear to return a BOM:

var b = System.Text.Encoding.UTF8.GetBytes("Hello world");

foreach (byte bb in b)
    Console.WriteLine(bb.ToString("X2"));

This outputs (on .NET 5):

heemskerkerik · 2021-04-16T05:22:19Z

Ah, I see. GetBytes never returns a BOM, but a StreamWriter will first write the result of GetPreamble and then the result of GetBytes.

tfenise · 2021-04-16T09:51:20Z

It seems confusing if Encoding.UTF32 and Encoding.Unicode emit BOM, but Encoding.UTF8 does not.

benaadams · 2021-04-16T09:59:05Z

It seems confusing if Encoding.UTF32 and Encoding.Unicode emit BOM, but Encoding.UTF8 does not.

UTF32 and UTF16 need a Byte Order Mark to indicate the endianness of the data apart from anything else; UTF8 doesn't have any endianness so doesn't require it for that purpose.

Also while ASCII text encoded using UTF-8 is backward compatible with ASCII, this is not true when Unicode Standard recommendations are ignored and a BOM is added.

tfenise · 2021-04-16T11:35:47Z

Encoding.UTF8 is explicitly documented to emit BOM. Changing Encoding.UTF8 is unlike changing Encoding.Default to be UTF-8 w/o BOM. Encoding.Default is platform-dependent anyway. If Encoding.UTF8 is easily misused, obsoletion or analyzer or warning seems better than a breaking change on Encoding.UTF8.

benaadams · 2021-04-16T11:43:29Z

obsoletion or analyzer or warning seems better than a breaking change on Encoding.UTF8.

Obsolete Encoding.UTF8 introduce Encoding.Utf8 🤔 Though VB isn't case sensitive so that wouldn't work

jkotas · 2021-04-16T15:57:40Z

This breaking change can break msbuild or powershell scripts. For example, the fix from dotnet/runtimelab#782 would break since it depends on Encoding.GetEncoding("utf-8") returning encoding with BOM. What would be the recommended action to resolve the break for this specific case?

GrabYourPitchforks · 2021-04-16T17:29:41Z

@jkotas Yeah, the challenge is to weigh how many people will be broken against how many future .NET developers won't run into this trap. In your scenario, the ideal resolution would be to update the tooling to understand UTF-8 natively without a BOM. If this is impractical and the file format remains "you must emit a BOM before writing UTF-8", then this falls squarely under the scenario at https://www.unicode.org/L2/L2021/21038-bom-guidance.pdf, bottom of pg. 6. In that case, I'd suggest the following three changes:

The WriteLinesToFile task get an additional property WriteBOM, which defaults to false.
The WriteLinesToFile code check the WriteBOM property and special-case it for the known Unicode encodings (see sample code below).
The Native AOT compilation toolchain puts encoding="utf-8" writeBom="true" in their targets file.

Encoding encoding = Encoding.GetEncoding(requestedEncoding);
if (encoding.CodePage == Encoding.UTF8.CodePage)
{
    encoding = new UTF8Encoding(writeBom);
}
else if (encoding.CodePage = Encoding.Unicode.CodePage)
{
    encoding = new UnicodeEncoding(writeBom);
}
else if (/* ... */) { }

jkotas · 2021-04-16T17:51:27Z

In your scenario, the ideal resolution would be to update the tooling to understand UTF-8 natively without a BOM

In this case, it would be a breaking change and/or new feature for VC++ compiler/linker.

WriteLinesToFile task get an additional property WriteBOM

Agree. In fact, msbuild has an issue on this already: dotnet/msbuild#6168. It sounds like we would have some coordination to do for this one with msbuild and other similar projects.

atsushieno · 2021-04-17T04:15:49Z

Mono mscorlib.dll used to have Encoding.UTF8Unmarked (internal only) which I believe is still much better than mixing cases or introducing breaking changes. Having this alongside, Encoding.UTF8 would not look confusing anymore.

GrabYourPitchforks · 2021-04-17T04:56:10Z

We discussed a little bit internally the idea of having Encoding.UTF8NoBOM as a first-class citizen alongside Encoding.UTF8. That suggestion has come up a few times in this thread as well.

I'm not sold on that as a good long-term solution. The spirit of this work item is that we want to reduce the number of developers who are exposed to the concept of a BOM. By having static factories for "with BOM" and "without BOM", we'd be foisting this concept upon every developer who starts typing Encoding.* in their code editor. A developer who is well-versed in these concepts can quickly and correctly answer the question of "do I want a BOM or not?", but for the majority of the developer audience these terms would be unfamiliar and they wouldn't know how to answer the question. Ultimately I think exposing these concepts on a primary API would result in a poorer user experience than exists today.

sharwell · 2021-04-17T05:07:33Z

I am strongly against this proposal. Not only is the current behavior documented such that applications can depend on it (object on compatibility grounds), but the BOM has proven widely beneficial in my experience at avoiding encoding errors in documents that change over time (object to the principle of the proposed direction).

atsushieno · 2021-04-17T05:22:46Z

I completely agree that BOM-less UTF8 is definitely better and having only one instead of two brings better developer experience.

But ABI-breaking changes should be the last resort and should not be made for "somewhat better" developer experience. Aged platforms have their appropriate reasons for technical choices.

To not make ABI breaking changes, there are still ways to improve developer experience with updating the API:

Add CLEARER coding hints (<summar> on Encoding.* members) to tell those casual developers using simple words like "you would not like to use it unless you understand what BOM is."
Deprecate Encoding.UTF8 with same or similar comment, indicating the new alternative like UTF8NoBOM. If I were to design "long term" solution, I would deprecate UTF8 first, then remove it from the API, then bring it back as BOM-less.

krwq · 2021-05-10T12:55:29Z

@GrabYourPitchforks are there any news here? Did we get into any consensus? Is this conversation only about Encoding.UTF8 or also Encoding.GetEncoding("utf-8") which currently returns UTF8 instance which produces BOM? As a first step perhaps we could only make latter return BOM-less instance? Encoding.UTF8 will probably cause more issues than that

mklement0 · 2021-05-10T14:27:26Z

Good catch, @krwq: Encoding.GetEncoding("utf-8") does emit a BOM (though you wouldn't be able to tell from the the description of the .Preamble property, which doesn't mention .GetEncoding() and explicitly states that only using System.Text.Encoding.UTF8 and using the argument-less or no-BOM UTF8Encoding constructor results in a BOM).

By contrast, System.Text.Encoding.GetEncoding(0), documented to return UTF-8 in .NET Core, does not emit a BOM.

(A systematic review of the docs with respect to recommending / discouraging a UTF-8 BOM is called for either way, as certain pages contradict each other.)

I think consistency is called for, and my vote is to consistently default to BOM-less UTF-8 and only ever return a with-BOM instance if explicitly requested.

While undoubtedly a breaking change, @GrabYourPitchforks has already made compelling (to me) arguments for it in the initial post; let me add a few points:

The Unix world has moved to BOM-less UTF-8 a long time ago, and it is primarily tools with a Unix heritage that do not expect a BOM, and in the presence of one either choke or misinterpret the BOM as part of the data.
The Windows world is undoubtedly moving towards assuming UTF-8 in the absence of a BOM as well:
- The major (cross-platform) text editors nowadays write and read BOM-less UTF-8 by default - see System.Xml.XmlDocument.Save() unexpectedly creates UTF-8 files *with BOM* #28218 (comment) for an overview.
- PowerShell Core (the cross-platform edition built on .NET Core / 5+) too uses BOM-less UTF-8 as its consistent default, both when reading its source code and in its file-processing cmdlets.
- Node.js (node.exe) - and others? - have chosen to "speak" (BOM-less) UTF-8 by default, irrespective of the active OEM code page as determined by the system locale (aka language for non-Unicode programs).
  - Python chose a different (also nonstandard) approach, defaulting to the active ANSI code page even when called from the console, rather than the OEM code page console applications are expected to use. However, it is is trivial to configure python to use (BOM-less) UTF-8 instead, via an environment variable (PYTHONUTF8) or, situationally, via a CLI parameter (-X utf8).
- Windows 10 now offers a - still-in-beta as of this writing - feature to switch to (BOM-less) UTF-8 system-wide, by setting the system locale so that both the OEM and the ANSI code pages use code page 65001, i.e. UTF-8; see this Stack Overflow answer for details and a discussion of the ramifications.
  - With this configuration:
    - Even Windows PowerShell and Python, for instance, then default to BOM-less UTF-8 (since the ANSI code page is then effectively UTF-8).
    - So will all conventional console applications that use the OEM code page, with the caveat that legacy applications that aren't equipped to handle the variable-length aspect of UTF-8 encoding malfunction.
- Last but not least: .NET's own default encoding for its System.IO APIs has - commendably - been BOM-less UTF-8 since v1.

GrabYourPitchforks · 2021-05-10T16:24:40Z

@krwq The idea is that all UTF-8 factories hanging off Encoding will be no-BOM, unless the caller calls new UTF8Encoding(true).

eiriktsarpalis · 2021-07-21T15:22:10Z

@GrabYourPitchforks fair to assume we're punting this for a post .NET 6 release?

GrabYourPitchforks · 2021-07-21T16:55:03Z

Yup, post-6.0. This is really the type of thing that needs to go into an early preview.

krwq · 2022-08-11T15:35:55Z

@GrabYourPitchforks I think we should pull the trigger right after 7.0 snap and have this smoke out through entire 8.0 period

fgimian · 2024-04-01T00:48:49Z

I personally think it would be ideal to have both as statics just like PowerShell has for Set-Content but I respect the decision made here. My only thoughts are that the XML documentation for Encoding.UTF8 could be improved.

Here's what I see with go to definition in VS Code:

        //
        // Summary:
        //     Gets an encoding for the UTF-8 format.
        //
        // Returns:
        //     An encoding for the UTF-8 format.

And in Visual Studio:

        // Returns an encoding for the UTF-8 format. The returned encoding will be
        // an instance of the UTF8Encoding class.

        public static Encoding UTF8 => UTF8Encoding.s_default;

As you can see, the comments related to Encoding.UTF8 don't clearly indicate that it will include the BOM. One's natural expectation would be for a BOM to not be included today, and thus I think it would be beneficial to clearly state it in the docs.

Cheers
Fotis

vladd · 2024-04-01T09:40:53Z

Maybe it would be sufficient to change Encoding.Default to be UTF-8 without BOM? I assume the current common wisdom is that Encoding.Default should not be used at all.

fgimian · 2024-04-01T09:46:05Z

Maybe it would be sufficient to change Encoding.Default to be UTF-8 without BOM? I assume the current common wisdom is that Encoding.Default should not be used at all.

Hmm, I checked this earlier today and Encoding.Default was actually UTF-8 without a BOM. So I'm using it along with Encoding.UTF8 to get both variants without having to create my own instance of UTF8Encoding. Not sure if I'm missing something though. 😊

tarekgh · 2024-04-01T15:53:27Z

Certainly, Encoding.Default returns a UTF-8 encoding object without a Byte Order Mark (BOM). However, it's important to note that if your code is executed on the .NET Framework, Encoding.Default may not necessarily return a UTF-8 encoding object. Instead, it retrieves the default active codepage encoding on Windows.

mklement0 · 2024-04-01T17:20:31Z

More specifically, in .NET Framework Encoding.Default is the varying encoding that represents a given machine's legacy system locale's ANSI code page, such as Windows-1252 on a US-English system, or Windows-1250 on a Czech system.

The only case in which Encoding.Default in .NET Framework reports UTF-8 - without a BOM (the invariable default in .NET (Core)) - is if you've changed your system locale to the still-in-beta option to use UTF-8 system-wide, as a result of which both the ANSI and the OEM code page are set to 65001 (UTF-8); note that doing this has far-reaching consequences.

along with Encoding.UTF8 to get both variants without having to create my own instance of UTF8Encoding.

The very purpose of this proposal is to remove the BOM from Encoding.UTF8, so I wouldn't rely on that.

fgimian · 2024-04-02T22:38:18Z

Just an additional note that is somewhat related. As of today, the very latest version of Visual Studio 2022 creates all files in projects with a BOM too. That includes *.sln, *csproj and *.cs files too.

GrabYourPitchforks added area-System.Text.Encoding breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. labels Apr 15, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 15, 2021

eiriktsarpalis removed the untriaged New issue has not been triaged by the area owner label Apr 22, 2021

krwq mentioned this issue May 10, 2021

XmlDocument: save BOM-less UTF-8 files by default #49322

Closed

eiriktsarpalis added this to the 7.0.0 milestone Jul 21, 2021

jeffhandley modified the milestones: 7.0.0, 8.0.0 Jul 9, 2022

This was referenced Sep 29, 2022

Remove encoding allocations for System.Security.Xml #76364

Merged

Remove unnecessary encoding allocations in System.Private.DataContractSerialization #76366

Merged

xtqqczze mentioned this issue Oct 23, 2022

Replace UTF8Encoding(false) with Encoding.Default part 2 PowerShell/PowerShell#18356

Merged

krwq modified the milestones: 8.0.0, Future Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking change proposal: Encoding.UTF8 singleton should not have a BOM #51353

Breaking change proposal: Encoding.UTF8 singleton should not have a BOM #51353

GrabYourPitchforks commented Apr 15, 2021

ghost commented Apr 15, 2021

tl;dr

Discussion

tarekgh commented Apr 15, 2021

SimonCropp commented Apr 16, 2021

huoyaoyuan commented Apr 16, 2021

heemskerkerik commented Apr 16, 2021

heemskerkerik commented Apr 16, 2021

tfenise commented Apr 16, 2021

benaadams commented Apr 16, 2021

tfenise commented Apr 16, 2021

benaadams commented Apr 16, 2021

jkotas commented Apr 16, 2021 •

edited

Loading

GrabYourPitchforks commented Apr 16, 2021

jkotas commented Apr 16, 2021 •

edited

Loading

atsushieno commented Apr 17, 2021

GrabYourPitchforks commented Apr 17, 2021

sharwell commented Apr 17, 2021 •

edited

Loading

atsushieno commented Apr 17, 2021

krwq commented May 10, 2021

mklement0 commented May 10, 2021 •

edited

Loading

GrabYourPitchforks commented May 10, 2021

eiriktsarpalis commented Jul 21, 2021

GrabYourPitchforks commented Jul 21, 2021

krwq commented Aug 11, 2022

fgimian commented Apr 1, 2024 •

edited

Loading

vladd commented Apr 1, 2024

fgimian commented Apr 1, 2024 •

edited

Loading

tarekgh commented Apr 1, 2024

mklement0 commented Apr 1, 2024 •

edited

Loading

fgimian commented Apr 2, 2024

Breaking change proposal: Encoding.UTF8 singleton should not have a BOM #51353

Breaking change proposal: Encoding.UTF8 singleton should not have a BOM #51353

Comments

GrabYourPitchforks commented Apr 15, 2021

tl;dr

Discussion

ghost commented Apr 15, 2021

tl;dr

Discussion

tarekgh commented Apr 15, 2021

SimonCropp commented Apr 16, 2021

huoyaoyuan commented Apr 16, 2021

heemskerkerik commented Apr 16, 2021

heemskerkerik commented Apr 16, 2021

tfenise commented Apr 16, 2021

benaadams commented Apr 16, 2021

tfenise commented Apr 16, 2021

benaadams commented Apr 16, 2021

jkotas commented Apr 16, 2021 • edited Loading

GrabYourPitchforks commented Apr 16, 2021

jkotas commented Apr 16, 2021 • edited Loading

atsushieno commented Apr 17, 2021

GrabYourPitchforks commented Apr 17, 2021

sharwell commented Apr 17, 2021 • edited Loading

atsushieno commented Apr 17, 2021

krwq commented May 10, 2021

mklement0 commented May 10, 2021 • edited Loading

GrabYourPitchforks commented May 10, 2021

eiriktsarpalis commented Jul 21, 2021

GrabYourPitchforks commented Jul 21, 2021

krwq commented Aug 11, 2022

fgimian commented Apr 1, 2024 • edited Loading

vladd commented Apr 1, 2024

fgimian commented Apr 1, 2024 • edited Loading

tarekgh commented Apr 1, 2024

mklement0 commented Apr 1, 2024 • edited Loading

fgimian commented Apr 2, 2024

jkotas commented Apr 16, 2021 •

edited

Loading

jkotas commented Apr 16, 2021 •

edited

Loading

sharwell commented Apr 17, 2021 •

edited

Loading

mklement0 commented May 10, 2021 •

edited

Loading

fgimian commented Apr 1, 2024 •

edited

Loading

fgimian commented Apr 1, 2024 •

edited

Loading

mklement0 commented Apr 1, 2024 •

edited

Loading