-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require utf-8 when specifying character encoding #3091
Conversation
Is @hsivonen now comfortable with this? When Encoding initially required this there was a little bit of fear it might be too soon. So maybe we should split it out for |
I am in support of doing this everywhere. E.g. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg implies a good trend , and in general the "only UTF-8" meme has gotten pretty widespread. I haven't reviewed the commits yet, but will do so soon, under the assumption that we're gonna go all the way. |
I think we should nudge authors towards making everything UTF-8. I'm am still a bit worried about authors reacting to an error in a silly way: Making the I guess the exact message that the validator gives matters here. Assuming a message that is worded to complain more about the resource not being UTF-8 than about the value of the attribute per se, I'm OK with this. As for Reviewing the patch:
It seems to me that the patch should upgrade this to a MUST instead of removing it. I fail to locate a normative statement to that effect for the BOM and HTTP cases. (I see it only for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source
Outdated
|
||
<p class="note">A character encoding declaration is required (either in the <span | ||
<div class="note"> | ||
<p>A character encoding declaration is required (either in the <span |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent by a space
source
Outdated
data-x="Content-Type">Content-Type metadata</span> or explicitly in the file) even when all | ||
characters are in the ASCII range, because a character encoding is needed to process non-ASCII | ||
characters entered by the user in forms, in URLs generated by scripts, and so forth.</p> | ||
<p>Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert a blank line between paragraphs.
data-x="attr-meta-http-equiv-content-type">Encoding declaration state</span>, then the character | ||
encoding used must be an <span>ASCII-compatible encoding</span>.</p> | ||
|
||
<p>Authors should use <span>UTF-8</span>. Conformance checkers may advise authors against using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here that the Encoding Standard already requires utf-8? Maybe put in a statement of fact here mentioning that requirement, so we're clear (since the meta
encoding declaration is itself optional and encoding could be specified in HTTP/BOM/XML decl).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here that the Encoding Standard already requires utf-8? Maybe put in a statement of fact here mentioning that requirement, so we're clear
OK, 94517b3 attempts to do that
Yeah, most of #3006 ends up withdrawn. I need to separate a separate PR for the minor things I fixed on the side. |
I know — but that was nearly 5 years ago (January 2013). So finally requiring UTF-8 in HTML almost 5 years after Encoding initially required it doesn’t seem like we’re exactly rushing things…
I’m OK with just merging the I assume we’d agree we don’t want to wait, say, another 5 years. But short of that it’s not clear to me how we can measure when it’s no longer too soon and we’re instead finally ready to go forward with it. So it seems like instead we just need to choose some point at which to do it, and then finally just do it. |
Yeah, agreed that would be a counterproductive outcome
OK, that‘s doable of course — but to be clear: in the validator architecture, that message would need to come from the parser code, right?
OK, I’ll make that change.
Not sure what you mean. I take it you don’t meant a normative statement about authoring tools in relation to the BOM, or a normative statement about authoring tools in relation to the HTTP-delivered charset. |
@sideshowbarker it seems that everyone who commented here is okay with going ahead with it, so let's (finally) do it. |
OK, made it so in 769d6fe |
While the parser could make sense for |
I meant the same thing as @zcorpan meant in the comment right after mine. |
Aha yeah OK I’ll add a datatype checker for it that way to the validator sources |
We should update those too. |
I agree about text/html. But I think we should probably separate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“Update text/html registration” change LGTM
Per #3006 (comment) , I was thinking we should make charset="utf-8" on script elements obsolete but conforming (i.e. validators display a warning), since in a UTF-8 document it is redundant, and we've recently been making redundant script attributes obsolete but conforming. This would mean the charset attribute on script gets a treatment similar to type on style. |
13efba8
to
dfef71a
Compare
Yes, will update the source on this branch to do that |
I was going to do a review but then I thought it'd be easier to just tweak things myself so I got carried away and did a bit more. Let me know what you think :). |
Looks beautiful 🎉 |
d891e56
to
7a64e46
Compare
This change adds a “must” requirement for UTF-8 in all but one of the places in the spec that define a means for specifying a character encoding. Specifically, it makes UTF-8 required for any “character encoding declaration”, which includes the HTTP Content-Type header sent with any document, the `<meta charset>` element, and the `<meta http-equiv=content-type>` element. Along with those, this change also makes UTF-8 required for `<script charset>` but also moves `<script charset>` to being obsolete-but-conforming (because now that both documents and scripts are required to use UTF-8, it’s redundant to specify `charset` on the `script` element, since it inherits from the document). To make the normative source of those requirements clear, this change also adds a specific citation to the relevant requirement from the Encoding standard, and updates the in-spec IANA registration for text/html media type to indicate that UTF-8 is required. Finally, it changes an existing requirement for authoring tools to use UTF-8 from a “should” to a “must”. The one place where this change doesn’t yet add a requirement for UTF-8 is for the `form` element’s `accept-charset` attribute. For that, see issue #3097.
7a64e46
to
4089e5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a couple more nits. Happy to fix these later today.
source
Outdated
<p>The Encoding standard requires use of the <span>UTF-8</span> <span data-x="encoding">character | ||
encoding</span> and requires use of the "<code data-x="">utf-8</code>" <span>encoding label</span> | ||
to identify it. Those requirements necessitate that the document's <span>character encoding | ||
declaration</span>, if it exists, specify an <span>encoding label</span> using an <span>ASCII |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specifies?
source
Outdated
case-insensitive</span> match for the string "<code data-x="">utf-8</code>". Regardless of whether | ||
a <span>character encoding declaration</span> is present or not, the actual <span | ||
data-x="document's character encoding">character encoding</span> used to store or transmit the | ||
document must be <span>UTF-8</span>. <ref spec=ENCODING></p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say "to encode the document". Storage and transmission have little to do with text encoding.
source
Outdated
data-x="attr-script-async">async</code>, and <code data-x="attr-script-defer">defer</code> | ||
attributes. Authors should omit the attribute instead of redundantly setting it.</p></li> | ||
<code data-x="attr-script-async">async</code> and <code data-x="attr-script-defer">defer</code> | ||
attributes (as well as the legacy <code data-x="attr-script-charset">charset</code> attribute). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other legacy attributes can influence the processing model as well, but we don't mention them here. Is this really needed?
source
Outdated
changes to the base URL also have no effect --> | ||
<code data-x="attr-script-integrity">integrity</code> attributes dynamically has no direct effect; | ||
these attributes are only used at specific times described below. (The same is true for the legacy | ||
<code data-x="attr-script-charset">charset</code> attribute.</p> <!-- by implication, changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I don't think we mention the other legacy attributes here.
source
Outdated
<code>script</code> element must <span>reflect</span> the element's | ||
<code data-x="attr-script-event">event</code> content attribute.</p> | ||
<p>The <dfn><code data-x="dom-script-event">event</code></dfn> and <dfn><code | ||
data-x="dom-script-charset">charset</code></dfn> IDL attributes of the <code>script</code> element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do these in alphabetical order normally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks good now, but someone should probably double check my edit.
Sure thing, done. |
domenic commented on Oct 4:
It would be good to have some more recent data. The graph on Wikipedia is about 5 years old. |
https://w3techs.com/technologies/history_overview/character_encoding/ms/y has up-to-date data:
So the 5-6 year trend is, UTF-8 usage has grown from 68% in January 2012 to over 90% now. And while it does show the rate of increase leveling off a bit, over the last 3 years it’s still been growing at over 2% per year. |
This addresses #3006.