HTML character reference processing is inconsistent #488

gibson042 · 2022-10-05T21:07:56Z

#481 introduced processing of named character references, but is inconsistent with the HTML spec (cf. https://html.spec.whatwg.org/multipage/parsing.html#tokenization and specifically https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state )—it matches without a terminal semicolon but also fails to handle numeric references or the prefix matching that renders e.g. I'm &notit; I tell you as I'm ¬it; I tell you, and the inner logic lowercasing ≤ (which is its own bug and should be <) and & only with a terminal semicolon is inconsistent with it being optional in the initial match and also inconsistent with the lack of e.g. AMp; in https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references .

I think it should instead be either unconditionally strict¹ or else better aligned with HTML², and in either case should probably also be updated for numeric references and correct case handling.

e.g.,

--- src/formatter/text.ts
+++ src/formatter/text.ts
@@ -9,1 +9,1 @@
-  text = text.replace(/&[a-zA-Z0-9]+;?/g, m => {
+  text = text.replace(/&[a-zA-Z0-9]+;/g, m => {

↩

e.g.,

--- src/formatter/text.ts
+++ src/formatter/text.ts
@@ -11,2 +11,4 @@
-    if ({}.hasOwnProperty.call(entities, m) && entities[m] !== null) {
-      return entities[m];
+    const prefix = Array(m.length).fill().map((_, n) => n ? m.slice(0, -n) : m)
+      .find(prefix => ({}).hasOwnProperty.call(entities, prefix) && entities[prefix] !== null);
+    if (prefix) {
+      return entities[prefix] + m.slice(prefix.length);

↩

The text was updated successfully, but these errors were encountered:

bakkot · 2022-10-05T21:29:45Z

matches without a terminal semicolon

Only for legacy entities, which is consistent with HTML. The list of entities matched is taken directly from the HTML spec.

also fails to handle numeric references

I don't think we have any uses of that form to check if the decision to use that form was deliberate (e.g. the numeric value of the code point is relevant to the reader), so I didn't implement it. If there's some usages out there which you think are definitely not deliberate, I'd be happy to accept a PR adding that.

the prefix matching that renders e.g. I'm &notit; I tell you as I'm ¬it; I tell you

Omitting that case was a deliberate decision, which I stand by. Rather than formatting those, I would prefer to have a lint rule for unknown entity-like things which will warn on that case.

and the inner logic lowercasing ≤ (which is its own bug and should be <) and & only with a terminal semicolon

Fixed in #489, thanks.

and also inconsistent with the lack of e.g. AMp;

I don't think this is literally ever going to come up, so I don't want to spend time implementing it.

gibson042 · 2022-10-05T22:50:36Z

I don't think we have any uses of that form to check if the decision to use that form was deliberate (e.g. the numeric value of the code point is relevant to the reader), so I didn't implement it. If there's some usages out there which you think are definitely not deliberate, I'd be happy to accept a PR adding that.

I think it makes as much sense to convert numeric references as it does to convert named references, so I'll try to submit that.

the prefix matching that renders e.g. I'm &notit; I tell you as I'm ¬it; I tell you

Omitting that case was a deliberate decision, which I stand by. Rather than formatting those, I would prefer to have a lint rule for unknown entity-like things which will warn on that case.

and also inconsistent with the lack of e.g. AMp;

I don't think this is literally ever going to come up, so I don't want to spend time implementing it.

Perfect, that's the kind of context I was looking for. It would incline me towards requiring the semicolon for replacement and catching mistakes by linting (i.e., the first suggestion above). WDYT?

bakkot · 2022-10-05T23:03:19Z

It would incline me towards requiring the semicolon for replacement and catching mistakes by linting

I prefer errors to be automatically fixed when possible, which only the formatter is capable of at the moment. &amp is unambiguously supposed to be &, so it would be nice for the tooling to fix it rather than just pointing it out. This is not true of e.g. &notit;, where it's not clear what the user intended, so that needs to be a lint rule.

(It would be nice to make the linter support --fix, but that's a bunch more work.)

So I'm inclined to stick with the current strategy unless there's some improvement to the user experience from the strategy you propose.

gibson042 · 2022-10-05T23:21:09Z

Works for me.

Ref tc39#488

bakkot mentioned this issue Oct 5, 2022

handle a couple of entity edge cases #489

Merged

gibson042 closed this as completed Oct 5, 2022

gibson042 closed this as not planned Won't fix, can't repro, duplicate, stale Oct 6, 2022

gibson042 added a commit to gibson042/ecmarkup that referenced this issue Oct 6, 2022

formatter: render numeric character references like named ones

9567cbb

Ref tc39#488

gibson042 mentioned this issue Oct 6, 2022

formatter: render numeric character references like named ones #490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML character reference processing is inconsistent #488

HTML character reference processing is inconsistent #488

gibson042 commented Oct 5, 2022

bakkot commented Oct 5, 2022

gibson042 commented Oct 5, 2022

bakkot commented Oct 5, 2022

gibson042 commented Oct 5, 2022

HTML character reference processing is inconsistent #488

HTML character reference processing is inconsistent #488

Comments

gibson042 commented Oct 5, 2022

Footnotes

bakkot commented Oct 5, 2022

gibson042 commented Oct 5, 2022

bakkot commented Oct 5, 2022

gibson042 commented Oct 5, 2022