Parsing breaks after `<script>` or `<style>` block, followed by an entity (`&blah;`) #1426

KillyMXI · 2023-02-26T10:17:26Z

Input:

import { parseDocument } from 'htmlparser2';

const document = parseDocument(
  '<style>a{}</style>&apos;<br/>',
  { decodeEntities: true }
);

console.log(document);

Observed output:

<ref *1> Document {
  type: 'root',
  parent: null,
  prev: null,
  next: null,
  startIndex: null,
  endIndex: null,
  children: [
    Element {
      type: 'style',
      parent: [Circular *1],
      prev: null,
      next: [Text],
      startIndex: null,
      endIndex: null,
      children: [Array],
      name: 'style',
      attribs: {}
    },
    Text {
      type: 'text',
      parent: [Circular *1],
      prev: [Element],
      next: null,
      startIndex: null,
      endIndex: null,
      data: "'<br/>"
    }
  ]
}

Expected: Text node contains "'", it is followed by an Element of type "tag" named "br".

When changed to <style>a{}</style>\'<br/> or <style>a{}</style><br/>'<br/> - it works as expected.

When decodeEntities is set to false - it works as expected.

Version 6.1.0 is the last one that works as expected - it was broken in version 7.0.0.

First reported by @galenhuntington in html-to-text/node-html-to-text#285

The text was updated successfully, but these errors were encountered:

KillyMXI · 2023-03-21T15:50:34Z

tokenize("<style>a{}</style>&apos;<br/>")

Expand

[
  [
    "onopentagname",
    1,
    6,
  ],
  [
    "onopentagend",
    6,
  ],
  [
    "ontext",
    7,
    10,
  ],
  [
    "onclosetag",
    12,
    17,
  ],
  [
    "ontextentity",
    39,
  ],
  [
    "ontext", // just text
    24,
    29,
  ],
  [
    "onend",
  ],
]

tokenize("<style>a{}</style><br/>&apos;<br/>")

Expand

[
  [
    "onopentagname",
    1,
    6,
  ],
  [
    "onopentagend",
    6,
  ],
  [
    "ontext",
    7,
    10,
  ],
  [
    "onclosetag",
    12,
    17,
  ],
  [
    "onopentagname",
    19,
    21,
  ],
  [
    "onselfclosingtag",
    22,
  ],
  [
    "ontextentity",
    39,
  ],
  [
    "onopentagname", // tag, as expected
    30,
    32,
  ],
  [
    "onselfclosingtag",
    33,
  ],
  [
    "onend",
  ],
]

So the issue is in Tokenizer.

I tried to step through:

while Tokenizer has state = State.InSpecialTag (24), it also has baseState = State.InSpecialTag (24);
when the special tag ends, state is reset to State.Text (1), but baseState is left unchanged;
following named entity processing doesn't affect baseState but does reset the state to this erroneous baseState in the end;

Not sure if this is the cause but it looks suspicious.

KillyMXI · 2023-03-22T15:54:41Z

--- a/src/Tokenizer.ts
+++ b/src/Tokenizer.ts
@@ -454,7 +454,8 @@ export default class Tokenizer {
     private stateAfterClosingTagName(c: number): void {
         // Skip everything until ">"
         if (c === CharCodes.Gt || this.fastForwardTo(CharCodes.Gt)) {
             this.state = State.Text;
+            this.baseState = State.Text;
             this.sectionStart = this.index + 1;
         }
     }

[
  [
    "onopentagname",
    1,
    6,
  ],
  [
    "onopentagend",
    6,
  ],
  [
    "ontext",
    7,
    10,
  ],
  [
    "onclosetag", // closed style tag
    12,
    17,
  ],
  [
    "ontextentity", // entity
    39,
  ],
  [
    "onopentagname", // following tag parsed properly
    25,
    27,
  ],
  [
    "onselfclosingtag",
    28,
  ],
  [
    "onend",
  ],
]

All existing tests still passing.

This fix seems to be similar how baseState is reset for self-closing tags. But I'm not sure I understand the code correctly to be sure there are no more edge cases. I'm also not sure where to put the unit test for this.

fb55 · 2023-03-22T16:01:32Z

Thanks for the report, and awesome job figuring this one out!

Unit tests would go into https://github.com/fb55/htmlparser2/blob/master/src/Tokenizer.spec.ts, or the events test file. Run jest once, and you'll have the snapshots needed to avoid future issues.

KillyMXI · 2023-03-22T16:04:32Z

I mean, locating the spec file is easy, describing the test requires more effort :)

KillyMXI · 2023-03-22T16:06:40Z

Ok, decided on the description.

Prevents leaking baseState and breaking the Tokenizer if followed by an entity - fb55#1426

Prevents leaking baseState and breaking the Tokenizer if followed by an entity - #1426

fb55 · 2023-03-22T23:34:00Z

Fixed in #1460.

KillyMXI mentioned this issue Feb 26, 2023

Entity after script tag results in HTML being copied. html-to-text/node-html-to-text#285

Closed

KillyMXI added a commit to KillyMXI/htmlparser2 that referenced this issue Mar 22, 2023

Reset baseState after closing tag name.

5b1e581

Prevents leaking baseState and breaking the Tokenizer if followed by an entity - fb55#1426

KillyMXI mentioned this issue Mar 22, 2023

Reset baseState after closing tag name #1460

Merged

fb55 pushed a commit that referenced this issue Mar 22, 2023

fix(tokenizer): Reset baseState after closing tag name (#1460)

f6dc2d3

Prevents leaking baseState and breaking the Tokenizer if followed by an entity - #1426

fb55 closed this as completed Mar 22, 2023

NewEraCracker mentioned this issue Jul 25, 2024

sanitizeHtml throws TypeError on '&' symbol apostrophecms/sanitize-html#606

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing breaks after `<script>` or `<style>` block, followed by an entity (`&blah;`) #1426

Parsing breaks after `<script>` or `<style>` block, followed by an entity (`&blah;`) #1426

KillyMXI commented Feb 26, 2023 •

edited

Loading

KillyMXI commented Mar 21, 2023

KillyMXI commented Mar 22, 2023 •

edited

Loading

fb55 commented Mar 22, 2023

KillyMXI commented Mar 22, 2023

KillyMXI commented Mar 22, 2023

fb55 commented Mar 22, 2023

Parsing breaks after <script> or <style> block, followed by an entity (&blah;) #1426

Parsing breaks after <script> or <style> block, followed by an entity (&blah;) #1426

Comments

KillyMXI commented Feb 26, 2023 • edited Loading

KillyMXI commented Mar 21, 2023

KillyMXI commented Mar 22, 2023 • edited Loading

fb55 commented Mar 22, 2023

KillyMXI commented Mar 22, 2023

KillyMXI commented Mar 22, 2023

fb55 commented Mar 22, 2023

Parsing breaks after `<script>` or `<style>` block, followed by an entity (`&blah;`) #1426

Parsing breaks after `<script>` or `<style>` block, followed by an entity (`&blah;`) #1426

KillyMXI commented Feb 26, 2023 •

edited

Loading

KillyMXI commented Mar 22, 2023 •

edited

Loading