Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing breaks after <script> or <style> block, followed by an entity (&blah;) #1426

Closed
KillyMXI opened this issue Feb 26, 2023 · 6 comments
Closed

Comments

@KillyMXI
Copy link
Contributor

KillyMXI commented Feb 26, 2023

Input:

import { parseDocument } from 'htmlparser2';

const document = parseDocument(
  '<style>a{}</style>&apos;<br/>',
  { decodeEntities: true }
);

console.log(document);

Observed output:

<ref *1> Document {
  type: 'root',
  parent: null,
  prev: null,
  next: null,
  startIndex: null,
  endIndex: null,
  children: [
    Element {
      type: 'style',
      parent: [Circular *1],
      prev: null,
      next: [Text],
      startIndex: null,
      endIndex: null,
      children: [Array],
      name: 'style',
      attribs: {}
    },
    Text {
      type: 'text',
      parent: [Circular *1],
      prev: [Element],
      next: null,
      startIndex: null,
      endIndex: null,
      data: "'<br/>"
    }
  ]
}

Expected: Text node contains "'", it is followed by an Element of type "tag" named "br".

When changed to <style>a{}</style>\'<br/> or <style>a{}</style><br/>&apos;<br/> - it works as expected.

When decodeEntities is set to false - it works as expected.

Version 6.1.0 is the last one that works as expected - it was broken in version 7.0.0.

First reported by @galenhuntington in html-to-text/node-html-to-text#285

@KillyMXI
Copy link
Contributor Author

tokenize("<style>a{}</style>&apos;<br/>")
Expand
[
  [
    "onopentagname",
    1,
    6,
  ],
  [
    "onopentagend",
    6,
  ],
  [
    "ontext",
    7,
    10,
  ],
  [
    "onclosetag",
    12,
    17,
  ],
  [
    "ontextentity",
    39,
  ],
  [
    "ontext", // just text
    24,
    29,
  ],
  [
    "onend",
  ],
]
tokenize("<style>a{}</style><br/>&apos;<br/>")
Expand
[
  [
    "onopentagname",
    1,
    6,
  ],
  [
    "onopentagend",
    6,
  ],
  [
    "ontext",
    7,
    10,
  ],
  [
    "onclosetag",
    12,
    17,
  ],
  [
    "onopentagname",
    19,
    21,
  ],
  [
    "onselfclosingtag",
    22,
  ],
  [
    "ontextentity",
    39,
  ],
  [
    "onopentagname", // tag, as expected
    30,
    32,
  ],
  [
    "onselfclosingtag",
    33,
  ],
  [
    "onend",
  ],
]

So the issue is in Tokenizer.

I tried to step through:

  • while Tokenizer has state = State.InSpecialTag (24), it also has baseState = State.InSpecialTag (24);
  • when the special tag ends, state is reset to State.Text (1), but baseState is left unchanged;
  • following named entity processing doesn't affect baseState but does reset the state to this erroneous baseState in the end;

Not sure if this is the cause but it looks suspicious.

@KillyMXI
Copy link
Contributor Author

KillyMXI commented Mar 22, 2023

--- a/src/Tokenizer.ts
+++ b/src/Tokenizer.ts
@@ -454,7 +454,8 @@ export default class Tokenizer {
     private stateAfterClosingTagName(c: number): void {
         // Skip everything until ">"
         if (c === CharCodes.Gt || this.fastForwardTo(CharCodes.Gt)) {
             this.state = State.Text;
+            this.baseState = State.Text;
             this.sectionStart = this.index + 1;
         }
     }
[
  [
    "onopentagname",
    1,
    6,
  ],
  [
    "onopentagend",
    6,
  ],
  [
    "ontext",
    7,
    10,
  ],
  [
    "onclosetag", // closed style tag
    12,
    17,
  ],
  [
    "ontextentity", // entity
    39,
  ],
  [
    "onopentagname", // following tag parsed properly
    25,
    27,
  ],
  [
    "onselfclosingtag",
    28,
  ],
  [
    "onend",
  ],
]

All existing tests still passing.

This fix seems to be similar how baseState is reset for self-closing tags. But I'm not sure I understand the code correctly to be sure there are no more edge cases. I'm also not sure where to put the unit test for this.

@fb55
Copy link
Owner

fb55 commented Mar 22, 2023

Thanks for the report, and awesome job figuring this one out!

Unit tests would go into https://github.com/fb55/htmlparser2/blob/master/src/Tokenizer.spec.ts, or the events test file. Run jest once, and you'll have the snapshots needed to avoid future issues.

@KillyMXI
Copy link
Contributor Author

I mean, locating the spec file is easy, describing the test requires more effort :)

@KillyMXI
Copy link
Contributor Author

Ok, decided on the description.

KillyMXI added a commit to KillyMXI/htmlparser2 that referenced this issue Mar 22, 2023
Prevents leaking baseState and breaking the Tokenizer
if followed by an entity - fb55#1426
fb55 pushed a commit that referenced this issue Mar 22, 2023
Prevents leaking baseState and breaking the Tokenizer
if followed by an entity - #1426
@fb55
Copy link
Owner

fb55 commented Mar 22, 2023

Fixed in #1460.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants