-
-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing breaks after <script>
or <style>
block, followed by an entity (&blah;
)
#1426
Comments
tokenize("<style>a{}</style>'<br/>") Expand[
[
"onopentagname",
1,
6,
],
[
"onopentagend",
6,
],
[
"ontext",
7,
10,
],
[
"onclosetag",
12,
17,
],
[
"ontextentity",
39,
],
[
"ontext", // just text
24,
29,
],
[
"onend",
],
] tokenize("<style>a{}</style><br/>'<br/>") Expand[
[
"onopentagname",
1,
6,
],
[
"onopentagend",
6,
],
[
"ontext",
7,
10,
],
[
"onclosetag",
12,
17,
],
[
"onopentagname",
19,
21,
],
[
"onselfclosingtag",
22,
],
[
"ontextentity",
39,
],
[
"onopentagname", // tag, as expected
30,
32,
],
[
"onselfclosingtag",
33,
],
[
"onend",
],
] So the issue is in Tokenizer. I tried to step through:
Not sure if this is the cause but it looks suspicious. |
--- a/src/Tokenizer.ts
+++ b/src/Tokenizer.ts
@@ -454,7 +454,8 @@ export default class Tokenizer {
private stateAfterClosingTagName(c: number): void {
// Skip everything until ">"
if (c === CharCodes.Gt || this.fastForwardTo(CharCodes.Gt)) {
this.state = State.Text;
+ this.baseState = State.Text;
this.sectionStart = this.index + 1;
}
} [
[
"onopentagname",
1,
6,
],
[
"onopentagend",
6,
],
[
"ontext",
7,
10,
],
[
"onclosetag", // closed style tag
12,
17,
],
[
"ontextentity", // entity
39,
],
[
"onopentagname", // following tag parsed properly
25,
27,
],
[
"onselfclosingtag",
28,
],
[
"onend",
],
] All existing tests still passing. This fix seems to be similar how baseState is reset for self-closing tags. But I'm not sure I understand the code correctly to be sure there are no more edge cases. I'm also not sure where to put the unit test for this. |
Thanks for the report, and awesome job figuring this one out! Unit tests would go into https://github.com/fb55/htmlparser2/blob/master/src/Tokenizer.spec.ts, or the events test file. Run jest once, and you'll have the snapshots needed to avoid future issues. |
I mean, locating the spec file is easy, describing the test requires more effort :) |
Ok, decided on the description. |
Prevents leaking baseState and breaking the Tokenizer if followed by an entity - fb55#1426
Prevents leaking baseState and breaking the Tokenizer if followed by an entity - #1426
Fixed in #1460. |
Input:
Observed output:
Expected: Text node contains "'", it is followed by an Element of type "tag" named "br".
When changed to
<style>a{}</style>\'<br/>
or<style>a{}</style><br/>'<br/>
- it works as expected.When
decodeEntities
is set tofalse
- it works as expected.Version 6.1.0 is the last one that works as expected - it was broken in version 7.0.0.
First reported by @galenhuntington in html-to-text/node-html-to-text#285
The text was updated successfully, but these errors were encountered: