Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tokenizer] don't reset comment state in case of long endings #49

Merged
merged 1 commit into from
Jun 6, 2013

Conversation

AndreasMadsen
Copy link
Collaborator

Found the bug when scraping http://www.guardian.co.uk/world/2013/jun/06/florida-tropical-storm-andrea

See testcase below:

var Parser = require('htmlparser2').Parser;

var stream = new Parser({
  onopentag: function (tagname, attr) {
    console.log('open: ' + tagname, attr);
  },

  oncomment: function (value) {
    console.log('open comment: ' + value);
  },

  ontext: function (text) {
    console.log('text: ' + text);
  },

  oncommentend: function () {
    console.log('close comment');
  },

  onclosetag: function (tagname) {
    console.log('close: ' + tagname);
  }
});

stream.write('<meta id="before">');
stream.write('<!-- text --->');
stream.write('<meta id="after">');
stream.end();

Outputs this on master:

open: meta { id: 'before' }
close: meta
open comment:  text ---><meta id="after">
close comment

with this fix:

open: meta { id: 'before' }
close: meta
open comment:  text -
close comment
open: meta { id: 'after' }
close: meta

fb55 added a commit that referenced this pull request Jun 6, 2013
[tokenizer] don't reset comment state in case of long endings
@fb55 fb55 merged commit 623cd89 into fb55:master Jun 6, 2013
@AndreasMadsen
Copy link
Collaborator Author

That was quick :)

@fb55
Copy link
Owner

fb55 commented Jun 6, 2013

Good catch, thanks!

There should be a similar issue with CDATA, where ]]]> is ignored. The AFTER_CDATA_2 state needs a similar check (for ]). It would be great if you could fix that too (or at least contribute a test case).

fb55 added a commit that referenced this pull request Oct 21, 2018
[tokenizer] don't reset comment state in case of long endings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants