-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use parse5 as a default parser (closes #863) #985
Conversation
1 similar comment
Please, do not merge yet. I'll go through open issues and reference the ones that will be closed by this PR along with regression tests. Update: done (see OP post). \cc @jugglinmike @fb55 |
4 similar comments
This is awesome, thank you! I will give it a look as soon as I am able. |
Hey guys, I know everyone on is very busy, but wondering if there any timescale on when this will be merged in as would really like to start using parse5 with cheerio in a new project. |
Not currently, no. Please remember that this is a non-trivial change and calls for careful review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far!
@@ -164,14 +164,15 @@ $ = cheerio.load('<ul id="fruits">...</ul>', { | |||
}); | |||
``` | |||
|
|||
These parsing options are taken directly from [htmlparser2](https://github.com/fb55/htmlparser2/wiki/Parser-options), therefore any options that can be used in `htmlparser2` are valid in cheerio as well. The default options are: | |||
These parsing options are taken directly from [htmlparser2](https://github.com/fb55/htmlparser2/wiki/Parser-options), therefore any options that can be used in `htmlparser2` are valid in cheerio as well. If any of these options is set to non-default value cheerio will implicitly use `htmlparser2` as an underlying parser. In addition, you can use `useHtmlParser2` option to force cheerio use `htmlparser2` instead of `parse5`. The default options are: | |||
|
|||
```js | |||
{ | |||
withDomLvl1: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is withDomLvl1
implemented in the parse5 tree adapter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same question, so I experimented a bit and found that Dom Level 1 support is analogous. Here's the relevant source code. Great!
lib/parse.js
Outdated
|
||
// Update the dom using the root | ||
exports.update(dom, root); | ||
|
||
return root; | ||
}; | ||
|
||
function parseWithParse5 (content) { | ||
var parseAsDocument = /^(\s|<!--.*?-->)*?<(!doctype|html|head|body)(.*?)>/i.test(content), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a smart hack, I'm not sure if it's expected though. IMHO we should only parse fragments right now and switch to full parsing in cheerio@1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we'll always use fragment parsing parser will omit <html>
, <body>
and <head>
tags and doctypes, so it will basically will return empty AST for full documents. On the other hand, enabling full parsing will always add these tags. Even in 1.0 you will still need this hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking clearly, it just enables the behavior that we currently have in cheerio. Unless we'll not add explicit API for document and fragment parsing, but as far as I can tell, it will lead to breaking change.
Okay, I didn't realize that's the case. My preferred final behavior would
be to have .load load documents and $() parse fragments. That suggestion
was only to keep backwards compatibility for now.
Given that that won't be possible, we might consider a semver major bump
for this. At the same time, we could also adopt the parse5 DOM structure
and add hooks for the old DOM structure (getters for type, name etc.) for
backwards compatibility, with eventually an option to turn it off for
better performance.
|
@fb OK, let's stick with this plan:
|
I've just realized that it will not be possible to switch to parse5 AST, since we use htmlparser2 for XML. |
Let me start be re-iterating: this is really great work! You've clearly done your homework when it comes to Cheerio's known bugs and backwards compatibility. This is a significant change to the core behavior of the library. While the value is clear, it's also a risk. Our test suite is strong, but I'm not convinced it is thorough enough to identify all possible regressions. It's also kind of surprising that any change to the parsing options triggers the use of a completely different parser, especially when one of the options is $ = cheerio.load('<ul id="fruits">...</ul>', {
xmlMode: true,
useHtmlParser2: false
}); It's not at all clear that example would actually use I agree with @fb55's thoughts about releasing this as part of a new major version for Cheerio. That would:
Does that sound good to you two? |
OK, I've just implemented changes we've discussed with @fb55 before: Regarding @jugglinmike's suggestion about parser options: I agree that implicit switching of options doesn't look good. I like the idea of |
Good point. I don't want to let difficulty choosing names determine what Do we really want to commit to simultaneously supporting both parsers for HTML This would certainly simplify configuration (basically, @fb55 I'd love your feedback here, too. Does it seem wise to support a single |
Any feedback on this? |
Sorry for the delay :/ As far as I remember the main reason to keep htmlparser2 was to have a parser that does not modify the DOM structure, or supports weird DOM structures (self-closing tags!). I would agree that that's in the realm of feature-creep. For XML, isaacs/sax-js is the better option, although it is stricter with what it considers valid XML. (It's also what jsdom is using for xml.) I'm not sure how we handle the DOM transition best. One option would be to continue to support the old names (ideally with the option to disable it), another to provide a codemod that tries to rename all property accesses. For the latter option we could just stick with parse5's DOM tree and continue to use |
To conclude, let's figure out features which should get into this PR and thus in |
Actually, I'm not in favor of this modification to my recommendation. This is
or
Although we may end up removing the second form prior to releasing this.
This sounds good to me. You've done a great job here, but I don't want to keep ...and on that note:
This may be callous of me, but I'm not particularly concerned with the end-user So to sum up, here's what I'm proposing for our road map:
Does that sound reasonable to you two? |
Works for me. @fb55, do you have any objections? |
Sounds reasonable, let's do it :)
|
@jugglinmike It's my opinion that |
Now that we're disallowing
I'm hoping this will avoid the kind of confusion voice in this
Internally, we'll still merge the parsing options to the top-level object and I've pushed a commit to #1033 that |
Is there a maintained fork with this PR? |
I'm merging this into master to resolve all of the linked issues. Afterwards, I will remove the master branch, rename the current These changes have already been merged into |
@inikulin can you please explain how this removed the recursive node decoration calls ? |
@dor-benatia This is an independent change; see #1559. |
This PR makes
parse5
a default parser for cheerio. If any parsing options are set to non-default values cheerio fallbacks tohtmlparser2
. Additionally,useHtmlParser2
option was introduced to force usage ofhtmlparser2
.