Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to use parse5 as optional(?) parser #863

Closed
inikulin opened this issue May 11, 2016 · 28 comments · Fixed by #985
Closed

Make it possible to use parse5 as optional(?) parser #863

inikulin opened this issue May 11, 2016 · 28 comments · Fixed by #985

Comments

@inikulin
Copy link
Contributor

Today I'm deprecating whacko, so, if cheerio maintainers don't mind, I'll send those who is interested in this feature here.

@fb55
Copy link
Member

fb55 commented May 12, 2016

@inikulin Would you mind sending a PR? As I haven't worked on htmlparser2 for a long while, it is probably better to drop it entirely.

@inikulin
Copy link
Contributor Author

Would you mind sending a PR?

Sure. Unfortunately, I'm quite busy at the moment, however I believe I'll be able to get my hands on it somewhere at the end of the month.

@dypsilon
Copy link

Hi, what is the status on this? It would be huge improvement.

@inikulin
Copy link
Contributor Author

@dypsilon Currently I'm extremely busy. As soon as I'm done with my job duties (I hope at the mid-August) I will start working on this one. Meanwhile, PRs are welcome if someone else would like to start work on this.

@fb55 fb55 mentioned this issue Jul 20, 2016
11 tasks
@stevenvachon
Copy link
Contributor

If htmlparser2 is removed, what will be used to parse XML? sax?

@fb55
Copy link
Member

fb55 commented Aug 4, 2016

You will still be able to modify htmlparser2 dom stuctures, so you might have to parse xml in a separate step (one additional line).

@stevenvachon
Copy link
Contributor

@fb55 but which tool will/should be used in that separate step?

@fb55
Copy link
Member

fb55 commented Aug 4, 2016 via email

@stevenvachon
Copy link
Contributor

Ok, so then cheerio will lose the ability to parse XML out-of-the-box.

Perhaps an XML-only fork of htmlparser2 is on the distant horizon?

@luanmuniz
Copy link
Contributor

parse5 will parse xml as well. Its what whacko is using

@stevenvachon
Copy link
Contributor

@luanmuniz XML is case-sensitive, HTML is not.

@luanmuniz
Copy link
Contributor

@stevenvachon https://github.com/inikulin/parse5
It parse xml as well

@fb55 i would like todo this. I'm sing whacko for a long time and since its not supported anymore. I would like to make cheerio use parse5. Can you give me some instructions and points that i need to pay attention when i'm working on this?

@stevenvachon
Copy link
Contributor

stevenvachon commented Aug 22, 2016

@luanmuniz Just because it parses it, does not mean that it does so accurately nor reliably. It will perform HTML operations on XML, which definitely could result in errors.

@luanmuniz
Copy link
Contributor

@stevenvachon I use whacko, that is a fork of cheerio using parse5. I manipulate a few XML and it works fine. But anyway, cheerio is for HTML. XML is secondary. The project intend to imitate the jQuery api for html, not be a xml parser.

I will try to make it work with xml as well. I will keep that in mind when doing this.

@fb55
Copy link
Member

fb55 commented Aug 23, 2016

@luanmuniz That's awesome! I guess using parse5 in document mode for .load and in fragment mode for $() would be a great first step (and should be portable from whacko).

For now, I'd like to keep using htmlparser2 if xmlMode is enabled (and follow jsdom afterwards with using sax.js for xml).

A second step would be to also adopt parse5's serializer. I would like to preserve the option to output ASCII-only strings (which is the current behavior), not sure if that's supported by the serializer or needs another step.

Finally, and way out of scope for this, cheerio should adopt a better DOM representation with names from the DOM spec. parse5's own DOM is substantially better here, but we would need to add some layer for backwards compatibility (at least until a 1.0.0 release).

You can create a PR early on and ping me for help and feedback.

@inikulin
Copy link
Contributor Author

inikulin commented Aug 26, 2016

Hi guys.

I was thinking a lot about how this should be implemented and I would like to share my concerns.
I see two usage scenarios for cheerio:

  • Scrapping and content analysis
  • HTML reprocessing

The problem with full-featured HTML parsing is that it's not round-tripable: once document parsed you will get perfectly valid DOM (like in browsers), but if you will try to serialize it back you may end up with broken markup (e.g. whatwg/html#1280). The only way to avoid such behavior is to have perfectly valid HTML as an input. That's why I believe full featured algorithm will not fit well for HTML reprocessing scenario. The best thing that can be done here is to build LexicalTreeParser built on top of parse5 SAXParser: you will get correctly tokenized HTML with real parser feedback simulation, but with lexical node nesting order. @jescalan worked on such parser for https://github.com/posthtml/posthtml project (see discussion: inikulin/parse5#144, inikulin/parse5#132). Maybe he could share his experience?

For scraping full-featured parsing will fit better, since more likely you expect to deal with DOM that produced by the browser when it opens a page.

The question is how to introduce this two parsing modes in cheerio API.

Regarding XML: it can't be parsed with parse5. HTML5 and XML are completely different languages. As @fb55 mentioned guys from JSDOM now use XML parser built on top of sax package and AFAIK it solved some XML parsing issues for them. Maybe @Sebmaster could give more insights on the topic.

@jescalan
Copy link

Hi there!

So I actually ended up needing to branch off from posthtml due to fundamental differences in visions for the future, and posthtml is not using the work I did on the parser at all. I think they are planning on building their own parser from scratch now for some reason.

You can find the work I did with parse5 here - https://github.com/reshape/parser. It's working quite well, thank you for such a great project and for your assistance, @inikulin!

I used the SAXParser instead of parse5 directly because I needed a bit more forgiving of a parser in order to be able to accommodate some common use-cases such as layouts and partials that are implemented on top of reshape. It was pretty straightforward working with the SAXParser, took less than a day to build it out, and it has been working reliably since then.

I also did some work with htmlparser2, so I'll speak quickly to differences I noticed between them. Parse5's SAXParser has a remarkably similar API and could easily produce an output tree that is the same as what htmlparser2 produces. The main difference I have run into is that it's a bit more granular in the way it breaks down doctype and attributes containing :s, so this needs to be handled explicitly. In addition, it is more strict in the manner of coercing attribute names to all lower case in accordance with the html spec. While this is a good practice without question, it can cause issues with some frameworks like angular that use strange invalid html attributes at the core. I'm sure there are a couple other subtle differences I will run into in time, but they are all easily patched.

In addition, I have been able to parse large trees of custom tags, and SVG structures without issue using the SAXParser. I am not super knowledgeable about XML and the finer details of its spec, but I would estimate that it would be able to parse most basic XML structures without a problem.

Hope this helps, and let me know if there's any other way I can help out!

@fb55
Copy link
Member

fb55 commented Aug 27, 2016

@inikulin Very interesting problem. I feel like we could get away with ignoring that issue for the common case and offering a way to fall back to htmlparser2 otherwise. I don't see a clear advantage of using parse5 without the tree builder stuff, but that's just personal preference.

At the same time, I would like to adopt parse5's serializer asap. Would it perhaps be possible to split it into a separate module that cheerio can switch to right away?

@inikulin
Copy link
Contributor Author

The advantage is to have proper tokenization. Even if you switch to SAXParser-based parser without full featured html parsing it should leverage existing parsing experience (+ you will have parse5 location info, including attributes location) Regarding serializer: I'm not sure if it will make much sense if you going to switch to parse5 anyway. There is some shared code beetwen parser and serializer, so I'd like to keep them in one package

@fb55
Copy link
Member

fb55 commented Aug 31, 2016

@inikulin That makes a lot of sense. How do you feel about adding an option to parse5's default DOM implementation that allows adding a custom __proto__ to nodes? This would be incredibly useful for backwards compatibility and would allow cheerio to switch without any code changes.

@inikulin
Copy link
Contributor Author

inikulin commented Aug 31, 2016

@fb55 I believe you can accomplish this task by extending one of existing tree adapters: you can create new adapter via _.assign using htmlparser2 adapter as a proto and then override node construction methods and then pass it to parse5. Please, let me know if it works for you.

@sentanos
Copy link

Any updates on this? It seems like the blatant memory leaks from the existing parser are being given very little priority even though they can be extremely inconvenient.

@inikulin
Copy link
Contributor Author

We need some bike shedding on #863 (comment) I would like to have some feedback on this topic before I start implementing anything.

@fb55
Copy link
Member

fb55 commented Feb 19, 2017 via email

inikulin added a commit to inikulin/whacko that referenced this issue Feb 21, 2017
inikulin added a commit to inikulin/whacko that referenced this issue Feb 21, 2017
inikulin added a commit to inikulin/whacko that referenced this issue Feb 25, 2017
@stevenvachon
Copy link
Contributor

Is anything happening here? It is pretty tiring explaining to people that cheerio is problematic because of htmlparser2 and for most of my use cases, I cannot use cheerio.

@inikulin
Copy link
Contributor Author

@stevenvachon Yeah, PR is issued and approved. It will be landed as RC at first and then in cheerio 1.0.0

@mirague
Copy link

mirague commented Mar 13, 2018

How far along until the RC? :)

@fb55 fb55 closed this as completed in ca6963c Oct 3, 2018
@pencilcheck
Copy link

Getting a lot of heap out of memory with parse5

fb55 pushed a commit that referenced this issue Dec 20, 2020
* Use parse5 as a default parser (closes #863)

* Use documents via $.load

* Add test for #997

* Change options format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

9 participants