Make it possible to use parse5 as optional(?) parser #863

inikulin · 2016-05-11T23:27:01Z

Today I'm deprecating whacko, so, if cheerio maintainers don't mind, I'll send those who is interested in this feature here.

fb55 · 2016-05-12T06:01:01Z

@inikulin Would you mind sending a PR? As I haven't worked on htmlparser2 for a long while, it is probably better to drop it entirely.

inikulin · 2016-05-12T10:22:17Z

Would you mind sending a PR?

Sure. Unfortunately, I'm quite busy at the moment, however I believe I'll be able to get my hands on it somewhere at the end of the month.

dypsilon · 2016-07-20T11:49:54Z

Hi, what is the status on this? It would be huge improvement.

inikulin · 2016-07-20T12:19:00Z

@dypsilon Currently I'm extremely busy. As soon as I'm done with my job duties (I hope at the mid-August) I will start working on this one. Meanwhile, PRs are welcome if someone else would like to start work on this.

stevenvachon · 2016-08-03T23:39:50Z

If htmlparser2 is removed, what will be used to parse XML? sax?

fb55 · 2016-08-04T01:13:24Z

You will still be able to modify htmlparser2 dom stuctures, so you might have to parse xml in a separate step (one additional line).

stevenvachon · 2016-08-04T02:03:11Z

@fb55 but which tool will/should be used in that separate step?

fb55 · 2016-08-04T06:20:03Z

htmlparser2 directly :)

stevenvachon · 2016-08-04T12:50:30Z

Ok, so then cheerio will lose the ability to parse XML out-of-the-box.

Perhaps an XML-only fork of htmlparser2 is on the distant horizon?

luanmuniz · 2016-08-21T01:28:36Z

parse5 will parse xml as well. Its what whacko is using

stevenvachon · 2016-08-21T13:01:37Z

@luanmuniz XML is case-sensitive, HTML is not.

luanmuniz · 2016-08-22T15:14:02Z

@stevenvachon https://github.com/inikulin/parse5
It parse xml as well

@fb55 i would like todo this. I'm sing whacko for a long time and since its not supported anymore. I would like to make cheerio use parse5. Can you give me some instructions and points that i need to pay attention when i'm working on this?

stevenvachon · 2016-08-22T15:55:40Z

@luanmuniz Just because it parses it, does not mean that it does so accurately nor reliably. It will perform HTML operations on XML, which definitely could result in errors.

luanmuniz · 2016-08-22T16:28:32Z

@stevenvachon I use whacko, that is a fork of cheerio using parse5. I manipulate a few XML and it works fine. But anyway, cheerio is for HTML. XML is secondary. The project intend to imitate the jQuery api for html, not be a xml parser.

I will try to make it work with xml as well. I will keep that in mind when doing this.

fb55 · 2016-08-23T02:05:14Z

@luanmuniz That's awesome! I guess using parse5 in document mode for .load and in fragment mode for $() would be a great first step (and should be portable from whacko).

For now, I'd like to keep using htmlparser2 if xmlMode is enabled (and follow jsdom afterwards with using sax.js for xml).

A second step would be to also adopt parse5's serializer. I would like to preserve the option to output ASCII-only strings (which is the current behavior), not sure if that's supported by the serializer or needs another step.

Finally, and way out of scope for this, cheerio should adopt a better DOM representation with names from the DOM spec. parse5's own DOM is substantially better here, but we would need to add some layer for backwards compatibility (at least until a 1.0.0 release).

You can create a PR early on and ping me for help and feedback.

inikulin · 2016-08-26T13:25:20Z

Hi guys.

I was thinking a lot about how this should be implemented and I would like to share my concerns.
I see two usage scenarios for cheerio:

Scrapping and content analysis
HTML reprocessing

The problem with full-featured HTML parsing is that it's not round-tripable: once document parsed you will get perfectly valid DOM (like in browsers), but if you will try to serialize it back you may end up with broken markup (e.g. whatwg/html#1280). The only way to avoid such behavior is to have perfectly valid HTML as an input. That's why I believe full featured algorithm will not fit well for HTML reprocessing scenario. The best thing that can be done here is to build LexicalTreeParser built on top of parse5 SAXParser: you will get correctly tokenized HTML with real parser feedback simulation, but with lexical node nesting order. @jescalan worked on such parser for https://github.com/posthtml/posthtml project (see discussion: inikulin/parse5#144, inikulin/parse5#132). Maybe he could share his experience?

For scraping full-featured parsing will fit better, since more likely you expect to deal with DOM that produced by the browser when it opens a page.

The question is how to introduce this two parsing modes in cheerio API.

Regarding XML: it can't be parsed with parse5. HTML5 and XML are completely different languages. As @fb55 mentioned guys from JSDOM now use XML parser built on top of sax package and AFAIK it solved some XML parsing issues for them. Maybe @Sebmaster could give more insights on the topic.

jescalan · 2016-08-26T14:05:35Z

Hi there!

So I actually ended up needing to branch off from posthtml due to fundamental differences in visions for the future, and posthtml is not using the work I did on the parser at all. I think they are planning on building their own parser from scratch now for some reason.

You can find the work I did with parse5 here - https://github.com/reshape/parser. It's working quite well, thank you for such a great project and for your assistance, @inikulin!

I used the SAXParser instead of parse5 directly because I needed a bit more forgiving of a parser in order to be able to accommodate some common use-cases such as layouts and partials that are implemented on top of reshape. It was pretty straightforward working with the SAXParser, took less than a day to build it out, and it has been working reliably since then.

I also did some work with htmlparser2, so I'll speak quickly to differences I noticed between them. Parse5's SAXParser has a remarkably similar API and could easily produce an output tree that is the same as what htmlparser2 produces. The main difference I have run into is that it's a bit more granular in the way it breaks down doctype and attributes containing :s, so this needs to be handled explicitly. In addition, it is more strict in the manner of coercing attribute names to all lower case in accordance with the html spec. While this is a good practice without question, it can cause issues with some frameworks like angular that use strange invalid html attributes at the core. I'm sure there are a couple other subtle differences I will run into in time, but they are all easily patched.

In addition, I have been able to parse large trees of custom tags, and SVG structures without issue using the SAXParser. I am not super knowledgeable about XML and the finer details of its spec, but I would estimate that it would be able to parse most basic XML structures without a problem.

Hope this helps, and let me know if there's any other way I can help out!

fb55 · 2016-08-27T18:31:55Z

@inikulin Very interesting problem. I feel like we could get away with ignoring that issue for the common case and offering a way to fall back to htmlparser2 otherwise. I don't see a clear advantage of using parse5 without the tree builder stuff, but that's just personal preference.

At the same time, I would like to adopt parse5's serializer asap. Would it perhaps be possible to split it into a separate module that cheerio can switch to right away?

inikulin · 2016-08-27T19:54:30Z

The advantage is to have proper tokenization. Even if you switch to SAXParser-based parser without full featured html parsing it should leverage existing parsing experience (+ you will have parse5 location info, including attributes location) Regarding serializer: I'm not sure if it will make much sense if you going to switch to parse5 anyway. There is some shared code beetwen parser and serializer, so I'd like to keep them in one package

fb55 · 2016-08-31T07:44:23Z

@inikulin That makes a lot of sense. How do you feel about adding an option to parse5's default DOM implementation that allows adding a custom __proto__ to nodes? This would be incredibly useful for backwards compatibility and would allow cheerio to switch without any code changes.

inikulin · 2016-08-31T08:17:32Z

@fb55 I believe you can accomplish this task by extending one of existing tree adapters: you can create new adapter via _.assign using htmlparser2 adapter as a proto and then override node construction methods and then pass it to parse5. Please, let me know if it works for you.

sentanos · 2017-02-19T01:38:20Z

Any updates on this? It seems like the blatant memory leaks from the existing parser are being given very little priority even though they can be extremely inconvenient.

inikulin · 2017-02-19T09:14:07Z

We need some bike shedding on #863 (comment) I would like to have some feedback on this topic before I start implementing anything.

fb55 · 2017-02-19T10:22:08Z

IMHO full parsing is the way to go, with a fallback in case that does not work for someone (whether that's another mode, htmlparser2 or something else should not matter – especially for XML this should be useful). Perhaps we should always default to fragment parsing for now and switch on a major version change, but otherwise this should work the same way as Whacko did.

stevenvachon · 2017-05-12T21:30:57Z

Is anything happening here? It is pretty tiring explaining to people that cheerio is problematic because of htmlparser2 and for most of my use cases, I cannot use cheerio.

inikulin · 2017-05-12T22:07:08Z

@stevenvachon Yeah, PR is issued and approved. It will be landed as RC at first and then in cheerio 1.0.0

mirague · 2018-03-13T13:08:14Z

How far along until the RC? :)

pencilcheck · 2019-06-18T17:49:48Z

Getting a lot of heap out of memory with parse5

* Use parse5 as a default parser (closes #863) * Use documents via $.load * Add test for #997 * Change options format

inikulin mentioned this issue May 19, 2016

LexicalTreeParser inikulin/parse5#132

Closed

deniak mentioned this issue Jul 11, 2016

whacko is no longer maintained. w3c/specberus#392

Closed

fb55 mentioned this issue Jul 20, 2016

Road to 1.0 #442

Closed

11 tasks

stevenvachon mentioned this issue Sep 8, 2016

Use whacko instead of cheerio Automattic/juice#191

Closed

fb55 mentioned this issue Oct 15, 2016

Apostrophe in HTML must be encoded with ''' instead of ''' fb55/entities#20

Closed

luanmuniz mentioned this issue Nov 16, 2016

out of memory when using cheerio in crawler #830

Closed

inikulin added a commit to inikulin/whacko that referenced this issue Feb 21, 2017

Use parse5 as a default parser (closes cheeriojs#863)

561038a

inikulin added a commit to inikulin/whacko that referenced this issue Feb 21, 2017

Use parse5 as a default parser (closes cheeriojs#863)

0ac3996

inikulin mentioned this issue Feb 25, 2017

Use parse5 as a default parser (closes #863) #985

Merged

inikulin added a commit to inikulin/whacko that referenced this issue Feb 25, 2017

Use parse5 as a default parser (closes cheeriojs#863)

4fa8547

fb55 pushed a commit that referenced this issue Feb 27, 2018

Use parse5 as a default parser (closes #863)

edcb24a

fb55 pushed a commit that referenced this issue Mar 9, 2018

Use parse5 as a default parser (closes #863)

c1b944e

fb55 closed this as completed in ca6963c Oct 3, 2018

luanmuniz mentioned this issue Jan 17, 2020

parse5 is about half the performance of htmlparser2 #1259

Closed

hansottowirtz mentioned this issue Jul 1, 2020

Support htmlparser2 Automattic/juice#352

Merged

fb55 pushed a commit that referenced this issue Sep 10, 2020

Use parse5 as a default parser (closes #863)

2d09404

fb55 pushed a commit that referenced this issue Dec 20, 2020

Use parse5 as a default parser (closes #863) (#985)

6e115ee

* Use parse5 as a default parser (closes #863) * Use documents via $.load * Add test for #997 * Change options format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it possible to use parse5 as optional(?) parser #863

Make it possible to use parse5 as optional(?) parser #863

inikulin commented May 11, 2016

fb55 commented May 12, 2016

inikulin commented May 12, 2016

dypsilon commented Jul 20, 2016

inikulin commented Jul 20, 2016

stevenvachon commented Aug 3, 2016

fb55 commented Aug 4, 2016 •

edited

Loading

stevenvachon commented Aug 4, 2016

fb55 commented Aug 4, 2016 via email

stevenvachon commented Aug 4, 2016

luanmuniz commented Aug 21, 2016

stevenvachon commented Aug 21, 2016

luanmuniz commented Aug 22, 2016

stevenvachon commented Aug 22, 2016 •

edited

Loading

luanmuniz commented Aug 22, 2016

fb55 commented Aug 23, 2016 •

edited

Loading

inikulin commented Aug 26, 2016 •

edited

Loading

jescalan commented Aug 26, 2016

fb55 commented Aug 27, 2016

inikulin commented Aug 27, 2016

fb55 commented Aug 31, 2016

inikulin commented Aug 31, 2016 •

edited

Loading

sentanos commented Feb 19, 2017

inikulin commented Feb 19, 2017

fb55 commented Feb 19, 2017 via email

stevenvachon commented May 12, 2017

inikulin commented May 12, 2017

mirague commented Mar 13, 2018

pencilcheck commented Jun 18, 2019

Make it possible to use parse5 as optional(?) parser #863

Make it possible to use parse5 as optional(?) parser #863

Comments

inikulin commented May 11, 2016

fb55 commented May 12, 2016

inikulin commented May 12, 2016

dypsilon commented Jul 20, 2016

inikulin commented Jul 20, 2016

stevenvachon commented Aug 3, 2016

fb55 commented Aug 4, 2016 • edited Loading

stevenvachon commented Aug 4, 2016

fb55 commented Aug 4, 2016 via email

stevenvachon commented Aug 4, 2016

luanmuniz commented Aug 21, 2016

stevenvachon commented Aug 21, 2016

luanmuniz commented Aug 22, 2016

stevenvachon commented Aug 22, 2016 • edited Loading

luanmuniz commented Aug 22, 2016

fb55 commented Aug 23, 2016 • edited Loading

inikulin commented Aug 26, 2016 • edited Loading

jescalan commented Aug 26, 2016

fb55 commented Aug 27, 2016

inikulin commented Aug 27, 2016

fb55 commented Aug 31, 2016

inikulin commented Aug 31, 2016 • edited Loading

sentanos commented Feb 19, 2017

inikulin commented Feb 19, 2017

fb55 commented Feb 19, 2017 via email

stevenvachon commented May 12, 2017

inikulin commented May 12, 2017

mirague commented Mar 13, 2018

pencilcheck commented Jun 18, 2019

fb55 commented Aug 4, 2016 •

edited

Loading

stevenvachon commented Aug 22, 2016 •

edited

Loading

fb55 commented Aug 23, 2016 •

edited

Loading

inikulin commented Aug 26, 2016 •

edited

Loading

inikulin commented Aug 31, 2016 •

edited

Loading