Nokogiri Pain Points #14

yorickpeterse · 2014-04-03T20:12:17Z

When I started sharing the word of my work on Oga various developers remarked that they were very happy with a pure Ruby XML/HTML parser. I found this a bit surprising as I've always assumed people were generally happy enough with Nokogiri (at least before they started shipping libxml). To be more specific, I've not come across a lot of negative articles/resources about Nokogiri.

As a result of this I'll be using this issue to keep track of requests/suggestions/problems people currently have with Nokgiri and XML/HTML parsing in general. In particular I'd like to know what people dislike about Nokogiri to see if I can whip together something for that.

In other words, if there's something about Nokogiri that absolutely pisses you off please specify so in a comment below.

yorickpeterse · 2014-04-03T20:16:06Z

My personal issues:

Nokogiri is very unstable on Rubinius, see Bogus data being marked for GC sweeps under Rubinius sparklemotion/nokogiri#1047
Nokogiri ships libxml and unless you set an environment variable will compile it upon Gem installation. On EC2 this takes around 10 minutes or so.
Nokogiri is written in C and a total pain to debug
Nokogiri caches a bunch of things (in particular CSS selectors) on class level and uses locks for this to make it "thread safe"
Nokogiri doesn't offer any sane APIs for parsing large HTML/XML documents. The pull parser only supports XML and the SAX API is a total train wreck
Nokogiri's documentation is limited and not very beginner friendly

slaught · 2014-04-03T20:58:44Z

+1 to this being a big problem: Bogus data being marked for GC sweeps under Rubinius sparklemotion/nokogiri#1047
Not fully thread safe
Bad abstractions over libxml.

postmodern · 2014-04-03T23:37:26Z

JRuby support, see nokogiri-1.5.0-java inner_text is not respecting inner nodes sparklemotion/nokogiri#521
The weak detection of CSS-path vs. XPath. I prefer how jQuery differentiates between CSS-path and XPath.
Cannot use XPath/CSS-path with SAX. It should be possible to convert XPath/CSS-path into state machines that match each node incrementally.
When parsing large and complex XML documents (+5Mb), Nokogiri will create Ruby objects for every libxml2 node. Would rather prefer an opaque lazy interface to the libxml2 document.
Nokogiri::HTML and libxml2's HTML support is not as forgiving of malformed HTML as Gecko or WebKit.
Nokogiri::XSLT, which requires libxslt1, should be a separate library.
Poor documentation of XPath selectors and supported functions. Usually end up searching Nokogiri's Google Groups or StackOverflow.
The security risks of vendoring libxml2.
The name; oga isn't much better. Please stop with the cryptic names.

ryanstout · 2014-04-03T23:55:23Z

2nded, the jruby version behaves completely differently. Namespaces also don't work in jruby.

denisdefreyne · 2014-04-20T12:45:57Z

Nokogiri on JRuby is unusably buggy. Take a look at these issues I reported:

All of these issues were uncovered by the nanoc test cases.

jrochkind · 2014-07-16T21:10:27Z

Difficulty/slowness of compilation is my biggest painpoint for sure. In current variations that can take minutes to compile. And on some servers I still have to go installing various supporting libraries at the right versions to get a succesful compile.

Next painpoint would be nokogiri's inconsistency in some places in how it handles XML namespaces. The API is inconsistent and works different ways in different places.

It looks like oga is still under development and not yet mature. Very interested and plan on keeping an eye on it. Nokogiri's support for CSS selectors (an idea it took from HPricot) are super useful, not sure if you're planning on doing those too.

yorickpeterse · 2014-07-16T21:55:39Z

@jrochkind Oga will support both XPath expressions and CSS selectors (these would be compiled into their XPath equivalents and evaluated). Most of the current work is happening on the xpath branch. There indeed is still a lot of work to be done.

minad · 2014-10-01T11:14:04Z

@yorickpeterse I don't like that Nokogiri adds some stuff everywhere (doctype, cdata in scripts, xmlns:lang attributes, ...). This is a pain point especially in combination with html5. However I like the Nokogiri API: For example element[:attribute] which is not supported by oga, there I have to write element.attribute(:attribute) which I don't like very much. I would suggest to make the API more or less compatible. I am also missing the css selector.

yorickpeterse · 2014-10-01T11:48:34Z

@minad CSS selector support is a work in progress which I can hopefully release in 1-2 months. Progress is tracked at #11 and https://github.com/YorickPeterse/oga/tree/css

yorickpeterse · 2015-02-22T23:27:40Z

Closing this one as I've migrated most of these points to the Wiki page https://github.com/YorickPeterse/oga/wiki/Problems-with-Nokogiri.

yorickpeterse closed this as completed Feb 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nokogiri Pain Points #14

Nokogiri Pain Points #14

yorickpeterse commented Apr 3, 2014

yorickpeterse commented Apr 3, 2014

slaught commented Apr 3, 2014

postmodern commented Apr 3, 2014

ryanstout commented Apr 3, 2014

denisdefreyne commented Apr 20, 2014

jrochkind commented Jul 16, 2014

yorickpeterse commented Jul 16, 2014

minad commented Oct 1, 2014

yorickpeterse commented Oct 1, 2014

yorickpeterse commented Feb 22, 2015

Nokogiri Pain Points #14

Nokogiri Pain Points #14

Comments

yorickpeterse commented Apr 3, 2014

yorickpeterse commented Apr 3, 2014

slaught commented Apr 3, 2014

postmodern commented Apr 3, 2014

ryanstout commented Apr 3, 2014

denisdefreyne commented Apr 20, 2014

jrochkind commented Jul 16, 2014

yorickpeterse commented Jul 16, 2014

minad commented Oct 1, 2014

yorickpeterse commented Oct 1, 2014

yorickpeterse commented Feb 22, 2015