Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nokogiri Pain Points #14

Closed
yorickpeterse opened this issue Apr 3, 2014 · 10 comments
Closed

Nokogiri Pain Points #14

yorickpeterse opened this issue Apr 3, 2014 · 10 comments

Comments

@yorickpeterse
Copy link
Owner

When I started sharing the word of my work on Oga various developers remarked that they were very happy with a pure Ruby XML/HTML parser. I found this a bit surprising as I've always assumed people were generally happy enough with Nokogiri (at least before they started shipping libxml). To be more specific, I've not come across a lot of negative articles/resources about Nokogiri.

As a result of this I'll be using this issue to keep track of requests/suggestions/problems people currently have with Nokgiri and XML/HTML parsing in general. In particular I'd like to know what people dislike about Nokogiri to see if I can whip together something for that.

In other words, if there's something about Nokogiri that absolutely pisses you off please specify so in a comment below.

@yorickpeterse
Copy link
Owner Author

My personal issues:

  • Nokogiri is very unstable on Rubinius, see Bogus data being marked for GC sweeps under Rubinius sparklemotion/nokogiri#1047
  • Nokogiri ships libxml and unless you set an environment variable will compile it upon Gem installation. On EC2 this takes around 10 minutes or so.
  • Nokogiri is written in C and a total pain to debug
  • Nokogiri caches a bunch of things (in particular CSS selectors) on class level and uses locks for this to make it "thread safe"
  • Nokogiri doesn't offer any sane APIs for parsing large HTML/XML documents. The pull parser only supports XML and the SAX API is a total train wreck
  • Nokogiri's documentation is limited and not very beginner friendly

@slaught
Copy link

slaught commented Apr 3, 2014

@postmodern
Copy link

  • JRuby support, see nokogiri-1.5.0-java inner_text is not respecting inner nodes sparklemotion/nokogiri#521
  • The weak detection of CSS-path vs. XPath. I prefer how jQuery differentiates between CSS-path and XPath.
  • Cannot use XPath/CSS-path with SAX. It should be possible to convert XPath/CSS-path into state machines that match each node incrementally.
  • When parsing large and complex XML documents (+5Mb), Nokogiri will create Ruby objects for every libxml2 node. Would rather prefer an opaque lazy interface to the libxml2 document.
  • Nokogiri::HTML and libxml2's HTML support is not as forgiving of malformed HTML as Gecko or WebKit.
  • Nokogiri::XSLT, which requires libxslt1, should be a separate library.
  • Poor documentation of XPath selectors and supported functions. Usually end up searching Nokogiri's Google Groups or StackOverflow.
  • The security risks of vendoring libxml2.
  • The name; oga isn't much better. Please stop with the cryptic names.

@ryanstout
Copy link

2nded, the jruby version behaves completely differently. Namespaces also don't work in jruby.

@jrochkind
Copy link

Difficulty/slowness of compilation is my biggest painpoint for sure. In current variations that can take minutes to compile. And on some servers I still have to go installing various supporting libraries at the right versions to get a succesful compile.

Next painpoint would be nokogiri's inconsistency in some places in how it handles XML namespaces. The API is inconsistent and works different ways in different places.

It looks like oga is still under development and not yet mature. Very interested and plan on keeping an eye on it. Nokogiri's support for CSS selectors (an idea it took from HPricot) are super useful, not sure if you're planning on doing those too.

@yorickpeterse
Copy link
Owner Author

@jrochkind Oga will support both XPath expressions and CSS selectors (these would be compiled into their XPath equivalents and evaluated). Most of the current work is happening on the xpath branch. There indeed is still a lot of work to be done.

@minad
Copy link

minad commented Oct 1, 2014

@yorickpeterse I don't like that Nokogiri adds some stuff everywhere (doctype, cdata in scripts, xmlns:lang attributes, ...). This is a pain point especially in combination with html5. However I like the Nokogiri API: For example element[:attribute] which is not supported by oga, there I have to write element.attribute(:attribute) which I don't like very much. I would suggest to make the API more or less compatible. I am also missing the css selector.

@yorickpeterse
Copy link
Owner Author

@minad CSS selector support is a work in progress which I can hopefully release in 1-2 months. Progress is tracked at #11 and https://github.com/YorickPeterse/oga/tree/css

@yorickpeterse
Copy link
Owner Author

Closing this one as I've migrated most of these points to the Wiki page https://github.com/YorickPeterse/oga/wiki/Problems-with-Nokogiri.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants