JRuby XML::Reader memory performance is poor #2224

akimd · 2021-04-23T09:53:22Z

Hi,

In the context of a Rails application, I have to process huge XML documents that are "flat". I mean, they could just have been CSV documents instead of XML, but the source provides only XML.

While it appears to work well in MRI, with jruby the memory consumption is very high, and at some point the process is stuck (out of memory).

The following stupid script mimics the problem I face:

p = Pathname.new('big.xml')
n = 10_000_000
ping = -> (msg) { puts "#{Time.now}: #{msg}" }

p.open('w') { |f|
    f.puts "<foos>"
    n.times{ f.puts "  <foo>Hello World</foo>" }
    f.puts "</foos>"
}

ping['before']
c = 0
Nokogiri::XML.Reader(p.open).each do |node|
    ping[c] if c % 1_000_000 == 0
    c += 1
end
ping['after']

The documentation is somewhat ambiguous on how XML::Reader works. It is easy to understand "The Reader parser is good for when you need the speed of a SAX parser, but do not want to write a Document handler." as meaning "this is a SAX parser with a thin interface on top to make it easier than dealing with SAX yourself".

However the first node return by XML::Reader has the whole document as inner_xml, so I am wondering if XML::Reader is really SAX.

What we need in a document that looks like

<foos>
  <foo>...</foo>
  <foo>...</foo>
  <foo>...</foo>
  ...
  <foo>...</foo>
<foos>

is to iterate just on the entries. What is the recommendation in such a case?

Thanks a lot for Nokogiri

The text was updated successfully, but these errors were encountered:

flavorjones · 2021-04-23T13:24:00Z

Hi @akimd, thanks for asking this question.

I want to spend a little bit of time understanding the memory performance of this example in JRuby -- based on your description, it sounds like perhaps there's a memory leak in the JRuby implementation that we might be able to fix.

The Reader class is based on libxml2's xmlreader module. Although libxml2 uses a SAX-ish pasrser at the heart of its implementation, the API is specialized, and it is optimized for the memory pattern of exposing only a "cursor" as it encounters each node.

The JRuby implementation does not have a low-level parser abstraction like libxml2's xmlreader, and so it's emulating that API ... I'm not very familiar with this particular corner of the JRuby implementation (it's had many hands in it over the years, none of them mine) but it looks like it's using the standard SAX parser provided by Xerces, plus some wrapper logic to present a Reader cursor.

I have some ideas on where the issue might be, and it's probably in the JRuby Reader wrapper. I will dig in and see if I can figure it out.

In the meantime, if you are willing to take on the additional complexity of writing SAX parser handlers, you should find the memory performance of the SAX parser acceptable.

akimd · 2021-04-23T15:18:56Z

Hi Mike,
Thanks a lot for the quick response. Ok, so you do confirm that with respect to ressource consumption, XML::Reader is definitely expected to behave more like a nice and comfy SAX reader than a DOM one. That's reassuring. So that probably means that using something like inner_xml is asking for trouble that fire the parsing of all the remainder of the file (we don't do that in the real case, it's just something I encountered when toying with the artificial example above).
Other team members are currently trying to address this issue using other parsers, but that causes other problems. I have no idea what the final choice will be, but we will watching change in Nokogiri on this regard.

Thanks again!

flavorjones · 2021-04-23T17:23:13Z

with respect to ressource consumption, XML::Reader is definitely expected to behave more like a nice and comfy SAX reader than a DOM one

That's correct, to the best of my knowledge! If it's not doing that then we should fix it; or else I need to understand the low-level implemention of libxml2 better.

flavorjones · 2021-04-24T21:35:55Z

I would love some help with this from any of the folks who are familiar with the JRuby implementation.

akimd · 2021-04-30T07:23:06Z

Hi guys,
FWIW, we have fully converted our tool to using the SAX parser only.
Cheers!

flavorjones · 2021-05-07T10:57:06Z

For posterity: this isn't the first issue filed about the memory utilization of Reader in JRuby -- see also #1066.

flavorjones · 2021-05-07T11:12:05Z

See #831 for another instance when we did work to try to improve memory usage.

akimd added the meta/user-help label Apr 23, 2021

flavorjones added the platform/jruby label Apr 23, 2021

flavorjones changed the title ~~[help] Is XML::Reader closer to DOM or SAX parsing?~~ JRuby XML::Reader memory performance is poor Apr 23, 2021

flavorjones removed the meta/user-help label Apr 23, 2021

flavorjones added the help wanted label Apr 24, 2021

flavorjones mentioned this issue Jul 21, 2022

bug: v1.13.7 XML::Reader may segfault #2598

Closed

flavorjones mentioned this issue Jul 2, 2024

Nokogiri::XML::Reader for java keeps whole document in memory #1066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JRuby XML::Reader memory performance is poor #2224

JRuby XML::Reader memory performance is poor #2224

akimd commented Apr 23, 2021 •

edited

Loading

flavorjones commented Apr 23, 2021

akimd commented Apr 23, 2021

flavorjones commented Apr 23, 2021 •

edited

Loading

flavorjones commented Apr 24, 2021

akimd commented Apr 30, 2021

flavorjones commented May 7, 2021

flavorjones commented May 7, 2021

JRuby XML::Reader memory performance is poor #2224

JRuby XML::Reader memory performance is poor #2224

Comments

akimd commented Apr 23, 2021 • edited Loading

flavorjones commented Apr 23, 2021

akimd commented Apr 23, 2021

flavorjones commented Apr 23, 2021 • edited Loading

flavorjones commented Apr 24, 2021

akimd commented Apr 30, 2021

flavorjones commented May 7, 2021

flavorjones commented May 7, 2021

akimd commented Apr 23, 2021 •

edited

Loading

flavorjones commented Apr 23, 2021 •

edited

Loading