Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JRuby XML::Reader memory performance is poor #2224

Open
akimd opened this issue Apr 23, 2021 · 7 comments
Open

JRuby XML::Reader memory performance is poor #2224

akimd opened this issue Apr 23, 2021 · 7 comments

Comments

@akimd
Copy link

akimd commented Apr 23, 2021

Hi,

In the context of a Rails application, I have to process huge XML documents that are "flat". I mean, they could just have been CSV documents instead of XML, but the source provides only XML.

While it appears to work well in MRI, with jruby the memory consumption is very high, and at some point the process is stuck (out of memory).

The following stupid script mimics the problem I face:

p = Pathname.new('big.xml')
n = 10_000_000
ping = -> (msg) { puts "#{Time.now}: #{msg}" }

p.open('w') { |f|
    f.puts "<foos>"
    n.times{ f.puts "  <foo>Hello World</foo>" }
    f.puts "</foos>"
}

ping['before']
c = 0
Nokogiri::XML.Reader(p.open).each do |node|
    ping[c] if c % 1_000_000 == 0
    c += 1
end
ping['after']

The documentation is somewhat ambiguous on how XML::Reader works. It is easy to understand "The Reader parser is good for when you need the speed of a SAX parser, but do not want to write a Document handler." as meaning "this is a SAX parser with a thin interface on top to make it easier than dealing with SAX yourself".

However the first node return by XML::Reader has the whole document as inner_xml, so I am wondering if XML::Reader is really SAX.

What we need in a document that looks like

<foos>
  <foo>...</foo>
  <foo>...</foo>
  <foo>...</foo>
  ...
  <foo>...</foo>
<foos>

is to iterate just on the entries. What is the recommendation in such a case?

Thanks a lot for Nokogiri

@flavorjones
Copy link
Member

Hi @akimd, thanks for asking this question.

I want to spend a little bit of time understanding the memory performance of this example in JRuby -- based on your description, it sounds like perhaps there's a memory leak in the JRuby implementation that we might be able to fix.

The Reader class is based on libxml2's xmlreader module. Although libxml2 uses a SAX-ish pasrser at the heart of its implementation, the API is specialized, and it is optimized for the memory pattern of exposing only a "cursor" as it encounters each node.

The JRuby implementation does not have a low-level parser abstraction like libxml2's xmlreader, and so it's emulating that API ... I'm not very familiar with this particular corner of the JRuby implementation (it's had many hands in it over the years, none of them mine) but it looks like it's using the standard SAX parser provided by Xerces, plus some wrapper logic to present a Reader cursor.

I have some ideas on where the issue might be, and it's probably in the JRuby Reader wrapper. I will dig in and see if I can figure it out.

In the meantime, if you are willing to take on the additional complexity of writing SAX parser handlers, you should find the memory performance of the SAX parser acceptable.

@flavorjones flavorjones changed the title [help] Is XML::Reader closer to DOM or SAX parsing? JRuby XML::Reader memory performance is poor Apr 23, 2021
@akimd
Copy link
Author

akimd commented Apr 23, 2021

Hi Mike,
Thanks a lot for the quick response. Ok, so you do confirm that with respect to ressource consumption, XML::Reader is definitely expected to behave more like a nice and comfy SAX reader than a DOM one. That's reassuring. So that probably means that using something like inner_xml is asking for trouble that fire the parsing of all the remainder of the file (we don't do that in the real case, it's just something I encountered when toying with the artificial example above).
Other team members are currently trying to address this issue using other parsers, but that causes other problems. I have no idea what the final choice will be, but we will watching change in Nokogiri on this regard.

Thanks again!

@flavorjones
Copy link
Member

flavorjones commented Apr 23, 2021

with respect to ressource consumption, XML::Reader is definitely expected to behave more like a nice and comfy SAX reader than a DOM one

That's correct, to the best of my knowledge! If it's not doing that then we should fix it; or else I need to understand the low-level implemention of libxml2 better.

@flavorjones
Copy link
Member

I would love some help with this from any of the folks who are familiar with the JRuby implementation.

@akimd
Copy link
Author

akimd commented Apr 30, 2021

Hi guys,
FWIW, we have fully converted our tool to using the SAX parser only.
Cheers!

@flavorjones
Copy link
Member

For posterity: this isn't the first issue filed about the memory utilization of Reader in JRuby -- see also #1066.

@flavorjones
Copy link
Member

See #831 for another instance when we did work to try to improve memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants