Jruby nokogiri 1.9.0+, rehomed elements don't keep namespaces, plus regression from 1.8 #1875

jrochkind · 2019-02-25T18:57:03Z

This is a followup to #1774. I'm sorry it took me a while to get back around to it.

While a fix to #1774 was provided in #1778, the fix only works for a default namespace, not one with a prefix. (It hadn't occured to me in my repro for #1774 that I should provide both, as they may be have different codepaths in java nokogiri I guess? )

What is worse, there is a regression in 1.9.0+, possibly introduced by #1778, but I haven't verified this.

I used to be able to work around #1774, by explicitly calling add_namespace to add the namespaces I wanted on the rehomed element.

However, my workaround that worked in JRuby 1.8.5 stopped working in JRuby 1.9.0+, so I'm kind of stymied and maybe even worse off now. :(

Reproductions are all in MRI 2.5 and JRuby 9.2.0.0, but I do not believe specific MRI/JRuby versions matter.

xml_str = <<~EOS
<?xml version="1.0" encoding="UTF-8"?>
<top xmlns="http://example.org/top" xmlns:a="http://example.org/a">
<record>
<a:something>a:something</a:something>
</record>
</top>
EOS

noko_doc = Nokogiri::XML.parse(xml_str)

record_element = noko_doc.xpath("//top:record", {top: "http://example.org/top"}).first

new_doc = Nokogiri::XML::Document.new
new_doc.root = record_element

puts new_doc.namespaces

# JRuby nokogiri 1.8.5:
#    {}

# JRuby nokogiri 1.9.0+: default ns is there, but the `a` namespace did not transfer
#    {"xmlns"=>"http://example.org/top"}

# MRI 2.5, both nokogiri versions
# (and I consider this the correct and most useful result):
#    {"xmlns"=>"http://example.org/top", "xmlns:a"=>"http://example.org/a", "xmlns:b"=>"http://example.org/b"}


# Workaround I used to use in Jruby nokogiri 1.8.0, which now fails in Jruby nokogiri 1.9.0
new_doc.root.add_namespace("a", "http://example.org/a")

puts "\nafter add_namespace:\n"
puts new_doc.namespaces.inspect
puts new_doc.to_xml

# JRuby nokogiri 1.8.5:
# the one we added was there:
# {"xmlns:a"=>"http://example.org/a"}
#
# And is included in the output:
# <?xml version="1.0"?>
# <record xmlns:a="http://example.org/a">
#   <a:something>a:something</a:something>
# </record>

# JRuby nokogiri 1.9.0+:
# The one we added isn't there at all, and is not included in the output either:
# {"xmlns"=>"http://example.org/top"}  # new namespace we added had no effect on root element
#
# And the "a" namespace is magically on the <a:something> tag, but we wanted to add it on
# the _root_
# <?xml version="1.0"?>
# <record xmlns="http://example.org/top">
#   <a:something xmlns:a="http://example.org/a">a:something</a:something>
# </record>
# 
# 
# MRI does not require the "add_namespace" workaround, and still outputs what
# I consider correct and desirable XML:
# <?xml version="1.0"?>
# <record xmlns="http://example.org/top" xmlns:a="http://example.org/a">
#    <a:something>a:something</a:something>
# </record>

The text was updated successfully, but these errors were encountered:

jvshahid · 2019-02-26T05:14:44Z

This is a followup to #1774. I'm sorry it took me a while to get back around to it.

While a fix to #1774 was provided in #1778, the fix only works for a default namespace, not one with a prefix. (It hadn't occured to me in my repro for #1774 that I should provide both, as they may be have different codepaths in java nokogiri I guess? )

AFAIU the JRuby implementation is conforming to the standard. See my explanation inlined below.

What is worse, there is a regression in 1.9.0+, possibly introduced by #1778, but I haven't verified this.

I used to be able to work around #1774, by explicitly calling add_namespace to add the namespaces I wanted on the rehomed element.

I am not sure what is going on here. In particular, I don't remember what add_namespace does. I will have to look into this later.

Reproductions are all in MRI 2.5 and JRuby 9.2.0.0, but I do not believe specific MRI/JRuby versions matter.

xml_str = <<~EOS
<?xml version="1.0" encoding="UTF-8"?>
<top xmlns="http://example.org/top" xmlns:a="http://example.org/a">
<record>
<a:something>a:something</a:something>
</record>
</top>
EOS

noko_doc = Nokogiri::XML.parse(xml_str)

record_element = noko_doc.xpath("//top:record", {top: "http://example.org/top"}).first

new_doc = Nokogiri::XML::Document.new
new_doc.root = record_element

puts new_doc.namespaces

# JRuby nokogiri 1.8.5:
#    {}

# JRuby nokogiri 1.9.0+: default ns is there, but the `a` namespace did not transfer
#    {"xmlns"=>"http://example.org/top"}

This is expected. XML::Node::namespaces returns the namespaces that are in scope of the current node. As pointed out later, JRuby puts the namespace declaration on the something element. AFAIU both MRI and JRuby implementation are correct. I am also curious about what @flavorjones think regarding this difference.

# MRI 2.5, both nokogiri versions
# (and I consider this the correct and most useful result):
#    {"xmlns"=>"http://example.org/top", "xmlns:a"=>"http://example.org/a", "xmlns:b"=>"http://example.org/b"}

I don't understand where http://example.org/b is coming from. Is that a copy paste error ?

# And the "a" namespace is magically on the <a:something> tag, but we wanted to add it on
# the _root_
# <?xml version="1.0"?>
# <record xmlns="http://example.org/top">
#   <a:something xmlns:a="http://example.org/a">a:something</a:something>
# </record>
```

It isn't magical. The namespaces that are in scope of something are the default namespace and the xmlns:a namespace. For example puts new_doc.root.children[1].namespaces.inspect (children[0] is a text node) prints both namespaces:

{"xmlns:a"=>"http://example.org/a", "xmlns"=>"http://example.org/top"}

jrochkind · 2019-02-26T12:38:15Z

Thanks for response!

Are you suggesting that MRI and Jruby are both correct, even though they don't match? I don't understand how this can be, I thought MRI and JRuby nokogiri implementations could be expected to match. To me, the MRI implementation is the most useful, as well as seeming correct. The JRuby implementation has changed between nokogiri 1.8 and 1.9, and neither of them match MRI. (Are you suggesting that MRI ought to match... JRuby 1.9, even though it never has?)

What I need to do is remove a sub-tree (a node and it's children) from a document, and make it into it's own document -- with all namespaces semantically intact. In MRI nokogiri, I can do this. In Java nokogiri 1.8, I could do it via a workaround (I considered it a workaround to a bug). In Java nokogiri 1.9+, I can find no way to do this at all. (Short of serializing to a string and re-parsing). That doesn't seem right to me.

I don't understand where http://example.org/b is coming from. Is that a copy paste error ?

It was, copy-pasted from the wrong example, as I was messing with different examples to try and isolate the error. (Not sure if it's more or less confusing for me to go edit my original text at this point?)

jvshahid · 2019-02-27T12:41:23Z

Are you suggesting that MRI and Jruby are both correct, even though they don't match ?

Yes. They are equivalent, AFAIK.

I don't understand how this can be, I thought MRI and JRuby nokogiri implementations could be expected to match.

Not necessarily, specially if the output XML is equivalent, such as this case.

To me, the MRI implementation is the most useful, as well as seeming
correct.

I don't think I understand how the MRI output is more useful. I think this is worth clarifying.

What I need to do is remove a sub-tree (a node and it's children) from a
document, and make it into it's own document -- with all namespaces semantically intact.

Isn't that exactly what the current version of Nokogiri does ? JRuby is adding the namespace declaration to the node itself, instead of adding the namesapce declaration on the parent node. Both implementations are correct 1.

[...]

In Java nokogiri 1.9+, I can find no way to do this at all. (Short of serializing to a string and re-parsing). That doesn't seem right to me.

Can you provide an example code to demonstrate how the JRuby's output is problematic. Is the consumer of the output assuming that doc.namespaces will return all namespaces in the document ?

jrochkind · 2019-02-27T14:24:03Z

Basically I want to write code that uses namespaces that will work under MRI or JRuby, and write tests that will pass under either. I want equivalent behavior. Which I thought was the contract of nokogiri.

This includes add_namespace having the same effect in either, namespaces returning the same thing in either, and produced serialization from the same operations on the same input being equivalent in both.

But I see your point about the XML trees being semantically equivalent in both. (Note: semantically equivalent to a namespace-aware processor; probably not to a non-namespace-aware-processor that thinks "a:something" is just a tag name and "xmlns:a" just an attribute name.) I will spend some more with it and satisfy myself that this is true; it still wouldn't totally satisfy me, but I see your point.

It seems likely to me that there are cases where the JRuby noko 1.9 version would produce a serialization that is a lot more bytes than the MRI version. If there are a lot of a:something elements, which in the JRuby version each get an xmlns declaration serialized on that element.

Which suggests another point: Let's say I have an in-memory representation that matches:

<record>
    <a:something xmlns:a="http://example.org/a">a:something</a:something>
    <a:something xmlns:a="http://example.org/a">a:something</a:something>
</record>

If I want to "hoist" that xmlns:a= decleration to be on the record element instead, so that when it is serialized it's on the record element, and not present on the a:something elements -- should nokogiri give me API to do that?

I don't think nokogiri gives me API to do that. The nokogiri API doesn't let you alter namespace declarations much, if at all. At least in Jruby nokogiri, not totally sure about MRI nokogiri (if they differ, that goes back to the point about MRI/nokogiri divergence).

…ent elements than MRI, when we re-home elements into new documents for `each_record_xpath`. Test accordingly so tests pass. Documents, whether in-memory or once serialized, should still be semantically identical to a namespace-aware processor, and that's what we try to test for. See: sparklemotion/nokogiri#1875

…ent elements than MRI, when we re-home elements into new documents for `each_record_xpath`. Test accordingly so tests pass. Documents, whether in-memory or once serialized, should still be semantically identical to a namespace-aware processor, and that's what we try to test for. See: sparklemotion/nokogiri#1875 This commits passes tests in MRI nokogiri, as well as JRuby nokogiri 1.8.6 through 1.10.x

…ent elements than MRI, when we re-home elements into new documents for `each_record_xpath`. Test accordingly so tests pass. Documents, whether in-memory or once serialized, should still be semantically identical to a namespace-aware processor, and that's what we try to test for. See: sparklemotion/nokogiri#1875 This commits passes tests in MRI nokogiri, as well as JRuby nokogiri 1.8.5 through 1.10.x

jrochkind · 2019-02-27T19:20:09Z

Thanks for your analysis @jvshahid !

I see how the documents are semantically identical. It could be (and was!) a lot worse.

I still think ideally:

Jruby behavior should match MRI. So test or other code written under MRI can pass in JRuby without changes, and JRuby nokogiri can be considered entirely compat with MRI.
Failing that (or for other reasons), there ought to be a way to use Nokogiri API on an in-memory representation to change what elements xlmns declarations appear on. I don't believe there is any such API (def on Jruby, I tried lots of things, but I believe in MRI too). xmlns namespace declerations (and on what element they will be serialized when writing out) are essentially immutable.

However, in the current implementation, I've changed my code to test for "namespace-aware semantic equivalence", where it was failing because serializations were not identical.

(In particular, if you take an existing document, and do node.to_xml, vs re-homing that node to new_document.root = node and doing a to_xml, you may see xmlns attributes on different nodes in those two different serializations, in Jruby).

Here is the pretty hacky method I wrote to test "namespace-aware semantic equivalence" of two different nokogiri elements, in case it's useful to anyone.

  def ns_semantic_equivalent_xml?(noko_a, noko_b)
    noko_a = noko_a.root if noko_a.kind_of?(Nokogiri::XML::Document)
    noko_b = noko_b.root if noko_b.kind_of?(Nokogiri::XML::Document)

    noko_a.name == noko_b.name &&
      noko_a.namespace&.prefix == noko_b.namespace&.prefix &&
      noko_a.namespace&.href   == noko_b.namespace&.href &&
      noko_a.attributes        == noko_b.attributes &&
      noko_a.children.length   == noko_b.children.length &&
      noko_a.children.each_with_index.all? do |a_child, index|
        ns_semantic_equivalent_xml?(a_child, noko_b.children[index])
      end
  end

flavorjones · 2021-11-15T15:39:49Z

Hi, sorry for not responding before now on this thread.

Jruby behavior should match MRI

Maybe in an ideal world; but the JRuby implementation is a completely separate implementation from the CRuby backend and the benefit in this case is low relative to the cost. The "correct" implementation could be argued either way, and I'm unaware of any XML spec that covers reparenting nodes into new documents.

I would welcome a pull request that cleaned this up, but otherwise I will not personally prioritize it. I'm sorry your experience isn't seamless across these implementations.

flavorjones · 2021-11-15T15:40:57Z

For posterity, the CRuby behavior is to create a doc that looks like:

<?xml version="1.0"?>
<record xmlns="http://example.org/top" xmlns:a="http://example.org/a">
<a:something>a:something</a:something>
</record>

and the JRuby behavior is:

<?xml version="1.0"?>
<record xmlns="http://example.org/top">
<a:something xmlns:a="http://example.org/a">a:something</a:something>
</record>

jrochkind · 2021-11-15T19:11:55Z

Thanks. If anyone can figure out any invocation to get the JRuby serialized doc to look like the CRuby one, like any methods I can call to make that happen, please do share! I would love to be able to create a doc by mutating existing documents, in Jruby, that would serialize the way the Cruby one does -- I actually used to have a workaround to do it prior to 1.9.0, but now can't figure out any way to do it at all, which is a problem. :(

jrochkind mentioned this issue Feb 25, 2019

Jruby builds are failing due to nokogiri 1.9.0+ namespace regression traject/traject#203

Closed

jvshahid added platform/jruby topic/namespaces labels Feb 25, 2019

jrochkind mentioned this issue Feb 27, 2019

Fix Nokogiri reader for Jruby nokogiri >= 1.9.0 traject/traject#209

Merged

flavorjones closed this as completed Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jruby nokogiri 1.9.0+, rehomed elements don't keep namespaces, plus regression from 1.8 #1875

Jruby nokogiri 1.9.0+, rehomed elements don't keep namespaces, plus regression from 1.8 #1875

jrochkind commented Feb 25, 2019 •

edited

Loading

jvshahid commented Feb 26, 2019

jrochkind commented Feb 26, 2019

jvshahid commented Feb 27, 2019

jrochkind commented Feb 27, 2019 •

edited

Loading

jrochkind commented Feb 27, 2019

flavorjones commented Nov 15, 2021

flavorjones commented Nov 15, 2021

jrochkind commented Nov 15, 2021

Jruby nokogiri 1.9.0+, rehomed elements don't keep namespaces, plus regression from 1.8 #1875

Jruby nokogiri 1.9.0+, rehomed elements don't keep namespaces, plus regression from 1.8 #1875

Comments

jrochkind commented Feb 25, 2019 • edited Loading

jvshahid commented Feb 26, 2019

jrochkind commented Feb 26, 2019

jvshahid commented Feb 27, 2019

jrochkind commented Feb 27, 2019 • edited Loading

jrochkind commented Feb 27, 2019

flavorjones commented Nov 15, 2021

flavorjones commented Nov 15, 2021

jrochkind commented Nov 15, 2021

jrochkind commented Feb 25, 2019 •

edited

Loading

jrochkind commented Feb 27, 2019 •

edited

Loading