Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[help] Nokogiri xpath without root element #2091

Closed
sean-yeoh opened this issue Oct 6, 2020 · 4 comments
Closed

[help] Nokogiri xpath without root element #2091

sean-yeoh opened this issue Oct 6, 2020 · 4 comments

Comments

@sean-yeoh
Copy link

sean-yeoh commented Oct 6, 2020

Hi guys, I'm not too sure where to ask for help so I'm trying my luck here. Please feel free to close this and direct me to a proper channel if there's one. Thanks!

To Reproduce
I'm trying to query the xml partial string below in the variable text. It's in Open XML format. When I tried to query it without adding a root element, I get the Undefined namespace prefix error.

text = "<w:p>"\
          "<w:r>"\
            "<w:t xml:space=\"preserve\"></w:t>"\
          "</w:r>"\
          "<w:hyperlink r:id=\"https://facebook.com\" w:history=\"1\">"\
            "<w:r>"\
              "<w:rPr>"\
                "<w:rStyle w:val=\"Hyperlink\"/>"\
              "</w:rPr>"\
              "<w:t>Facebook</w:t>"\
            "</w:r>"\
          "</w:hyperlink>"\
          "<w:r>"\
            "<w:t xml:space=\"preserve\"></w:t>"\
          "</w:r>"\
        "</w:p>"

xml = Nokogiri::XML(text)
xml.xpath('//w:hyperlink') # => Nokogiri::XML::XPath::SyntaxError: ERROR: Undefined namespace prefix: //w:hyperlink

I then tried adding a root element and adding namespaces manually. It doesn't raise an error now but It's returning an empty array.

require 'nokogiri'

text = "<w:p>"\
          "<w:r>"\
            "<w:t xml:space=\"preserve\"></w:t>"\
          "</w:r>"\
          "<w:hyperlink r:id=\"https://facebook.com\" w:history=\"1\">"\
            "<w:r>"\
              "<w:rPr>"\
                "<w:rStyle w:val=\"Hyperlink\"/>"\
              "</w:rPr>"\
              "<w:t>Facebook</w:t>"\
            "</w:r>"\
          "</w:hyperlink>"\
          "<w:r>"\
            "<w:t xml:space=\"preserve\"></w:t>"\
          "</w:r>"\
        "</w:p>"

xml_namespaces = [
  { aink: 'http://schemas.microsoft.com/office/drawing/2016/ink' },
  { am3d: 'http://schemas.microsoft.com/office/drawing/2017/model3d' },
  { cx: 'http://schemas.microsoft.com/office/drawing/2014/chartex' },
  { cx1: 'http://schemas.microsoft.com/office/drawing/2015/9/8/chartex' },
  { cx2: 'http://schemas.microsoft.com/office/drawing/2015/10/21/chartex' },
  { cx3: 'http://schemas.microsoft.com/office/drawing/2016/5/9/chartex' },
  { cx4: 'http://schemas.microsoft.com/office/drawing/2016/5/10/chartex' },
  { cx5: 'http://schemas.microsoft.com/office/drawing/2016/5/11/chartex' },
  { cx6: 'http://schemas.microsoft.com/office/drawing/2016/5/12/chartex' },
  { cx7: 'http://schemas.microsoft.com/office/drawing/2016/5/13/chartex' },
  { cx8: 'http://schemas.microsoft.com/office/drawing/2016/5/14/chartex' },
  { m: 'http://schemas.openxmlformats.org/officeDocument/2006/math' },
  { mc: 'http://schemas.openxmlformats.org/markup-compatibility/2006' },
  { o: 'urn:schemas-microsoft-com:office:office' },
  { r: 'http://schemas.openxmlformats.org/officeDocument/2006/relationships' },
  { v: 'urn:schemas-microsoft-com:vml' },
  { w: 'http://schemas.openxmlformats.org/wordprocessingml/2006/main' },
  { w10: 'urn:schemas-microsoft-com:office:word' },
  { w14: 'http://schemas.microsoft.com/office/word/2010/wordml' },
  { w15: 'http://schemas.microsoft.com/office/word/2012/wordml' },
  { w16: 'http://schemas.microsoft.com/office/word/2018/wordml' },
  { w16cex: 'http://schemas.microsoft.com/office/word/2018/wordml/cex' },
  { w16cid: 'http://schemas.microsoft.com/office/word/2016/wordml/cid' },
  { w16se: 'http://schemas.microsoft.com/office/word/2015/wordml/symex' },
  { wne: 'http://schemas.microsoft.com/office/word/2006/wordml' },
  { wp: 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing' },
  { wp14: 'http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing' },
  { wpc: 'http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas' },
  { wpg: 'http://schemas.microsoft.com/office/word/2010/wordprocessingGroup' },
  { wpi: 'http://schemas.microsoft.com/office/word/2010/wordprocessingInk' },
  { wps: 'http://schemas.microsoft.com/office/word/2010/wordprocessingShape' }
]

xml = Nokogiri::XML("<w:root>#{text}</w:root>")
xml_namespaces.each do |namespace|
  xml.root.add_namespace_definition(namespace.keys[0].to_s, namespace.values[0].to_s) 
end

xml.xpath('//w:hyperlink') # => []

I would like to find all <w:hyperlink> nodes within a xml partial.

Expected behavior
It should return the <w:hyperlink> node.

Environment

Please paste the output from nokogiri -v here, escaped by triple-backtick.

# Nokogiri (1.10.10)
    ---
    warnings: []
    nokogiri: 1.10.10
    ruby:
      version: 2.4.1
      platform: x86_64-linux
      description: ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/sean/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/nokogiri-1.10.10/ports/x86_64-pc-linux-gnu/libxml2/2.9.10"
      libxslt_path: "/home/sean/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/nokogiri-1.10.10/ports/x86_64-pc-linux-gnu/libxslt/1.1.34"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      - 0004-libxml2.la-is-in-top_builddir.patch
      - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
      libxslt_patches: []
      compiled: 2.9.10
      loaded: 2.9.10

This output will tell us what version of Ruby you're using, how you installed nokogiri, what versions of the underlying libraries you're using, and what operating you're using.

Additional context

@sean-yeoh sean-yeoh added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Oct 6, 2020
@flavorjones
Copy link
Member

👋 Hi @sean-yeoh, sorry you're having trouble. I'll try to help!

OK, so I'm not completely sure about the context in which you're working, so I'm going to explain what's going wrong, and then suggest a couple of ways to work around it.

First, the query

You're correct in that an XPath search in a namespaced document needs to provide the namespaces as a hash in the query call. This is covered pretty well in https://nokogiri.org/tutorials/searching_a_xml_html_document.html so I won't go into it here.

What's going wrong

Let's take a simplified example:

xml = "<w:p></w:p>"

doc = Nokogiri::XML(xml)
doc.xpath('//w:p', "w" => "https://foo.com/bar").inspect # => "[]"
doc.root.name # => "w:p"
doc.root.namespace # => nil
doc.root.namespaces # => {}

When this XML string is parsed, there's no namespaces defined on the root node. As a result, libxml2 (the underlying parsing engine that Nokogiri uses) treats this node as if its name is w:p and there is no namespace.

If we include the namespace definition on the root node, everything works:

xml = "<w:p xmlns:w='https://foo.com/bar'></w:p>"
doc = Nokogiri::XML(xml)
doc.root.name # => "p"
doc.root.namespace # => #<Nokogiri::XML::Namespace:0x3c prefix="w" href="https://foo.com/bar">
doc.root.namespaces # => {"xmlns:w"=>"https://foo.com/bar"}
doc.xpath('//w:p', "w" => "https://foo.com/bar").to_xml # => "<w:p xmlns:w=\"https://foo.com/bar\"/>"

Note here that the name of the node is p and the namespace points at a ns with the prefix of w.

Attempt 1 (incomplete)

We can attempt to re-add the namespace to the node:

xml = "<w:p></w:p>"

doc = Nokogiri::XML(xml)

ns = doc.root.add_namespace_definition("w", "https://foo.com/bar")
doc.root.namespace = ns
doc.root.name # => "w:p"
doc.xpath('//w:p', "w" => "https://foo.com/bar").to_xml # => ""

# need to update the name of the node to make it really work
doc.root.name = "p"
doc.xpath('//w:p', "w" => "https://foo.com/bar").to_xml # => "<w:p xmlns:w=\"https://foo.com/bar\"/>"

Note that even after:

  1. creating a namespace definition
  2. setting the namespace of the node to that namespace definition

the name of the node is still w:p, and there's no easy way to query for that.

Solution 1

At this point, we can serialize a valid document (via #to_xml) even though the data structures in memory aren't valid. So we could re-parse the serialized document!

xml = "<w:p></w:p>"

doc = Nokogiri::XML(xml)
doc.root.add_namespace_definition("w", "https://foo.com/bar")
doc.to_xml
# => "<?xml version=\"1.0\"?>\n" + "<w:p xmlns:w=\"https://foo.com/bar\"/>\n"

doc = Nokogiri::XML(doc.to_xml) # reparse
doc.to_xml
# => "<?xml version=\"1.0\"?>\n" + "<w:p xmlns:w=\"https://foo.com/bar\"/>\n"
doc.xpath('//w:p', "w" => "https://foo.com/bar").to_xml # => "<w:p xmlns:w=\"https://foo.com/bar\"/>"

This has the advantage of being simple, but is a little bit slow because we're parsing the document a second time.

Solution 2

We could also hack the name of the node along with hacking the namespace definition and the namespace:

doc = Nokogiri::XML(xml)
ns = doc.root.add_namespace_definition("w", "https://foo.com/bar")
doc.root.namespace = ns
doc.root.name = "p"
doc.xpath('//w:p', "w" => "https://foo.com/bar").to_xml # => "<w:p xmlns:w=\"https://foo.com/bar\"/>"

This has the advantage of modifying the existing data structure without re-parsing, but requires a bit of code to rename all the nodes in a non-trivial tree.

Solution 3

A more subtle solution, but one that may be more repeatable, is to parse your xml fragment within the context of a document that has all the namespaces defined. "Within the context of" means that the parsing is done as if the fragment is a child of a particular node, but the parsed nodes aren't children of the context node.

# first create the "context node" in a new document
root_xml = "<root xmlns:w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'/>"

root_doc = Nokogiri::XML(root_xml)
root_doc.to_xml
# => "<?xml version=\"1.0\"?>\n" +
#    "<root xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"/>\n"
root_doc.root.namespaces # => {"xmlns:w"=>"http://schemas.openxmlformats.org/wordprocessingml/2006/main"}

# then parse the xml fragment in that node's context
xml = "<w:p><w:hyperlink/></w:p>"

fragment = root_doc.root.parse(xml)
fragment.to_xml # => "<w:p>\n  <w:hyperlink/>\n</w:p>"

# note that the variable `fragment` is a NodeSet ...
fragment.first.name # => "p"
fragment.first.namespace # => #<Nokogiri::XML::Namespace:0x3c prefix="w" href="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

# note the ".//" which needs to prefix the xpath query because this is a nodeset of unparented nodes
# see https://github.com/sparklemotion/nokogiri/issues/572 for more on this
fragment.xpath('.//w:hyperlink', "w" => "http://schemas.openxmlformats.org/wordprocessingml/2006/main").to_xml # => "<w:hyperlink/>"

Solution 4

Finally, I present an option that is a "hammer" if you don't really care about namespaces, in which case maybe you don't want to bother having to deal with this.

xml = "<w:p><w:hyperlink/></w:p>"

doc = Nokogiri::XML xml
# remove the leading "w:" from every node's name
doc.traverse { |node| node.name = node.name.split("w:").join }
doc.xpath('//hyperlink') # => [#<Nokogiri::XML::Element:0x3c name="hyperlink">]

Hopefully something in there helps?

@flavorjones flavorjones added meta/user-help and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Oct 6, 2020
@sean-yeoh
Copy link
Author

Hi @flavorjones , thanks a lot for taking the time to answer this!
I think I'll go with solution 3 since I feel it's the easiest to understand in a glance.

Thanks again, have a great week ahead!

@flavorjones
Copy link
Member

Good choice! Thanks again for asking this question.

@flavorjones
Copy link
Member

Related, I re-discovered some undesirable behavior of in-context node parsing while I was researching my response above. @sean-yeoh you may want to be aware of the potential unfortunate edge case I documented in #2092

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants