-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[help] Nokogiri xpath without root element #2091
Comments
👋 Hi @sean-yeoh, sorry you're having trouble. I'll try to help! OK, so I'm not completely sure about the context in which you're working, so I'm going to explain what's going wrong, and then suggest a couple of ways to work around it. First, the query You're correct in that an XPath search in a namespaced document needs to provide the namespaces as a hash in the query call. This is covered pretty well in https://nokogiri.org/tutorials/searching_a_xml_html_document.html so I won't go into it here. What's going wrong Let's take a simplified example: xml = "<w:p></w:p>"
doc = Nokogiri::XML(xml)
doc.xpath('//w:p', "w" => "https://foo.com/bar").inspect # => "[]"
doc.root.name # => "w:p"
doc.root.namespace # => nil
doc.root.namespaces # => {} When this XML string is parsed, there's no namespaces defined on the root node. As a result, libxml2 (the underlying parsing engine that Nokogiri uses) treats this node as if its name is If we include the namespace definition on the root node, everything works: xml = "<w:p xmlns:w='https://foo.com/bar'></w:p>"
doc = Nokogiri::XML(xml)
doc.root.name # => "p"
doc.root.namespace # => #<Nokogiri::XML::Namespace:0x3c prefix="w" href="https://foo.com/bar">
doc.root.namespaces # => {"xmlns:w"=>"https://foo.com/bar"}
doc.xpath('//w:p', "w" => "https://foo.com/bar").to_xml # => "<w:p xmlns:w=\"https://foo.com/bar\"/>" Note here that the Attempt 1 (incomplete) We can attempt to re-add the namespace to the node: xml = "<w:p></w:p>"
doc = Nokogiri::XML(xml)
ns = doc.root.add_namespace_definition("w", "https://foo.com/bar")
doc.root.namespace = ns
doc.root.name # => "w:p"
doc.xpath('//w:p', "w" => "https://foo.com/bar").to_xml # => ""
# need to update the name of the node to make it really work
doc.root.name = "p"
doc.xpath('//w:p', "w" => "https://foo.com/bar").to_xml # => "<w:p xmlns:w=\"https://foo.com/bar\"/>" Note that even after:
the name of the node is still Solution 1 At this point, we can serialize a valid document (via xml = "<w:p></w:p>"
doc = Nokogiri::XML(xml)
doc.root.add_namespace_definition("w", "https://foo.com/bar")
doc.to_xml
# => "<?xml version=\"1.0\"?>\n" + "<w:p xmlns:w=\"https://foo.com/bar\"/>\n"
doc = Nokogiri::XML(doc.to_xml) # reparse
doc.to_xml
# => "<?xml version=\"1.0\"?>\n" + "<w:p xmlns:w=\"https://foo.com/bar\"/>\n"
doc.xpath('//w:p', "w" => "https://foo.com/bar").to_xml # => "<w:p xmlns:w=\"https://foo.com/bar\"/>" This has the advantage of being simple, but is a little bit slow because we're parsing the document a second time. Solution 2 We could also hack the name of the node along with hacking the namespace definition and the namespace: doc = Nokogiri::XML(xml)
ns = doc.root.add_namespace_definition("w", "https://foo.com/bar")
doc.root.namespace = ns
doc.root.name = "p"
doc.xpath('//w:p', "w" => "https://foo.com/bar").to_xml # => "<w:p xmlns:w=\"https://foo.com/bar\"/>" This has the advantage of modifying the existing data structure without re-parsing, but requires a bit of code to rename all the nodes in a non-trivial tree. Solution 3 A more subtle solution, but one that may be more repeatable, is to parse your xml fragment within the context of a document that has all the namespaces defined. "Within the context of" means that the parsing is done as if the fragment is a child of a particular node, but the parsed nodes aren't children of the context node. # first create the "context node" in a new document
root_xml = "<root xmlns:w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'/>"
root_doc = Nokogiri::XML(root_xml)
root_doc.to_xml
# => "<?xml version=\"1.0\"?>\n" +
# "<root xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"/>\n"
root_doc.root.namespaces # => {"xmlns:w"=>"http://schemas.openxmlformats.org/wordprocessingml/2006/main"}
# then parse the xml fragment in that node's context
xml = "<w:p><w:hyperlink/></w:p>"
fragment = root_doc.root.parse(xml)
fragment.to_xml # => "<w:p>\n <w:hyperlink/>\n</w:p>"
# note that the variable `fragment` is a NodeSet ...
fragment.first.name # => "p"
fragment.first.namespace # => #<Nokogiri::XML::Namespace:0x3c prefix="w" href="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
# note the ".//" which needs to prefix the xpath query because this is a nodeset of unparented nodes
# see https://github.com/sparklemotion/nokogiri/issues/572 for more on this
fragment.xpath('.//w:hyperlink', "w" => "http://schemas.openxmlformats.org/wordprocessingml/2006/main").to_xml # => "<w:hyperlink/>" Solution 4 Finally, I present an option that is a "hammer" if you don't really care about namespaces, in which case maybe you don't want to bother having to deal with this. xml = "<w:p><w:hyperlink/></w:p>"
doc = Nokogiri::XML xml
# remove the leading "w:" from every node's name
doc.traverse { |node| node.name = node.name.split("w:").join }
doc.xpath('//hyperlink') # => [#<Nokogiri::XML::Element:0x3c name="hyperlink">] Hopefully something in there helps? |
Hi @flavorjones , thanks a lot for taking the time to answer this! Thanks again, have a great week ahead! |
Good choice! Thanks again for asking this question. |
Related, I re-discovered some undesirable behavior of in-context node parsing while I was researching my response above. @sean-yeoh you may want to be aware of the potential unfortunate edge case I documented in #2092 |
Hi guys, I'm not too sure where to ask for help so I'm trying my luck here. Please feel free to close this and direct me to a proper channel if there's one. Thanks!
To Reproduce
I'm trying to query the xml partial string below in the variable
text
. It's in Open XML format. When I tried to query it without adding a root element, I get theUndefined namespace prefix
error.I then tried adding a root element and adding namespaces manually. It doesn't raise an error now but It's returning an empty array.
I would like to find all
<w:hyperlink>
nodes within a xml partial.Expected behavior
It should return the
<w:hyperlink>
node.Environment
Please paste the output from
nokogiri -v
here, escaped by triple-backtick.This output will tell us what version of Ruby you're using, how you installed nokogiri, what versions of the underlying libraries you're using, and what operating you're using.
Additional context
The text was updated successfully, but these errors were encountered: