Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] "in context" fragment parsing silently degrades functionality when XML errors are encountered #2092

Closed
flavorjones opened this issue Oct 8, 2020 · 4 comments · Fixed by #3256

Comments

@flavorjones
Copy link
Member

flavorjones commented Oct 8, 2020

Please describe the bug

A fragment that is parsed "in context" and contains recoverable errors is silently parsed "out of context".

The code at

error_count = document.errors.length
node_set = in_context(contents, options.to_i)
if node_set.empty? and document.errors.length > error_count and options.recover?
fragment = Nokogiri::HTML::DocumentFragment.parse contents
node_set = fragment.children
end
is responsible for this behavior, dating back to a 2010 fix for #313.

The root cause is that libxml2 does not pay attention to the "recover" option when parsing fragments in context via xmlParseInNodeContext.

Help us reproduce what you're seeing

#! /usr/bin/env ruby

require 'nokogiri'

context_xml = "<root xmlns:n='https://example.com/foo'></root>"
context_doc = Nokogiri::XML::Document.parse(context_xml)

valid_xml_fragment = "<n:a><b/></n:a>"
invalid_xml_fragment = "<n:a><b></n:a>" # note missing closing tag for `b`

# valid fragment parses fine
context_doc.root.parse(valid_xml_fragment).tap do |fragment|
  fragment.to_xml # => "<n:a>\n  <b/>\n</n:a>"
  fragment.first.name # => "a"
  fragment.first.namespace # => #<Nokogiri::XML::Namespace:0x3c prefix="n" href="https://example.com/foo">
end

# invalid fragment parses with errors, cannot recover, and is silently parsed out of context leading
# to namespaces not being properly referenced
context_doc.root.parse(invalid_xml_fragment).tap do |fragment|
  fragment.to_xml # => "<a>\n  <b/>\n</a>"
  fragment.first.name # => "a"
  fragment.first.namespace # => nil
end

run with:

# Nokogiri (1.10.10)
    ---
    warnings: []
    nokogiri: 1.10.10
    ruby:
      version: 2.7.0
      platform: x86_64-linux
      description: ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/flavorjones/.rvm/gems/ruby-2.7.0/gems/nokogiri-1.10.10/ports/x86_64-pc-linux-gnu/libxml2/2.9.10"
      libxslt_path: "/home/flavorjones/.rvm/gems/ruby-2.7.0/gems/nokogiri-1.10.10/ports/x86_64-pc-linux-gnu/libxslt/1.1.34"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      - 0004-libxml2.la-is-in-top_builddir.patch
      - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
      libxslt_patches: []
      compiled: 2.9.10
      loaded: 2.9.10

Expected behavior

I think preferable behavior would be to choose one of:

  1. add an error noting that we "fell back" and pointing the user to turning off the recover option
  2. don't recover, but raise a sensible exception
  3. fix libxml2

Additional context

The behavior described here was introduced in #313.

@flavorjones flavorjones added state/needs-triage Inbox for non-installation-related bug reports or help requests topic/fragment topic/namespaces and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Oct 8, 2020
flavorjones added a commit that referenced this issue Oct 8, 2020
and link to the github issue I just opened, #2092.

[skip ci]
@flavorjones flavorjones added this to the 2.0.0 milestone Oct 8, 2020
@flavorjones
Copy link
Member Author

Side note: on JRuby in-context parsing seems to not work:

#! /usr/bin/env ruby
require 'nokogiri'

context_xml = "<root xmlns:n='https://example.com/foo'></root>"
context_doc = Nokogiri::XML::Document.parse(context_xml)

valid_xml_fragment = "<n:a><b/></n:a>"
invalid_xml_fragment = "<n:a><b></n:a>" # note missing closing tag for `b`

# valid fragment parses fine
context_doc.root.parse(valid_xml_fragment).tap do |fragment|
  fragment.to_xml # => "<n:a><b/></n:a>"
  fragment.first.name # => "a"
  fragment.first.namespace # => nil
end

# invalid fragment parses with errors, cannot recover, and is silently parsed out of context leading
# to namespaces not being properly referenced
context_doc.root.parse(invalid_xml_fragment).tap do |fragment|
  fragment.to_xml # => "<n:a><b/></n:a>"
  fragment.first.name # => "a"
  fragment.first.namespace # => nil
end

run with

# Nokogiri (1.10.10)
    ---
    warnings: []
    nokogiri: 1.10.10
    ruby:
      version: 2.5.7
      platform: java
      description: jruby 9.2.9.0 (2.5.7) 2019-10-30 458ad3e OpenJDK 64-Bit Server VM 11.0.8+10-post-Ubuntu-0ubuntu120.04
        on 11.0.8+10-post-Ubuntu-0ubuntu120.04 [linux-x86_64]
      engine: jruby
      jruby: 9.2.9.0
    xerces: Xerces-J 2.12.0
    nekohtml: NekoHTML 1.9.21

@flavorjones
Copy link
Member Author

This recovery behavior seems to be fixed for HTML4 documents on libxml2 master, but isn't fixed for XML documents. I've inquired about it at (on master) in-context parsing recovery: XML vs HTML (#645) · Issues · GNOME / libxml2 · GitLab

flavorjones added a commit that referenced this issue Dec 14, 2023
**What problem is this PR intended to solve?**

Upstream libxml2 test failures.

- work around a bug/merge-request filed upstream [(on master) fix:
support ASCII encoding on input buffers (!231) · Merge requests · GNOME
/ libxml2 ·
GitLab](https://gitlab.gnome.org/GNOME/libxml2/-/merge_requests/231)
- looks like in-context parser recovery now works? updating
`XML::Node#parse` behavior to accommodate it in a backwards-compatible
fashion. See #2092 which might be fixed in future versions of libxml2.


**Have you included adequate test coverage?**

Making existing tests pass (or not segfault while we wait for an
upstream fix).


**Does this change affect the behavior of either the C or the Java
implementations?**

No.
@flavorjones
Copy link
Member Author

The underlying behavior being worked around will be fixed in libxml 2.13 for both HTML4 and XML, so I'm not going to fix this in Nokogiri.

@flavorjones flavorjones modified the milestones: v2.0.0, v1.17.0 Jun 24, 2024
@flavorjones
Copy link
Member Author

Tagged for v1.17.0 since that seems to be when we'll ship libxml 2.13.x.

flavorjones added a commit that referenced this issue Jun 24, 2024
This hack is no longer necessary since upstream recovery behavior has
improved in v2.13.

Closes #2092
flavorjones added a commit that referenced this issue Jun 24, 2024
This hack is no longer necessary since upstream recovery behavior has
improved in v2.13.

Closes #2092
flavorjones added a commit that referenced this issue Jun 24, 2024
This hack is not necessary with libxml 2.13 which improves fragment
recovery behavior.

Closes #2092
flavorjones added a commit that referenced this issue Jun 24, 2024
This hack is not necessary with libxml 2.13 which improves fragment
recovery behavior.

Closes #2092
flavorjones added a commit that referenced this issue Jun 24, 2024
…2.13 (#3256)

**What problem is this PR intended to solve?**

This hack is not necessary with libxml 2.13 which improves fragment
recovery behavior.

- add a TODO to remind me to remove the hack once we no longer support
libxml 2.13 (system libs)
- add a test that asserts the correct behavior when using libxml >= 2.13

Closes #2092


**Have you included adequate test coverage?**

Yes. 


**Does this change affect the behavior of either the C or the Java
implementations?**

Sadly, the Java implementation still does not handle in-context fragment
parsing correctly, but that's out of scope for this improvement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant