Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAX::Parser errors when it encounters non-predefined entities. #1926

Closed
searls opened this issue Sep 22, 2019 · 16 comments · Fixed by #3265
Closed

SAX::Parser errors when it encounters non-predefined entities. #1926

searls opened this issue Sep 22, 2019 · 16 comments · Fixed by #3265

Comments

@searls
Copy link

searls commented Sep 22, 2019

Describe the bug

When an XML document contains non-predefined entities—even if the document defines those entities up-front—it will error when parsing with nokogiri's SAX parser.

Note that this warning from libxml2's docs seem to hint that getting this right is hard:

WARNING: handling entities on top of the libxml2 SAX interface is difficult!!!* If you plan to use non-predefined entities in your documents, then the learning curve to handle then using the SAX API may be long. If you plan to use complex documents, I strongly suggest you consider using the DOM interface instead and let libxml deal with the complexity rather than trying to do it yourself.

To Reproduce

#! /usr/bin/env ruby

xml = <<~XML
  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE Stuff [
  <!ELEMENT stuff (#PCDATA)>
  <!ENTITY THING "a thing">
  ]>
  <stuff>&THING;</stuff>
XML

require "nokogiri"
require "pp"

puts "----> parsing with DOM parser"
doc = Nokogiri::XML.parse(xml)
pp doc

puts "----> parsing with SAX parser"
class StuffDoc < Nokogiri::XML::SAX::Document
  def error(s)
    raise s
  end
end

Nokogiri::XML::SAX::Parser.new(StuffDoc.new).parse(xml)

When run, this will output:

----> parsing with DOM parser
#(Document:0x3fd9cdca51ac {
  name = "document",
  children = [
    #(DTD:0x3fd9cdca96a8 {
      name = "Stuff",
      children = [
        #(ElementDecl:0x3fd9cdca862c { name = "stuff" }),
        #(EntityDecl:0x3fd9cdcad8ac {
          name = "THING",
          children = [ #(Text "a thing")]
          })]
      }),
    #(Element:0x3fd9cdcac86c {
      name = "stuff",
      children = [ #(EntityReference:0x3fd9cdcb1f9c { name = "THING" })]
      })]
  })
----> parsing with SAX parser
Traceback (most recent call last):
	4: from demo.rb:24:in `<main>'
	3: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:83:in `parse'
	2: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:110:in `parse_memory'
	1: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:110:in `parse_with'
demo.rb:20:in `error': Entity 'THING' not defined (RuntimeError)

Expected behavior

I honestly just don't want this to explode. I'd prefer to get a literal string of the entity (e.g. "&THING;" in this case.

Environment

# Nokogiri (1.10.4)
    ---
    warnings: []
    nokogiri: 1.10.4
    ruby:
      version: 2.6.3
      platform: x86_64-darwin18
      description: ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-darwin18]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/ports/x86_64-apple-darwin18.6.0/libxml2/2.9.9"
      libxslt_path: "/Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/ports/x86_64-apple-darwin18.6.0/libxslt/1.1.33"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      libxslt_patches:
      - 0001-Fix-security-framework-bypass.patch
      compiled: 2.9.9
      loaded: 2.9.9

Additional context

This is a real problem for one important document, the JMDict XML file, which is a daily export of the most prominent community-maintained Japanese-English dictionary on the Internet. JMDict uses dozens of custom entities for tagging entries with various metadata. However, because the file is over 100MB, it's more appropriate for SAX parsing, which is how folks might run into this problem. (One example)

@flavorjones
Copy link
Member

Hey @searls, thanks for the clear bug report. I wanted to acknowledge that I saw this and let you know I will likely not be able to dig into it immediately, but I agree with your take that the described behavior is problematic.

@tenderlove
Copy link
Member

I did a quick look in to this. I knew I had looked in to this at one point, and now the code is jogging my memory. The SAX parser will actually build a DOM object in order to support entity substitution. I'm not sure if it actually builds the full DOM, but it definitely keeps the entity references on the document structs. In order to support this we need to make sure the myDoc part of the struct is filled out.

I thought the easiest approach would be to initialize the SAX parser with all the default callbacks, but that doesn't seem to work because the context pointer we're using is a nokogiriSAXTuple * where I think all the default callbacks are expecting a xmlParserCtxtPtr.

I think there are two approaches we could take to fix this:

  1. Embed xmlParserCtxt at the head of our nokogiriSAXTuple struct
  2. Implement all the default SAX callbacks and unwrap the nokogiriSAXTuple

I'm not sure if the first approach is possible, and the second approach sounds like a bigger patch.

@tenderlove
Copy link
Member

I forgot to mention, there is a callback for when an entity is encountered. But the huge bummer is that the callback is expected to return a xmlEntityPtr. We can't simply return NULL, otherwise you'll get the same error (but with a callback executed!).

With this patch:

diff --git a/ext/nokogiri/xml_sax_parser.c b/ext/nokogiri/xml_sax_parser.c
index 1a5f6c5f..7cc25524 100644
--- a/ext/nokogiri/xml_sax_parser.c
+++ b/ext/nokogiri/xml_sax_parser.c
@@ -7,7 +7,7 @@ static ID id_start_document, id_end_document, id_start_element, id_end_element;
 static ID id_start_element_namespace, id_end_element_namespace;
 static ID id_comment, id_characters, id_xmldecl, id_error, id_warning;
 static ID id_cdata_block, id_cAttribute;
-static ID id_processing_instruction;
+static ID id_processing_instruction, id_get_entity;
 
 static void start_document(void * ctx)
 {
@@ -251,6 +251,21 @@ static void processing_instruction(void * ctx, const xmlChar * name, const xmlCh
   );
 }
 
+static xmlEntityPtr get_entity(void * ctx, const xmlChar *name)
+{
+  VALUE rb_content;
+  VALUE self = NOKOGIRI_SAX_SELF(ctx);
+  VALUE doc = rb_iv_get(self, "@document");
+
+  rb_funcall( doc,
+              id_get_entity,
+              1,
+              NOKOGIRI_STR_NEW2(name)
+  );
+
+  return NULL;
+}
+
 static void deallocate(xmlSAXHandlerPtr handler)
 {
   NOKOGIRI_DEBUG_START(handler);
@@ -276,6 +291,7 @@ static VALUE allocate(VALUE klass)
   handler->error = error_func;
   handler->cdataBlock = cdata_block;
   handler->processingInstruction = processing_instruction;
+  handler->getEntity = get_entity;
   handler->initialized = XML_SAX2_MAGIC;
 
   return Data_Wrap_Struct(klass, NULL, deallocate, handler);
@@ -303,6 +319,7 @@ void init_xml_sax_parser()
   id_error          = rb_intern("error");
   id_warning        = rb_intern("warning");
   id_cdata_block    = rb_intern("cdata_block");
+  id_get_entity     = rb_intern("get_entity");
   id_cAttribute     = rb_intern("Attribute");
   id_start_element_namespace = rb_intern("start_element_namespace");
   id_end_element_namespace = rb_intern("end_element_namespace");

And this script:

xml = <<~XML
  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE Stuff [
  <!ELEMENT stuff (#PCDATA)>
  <!ENTITY THING "a thing">
  ]>
  <stuff>&THING;</stuff>
XML

require "nokogiri"
require "pp"

puts "----> parsing with SAX parser"
class StuffDoc < Nokogiri::XML::SAX::Document
  def get_entity name
    p [__method__, name]
  end

  def error(s)
    p [__method__, s]
  end
end

parser = Nokogiri::XML::SAX::Parser.new(StuffDoc.new)
parser.parse(xml)

Output is:

----> parsing with SAX parser
[:get_entity, "THING"]
[:get_entity, "THING"]
[:error, "Entity 'THING' not defined\n"]

@flavorjones
Copy link
Member

@tenderlove Thanks for looking into this. I've got performance concerns about option 2 you outlined above. If you recall the Fairy Wing Throwdown from 2011, I suspected that the poor performance of the SAX parser was due to callbacks. Maybe we should actually benchmark what happens if we implement all the defaults so we know for sure if this is reasonable to do?

@nwellnhof
Copy link

You have to implement the getEntity callback in libxml2's SAX interface. This should return a pointer to an xmlEntity struct which can be created with xmlNewEntity(). Make sure to set the doc argument to NULL, so the document's entity table won't be polluted. etype should be XML_INTERNAL_PREDEFINED_ENTITY for literal string or XML_INTERNAL_GENERAL_ENTITY for XML data to be parsed. To simplify memory management, it should be safe to free the old struct if getEntity is invoked again.

@flavorjones
Copy link
Member

OK, so I have a branch where I've implemented a default getEntity callback that invokes libxml2's xmlSAX2GetEntity to return the correct entity from the document:

diff --git a/ext/nokogiri/xml_sax_parser.c b/ext/nokogiri/xml_sax_parser.c
index 989ad9eb..622e8159 100644
--- a/ext/nokogiri/xml_sax_parser.c
+++ b/ext/nokogiri/xml_sax_parser.c
@@ -265,6 +265,12 @@ processing_instruction(void *ctx, const xmlChar *name, const xmlChar *content)
             );
 }
 
+static xmlEntityPtr
+get_entity(void *ctx, const xmlChar *name)
+{
+  return xmlSAX2GetEntity(NOKOGIRI_SAX_CTXT(ctx), name);
+}
+
 static size_t
 memsize(const void *data)
 {
@@ -300,6 +306,7 @@ allocate(VALUE klass)
   handler->cdataBlock = cdata_block;
   handler->processingInstruction = processing_instruction;
   handler->initialized = XML_SAX2_MAGIC;
+  handler->getEntity = get_entity;
 
   return self;
 }

The good news is that with this change, errors are no longer reported for entities that are properly declared in the DTD.

The less-good news is that the expansion of the entity is passed to the characters callback, meaning that detecting it as an entity as such is not easy. We could add a callback Nokogiri::XML::SAX::Document#get_entity which could be passed the name and the expansion, but there doesn't seem to be a way to prevent libxml2 from invoking the characters callback.

This change would have some implications for the design of eiwa which uses the Eiwa::Tag::Entity class.

@searls what are your thoughts here? Using Nokogiri with above patch, I can make all eiwa tests pass by applying this patch:

diff --git a/lib/eiwa/jmdict/doc.rb b/lib/eiwa/jmdict/doc.rb
index 8d2d4155..eeb91880 100644
--- a/lib/eiwa/jmdict/doc.rb
+++ b/lib/eiwa/jmdict/doc.rb
@@ -62,15 +62,7 @@ def characters(s)
       # end
 
       def error(msg)
-        if (matches = msg.match(/Entity '(\S+)' not defined/))
-          # See: http://github.com/sparklemotion/nokogiri/issues/1926
-          code = matches[1]
-          @current.set_entity(code, ENTITIES[code])
-        elsif msg == "Detected an entity reference loop\n"
-          # Do nothing and hope this does not matter.
-        else
-          raise Eiwa::Error.new("Parsing error: #{msg}")
-        end
+        raise Eiwa::Error.new("Parsing error: #{msg}")
       end
 
       # def cdata_block string
diff --git a/lib/eiwa/jmdict/entities.rb b/lib/eiwa/jmdict/entities.rb
index cf218d60..553e952c 100644
--- a/lib/eiwa/jmdict/entities.rb
+++ b/lib/eiwa/jmdict/entities.rb
@@ -13,7 +13,7 @@ module Jmdict
       "adj-ku" => "`ku' adjective (archaic)",
       "adj-na" => "adjectival nouns or quasi-adjectives (keiyodoshi)",
       "adj-nari" => "archaic/formal form of na-adjective",
-      "adj-no" => "nouns which may take the genitive case particle `no'",
+      "adj-no" => "nouns which may take the genitive case particle 'no'",
       "adj-pn" => "pre-noun adjectival (rentaishi)",
       "adj-shiku" => "`shiku' adjective (archaic)",
       "adj-t" => "`taru' adjective",
@@ -34,7 +34,7 @@ module Jmdict
       "chem" => "chemistry term",
       "chn" => "children's language",
       "col" => "colloquialism",
-      "comp" => "computer terminology",
+      "comp" => "computing",
       "conj" => "conjunction",
       "cop" => "copula",
       "cop-da" => "copula",
@@ -101,6 +101,7 @@ module Jmdict
       "quote" => "quotation",
       "rare" => "rare",
       "rkb" => "Ryuukyuu-ben",
+      "rK" => "rarely used kanji form",
       "sens" => "sensitive",
       "shogi" => "shogi term",
       "sl" => "slang",
diff --git a/lib/eiwa/tag/entity.rb b/lib/eiwa/tag/entity.rb
index 12f4f8c7..a75263a9 100644
--- a/lib/eiwa/tag/entity.rb
+++ b/lib/eiwa/tag/entity.rb
@@ -4,10 +4,13 @@ class Entity < Any
       attr_reader :code, :text
 
       def initialize(code: nil, text: nil)
-        @code = code
         @text = text
       end
 
+      def add_characters(s)
+        @text = s
+      end
+
       def set_entity(code, text)
         @code = code
         @text = text

@nwellnhof
Copy link

IIRC, custom SAX parsers can only work in entity replacement mode (XML_PARSE_NOENT). Without this option, the callback sequence is a bit nonsensical.

@flavorjones
Copy link
Member

@nwellnhof Just to make sure I understand your meaning -- are you saying that there's no way to avoid the characters callback being invoked for entities?

@nwellnhof
Copy link

are you saying that there's no way to avoid the characters callback being invoked for entities?

Yes, but when substituting entities, this should be what you want.

@searls
Copy link
Author

searls commented Mar 19, 2024

@flavorjones this seems about right? As long as I'm able to retrieve the code (i.e. "uk"), I'm happy.

@flavorjones
Copy link
Member

@searls Sorry for my delayed response. To be clear, the above proposal does NOT give you access to the entity name (code).

I think we have a few options that might be able to solve this problem fully:

  1. implement the entityDecl callback in libxml2's SAX model, which will get invoked when the entity declaration is processed. Eiwa can then build a hash table, and when Entity#characters is called, look it up in the hash table to find the code.
  2. implement a new callback in nokogiri for when an entity is encountered and resolved in a document. Unfortunately we can't prevent libxml2 from also invoking the characters callback, which isn't a problem for Eiwa, but also makes this callback not very useful in the general case.
  3. not sure if this is a reasonable suggestion, but would you consider dropping @code from the Eiwa::Tag::Entity class? Do you know if this is used by Eiwa users?

WDYT? I'm leaning towards (1) which seems aligned with libxml2's design and I think is more generally useful.

@searls
Copy link
Author

searls commented Jun 23, 2024

I think the first course of action would be best, as it would improve the utility of SAX for all libxml2 users. Good idea!

@nwellnhof
Copy link

I'd also suggest a default implementation of the entityDecl and getEntity callbacks similar to the following code in libxml2's test suite: https://gitlab.gnome.org/GNOME/libxml2/-/blob/master/runtest.c?ref_type=heads#L721

@flavorjones
Copy link
Member

deep breath this has been a real journey!

I've got a branch (which just needs some final polish) that

  • removes our crufty SAX-tuple struct and replaces it with code that is compatible with libxml2's default SAX handlers
  • uses some of the default SAX handlers to ensure DTDs, entity declarations, and entity references are handled properly (fixing this issue)
  • implements the SAX callback reference (for non-predefined entities when #replace_entities is false (the default))

I've also fixed some quirky behavior in the JRuby impl along the way.

As a result, the fix to eiwa is something like:

diff --git a/lib/eiwa/jmdict/doc.rb b/lib/eiwa/jmdict/doc.rb
index 8d2d4155..6d554119 100644
--- a/lib/eiwa/jmdict/doc.rb
+++ b/lib/eiwa/jmdict/doc.rb
@@ -53,6 +53,10 @@ def characters(s)
         @current.add_characters(s)
       end

+      def reference(name, content)
+        @current.set_entity(name, content)
+      end
+
       # def comment string
       #   puts "comment #{string}"
       # end

which, since it does nothing on older versions of Nokogiri, is backwards-compatible. (I will submit a PR to eiwa shortly.)

I hope to have that branch up in a PR tomorrow! 😅 Thanks for your patience.

@searls
Copy link
Author

searls commented Jul 1, 2024

Nice! Sounds like a big win from a tech debt perspective

flavorjones added a commit that referenced this issue Jul 1, 2024
On CRuby, this fixes the fact that the parser was registering errors
when encountering general (non-predefined) entities. Now these
entities are resolved properly and converted into `#characters`
callbacks. Fixes #1926.

On JRuby, the SAX parser now respects the `#replace_entities`
attribute, which was previously ignored AND defaulted incorrectly to
`true`. The default now matches CRuby -- `false` -- and the parser
behavior matches CRuby with respect to entities. Fixes #614.

This commit also includes some granular tests of how the sax parser
handles different entities under different circumstances, which should
be clarifying for user reports like #1284 and #1500 that expect
predefined entities and character references to be treated like parsed
entities (which they aren't).
flavorjones added a commit that referenced this issue Jul 1, 2024
The behavior here is relatively complex, being a function of entity
type and `#replace_entities` value, but there are sufficient tests and
both Java and C impls behave identically.

Related to the problem described at #1926.
@flavorjones
Copy link
Member

PR is at #3265

flavorjones added a commit that referenced this issue Jul 2, 2024
The behavior here is relatively complex, being a function of entity
type and `#replace_entities` value, but there are sufficient tests and
both Java and C impls behave identically.

Related to the problem described at #1926.
flavorjones added a commit that referenced this issue Jul 2, 2024
**What problem is this PR intended to solve?**

#1926 described an issue wherein the SAX parser was not correctly
resolving and replacing internal entities, and was instead reporting an
error for each entity reference. This PR includes a fix for that
problem.

I've removed the unnecessary "SAX tuple" from the SAX implementation,
replacing it with the `_private` struct member that libxml2 makes
available. Then I set up the parser context structs so that we can use
libxml2's standard SAX callbacks where they're useful (which is how I
addressed the above issue).

This PR also introduces a new feature, a SAX handler callback
`Document#reference` which allows callers to get entity-specific name
and replacement text information (rather than relying on the
`Document#characters` callback). This can be used to solve the original
issue in #1926 with this code: searls/eiwa#11

The behavior of the SAX parser with respect to entities is complex
enough that I wrote up a short doc in the `XML::SAX::Document` docstring
with a table and explanation. I've also added warnings to remind users
that `#replace_entities` is not safe to set when parsing untrusted
documents.

In the Java implementation, I've fixed the `#replace_entities` option in
the SAX parser context and set it to the proper default (`false`),
fixing #614. I've also corrected the value of the URI argument to
`Document#start_element_namespace` which was a blank string when it
should have been `nil`.

I've added quite a bit of testing around the SAX parser's handling of
entities.

I added and clarified quite a bit of documentation around SAX parsing
generally. Exception messages have been clarified in a couple of places,
and made consistent between the C and Java implementations. This should
address questions asked in issues #1500 and #1284.

Finally, I cleaned up some of the C code that implements SAX parsing,
naming functions more explicitly (and moving towards some kind of
standard naming convention).

Closes #1926.
Closes #614.


**Have you included adequate test coverage?**

Yes!


**Does this change affect the behavior of either the C or the Java
implementations?**

Yes, but the implementations are much more consistent with each other
now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants