Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: v1.13.7 XML::Reader may segfault #2598

Closed
flavorjones opened this issue Jul 20, 2022 · 10 comments
Closed

bug: v1.13.7 XML::Reader may segfault #2598

flavorjones opened this issue Jul 20, 2022 · 10 comments
Labels
topic/memory Segfaults, memory leaks, valgrind testing, etc.

Comments

@flavorjones
Copy link
Member

Originally reported downstream in pythonicrubyist/creek#105

Special thanks to @bf4 who helped reliably reproduce this.

More details coming shortly.

@flavorjones
Copy link
Member Author

Summary and Diagnosis

This comment is an explanation of the problem. I'll follow up with a few suggestions for approaches to fix it.

The Symptoms

Users of Nokogiri::XML::Reader experiencing segfaults in Nokogiri v1.13.7 while traversing a document.

A reproduction (provided by @bf4) captured in valgrind showed the following memory error:

==721789== Invalid read of size 8
==721789==    at 0xBAF3D44: _xml_node_mark (xml_node.c:24)
==721789==    by 0x49474AE: gc_mark_children (gc.c:6344)
==721789==    by 0x494871C: gc_mark_stacked_objects (gc.c:6448)
==721789==    by 0x494871C: gc_mark_stacked_objects_all (gc.c:6488)
==721789==    by 0x494871C: gc_marks_rest (gc.c:7429)
==721789==    by 0x4949007: gc_marks (gc.c:7485)
==721789==    by 0x4949007: gc_start (gc.c:8334)
...

Nokogiri v1.13.7 introduced some changes in how Nodes behave during garbage collection, so this stack isn't terribly surprising. But the changes were relatively straightforward and this is rather unexpected.

It also happens only rarely, so it's challenging to reproduce and didn't show up in any of the valgrind jobs in our CI pipeline.

How Reader's memory lifecycle works

Nodes associated with an XML::Reader don't have a fully-realized document. They have an associated xmlDoc struct (pointed to in the xmlNode->doc struct member), but the libxml Reader API doesn't expose it, and Nokogiri doesn't expose it, and so it never gets wrapped in a Nokogiri::XML::Document ruby object.

To demonstrate, if we add some debugging output to xml_node.c:noko_xml_node_wrap() as follows:

--- a/ext/nokogiri/xml_node.c
+++ b/ext/nokogiri/xml_node.c
@@ -2069,6 +2069,18 @@ noko_xml_node_wrap(VALUE rb_class, xmlNodePtr c_node)
   rb_node = TypedData_Wrap_Struct(rb_class, &nokogiri_node_type, c_node) ;
   c_node->_private = (void *)rb_node;
 
+  {
+    VALUE message = rb_sprintf(
+      "MIKE: %"PRIsVALUE" name=%s, c=%p | document: c=%p, ruby=%s\n",
+      rb_funcall(rb_class, rb_intern("to_s"), 0),
+      c_node->name,
+      c_node,
+      c_node->doc,
+      node_has_a_document ? "yes" : "no"
+      );
+    fprintf(stderr, "%s", StringValueCStr(message));
+  }
+
   if (node_has_a_document) {
     rb_document = DOC_RUBY_OBJECT(c_doc);
     rb_node_cache = DOC_NODE_CACHE(c_doc);

Then traversing a document with Reader (specifically, inspecting node attributes) will emit output like:

MIKE: Nokogiri::XML::Attr name=r, c=0x000055c2e5e6e400, ruby=0x000055c2e61f39d0 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=customFormat, c=0x000055c2e5f47730, ruby=0x000055c2e61f38b8 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=ht, c=0x000055c2e64ed970, ruby=0x000055c2e61f37f0 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=hidden, c=0x000055c2e61dca00, ruby=0x000055c2e61f3728 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=customHeight, c=0x000055c2e5f85900, ruby=0x000055c2e61f3610 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=outlineLevel, c=0x000055c2e5f1b900, ruby=0x000055c2e61f3548 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=collapsed, c=0x000055c2e5b92090, ruby=0x000055c2e61f33e0 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=r, c=0x000055c2e5849310, ruby=0x000055c2e61f2a58 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=customFormat, c=0x000055c2e58492a0, ruby=0x000055c2e61f2990 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=ht, c=0x000055c2e5f1abc0, ruby=0x000055c2e61f27d8 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=hidden, c=0x000055c2e5f1ab50, ruby=0x000055c2e61f2710 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=customHeight, c=0x000055c2e5f41b40, ruby=0x000055c2e61f2620 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=outlineLevel, c=0x000055c2e5f41bb0, ruby=0x000055c2e61f2558 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=collapsed, c=0x000055c2e5f41a00, ruby=0x000055c2e61f2468 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=r, c=0x000055c2e5b92090, ruby=0x000055c2e61f19a0 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=customFormat, c=0x000055c2e5f1b900, ruby=0x000055c2e61f18d8 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=ht, c=0x000055c2e5f85900, ruby=0x000055c2e61f17e8 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=hidden, c=0x000055c2e61dca00, ruby=0x000055c2e61f1720 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=customHeight, c=0x000055c2e64ed970, ruby=0x000055c2e61f1568 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=outlineLevel, c=0x000055c2e5f47730, ruby=0x000055c2e61f1400 | document: c=0x000055c2e5b95070, ruby=no
MIKE: Nokogiri::XML::Attr name=collapsed, c=0x000055c2e5e6e400, ruby=0x000055c2e61f1338 | document: c=0x000055c2e5b95070, ruby=no

Note that these are all attribute nodes returned by Reader#attribute_nodes, and these are the only kind of nodes associated with Reader that get wrapped by Ruby objects.

(We can verify this statement by adding these lines to noko_xml_node_wrap():

    if (!node_has_a_document && rb_class != cNokogiriXmlAttr) {
      abort();
    }

and seeing that the Nokogiri test suite runs without hitting this abort statement.)

There has always been special handling in noko_xml_node_wrap to handle these nodes, because libxml2's memory lifecycle for them is completely different from ordinary nodes in that:

  • the C structs returned from xmlTextReaderExpand() are short-lived and reused
  • and therefore are not garbage-collected like all the other Ruby-wrapped nodes in Nokogiri

We can see that the C structs are getting reused: search for 0x000055c2e5e6e400 in the above output and you'll see two matches showing the same C struct address being associated with two different Ruby objects:

MIKE: Nokogiri::XML::Attr name=r, c=0x000055c2e5e6e400, ruby=0x000055c2e61f39d0 | document: c=0x000055c2e5b95070, ruby=no
...
MIKE: Nokogiri::XML::Attr name=collapsed, c=0x000055c2e5e6e400, ruby=0x000055c2e61f1338 | document: c=0x000055c2e5b95070, ruby=no

This behavior is dangerous, and it makes the Reader API unsafe (as mentioned in ROADMAP.md) because callers may keep a reference to the Ruby object while libxml2 has freed or re-used the underlying C struct. In a future release I may re-implement or remove the Reader#attribute_nodes method for this reason.

But really, this is fine! It's a wonky memory lifecycle, but the code in v1.13.6 and earlier handled it (since v1.4.5 in 2011! see e95a3441).

The changes in Nokogiri v1.13.7

In v1.13.7 we introduced some code to make sure that nodes would behave well when compacted by Ruby's new compacting garbage collector. As part of this code, the Ruby object wrapping went from this:

static void
_xml_node_mark(xmlNodePtr node)
{
  xmlDocPtr doc = node->doc;
  if (doc->type == XML_DOCUMENT_NODE || doc->type == XML_HTML_DOCUMENT_NODE) {
    if (DOC_RUBY_OBJECT_TEST(doc)) {
      rb_gc_mark(DOC_RUBY_OBJECT(doc));
    }
  } else if (node->doc->_private) {
    rb_gc_mark((VALUE)doc->_private);
  }
}

VALUE
noko_xml_node_wrap(VALUE rb_class, xmlNodePtr c_node)
{
  VALUE rb_document, rb_node_cache, rb_node;
  nokogiriTuplePtr node_has_a_document;
  xmlDocPtr c_doc;
  void (*f_mark)(xmlNodePtr) = NULL ;

  /* ... */

  node_has_a_document = DOC_RUBY_OBJECT_TEST(c_doc);

  /* ... */

  f_mark = node_has_a_document ? _xml_node_mark : NULL ;
  rb_node = Data_Wrap_Struct(rb_class, f_mark, _xml_node_dealloc, c_node) ;

  /* ... */
}

to (simplified slightly):

static void
_xml_node_mark(xmlNodePtr node)
{
  if (!DOC_RUBY_OBJECT_TEST(node->doc)) {
    return;
  }

  xmlDocPtr doc = node->doc;
  if (doc->type == XML_DOCUMENT_NODE || doc->type == XML_HTML_DOCUMENT_NODE) {
    if (DOC_RUBY_OBJECT_TEST(doc)) {
      rb_gc_mark(DOC_RUBY_OBJECT(doc));
    }
  } else if (node->doc->_private) {
    rb_gc_mark((VALUE)doc->_private);
  }
}

static const rb_data_type_t nokogiri_node_type = {
  "Nokogiri/XMLNode",
  {
    (gc_callback_t)_xml_node_mark, (gc_callback_t)_xml_node_dealloc, 0, (gc_callback_t)_xml_node_update_references
  },
  0, 0, RUBY_TYPED_FREE_IMMEDIATELY,
};

VALUE
noko_xml_node_wrap(VALUE rb_class, xmlNodePtr c_node)
{
  VALUE rb_document, rb_node_cache, rb_node;
  nokogiriTuplePtr node_has_a_document;
  xmlDocPtr c_doc;

  /* ... */
  
  rb_node = TypedData_Wrap_Struct(rb_class, &nokogiri_node_type, c_node) ;

  /* ... */
}

In summary: the check for the ruby-wrapped-ness of the node's associated document was moved from "wrap time" to "mark time". From this:

  f_mark = DOC_RUBY_OBJECT_TEST(c_doc) ? _xml_node_mark : NULL ;
  rb_node = Data_Wrap_Struct(rb_class, f_mark, _xml_node_dealloc, c_node) ;

to this:

static void
_xml_node_mark(xmlNodePtr node)
{
  if (!DOC_RUBY_OBJECT_TEST(node->doc)) {
    return;
  }
  /* ... */
}

99.9999% of the time, this is fine.

Except when it isn't

Between the time the node is wrapped and the time the mark function gets called, libxml2 may have decided to free (and possibly re-use) the C struct (or the memory). We can see this by applying this patch:

--- a/ext/nokogiri/xml_node.c
+++ b/ext/nokogiri/xml_node.c
@@ -21,6 +21,8 @@ _xml_node_dealloc(xmlNodePtr x)
 static void
 _xml_node_mark(xmlNodePtr node)
 {
+  fprintf(stderr, "MIKE: xml_node.c:mark: node=%p, document=%p\n", node, node->doc);
+
   if (!DOC_RUBY_OBJECT_TEST(node->doc)) {
     return;
   }
@@ -2069,6 +2071,10 @@ noko_xml_node_wrap(VALUE rb_class, xmlNodePtr c_node)
   rb_node = TypedData_Wrap_Struct(rb_class, &nokogiri_node_type, c_node) ;
   c_node->_private = (void *)rb_node;
 
+  if (!node_has_a_document) {
+    fprintf(stderr, "MIKE: xml_node.c:wrap: node=%p, document=%p\n", c_node, c_doc);
+  }
+
   if (node_has_a_document) {
     rb_document = DOC_RUBY_OBJECT(c_doc);
     rb_node_cache = DOC_NODE_CACHE(c_doc);

and letting the repro script run. In one particular case, the last lines emitted were:

MIKE: xml_node.c:mark: node=0x55a0d5bdda60, document=0x7f2800000001
/home/flavorjones/code/oss/nokogiri/lib/nokogiri/xml/reader.rb:91: [BUG] Segmentation fault at 0x00007f2800000001

Look backwards in the debug log for the node pointer 0x55a0d5bdda60 and you see:

<snip>
MIKE: xml_node.c:wrap: node=0x55a0d5bdda60, document=0x55a0d56c3280
MIKE: xml_node.c:wrap: node=0x55a0d5bdda60, document=0x55a0d56c3280
MIKE: xml_node.c:mark: node=0x55a0d5bdda60, document=0x55a0d56c3280
MIKE: xml_node.c:mark: node=0x55a0d5bdda60, document=0x7f2800000001

Whoa! The node was wrapped with a valid document pointer. But between the two mark calls, the value of that pointer changes. What is going on?

Let's apply this patch to libxml2's xmlreader.c:xmlTextReaderFreeNode() to see the full picture:

diff --git a/xmlreader.c b/xmlreader.c
index ba95813..8759f14 100644
--- a/xmlreader.c
+++ b/xmlreader.c
@@ -413,7 +413,16 @@ xmlTextReaderFreeNode(xmlTextReaderPtr reader, xmlNodePtr cur) {
     (cur->type == XML_XINCLUDE_START) ||
     (cur->type == XML_XINCLUDE_END)) &&
    (cur->properties != NULL))
+    {
+        fprintf(stderr, "MIKE: freeing properties:");
+        xmlAttrPtr x = cur->properties;
+        while (x != NULL) {
+            fprintf(stderr, " %p", x);
+            x = x->next;
+        }
+        fprintf(stderr, "\n");
    xmlTextReaderFreePropList(reader, cur->properties);
+    }
     if ((cur->content != (xmlChar *) &(cur->properties)) &&
         (cur->type != XML_ELEMENT_NODE) &&
    (cur->type != XML_XINCLUDE_START) &&

This patch dumps the addresses of all the node attribute structs that are getting freed, like this:

MIKE: freeing properties: 0x5619831491b0 0x5619831492a0
MIKE: freeing properties: 0x56198314b290 0x56198314b380 0x56198314b470 0x56198314b560 0x56198314b650 0x56198314b740 0x56198314b830 0x56198314b920

Now, when we repro and segfault during the mark phase, we can see the node being marked, and we can also see if it's been previously freed.

So in this case when we see:

MIKE: xml_node.c:mark: node=0x561982ce9f70, document=0x7efc00000001
/home/flavorjones/code/oss/nokogiri/lib/nokogiri/xml/reader.rb:91: [BUG] Segmentation fault at 0x00007efc00000001

we can grep for 0x561982ce9f70 in the log and we see:

MIKE: xml_node.c:wrap: node=0x561982ce9f70, document=0x561982c22e40
MIKE: freeing properties: 0x561982ce9f70 0x561982623740 0x561982d8fea0 0x561982cfd290 0x561982ee3bc0 0x561982d9f3c0 0x561982e3d240
MIKE: xml_node.c:wrap: node=0x561982ce9f70, document=0x561982c22e40
MIKE: freeing properties: 0x561982e3d240 0x561982d9f3c0 0x561982ee3bc0 0x561982cfd290 0x561982d8fea0 0x561982623740 0x561982ce9f70
MIKE: xml_node.c:wrap: node=0x561982ce9f70, document=0x561982c22e40
MIKE: freeing properties: 0x561982ce9f70 0x561982623740 0x561982d8fea0 0x561982cfd290 0x561982ee3bc0 0x561982d9f3c0 0x561982e3d240
MIKE: xml_node.c:wrap: node=0x561982ce9f70, document=0x561982c22e40
MIKE: freeing properties: 0x561982e3d240 0x561982d9f3c0 0x561982ee3bc0 0x561982cfd290 0x561982d8fea0 0x561982623740 0x561982ce9f70
MIKE: xml_node.c:wrap: node=0x561982ce9f70, document=0x561982c22e40
MIKE: freeing properties: 0x561982ce9f70 0x561982623740 0x561982d8fea0 0x561982cfd290 0x561982ee3bc0 0x561982d9f3c0 0x561982e3d240
MIKE: xml_node.c:wrap: node=0x561982ce9f70, document=0x561982c22e40
MIKE: freeing properties: 0x561982e3d240 0x561982d9f3c0 0x561982ee3bc0 0x561982cfd290 0x561982d8fea0 0x561982623740 0x561982ce9f70
MIKE: freeing properties: 0x561982ce9f70 0x561982623740
MIKE: xml_node.c:mark: node=0x561982ce9f70, document=0x561982c22e40
MIKE: xml_node.c:mark: node=0x561982ce9f70, document=0x561982c22e40
MIKE: xml_node.c:mark: node=0x561982ce9f70, document=0x7efc00000001

We can totally see that this struct is being freed -- twice! -- between being wrapped and being marked. Being freed indicates that the memory could have been re-used, which would explain why the value of the document pointer has changed.

To prove our use-after-free theory, we can ask valgrind to fill memory when it's freed:

VALGRIND="valgrind --free-fill=66"

bundle exec $VALGRIND $(rbenv which ruby) ./repro.rb

and we see

==788214== Invalid read of size 8
==788214==    at 0x106BA8D2: fprintf (stdio2.h:100)
==788214==    by 0x106BA8D2: _xml_node_mark (xml_node.c:24)
==788214==    by 0x49474AE: gc_mark_children (gc.c:6344)
==788214==    by 0x494871C: gc_mark_stacked_objects (gc.c:6448)
==788214==    by 0x494871C: gc_mark_stacked_objects_all (gc.c:6488)
==788214==    by 0x494871C: gc_marks_rest (gc.c:7429)
==788214==    by 0x4949007: gc_marks (gc.c:7485)
==788214==    by 0x4949007: gc_start (gc.c:8334)
...
==788214==  Address 0xe2f43f0 is 64 bytes inside a block of size 96 free'd
==788214==    at 0x483F0C3: free (vg_replace_malloc.c:872)
==788214==    by 0x4942AC4: objspace_xfree (gc.c:10842)
==788214==    by 0x4942AC4: objspace_xfree (gc.c:10774)
==788214==    by 0x4942AC4: ruby_sized_xfree (gc.c:10935)
==788214==    by 0x4942AC4: ruby_sized_xfree (gc.c:10932)
==788214==    by 0x10822136: xmlTextReaderFreePropList (xmlreader.c:273)
...
==788214==  Block was alloc'd at
==788214==    at 0x483C855: malloc (vg_replace_malloc.c:381)
==788214==    by 0x494D3AF: objspace_xmalloc0 (gc.c:10630)
==788214==    by 0x494D3AF: ruby_xmalloc0 (gc.c:10851)
==788214==    by 0x494D3AF: ruby_xmalloc_body (gc.c:10860)
==788214==    by 0x494D3AF: ruby_xmalloc (gc.c:12799)
==788214==    by 0x1079A1AD: xmlNewPropInternal (tree.c:1875)
==788214==    by 0x1083A9CF: xmlSAX2AttributeNs (SAX2.c:2022)
MIKE: xml_node.c:mark: node=0xe2f43b0, document=0x6666666666666666

🎉

OK, now I fully understand the problem, and hopefully you do, too.

What can we do about this?

See my next comment for some ideas.

@tenderlove
Copy link
Member

How is xmlTextReaderFreeNode getting called? Is the attribute object is outliving reader object?

@flavorjones
Copy link
Member Author

xmlTextReaderRead calls this in the process of cursoring through the document. When the user calls Reader#read to move on to the next node, there are no guarantees that the previous node's memory won't be freed.

@flavorjones
Copy link
Member Author

How to move forward

Some thoughts on what we might do.

The naive solution

A reasonable idea might be to move the ruby-wrapped-ness test out of the mark function and back into the wrap function. This would avoid this particular code path, and would certainly segfault less.

We could do that by making a second rb_data_type_t struct:

static const rb_data_type_t nokogiri_node_type_nowrap = {
  "Nokogiri/XMLNode",
  {
    0, (gc_callback_t)_xml_node_dealloc, 0,
#ifdef HAVE_RB_GC_LOCATION
    (gc_callback_t)_xml_node_update_references
#endif
  },
  0, 0,
#ifdef RUBY_TYPED_FREE_IMMEDIATELY
  RUBY_TYPED_FREE_IMMEDIATELY,
#endif
};

and use this at wrap time:

  if (node_has_a_document) {
    rb_node = TypedData_Wrap_Struct(rb_class, &nokogiri_node_type, c_node) ;
  } else {
    rb_node = TypedData_Wrap_Struct(rb_class, &nokogiri_node_type_nowrap, c_node) ;
  }

"It's complicated"

But if the node's C pointer is unsafe to dereference at "mark time", then it's also going to be unsafe to dereference at "update references time". So we will have made this crash less often, but the path to segfaulting still exists, and it's:

  • create the XML::Attr Ruby object, wrapped around a C struct
  • the C struct's memory is freed and re-used
  • the Ruby object is compacted
  • and the _xml_node_update_references function is called, dereferencing the C struct pointer and segfaulting

So I think the real culprit here is the fact that Nokogiri is wrapping Ruby objects around C structs that are ephemeral and can't be relied upon to last the entire lifetime of the Ruby object.

So: I think that Reader#attribute_nodes as currently implemented is incompatible with Ruby's garbage collector, and is unsafe to exist in its current state.

Option 1: Copy the C struct

We could, in Reader#attribute_nodes, make copies of all the C structs, and then clean up that memory when the Ruby object gets garbage collected.

Although this option seems simple, there's a lot not to like about it.

First, we still have to maintain a separate object lifecycle for this particular kind of node, and I don't like that complication in Nokogiri. Over the years I've had to spend quite a bit of time debugging Reader memory problems, and I'd prefer to use this as an opportunity to simplify the code.

Second, making a copy of a single C struct doesn't solve the deeper problem, which is that libxml2 structs represent a graph of nodes. The xmlAttribute struct looks like this:

struct _xmlAttribute {
    void           *_private;	        /* application data */
    xmlElementType          type;       /* XML_ATTRIBUTE_DECL, must be second ! */
    const xmlChar          *name;	/* Attribute name */
    struct _xmlNode    *children;	/* NULL */
    struct _xmlNode        *last;	/* NULL */
    struct _xmlDtd       *parent;	/* -> DTD */
    struct _xmlNode        *next;	/* next sibling link  */
    struct _xmlNode        *prev;	/* previous sibling link  */
    struct _xmlDoc          *doc;       /* the containing document */

    struct _xmlAttribute  *nexth;	/* next in hash table */
    xmlAttributeType       atype;	/* The attribute type */
    xmlAttributeDefault      def;	/* the default */
    const xmlChar  *defaultValue;	/* or the default value */
    xmlEnumerationPtr       tree;       /* or the enumeration tree if any */
    const xmlChar        *prefix;	/* the namespace prefix if any */
    const xmlChar          *elem;	/* Element holding the attribute */
};

Which of these members should we NULL out (possibly breaking current functionality), and which should we (gulp) also make copies of and garbage collect?

The idea of managing our own separate DOM graph makes me really nervous, and this feels like the risk/reward ratio is too high.

Option 2: Deprecate Reader#attribute_nodes

Reader#attribute_nodes is the only method in Nokogiri that wraps xmlReader-returned C structs in Ruby objects; and yet, it's not clear to me that it is called directly very often.

Creek, for example, calls Reader#attributes which simply returns a hash of name-value pairs. The fact that Reader#attributes calls #attribute_nodes is an implementation detail that we can easily change without impacting Creek or its users.

And Reader#attributes is the only call site for Reader#attribute_nodes within Nokogiri.

So another idea is to deprecate Reader#attribute_nodes, emitting a warning message for anyone calling it directly; and then re-implement Reader#attributes to create only primitive Hash and String objects. Until the method is removed (in v1.15?) we'd keep the current v1.13.7 implementation around.

Another advantage: after the method is removed, we will get to eliminate the special GC handling code in xml_node.c, unifying and simplifying our code.

This is the least-bad idea I can come up with. Anybody else have other ideas?

@stevecheckoway
Copy link
Contributor

I haven't read this closely enough to have an opinion, but did you mean xmlAttribute or xmlAttr? xmlAttribute is for a DTD and xmlAttr is for property nodes in the DOM.

struct _xmlAttr {
    void           *_private;	/* application data */
    xmlElementType   type;      /* XML_ATTRIBUTE_NODE, must be second ! */
    const xmlChar   *name;      /* the name of the property */
    struct _xmlNode *children;	/* the value of the property */
    struct _xmlNode *last;	/* NULL */
    struct _xmlNode *parent;	/* child->parent link */
    struct _xmlAttr *next;	/* next sibling link  */
    struct _xmlAttr *prev;	/* previous sibling link  */
    struct _xmlDoc  *doc;	/* the containing document */
    xmlNs           *ns;        /* pointer to the associated namespace */
    xmlAttributeType atype;     /* the attribute type if validating */
    void            *psvi;	/* for type/PSVI information */
};

Either way, maintaining copies seems tricky at best.

@flavorjones
Copy link
Member Author

@stevecheckoway Ah, thank you for catching that! It's been a long week. I absolutely meant xmlAttr and not xmlAttribute, but the essence of my point still holds -- there are lots of pointer members connecting a graph of structs.

@flavorjones
Copy link
Member Author

There's another option that @tenderlove suggested in a chat today ...

Option 3: ask the Reader to preserve nodes

xmlReader provides a function xmlTextReaderPreserve which "tells the XML Reader to preserve the current node" which, based on an incomplete understanding of the mechanics of that preservation, seems to meet our needs.

I'm concerned, though, about automatically asking libxml2 to preserve any node on which we've called #attribute_nodes, because in the degenerate case we'll be holding the entire document in memory, which defeats the reason many people use Reader in the first place: it doesn't require as much memory.

Side note: the JRuby implementation of Nokogiri::XML::Reader suffers from this problem, and we have a long history of complaints about it:

But if we assume that we could call xmlTextReaderPreserve on any node we needed without impacting users, then we're still faced with the challenge of extending the already complex memory lifecycle of a xmlReader-generated-xmlAttr to also track the lifecycle of an unwrapped xmlDoc as prescribed in the comments for xmlTextReaderPreserve.

Additionally we'd be trusting libxml2 to not be buggy on a little-used and untested code path, which ... well, we've been burned before.

I honestly think the better thing to do is to plan to avoid instantiating wrapped objects related to Reader. Let's move towards a simpler implementation rather an a more complex implementation. Avoiding this option embraces the spirit of Reader being a cursor through the document, allows us to be very conservative with memory usage, and the worst-case scenario is we add more methods to Reader to allow users to get only the information they need.

I'm open to hearing other opinions, but this doesn't seem like a better option to me than "Option 2" above.

@flavorjones
Copy link
Member Author

Trying to get to a decision here ... I've written a PR that implements Option 2, and I like it: #2599

I'm not going to try to implement Option 3, though I'm open to considering it if someone else wants to try!

@flavorjones flavorjones added this to the v1.13.x patch releases milestone Jul 22, 2022
flavorjones added a commit that referenced this issue Jul 22, 2022
…de-gc

fix: XML::Reader XML::Attr garbage collection

---

**What problem is this PR intended to solve?**

This is a proposed fix for #2598, see that issue for an extended explanation of the problem.

This PR implements "option 2" from that issue's proposed solutions:

- introduce a new `Reader#attribute_hash` that will return a `Hash<String ⇒ String>` (instead of an `Array<XML::Attr>`)
- deprecate `Reader#attribute_nodes` with a plan to remove it entirely in a future release
- re-implement `Reader#attributes` to use `#attribute_hash` (instead of `#attribute_nodes`)

After this change, only applications calling `Reader#attribute_nodes` directly will be running the unsafe code. These users will see a deprecation warning and may use `#attribute_hash` as a replacement.

I think it's very possible that `Reader#attribute_hash` won't meet the needs of people who are working with namespaced attributes and are using `#attribute_nodes` for this purpose. However, I'm intentionally deferring any attempt to solve that DX problem until someone who needs this functionality asks for it.

**Have you included adequate test coverage?**

I tried and failed to add test coverage to the suite that would reproduce the underlying GC bug.

However, existing test coverage of `Reader#attributes` is sufficient for now.


**Does this change affect the behavior of either the C or the Java implementations?**

This PR modifies both the C and Java implementations to behave the same.

Notably, the Java implementation contains a small bugfix which is that `Reader#namespaces` now returns an empty hash when there are no namespaces (it previously returned `nil`).
flavorjones added a commit that referenced this issue Jul 22, 2022
…de-gc_backport-v1.13.x

fix: XML::Reader XML::Attr garbage collection (backport to v1.13.x)

---

**What problem is this PR intended to solve?**

This is a proposed fix for #2598, see that issue for an extended explanation of the problem.

This PR implements "option 2" from that issue's proposed solutions:

- introduce a new `Reader#attribute_hash` that will return a `Hash<String ⇒ String>` (instead of an `Array<XML::Attr>`)
- deprecate `Reader#attribute_nodes` with a plan to remove it entirely in a future release
- re-implement `Reader#attributes` to use `#attribute_hash` (instead of `#attribute_nodes`)

After this change, only applications calling `Reader#attribute_nodes` directly will be running the unsafe code. These users will see a deprecation warning and may use `#attribute_hash` as a replacement.

I think it's very possible that `Reader#attribute_hash` won't meet the needs of people who are working with namespaced attributes and are using `#attribute_nodes` for this purpose. However, I'm intentionally deferring any attempt to solve that DX problem until someone who needs this functionality asks for it.

**Have you included adequate test coverage?**

I tried and failed to add test coverage to the suite that would reproduce the underlying GC bug.

However, existing test coverage of `Reader#attributes` is sufficient for now.


**Does this change affect the behavior of either the C or the Java implementations?**

This PR modifies both the C and Java implementations to behave the same.

Notably, the Java implementation contains a small bugfix which is that `Reader#namespaces` now returns an empty hash when there are no namespaces (it previously returned `nil`).
@flavorjones
Copy link
Member Author

Cutting v1.13.8 shortly to fix this.

@flavorjones
Copy link
Member Author

flavorjones added a commit that referenced this issue Nov 28, 2023
**What problem is this PR intended to solve?**

Before a minor release, I generally review deprecations and look for
things we can remove.

* Removed `Nokogiri::HTML5.get` which was deprecated in v1.12.0. [#2278]
(@flavorjones)
* Removed the CSS-to-XPath utility modules
`XPathVisitorAlwaysUseBuiltins` and `XPathVisitorOptimallyUseBuiltins`,
which were deprecated in v1.13.0 in favor of `XPathVisitor` constructor
args. [#2403] (@flavorjones)
* Removed `XML::Reader#attribute_nodes` which was deprecated in v1.13.8
in favor of `#attribute_hash`. [#2598, #2599] (@flavorjones)

Also we're now specifying version numbers in remaining deprecation
warnings.

**Have you included adequate test coverage?**

Tests have been removed, otherwise no new coverage needed.

**Does this change affect the behavior of either the C or the Java
implementations?**

As documented above.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/memory Segfaults, memory leaks, valgrind testing, etc.
Projects
None yet
Development

No branches or pull requests

3 participants