Skip to content

Commit

Permalink
fix: RelaxNG.read_memory is now reliable (#3259)
Browse files Browse the repository at this point in the history
**What problem is this PR intended to solve?**

On both JRuby and CRuby impls, replace the native `read_memory` method
from XML::Schema and XML::RelaxNG with a ruby method that uses
`from_document`. `read_memory` was buggy on both platforms, but
especially on the JRuby impl.

This is comparable in performance on CRuby. From a benchmark taken
before this change:

```
Warming up --------------------------------------
          Schema.new     1.219k i/100ms
Schema.from_document     1.258k i/100ms
  Schema.read_memory     1.118k i/100ms
Calculating -------------------------------------
          Schema.new     12.160k (± 8.3%) i/s -    121.900k in  10.093638s
Schema.from_document     12.216k (± 8.6%) i/s -    122.026k in  10.059696s
  Schema.read_memory     12.790k (±10.7%) i/s -    127.452k in  10.105931s

Comparison:
  Schema.read_memory:    12789.6 i/s
Schema.from_document:    12215.9 i/s - same-ish: difference falls within error
          Schema.new:    12160.1 i/s - same-ish: difference falls within error
```

IMHO the resulting code is less buggy and easier to maintain, and the
slight (if any) performance hit is worth it.

Also: improve documentation for Schema and RelaxNG classes.

Closes #2113
Closes #2115


**Have you included adequate test coverage?**

Yes.


**Does this change affect the behavior of either the C or the Java
implementations?**

Brings JRuby and CRuby into alignment.
  • Loading branch information
flavorjones authored Jun 24, 2024
2 parents 73d9f35 + 49c5b8b commit 925f3be
Show file tree
Hide file tree
Showing 7 changed files with 226 additions and 163 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA

### Improved
* Documentation has been improved for `CSS.xpath_for`. [#3224] @flavorjones
* `XML::Schema#read_memory` and `XML::RelaxNG#read_memory` are now Ruby methods that call `#from_document`. Previously these were native functions, but they were buggy on both CRuby and JRuby (but worse on JRuby) and so this is now useful, comparable in performance, and simpler code that is easier to maintain. [#2113, #2115] @flavorjones
* [CRuby] When compiling packaged libraries from source, allow users' `AR` and `LD` environment variables to set the archiver and linker commands, respectively. This augments the existing `CC` environment variable to set the compiler command. [#3165] @ziggythehamster
* [CRuby] The HTML5 parse methods accept a `:parse_noscript_content_as_text` keyword argument which will emulate the parsing behavior of a browser which has scripting enabled. [#3178, #3231] @stevecheckoway
* [CRuby] `HTML5::DocumentFragment.parse` and `.new` accept a `:context` keyword argument that is the parse context node or element name. Previously this could only be passed in as a positional argument to `.new` and not at all to `.parse`. @flavorjones
Expand Down
13 changes: 0 additions & 13 deletions ext/java/nokogiri/XmlSchema.java
Original file line number Diff line number Diff line change
Expand Up @@ -175,19 +175,6 @@ public class XmlSchema extends RubyObject
return getSchema(context, (RubyClass)klazz, source, parseOptions);
}

@JRubyMethod(meta = true, required = 1, optional = 1)
public static IRubyObject
read_memory(ThreadContext context, IRubyObject klazz, IRubyObject[] args)
{
IRubyObject content = args[0];
IRubyObject parseOptions = null;
if (args.length > 1) {
parseOptions = args[1];
}
String data = content.convertToString().asJavaString();
return getSchema(context, (RubyClass) klazz, new StreamSource(new StringReader(data)), parseOptions);
}

private static IRubyObject
getSchema(ThreadContext context, RubyClass klazz, Source source, IRubyObject parseOptions)
{
Expand Down
48 changes: 14 additions & 34 deletions ext/nokogiri/xml_relax_ng.c
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,6 @@ static const rb_data_type_t xml_relax_ng_type = {
.flags = RUBY_TYPED_FREE_IMMEDIATELY | RUBY_TYPED_WB_PROTECTED,
};

/*
* call-seq:
* validate_document(document)
*
* Validate a Nokogiri::XML::Document against this RelaxNG schema.
*/
static VALUE
noko_xml_relax_ng__validate_document(VALUE self, VALUE document)
{
Expand Down Expand Up @@ -107,36 +101,23 @@ xml_relax_ng_parse_schema(
}

/*
* call-seq:
* read_memory(string)
* :call-seq:
* from_document(document) → Nokogiri::XML::RelaxNG
* from_document(document, parse_options) → Nokogiri::XML::RelaxNG
*
* Create a new RelaxNG from the contents of +string+
*/
static VALUE
read_memory(int argc, VALUE *argv, VALUE rb_class)
{
VALUE rb_content;
VALUE rb_parse_options;
xmlRelaxNGParserCtxtPtr c_parser_context;

rb_scan_args(argc, argv, "11", &rb_content, &rb_parse_options);

c_parser_context = xmlRelaxNGNewMemParserCtxt(
(const char *)StringValuePtr(rb_content),
(int)RSTRING_LEN(rb_content)
);

return xml_relax_ng_parse_schema(rb_class, c_parser_context, rb_parse_options);
}

/*
* call-seq:
* from_document(doc)
* Create a Schema from an already-parsed RELAX NG schema definition document.
*
* [Parameters]
* - +document+ (XML::Document) A XML::Document object representing the parsed RELAX NG
* - +parse_options+ (Nokogiri::XML::ParseOptions) ⚠ Unused
*
* [Returns] Nokogiri::XML::RelaxNG
*
* Create a new RelaxNG schema from the Nokogiri::XML::Document +doc+
* ⚠ +parse_options+ is currently unused by this method and is present only as a placeholder for
* future functionality.
*/
static VALUE
from_document(int argc, VALUE *argv, VALUE rb_class)
noko_xml_relax_ng_s_from_document(int argc, VALUE *argv, VALUE rb_class)
{
VALUE rb_document;
VALUE rb_parse_options;
Expand All @@ -159,8 +140,7 @@ noko_init_xml_relax_ng(void)
assert(cNokogiriXmlSchema);
cNokogiriXmlRelaxNG = rb_define_class_under(mNokogiriXml, "RelaxNG", cNokogiriXmlSchema);

rb_define_singleton_method(cNokogiriXmlRelaxNG, "read_memory", read_memory, -1);
rb_define_singleton_method(cNokogiriXmlRelaxNG, "from_document", from_document, -1);
rb_define_singleton_method(cNokogiriXmlRelaxNG, "from_document", noko_xml_relax_ng_s_from_document, -1);

rb_define_private_method(cNokogiriXmlRelaxNG, "validate_document", noko_xml_relax_ng__validate_document, 1);
}
64 changes: 16 additions & 48 deletions ext/nokogiri/xml_schema.c
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ static const rb_data_type_t xml_schema_type = {
};

static VALUE
validate_document(VALUE self, VALUE document)
noko_xml_schema__validate_document(VALUE self, VALUE document)
{
xmlDocPtr doc;
xmlSchemaPtr schema;
Expand Down Expand Up @@ -50,14 +50,8 @@ validate_document(VALUE self, VALUE document)
return errors;
}

/*
* call-seq:
* validate_file(filename)
*
* Validate a file against this Schema.
*/
static VALUE
validate_file(VALUE self, VALUE rb_filename)
noko_xml_schema__validate_file(VALUE self, VALUE rb_filename)
{
xmlSchemaPtr schema;
xmlSchemaValidCtxtPtr valid_ctxt;
Expand Down Expand Up @@ -96,7 +90,7 @@ xml_schema_parse_schema(
VALUE rb_parse_options
)
{
xmlExternalEntityLoader old_loader = 0;
xmlExternalEntityLoader saved_loader = 0;
libxmlStructuredErrorHandlerState handler_state;

if (NIL_P(rb_parse_options)) {
Expand All @@ -117,14 +111,14 @@ xml_schema_parse_schema(
);

if (c_parse_options & XML_PARSE_NONET) {
old_loader = xmlGetExternalEntityLoader();
saved_loader = xmlGetExternalEntityLoader();
xmlSetExternalEntityLoader(xmlNoNetExternalEntityLoader);
}

xmlSchemaPtr c_schema = xmlSchemaParse(c_parser_context);

if (old_loader) {
xmlSetExternalEntityLoader(old_loader);
if (saved_loader) {
xmlSetExternalEntityLoader(saved_loader);
}

xmlSchemaFreeParserCtxt(c_parser_context);
Expand All @@ -147,46 +141,21 @@ xml_schema_parse_schema(
}

/*
* call-seq:
* read_memory(string) → Nokogiri::XML::Schema
*
* Create a new schema parsed from the contents of +string+
*
* [Parameters]
* - +string+: String containing XML to be parsed as a schema
*
* [Returns] Nokogiri::XML::Schema
*/
static VALUE
read_memory(int argc, VALUE *argv, VALUE rb_class)
{
VALUE rb_content;
VALUE rb_parse_options;
xmlSchemaParserCtxtPtr c_parser_context;

rb_scan_args(argc, argv, "11", &rb_content, &rb_parse_options);

c_parser_context = xmlSchemaNewMemParserCtxt(
(const char *)StringValuePtr(rb_content),
(int)RSTRING_LEN(rb_content)
);

return xml_schema_parse_schema(rb_class, c_parser_context, rb_parse_options);
}

/*
* call-seq:
* :call-seq:
* from_document(document) → Nokogiri::XML::Schema
* from_document(document, parse_options) → Nokogiri::XML::Schema
*
* Create a new schema parsed from the +document+.
* Create a Schema from an already-parsed XSD schema definition document.
*
* [Parameters]
* - +document+: Nokogiri::XML::Document to be parsed
* - +document+ (XML::Document) A document object representing the parsed XSD
* - +parse_options+ (Nokogiri::XML::ParseOptions)
* Defaults to Nokogiri::XML::ParseOptions::DEFAULT_SCHEMA
*
* [Returns] Nokogiri::XML::Schema
*/
static VALUE
rb_xml_schema_s_from_document(int argc, VALUE *argv, VALUE rb_class)
noko_xml_schema_s_from_document(int argc, VALUE *argv, VALUE rb_class)
{
VALUE rb_document;
VALUE rb_parse_options;
Expand Down Expand Up @@ -236,9 +205,8 @@ noko_init_xml_schema(void)

rb_undef_alloc_func(cNokogiriXmlSchema);

rb_define_singleton_method(cNokogiriXmlSchema, "read_memory", read_memory, -1);
rb_define_singleton_method(cNokogiriXmlSchema, "from_document", rb_xml_schema_s_from_document, -1);
rb_define_singleton_method(cNokogiriXmlSchema, "from_document", noko_xml_schema_s_from_document, -1);

rb_define_private_method(cNokogiriXmlSchema, "validate_document", validate_document, 1);
rb_define_private_method(cNokogiriXmlSchema, "validate_file", validate_file, 1);
rb_define_private_method(cNokogiriXmlSchema, "validate_document", noko_xml_schema__validate_document, 1);
rb_define_private_method(cNokogiriXmlSchema, "validate_file", noko_xml_schema__validate_file, 1);
}
89 changes: 68 additions & 21 deletions lib/nokogiri/xml/relax_ng.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,36 +3,83 @@
module Nokogiri
module XML
class << self
###
# Create a new Nokogiri::XML::RelaxNG document from +string_or_io+.
# See Nokogiri::XML::RelaxNG for an example.
def RelaxNG(string_or_io, options = ParseOptions::DEFAULT_SCHEMA)
RelaxNG.new(string_or_io, options)
# :call-seq:
# RelaxNg(input) → Nokogiri::XML::RelaxNG
# RelaxNg(input, parse_options) → Nokogiri::XML::RelaxNG
#
# Parse a RELAX NG schema definition and create a new Schema object. This is a convenience
# method for Nokogiri::XML::RelaxNG.new
#
# See related: Nokogiri::XML::RelaxNG.new
#
# [Parameters]
# - +input+ (String, IO) RELAX NG schema definition
# - +parse_options+ (Nokogiri::XML::ParseOptions)
# Defaults to ParseOptions::DEFAULT_SCHEMA
#
# [Returns] Nokogiri::XML::RelaxNG
#
def RelaxNG(input, parse_options = ParseOptions::DEFAULT_SCHEMA)
RelaxNG.new(input, parse_options)
end
end

###
# Nokogiri::XML::RelaxNG is used for validating XML against a
# RelaxNG schema.
# Nokogiri::XML::RelaxNG is used for validating XML against a RELAX NG schema definition.
#
# == Synopsis
# *Example:* Determine whether an XML document is valid.
#
# Validate an XML document against a RelaxNG schema. Loop over the errors
# that are returned and print them out:
# schema = Nokogiri::XML::RelaxNG(File.read(RELAX_NG_FILE))
# doc = Nokogiri::XML(File.read(XML_FILE))
# schema.valid?(doc) # Boolean
#
# schema = Nokogiri::XML::RelaxNG(File.open(ADDRESS_SCHEMA_FILE))
# doc = Nokogiri::XML(File.open(ADDRESS_XML_FILE))
# *Example:* Validate an XML document against a RelaxNG schema, and capture any errors that are found.
#
# schema.validate(doc).each do |error|
# puts error.message
# end
# schema = Nokogiri::XML::RelaxNG(File.open(RELAX_NG_FILE))
# doc = Nokogiri::XML(File.open(XML_FILE))
# errors = schema.validate(doc) # Array<SyntaxError>
#
# The list of errors are Nokogiri::XML::SyntaxError objects.
#
# NOTE: RelaxNG input is always treated as TRUSTED documents, meaning that they will cause the
# underlying parsing libraries to access network resources. This is counter to Nokogiri's
# "untrusted by default" security policy, but is a limitation of the underlying libraries.
# ⚠ RELAX NG input is always treated as *trusted*, meaning that the underlying parsing libraries
# *will access network resources*. This is counter to Nokogiri's "untrusted by default" security
# policy, but is an unfortunate limitation of the underlying libraries. Please do not use this
# class for untrusted schema documents.
class RelaxNG < Nokogiri::XML::Schema
# :call-seq:
# new(input) → Nokogiri::XML::RelaxNG
# new(input, parse_options) → Nokogiri::XML::RelaxNG
#
# Parse a RELAX NG schema definition and create a new Schema object.
#
# [Parameters]
# - +input+ (String, IO) RELAX NG schema definition
# - +parse_options+ (Nokogiri::XML::ParseOptions)
# Defaults to ParseOptions::DEFAULT_SCHEMA ⚠ Unused
#
# [Returns] Nokogiri::XML::RelaxNG
#
# ⚠ +parse_options+ is currently unused by this method and is present only as a placeholder for
# future functionality.
def self.new(input, parse_options = ParseOptions::DEFAULT_SCHEMA)
read_memory(input, parse_options)
end

# :call-seq:
# read_memory(input) → Nokogiri::XML::RelaxNG
# read_memory(input, parse_options) → Nokogiri::XML::RelaxNG
#
# Parse a RELAX NG schema definition and create a new Schema object.
#
# [Parameters]
# - +input+ (String) RELAX NG schema definition
# - +parse_options+ (Nokogiri::XML::ParseOptions)
# Defaults to ParseOptions::DEFAULT_SCHEMA ⚠ Unused
#
# [Returns] Nokogiri::XML::RelaxNG
#
# ⚠ +parse_options+ is currently unused by this method and is present only as a placeholder for
# future functionality.
def self.read_memory(input, parse_options = ParseOptions::DEFAULT_SCHEMA)
from_document(Nokogiri::XML::Document.parse(input), parse_options)
end
end
end
end
Loading

0 comments on commit 925f3be

Please sign in to comment.