Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceeded recursion depth with multiple schema locations for namespaces not matching targetNamespace of referenced schemas #324

Closed
prettyv opened this issue Aug 25, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@prettyv
Copy link

prettyv commented Aug 25, 2022

When an XML document specifies multiple schema locations in its xsi:schemaLocation attribute with namespaces that don't match the namespace defined in the targetNamespace attribute of the schema there is infinite recursion.

Minimal example to reproduce:

xmlschema.XMLSchema10("http://www.loc.gov/standards/mets/mets.xsd", locations=[("http://www.loc.gov/standards/mix/", "http://www.loc.gov/standards/mix/mix.xsd"), ("http://www.loc.gov/standards/premis/", "http://www.loc.gov/standards/premis/v2/premis-v2-0.xsd")])

When there is only one namespace mismatch I at least get an XMLSchemaParseError: cannot import namespace 'http://www.w3.org/1999/xlink': imported schema 'http://www.loc.gov/standards/mix/mix.xsd' has an unmatched namespace 'http://www.loc.gov/standards/mix/'.

I ran into this when trying to simply instantiate a XmlDocument from a METS file that was produced by Goobi Workflow with a root element like this:

<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:dv="http://dfg-viewer.de/" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" OBJID="001" xsi:schemaLocation="http://www.loc.gov/standards/premis/ http://www.loc.gov/standards/premis/v2/premis-v2-0.xsd http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-7.xsd http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd http://www.loc.gov/standards/mix/ http://www.loc.gov/standards/mix/mix.xsd">

I'm not sure if a mismatch like this is even really supposed to be an error according to the XML Schema specifications after an admittedly quick glance at them. Even so, there definitely shouldn't be infinite recursion if there are multiple, but since I'm not familiar with the codebase I wasn't able to find an obvious fix yet.

Tested in a venv with the current v2.02 under Python 3.10.6 on Arch Linux.

@brunato brunato added the bug Something isn't working label Aug 26, 2022
@brunato
Copy link
Member

brunato commented Aug 26, 2022

Hi,

this is strictly related to locations argument when an imported schema has a different targetNamespace than the one declared in locations argument. To solve this i will filter out imported namespace from self._locations in import_schema() method.

Thank you

@prettyv
Copy link
Author

prettyv commented Aug 27, 2022

Thanks, I made some local adjustments in imports_schema() according to my interpretation of your hint to test this out and that worked for now (without breaking tests). If you want I can make a PR with these changes as well, although I'm not too confident I understand the whole well enough.

I also spent a bit more time wading through the XML Schema specs but I still couldn't find anything that explicitly says that the namespace being pointed to a schema in xsi:schemaLocation has to match the targetNamespace of the schema root. The way I'm currently thinking about it is akin to import name aliasing, i.e. when schema locations are specified with a (locally defined) namespace this effectively aliases the target namespace of the referenced schema. As such I have changed the if schema.target_namespace != namespace: check to log a warning instead of raising an exception for my local testing.

Do you know where this behaviour would be more explicitly specified or how other validators handle this namespace issue? My impression from searching around gave me the impression that this is not uniformly handled across available tools, but I might be wrong.

@brunato
Copy link
Member

brunato commented Aug 27, 2022

Thanks, I made some local adjustments in imports_schema() according to my interpretation of your hint to test this out and that worked for now (without breaking tests). If you want I can make a PR with these changes as well, although I'm not too confident I understand the whole well enough.

Thank you, but I already made the changes in my develop. In import_schema() replace locations=self._locations with:

locations=[x for x in self._locations if x[0] != namespace]

I also spent a bit more time wading through the XML Schema specs but I still couldn't find anything that explicitly says that the namespace being pointed to a schema in xsi:schemaLocation has to match the targetNamespace of the schema root. The way I'm currently thinking about it is akin to import name aliasing, i.e. when schema locations are specified with a (locally defined) namespace this effectively aliases the target namespace of the referenced schema. As such I have changed the if schema.target_namespace != namespace: check to log a warning instead of raising an exception for my local testing.

For xs:import, the namespace provided must match the targetNamespace of the referenced XSD (if the resource is fetched and downloaded). If the namespace is none the XSD root must not contain a targetNamespace attribute.

Using xsi:schemaLocation in a XML instance is very similar if the namespace is not already imported in other associated schema. Otherwise the incongruence of importing a schema (a namespace ...) ad having another one cannot be resolved (a possible solution in this case could be discarding the schema but this requires a complex fix).

Do you know where this behaviour would be more explicitly specified or how other validators handle this namespace issue? My impression from searching around gave me the impression that this is not uniformly handled across available tools, but I might be wrong.

I will build a test case for checking this and try it also with other software.
Also i will check this in a book about XML Schema (Definitive XML Schema, 2nd Edition) when i return at my office.

@prettyv
Copy link
Author

prettyv commented Aug 28, 2022

Yeah, for xs:import I understood it that way, while xsi:schemaLocation could be argued to be the same by analogy, but as I said I couldn't find this be stated explicitly. I see that it would also be more complex to handle if these were treated differently.

I tested locally with 2 validating environments I had available easily:

  • Visual Studio Code has an extension with schema validation support by Red Hat which uses the LemMinX language server which in turn makes use of the Java implementation of Xerces (all using current release versions). Using this an XML instance with mismatching namespaces for schema locations is evaluated as valid.
  • The C version of Xerces available via pacman on Arch Linux on the other hand does see these as an error (called via PParse -n -s -f METS.xml)

Would be interesting to see how Saxon or Oxygen validate this if you have access to them. Thanks for looking into it!

@brunato
Copy link
Member

brunato commented Aug 29, 2022

I see that it would also be more complex to handle if these were treated differently.

The complexity is on removing a schema after import if it has a different namespace than expected. The imported schema can have its <xs:include ../> and <xs:import .../> statements, that are processed during schema initialization, recursively. But I'm pretty sure that is not possible to have a namespace mismatch in imports, also if they are processed using xsi schema location hints.

I tested locally with 2 validating environments I had available easily:

* Visual Studio Code has an [extension](https://github.com/redhat-developer/vscode-xml) with schema validation support by Red Hat which uses the [LemMinX](https://github.com/eclipse/lemminx/) language server which in turn makes use of the Java implementation of Xerces (all using current release versions). Using this an XML instance with mismatching namespaces for schema locations is evaluated as valid.

I tested with an online validator based on Xerces-J and assessed as valid only if there is an explicit import in the origin schema, so xsi location hints in XML are ignored. But looking at Xerces-J source and some comments this behavior is configurable and LemMinX may not use the default setting.

* The `C` version of Xerces available via pacman on Arch Linux on the other hand does see these as an error (called via `PParse -n  -s -f METS.xml`)

Also testing with xmllint give me errors, but looking at source code in libxml2 this should be be possible when non schema is provided. In this case it should be an interface limitation, that requires a schema instance to do validation.

Would be interesting to see how Saxon or Oxygen validate this if you have access to them. Thanks for looking into it!

I will try Saxon-HE (that is open-source) using the python interface, also in this case the usage of XSI location hints should be an option.

Probably i will make optionable this also for xmlschema's document API (e.g. xmlschema.validate).

@brunato
Copy link
Member

brunato commented Aug 30, 2022

After a check in Definitive XML Schema i can confirm that if a schema is retrieved using a XSI schema location hint its targetNamespace must match the namespace paired with the location.

The usage of XSI location hints is not mandatory and depends by the implementation.

Currently xmlschema uses XSI schema location hints when a schema istance is not already available. I add the optional argument use_location_hints=True to document level API to permit the choice of ignoring schema location hints.

@brunato
Copy link
Member

brunato commented Sep 8, 2022

Hi @prettyv,

a fix is available with release v2.0.4.

Namespace mismatch still generates an import error. This is the correct behavior. If you need to create a schema with non-standard import, create the schema instance providing a list of sources or a list of locations and validation='skip' (but strange behavior maybe expected from not fully compliant schemas).

In v2.0.4 I have also added the option use_location_hints to give the opstion of ignoring schema location hints.

Best regards

@brunato brunato closed this as completed Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants