Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML Schema Download incorrectly modifies the schema #387

Closed
AmeyaVS opened this issue Feb 22, 2024 · 22 comments
Closed

XML Schema Download incorrectly modifies the schema #387

AmeyaVS opened this issue Feb 22, 2024 · 22 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@AmeyaVS
Copy link

AmeyaVS commented Feb 22, 2024

I am trying to download XML schema from a remote URL and it seems to be modifying one of the schema document incorrectly.

Here's a snippet of the code to reproduce the issue:

import xmlschema
import os.path
import urllib

def main():
    # Schema Base URL resource
    xsd_base_uri = "http://www.accellera.org/XMLSchema/IPXACT/1685-2022/index.xsd"

    # Extract Path from URI
    path = urllib.parse.urlparse(xsd_base_uri).path
    print(path)

    # Split path into path + resource name
    schema_path = os.path.split(path)
    print(schema_path)

    target_path = f"schemas/{schema_path[0]}"

    # Create Directory if it doesn't exists:
    os.makedirs(target_path, exist_ok=True)

    local_target_path = f"{target_path}/{schema_path[1]}"

    if not os.path.isfile(f"{local_target_path}"):
        schema = xmlschema.XMLSchema(xsd_base_uri)
        schema.export(target=target_path, save_remote=True)
    schema  = xmlschema.XMLSchema(local_target_path)

if __name__ == '__main__':
    main()

The library seems to be modifying the xs:import line in the autoConfigure.xsd document:

image

The Left side is the original file downloaded from the url:
http://www.accellera.org/XMLSchema/IPXACT/1685-2022/autoConfigure.xsd

Because of the edit the XML Validation would fail due to incorrect XML Schema specification with the following error:

xmlschema.validators.exceptions.XMLSchemaParseError: the QName 'xml:id' is mapped to the namespace 'http://www.w3.org/XML/1998/namespace', but this namespace has not an xs:import statement in the schema:

Schema component:

  <xs:attribute xmlns:xs="http://www.w3.org/2001/XMLSchema" ref="xml:id" />

Path: /xs:schema/xs:attributeGroup[2]/xs:attribute

For now there are other changes also but are of no significant impact.
Is there an option to download the XML Schemas without editing?

@brunato
Copy link
Member

brunato commented Feb 22, 2024

Hi, thank you for the detailed explanation.

An alternative is the uri_mapper option, available since release v3.0.0 (download the schemas manually and then provide a map for remote URLs to local paths).

@AmeyaVS
Copy link
Author

AmeyaVS commented Feb 22, 2024

For now, I have manually reverted the changes in the XML Schema affecting me to move ahead.
Would it be possible to call out which files (logs, etc.) are being modified from their original sources?
I spent nearly half day before realizing the underlying issue.

@brunato
Copy link
Member

brunato commented Feb 22, 2024

I could add a logger for export method (export_schema function in fact), providing loglevel optional argument like it's now for schema initialization/building.

@brunato brunato added bug Something isn't working enhancement New feature or request labels Feb 22, 2024
@brunato
Copy link
Member

brunato commented Mar 11, 2024

Maybe for solving this a fix in this helper can be sufficient:

def replace_location(text: str, location: str, repl_location: str) -> str:
    repl = 'schemaLocation="{}"'.format(repl_location)
    pattern = r'\bschemaLocation\s*=\s*[\'\"].*%s.*[\'"]' % re.escape(location)
    return re.sub(pattern, repl, text)

The replacement pattern matches also the namespace part so the XML namespace has no xs:import element in your case.

Also another improvement (reducing useless changes) could be to skip the erasing of residual non-remote locations.

brunato added a commit that referenced this issue Mar 12, 2024
  - Fix the replacement pattern in export_schema()
  - Add loglevel argument, apply with a decorator
  - Add logger.debug statements
  - Don't remove non-remote residuals schemaLocation entries
@brunato
Copy link
Member

brunato commented Mar 13, 2024

The new release v3.1.0 has a fix for schema exports
.
The replacement pattern has been changed with a safer one (considering that the source is a valid XML document ...) and the residual imports are cleared only if schemaLocation contains a remote URL.

Also a logging facility has been added to export_schema() function (activable providing logging='DEBUG' to XMLSchema.export()).

@AmeyaVS
Copy link
Author

AmeyaVS commented Mar 13, 2024

I tried out the latest release, it seems to not modify the xml schema.
But it seems to not download the dependency on the xml.xsd in the import statement.
Should I create another bug report for that?

@brunato
Copy link
Member

brunato commented Mar 13, 2024

The XML namespace is already loaded within the meta-schema, so an xs:import element has to be present in the schema if the namespace is used (e.g. xml:base) but the import has no effect (and in many cases i found that the location points to an HTML page instead of a regular XSD file).

So the download of remote xml.xsd (if any) is not necessary for xmlschema, the problem is only the removal of the namespace attribute from the xs:import statement.

@brunato
Copy link
Member

brunato commented Mar 13, 2024

To clarify: the schema export doesn't download nothing, it only uses the already downloaded XSD sources contained in the schema instance and save them locally.

@brunato
Copy link
Member

brunato commented Mar 14, 2024

(and in many cases i found that the location points to an HTML page instead of a regular XSD file).

I'm sorry, I didn't remember well, the referred xml.xsd (e.g. "http://www.w3.org/2001/xml.xsd" is an XSD file with a stylesheet).

Schema classes use a meta-schema that already has loaded a minimal set of base namespaces:

I cannot remove XML from base namespaces because xml:lang is used in XSD namespace meta-schema (with a regular import). The meta-schema does a fundamental part in validation and decoding in an efficient mode, despite it can be rebuild if it's needed.

Anyway I think the export procedure can be extended with another option for doing a tentative of loading and saving the residual locations referred by skipped xs:import elements. I will try this way for a next release.

@brunato
Copy link
Member

brunato commented Mar 16, 2024

FYI about the special status of the above four base namespaces: https://www.w3.org/TR/xmlschema11-1/#sec-nss-special

@brunato
Copy link
Member

brunato commented Mar 18, 2024

@AmeyaVS: I will not change schema export for downloading skipped schemas like the case of xml.xsd, but in the next minor release I will add a new API download_schemas() for download a set of schemas giving an URL as the starting point.

@AmeyaVS
Copy link
Author

AmeyaVS commented Mar 18, 2024

@AmeyaVS: I will not change schema export for downloading skipped schemas like the case of xml.xsd, but in the next minor release I will add a new API download_schemas() for download a set of schemas giving an URL as the starting point.

Makes sense. Thank you for getting a fix in quickly.

@AmeyaVS
Copy link
Author

AmeyaVS commented Mar 22, 2024

Should I close this issue in the meantime, or should I keep it open once the download_schemas() API is ready. I could update here with details or any other observations.

@brunato
Copy link
Member

brunato commented Mar 22, 2024

Keep it open, the next minor release should be ready soon.

@brunato
Copy link
Member

brunato commented Apr 2, 2024

The download_schemas() API is available with release v3.2.0.

@AmeyaVS
Copy link
Author

AmeyaVS commented Apr 2, 2024

Hello @brunato ,

I tried the following code to try and observe the download_schemas API.

import xmlschema
import os.path
import urllib

from xmlschema import download_schemas

def main():
    # Schema Base URL resource
    xsd_base_uri = "http://www.accellera.org/XMLSchema/IPXACT/1685-2022/index.xsd"

    # Extract Path from URI
    path = urllib.parse.urlparse(xsd_base_uri).path
    print(path)

    # Split path into path + resource name
    schema_path = os.path.split(path)
    print(schema_path)

    target_path = f"schemas/{schema_path[0]}"

    # Create Directory if it doesn't exists:
    os.makedirs(target_path, exist_ok=True)

    local_target_path = f"{target_path}/{schema_path[1]}"

    # Download schemas
    download_schemas(xsd_base_uri, target="schemas2")

    if not os.path.isfile(f"{local_target_path}"):
        schema = xmlschema.XMLSchema(xsd_base_uri)
        schema.export(target=target_path, save_remote=True)
    schema  = xmlschema.XMLSchema(local_target_path)

if __name__ == '__main__':
    main()

And observing following error message on the console with respect to the xsd URL:

Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/addressBlockDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/registerFileDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/registerDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/memoryMapDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/enumerationDefinition.xsd: not well-formed (invalid token): line 15, column 46
Error parsing XML resource at URL http://www.accellera.org/XMLSchema/IPXACT/1685-2022/fieldDefinition.xsd: not well-formed (invalid token): line 15, column 46

While looking at the index.xsd source file:

<xs:schema xmlns:ipxact="http://www.accellera.org/XMLSchema/IPXACT/1685-2022" xmlns:xs="http://www.w3.org/2001/<xs:schema xmlns:ipxact="http://www.accellera.org/XMLSchema/IPXACT/1685-2022" xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.accellera.org/XMLSchema/IPXACT/1685-2022" elementFormDefault="qualified">
	<xs:include schemaLocation="busDefinition.xsd"/>
	<xs:include schemaLocation="component.xsd"/>
	<xs:include schemaLocation="design.xsd"/>
	<xs:include schemaLocation="designConfig.xsd"/>
	<xs:include schemaLocation="abstractionDefinition.xsd"/>
	<xs:include schemaLocation="catalog.xsd"/>
	<xs:include schemaLocation="abstractor.xsd"/>
	<xs:include schemaLocation="typeDefinitions.xsd"/>
	<!-- <xs:include schemaLocation="memoryMapDefinition.xsd"/> -->
	<!-- <xs:include schemaLocation="addressBlockDefinition.xsd"/> -->
	<!-- <xs:include schemaLocation="registerFileDefinition.xsd"/> -->
	<!-- <xs:include schemaLocation="registerDefinition.xsd"/> -->
	<!-- <xs:include schemaLocation="fieldDefinition.xsd"/> -->
	<!-- <xs:include schemaLocation="enumerationDefinition.xsd"/> -->
	<xs:group name="IPXACTDocumentTypes">

It seems xmlschema is also parsing the commented section which anyway are invalid schema definitions.

Let me know if additional context is needed.

Regarding the 2 different ways to get the schemas results in identical schemas being downloaded for my use case.

@brunato
Copy link
Member

brunato commented Apr 2, 2024

Ok, maybe better to abandon regex for extracting schemaLocation list from text and use an iteration on ElementTree structure instead.
For the next bugfix release.

thank you

@AmeyaVS
Copy link
Author

AmeyaVS commented Apr 3, 2024

Another observation which I missed out yesterday, between both the approaches is the missing xml.xsd when using the save_remote parameter on the export API:

image

@brunato
Copy link
Member

brunato commented Apr 4, 2024

@AmeyaVS: I will not change schema export for downloading skipped schemas like the case of xml.xsd, but in the next minor release I will add a new API download_schemas() for download a set of schemas giving an URL as the starting point.

Changing that for export is not recommendable because xml.xsd is already included in base schema set, so the xmlschema library doesn't need to save another copy of xml.xsd. If you want you can try an export after creating the schema providing use_meta=False.

Anyway the download_schema() API will download all the referred XSD resources.

@AmeyaVS
Copy link
Author

AmeyaVS commented Apr 5, 2024

Sounds good, let me know if you want me to close this issue.

brunato added a commit that referenced this issue Apr 7, 2024
  - Modify dataclass XsdSource: now takes a path and an XMLResorce,
    other attributes are set in __init__;
  - Schema locations now are extracted from XML tree.
@brunato
Copy link
Member

brunato commented Apr 17, 2024

Now the changes are published. Try the updated code and report other problems eventually, or close the issue.
Thank you

@AmeyaVS
Copy link
Author

AmeyaVS commented Apr 30, 2024

Sorry, for the delay.
Closing the issue.

@AmeyaVS AmeyaVS closed this as completed Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants