Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codegen parser preserves filenames & line numbers in round trip #608

Open
tetron opened this issue Oct 12, 2022 · 0 comments
Open

Codegen parser preserves filenames & line numbers in round trip #608

tetron opened this issue Oct 12, 2022 · 0 comments

Comments

@tetron
Copy link
Member

tetron commented Oct 12, 2022

Schema salad uses the ruamel.yaml "round trip" YAML parser.

This parser preserves comments and line numbers by using ruamel.yaml.comments.CommentedMap ruamel.yaml.comments.CommentedSeq. These objects behave like Python maps/sequences, but have an additional field lc (which stands for "line column" I think), the lc contains information for both where the Map or Seq element started, as well as where each of its contained items start as well. In addition, we set our own filename field to track what file an object came from.

This information is used to give better CWL errors, so it is possible to communicate what part of the file contains a warning or error. Specifically, look at the SourceLine class, which is used to wrap a code block such that any uncaught exceptions will be re-thrown with additional line number information added to the message.

The purpose of Schema salad is to validate documents based against a schema. The primary user is CWL but the schema salad is intended to be general purpose.

Schema salad supports two ways of parsing and validating documents. The original way is to load the schema into a data structure and then use the ref_resolver.Loader.resolve_all followed by validate.validate methods. The newer way is to use generate Python code from the schema which implements the same logic. The benefit of the code generation approach is that the resulting parser is much, much faster.

However, if you want to "round trip" a CWL document by using the codegen parser (which is based on loading records into objects), then exporting it back to maps and sequences, you lose the line number information.

For this project, we want to preserve the line number and filename information so that if you re-export the document (using save()) it preserves, as best as possible, the original line/column and filename annotations for use by CWL. As a stretch goal, it would also be neat if it preserved the YAML comments (which are also recorded by the "CommentedMap" / "CommentedSeq" classes) so that using the ruamel round trip exporter included all the comments from the original document.

The code generator code can be found in python_codegen.py. The parsers are ultimately released in the cwl-utils project. Here's how the CWL parsers are generated:

https://github.com/common-workflow-language/cwl-utils#development

We're currently retaining the original CommentedMap in the _doc field but not doing anything with it, so one approach is to have the save() method use the annotations from _doc to annotate objects that are returned. Among other things, you'll need to return CommentedMap and CommentedSeq instead of Dict and List.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants