Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store the model definition into files that are written #358

Merged
merged 19 commits into from
Mar 7, 2023

Conversation

tmadlener
Copy link
Collaborator

@tmadlener tmadlener commented Dec 9, 2022

BEGINRELEASENOTES

  • Embed the EDM definition in JSON format into the shared core datamodel libraries
    • Generate an additional DatamodelDefinition.h header file containing the string literal json encoded definition
    • Statically register this to the newly introduced DatamodelRegistry and make collections aware of which datamodel they belong to
  • Collect all EDM definitions from all collections that are written with a writer and write all these definitions to the resulting file
    • Currently only done for the FrameWriters
  • Give podio-dump the necessary functionality to retrieve the stored models and dump them in YAML format again
    • Add roundtrip tests that compare the generated code from the original model and the one that has been dumped from a data file to ensure that all components work as intended.
  • See the advanced topics documentation for more details.

ENDRELEASENOTES

This is a first, incomplete, draft of storing the model definition into the written files, such that they can be retrieved again later such that the datamodel(s) of the types stored in the file can be regenerated. This is mainly here to gather some early feedback before investing too much time into the whole thing. @hegner, @gaede, @andresailer please let me know if there are any serious concerns with the overall approach, details can then be hashed out later, I suppose.

The general idea is:

  • Store the model definition as a JSON string somewhere in the metadata of the file.
  • Provide some standalone tools (or make the class generator or podio-dump capable of) either generating the code directly from a file, or simply dumping the yaml definition again.

The JSON string is a raw string literal that we simply generate via the code generator. In order to get this to the files, we also introduce a DatamodelRegistry singleton, where we make each datamodel statically register itself. While writing we simply query collections about which definition we need to write in the end, and we collect a list of all defintions that need writing, and write them once everything is done.

The reason for JSON as storage "format" is because it yields relatively compact strings that need to be stored (especially compared to trying to store the raw YAML files). Additionally, converting between JSON and YAML in python is pretty trivial in the end (and only necessary if we want to dump a YAML file again).

TODO list:

  • Make it possible to dump (parsed) datamodel definitions to JSON
  • code generation of compile time raw string literal
  • DatamodelRegistry singleton
  • static registration of the JSON definition strings in the registry
  • Make collections query-able about which definition they need
  • Instrument writers to collect and write the necessary definitions
  • Figure out the best way to handle upstream EDMs in this scenario
  • Tooling to retrieve definitions again from written files
  • (roundtrip) tests
  • (developers) documentation
  • release notes
  • Needs Add basic I/O tests for datamodel extensions #361
  • Needs Use sio utility functionality for writing also in legacy writer #363

@tmadlener tmadlener force-pushed the store-model-def branch 5 times, most recently from 8ec13fa to 56d1149 Compare December 23, 2022 09:47
@tmadlener
Copy link
Collaborator Author

tmadlener commented Dec 23, 2022

Moved some of the general refactoring / code cleanup into #363 to make this diff slightly less cluttered (once it is merged).

@tmadlener tmadlener changed the title [WIP] Store the model definition into files that are written Store the model definition into files that are written Jan 16, 2023
doc/advanced_topics.md Outdated Show resolved Hide resolved
include/podio/CollectionBase.h Show resolved Hide resolved
include/podio/EDMDefinitionRegistry.h Outdated Show resolved Hide resolved
include/podio/SIOBlock.h Show resolved Hide resolved
python/templates/DatamodelDefinition.h.jinja2 Outdated Show resolved Hide resolved
src/EDMDefinitionRegistry.cc Outdated Show resolved Hide resolved
src/EDMDefinitionRegistry.cc Outdated Show resolved Hide resolved
src/EDMDefinitionRegistry.cc Outdated Show resolved Hide resolved
Copy link
Collaborator

@hegner hegner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmadlener - thanks for this great implementation.
There are a few things which need changing here. See my comments.

doc/advanced_topics.md Show resolved Hide resolved
include/podio/EDMDefinitionRegistry.h Outdated Show resolved Hide resolved
include/podio/ROOTFrameReader.h Outdated Show resolved Hide resolved
python/templates/DatamodelDefinition.h.jinja2 Outdated Show resolved Hide resolved
@tmadlener
Copy link
Collaborator Author

Brief intermediate update: I have renamed the registry to DatamodelRegistry since it holds information about datamodels in general. I have also tried to be more clear about what is a datamodel and what is an EDM and renamed methods and classes accordingly. Essentially now an EDM refers to the compiled library of a datamodel only, whereas the datamodel is everything related to one model, i.e. it's definition in YAML format as well as other information. In the API now everything refers to datamodels unless it is specifically about something more concrete. I still have to update the markdown documentation to include these changes and definitions, and I also still have to to tackle the issue of dependent datamodels.

@hegner
Copy link
Collaborator

hegner commented Mar 2, 2023

@tmadlener - thanks for the updates - in particular as some of them were quite tedious...
I would open a ticket for the dependent datamodels and just fix the documentation for now.

@tmadlener
Copy link
Collaborator Author

Alright, documentation is updated, and a final leftover header has also been renamed to match the new terminology.

@tmadlener
Copy link
Collaborator Author

I think now I really have all the things that needed to be renamed ;)

The two workflows that are currently failing, are due to failing upstream nightly builds (the setup.sh script is missing in both cases, causing the workflow to fail)

@tmadlener
Copy link
Collaborator Author

all workflows are working again and passing

@hegner
Copy link
Collaborator

hegner commented Mar 6, 2023

@tmadlener - GREAT! :-)

@tmadlener
Copy link
Collaborator Author

Is there anything still left to do on this one? I would open the issue for the missing correct handling of dependent datamodels once this is merged. I think I addressed the rest here.

@hegner hegner self-requested a review March 7, 2023 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants