-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
save pipeline structure with parameters for reproduction #92
Comments
This was on our list of requirements when designing Sciline but it had a low priority lately. So thanks for the reminder! As you said, we will likely have to write the graph and parameter values. There are some open questions, though:
This is an interesting idea. But it would be an incomplete solution because providers typically call additional functions and we couldn't reasonably write their source code or hash, too. In our case, we expect to have a script or Jupyter notebook that defines the graph and possibly some specialised providers as well as one or more packages that define most providers. My assumption was that we at least write the precise version of al relevant packages (or a full |
It's great that you have a similar interest here :) I'll be happy to discuss this further at any point. Best Daniel |
I'll keep it open as a reminder. I'd be happy to hear your insights into how you and your users want to handle provenance and what requirements you have! |
My idea so far was to store the graph in a Sciline-independent manner. "Producers" and "Parameters" are strictly speaking an implementation detail of Sciline, so one would not want to rely on this for long-term archiving of data, FAIR data, ... The computational graph is hopefully more meaningful (when combined with input parameters). So we should look into how this can be stored in a generic manner. I don't know if studying, e.g., how Snakemake handles this can provide some guidange. |
Conclusion for now:
|
To get an overview of some formats used in practice, take a look at https://networkx.org/documentation/stable/reference/readwrite/index.html From this list, I'd prefer json or possibly adjacency list / multiline adjacency list. The former in particular, because it makes it easy to also store parameter values without inventing a new format. |
Can you explain why parameters might be large? I thought they would only be single/few numbers or strings. All large data would be read from a file. |
A parameter can be anything. For example you can process an intermediate result, set it as a parameter, and create a new task graph. Parameters can thus be anything, can have arbitrary size, and they might not be serializable at all. |
Are there any objections to using the json format described by networkx? If not, I'll implement that. |
JSON sounds good! |
First part done in #124. Now we need to figure out how to handle parameters. |
Hi scliline team,
First of all, many thanks for this great package. We were currently thinking of designing a similar system for pipeline data processing based on
xarray
data containers and luckily found your work before writing a line of code.In some of our data analysis tasks, we have some rather expensive
produces
(e.g. phase retrieval methods for X-ray holography) and in addition to the actual result of a pipeline, we would also like to save how we got there.Obviously, we could save the list of producers and parameters on our own, but how about a dedicated method of the
pipeline
class similar tovisualize()
, which could return not only the structure of the graph but also the actual parameter values?In addition to the names of the producers, one could also think of saving their source code and/or a hash of it for full reproducibility.
This feature could be extremely helpful during beamtimes, when code is often changed during online analysis.
Best
Daniel
The text was updated successfully, but these errors were encountered: