-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Materialization improvements #264
Conversation
ce5744b
to
d534e72
Compare
It now: 1. Works without additional_vars included 2. Returns the graphviz object for rendering in a notebook
This was added when we added graph copying. This adds the ability for update_dependencies to wipe the dependencies before executing. Note that it does an in-place modification if this doesn't happen -- that's the reset_dependencies option. Note this is an internal API so I don't mind it being geared towards performance.
This was referring to data loaders instead of data savers in the registry.
8c0a81a
to
7639297
Compare
7639297
to
a2fc7fa
Compare
We do the ML model example, and add some custom ones. Hopefully this gets people started. We have an easy script to run + a notebook.
a2fc7fa
to
ae6636e
Compare
MUltiprocessing doesn't work in many cases due to the default pickling mechanism being garbage.
Previously we would inject a node with a parameter name into a parameter consumed by downstream set of nodes. This would cause name-clashes if, say, the parmeter name was `data`: @load_from.json(...) def foo(data: pd.DataFrame) -> ...: ... To fix this, we did two things: 1. Change the data loader nodes that were created to have namespaces so they're unique 2. Allow the NodeInjector to rename input nodes so it can communicate the new names. Note this is probably slightly more abstraction than needed but I have a sense that NodeInjector will be necessary moving forward (external API calls, etc...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only questions is do we want to call it additional_vars
vs additional_outputs
(more intuitive)... but otherwise just comments that aren't blockers.
2. Alters the output to include the materializer nodes | ||
3. Processes a list of "additional variables" (for debugging) to return intermediary data | ||
4. Executes the DAG, including the materializers | ||
5. Returns a tuple of (`materialization metadata`, `additional variables`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why "additional variables"? Because that's what we use in execute? Should we rename it to "additional_outputs"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep to mirror "final_vars", but I like "outputs" better than vars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just make it outputs personally. But 🤷 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought we already released this and I'm a stickler for semantic versioning...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your call.
to.json( | ||
dependencies=["model_parameters"], id="model_params_to_json", path="./data/params.json" | ||
), | ||
# classification report to .txt file | ||
to.file( | ||
dependencies=["classification_report"], | ||
id="classification_report_to_txt", | ||
path="./data/classification_report.txt", | ||
), | ||
# materialize the model to a pickle file | ||
to.pickle(dependencies=["fit_clf"], id="clf_to_pickle", path="./data/clf.pkl"), | ||
# materialize the predictions we made to a csv file | ||
to.csv( | ||
dependencies=["predicted_output_with_labels"], | ||
id="predicted_output_with_labels_to_csv", | ||
path="./data/predicted_output_with_labels.csv", | ||
), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want path
to be a source, and have it provided as inputs? Or at least one of these should show that pattern so that people can see what can still be dynamic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so i can add that as an example or at least note it
572cb96
to
317dbe6
Compare
This has: 1. A reference table, automatically generated using a custom sphinx directive 2. References for the base classes for extension Re (1) we generate a bare-bones table but it should be enough. For now, we just link to the code, but we will, at some point, link to actual class docs.
317dbe6
to
be724aa
Compare
[Short description explaining the high-level reason for the pull request]
Changes
How I tested this
Notes
Checklist