Materialization improvements #264

elijahbenizzy · 2023-08-10T04:21:46Z

[Short description explaining the high-level reason for the pull request]

Changes

How I tested this

Notes

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

It now: 1. Works without additional_vars included 2. Returns the graphviz object for rendering in a notebook

This was added when we added graph copying. This adds the ability for update_dependencies to wipe the dependencies before executing. Note that it does an in-place modification if this doesn't happen -- that's the reset_dependencies option. Note this is an internal API so I don't mind it being geared towards performance.

This was referring to data loaders instead of data savers in the registry.

We do the ML model example, and add some custom ones. Hopefully this gets people started. We have an easy script to run + a notebook.

MUltiprocessing doesn't work in many cases due to the default pickling mechanism being garbage.

Previously we would inject a node with a parameter name into a parameter consumed by downstream set of nodes. This would cause name-clashes if, say, the parmeter name was `data`: @load_from.json(...) def foo(data: pd.DataFrame) -> ...: ... To fix this, we did two things: 1. Change the data loader nodes that were created to have namespaces so they're unique 2. Allow the NodeInjector to rename input nodes so it can communicate the new names. Note this is probably slightly more abstraction than needed but I have a sense that NodeInjector will be necessary moving forward (external API calls, etc...).

skrawcz

Only questions is do we want to call it additional_vars vs additional_outputs (more intuitive)... but otherwise just comments that aren't blockers.

skrawcz · 2023-08-13T21:03:36Z

examples/materialization/README.md

+2. Alters the output to include the materializer nodes
+3. Processes a list of "additional variables" (for debugging) to return intermediary data
+4. Executes the DAG, including the materializers
+5. Returns a tuple of (`materialization metadata`, `additional variables`)


why "additional variables"? Because that's what we use in execute? Should we rename it to "additional_outputs"?

Yep to mirror "final_vars", but I like "outputs" better than vars

I would just make it outputs personally. But 🤷 .

On second thought we already released this and I'm a stickler for semantic versioning...

skrawcz · 2023-08-13T21:06:26Z

examples/materialization/run.py

+        to.json(
+            dependencies=["model_parameters"], id="model_params_to_json", path="./data/params.json"
+        ),
+        # classification report to .txt file
+        to.file(
+            dependencies=["classification_report"],
+            id="classification_report_to_txt",
+            path="./data/classification_report.txt",
+        ),
+        # materialize the model to a pickle file
+        to.pickle(dependencies=["fit_clf"], id="clf_to_pickle", path="./data/clf.pkl"),
+        # materialize the predictions we made to a csv file
+        to.csv(
+            dependencies=["predicted_output_with_labels"],
+            id="predicted_output_with_labels_to_csv",
+            path="./data/predicted_output_with_labels.csv",
+        ),
+    ]


do we want path to be a source, and have it provided as inputs? Or at least one of these should show that pattern so that people can see what can still be dynamic.

Yeah so i can add that as an example or at least note it

This has: 1. A reference table, automatically generated using a custom sphinx directive 2. References for the base classes for extension Re (1) we generate a bare-bones table but it should be enough. For now, we just link to the code, but we will, at some point, link to actual class docs.

elijahbenizzy force-pushed the materialize-fixes branch from ce5744b to d534e72 Compare August 10, 2023 04:22

skrawcz changed the title ~~WIP~~ Materializer bug fix Aug 12, 2023

elijahbenizzy added 4 commits August 12, 2023 16:43

Fixes materialization viz function

c203dfc

It now: 1. Works without additional_vars included 2. Returns the graphviz object for rendering in a notebook

Fixes class hierarchy in dependencies.py

2be6fcd

Fixes materialize bug

af2e6ef

This was referring to data loaders instead of data savers in the registry.

elijahbenizzy force-pushed the materialize-fixes branch from 8c0a81a to 7639297 Compare August 13, 2023 00:00

elijahbenizzy changed the title ~~Materializer bug fix~~ Materialization improvements Aug 13, 2023

elijahbenizzy force-pushed the materialize-fixes branch from 7639297 to a2fc7fa Compare August 13, 2023 00:03

Adds examples for materializers

ae6636e

We do the ML model example, and add some custom ones. Hopefully this gets people started. We have an easy script to run + a notebook.

elijahbenizzy force-pushed the materialize-fixes branch from a2fc7fa to ae6636e Compare August 13, 2023 00:08

Sets the default executor to multithreading

846b3a6

MUltiprocessing doesn't work in many cases due to the default pickling mechanism being garbage.

elijahbenizzy mentioned this pull request Aug 13, 2023

Driver-level materialization #235

Closed

elijahbenizzy marked this pull request as ready for review August 13, 2023 05:09

elijahbenizzy requested a review from skrawcz August 13, 2023 05:46

skrawcz approved these changes Aug 13, 2023

View reviewed changes

elijahbenizzy force-pushed the materialize-fixes branch 3 times, most recently from 572cb96 to 317dbe6 Compare August 15, 2023 03:50

elijahbenizzy force-pushed the materialize-fixes branch from 317dbe6 to be724aa Compare August 15, 2023 04:01

elijahbenizzy merged commit 2e65cef into main Aug 15, 2023

elijahbenizzy deleted the materialize-fixes branch August 15, 2023 04:07

This was referenced Aug 15, 2023

Two data loaders can't share the same parameter #232

Closed

Documentation for adapters #150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Materialization improvements #264

Materialization improvements #264

elijahbenizzy commented Aug 10, 2023 •

edited

Loading

skrawcz left a comment

skrawcz Aug 13, 2023

elijahbenizzy Aug 14, 2023

skrawcz Aug 14, 2023

elijahbenizzy Aug 14, 2023

skrawcz Aug 14, 2023

skrawcz Aug 13, 2023

elijahbenizzy Aug 14, 2023

Materialization improvements #264

Materialization improvements #264

Conversation

elijahbenizzy commented Aug 10, 2023 • edited Loading

Changes

How I tested this

Notes

Checklist

skrawcz left a comment

Choose a reason for hiding this comment

skrawcz Aug 13, 2023

Choose a reason for hiding this comment

elijahbenizzy Aug 14, 2023

Choose a reason for hiding this comment

skrawcz Aug 14, 2023

Choose a reason for hiding this comment

elijahbenizzy Aug 14, 2023

Choose a reason for hiding this comment

skrawcz Aug 14, 2023

Choose a reason for hiding this comment

skrawcz Aug 13, 2023

Choose a reason for hiding this comment

elijahbenizzy Aug 14, 2023

Choose a reason for hiding this comment

elijahbenizzy commented Aug 10, 2023 •

edited

Loading