Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materialization improvements #264

Merged
merged 8 commits into from
Aug 15, 2023
Merged

Materialization improvements #264

merged 8 commits into from
Aug 15, 2023

Conversation

elijahbenizzy
Copy link
Collaborator

@elijahbenizzy elijahbenizzy commented Aug 10, 2023

[Short description explaining the high-level reason for the pull request]

Changes

How I tested this

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@skrawcz skrawcz changed the title WIP Materializer bug fix Aug 12, 2023
It now:
1. Works without additional_vars included
2. Returns the graphviz object for rendering in a notebook
This was added when we added graph copying. This adds the ability for
update_dependencies to wipe the dependencies before executing. Note that
it does an in-place modification if this doesn't happen -- that's the
reset_dependencies option. Note this is an internal API so I don't mind
it being geared towards performance.
This was referring to data loaders instead of data savers in the
registry.
@elijahbenizzy elijahbenizzy changed the title Materializer bug fix Materialization improvements Aug 13, 2023
We do the ML model example, and add some custom ones. Hopefully this
gets people started. We have an easy script to run + a notebook.
MUltiprocessing doesn't work in many cases due to the default pickling
mechanism being garbage.
Previously we would inject a node with a parameter name into a
parameter consumed by downstream set of nodes. This would cause
name-clashes if, say, the parmeter name was `data`:

@load_from.json(...)
def foo(data: pd.DataFrame) -> ...:
    ...

To fix this, we did two things:

1. Change the data loader nodes that were created to have namespaces so
   they're unique
2. Allow the NodeInjector to rename input nodes so it can communicate
   the new names.

Note this is probably slightly more abstraction than needed but I have a
sense that NodeInjector will be necessary moving forward (external API
calls, etc...).
@elijahbenizzy elijahbenizzy marked this pull request as ready for review August 13, 2023 05:09
@elijahbenizzy elijahbenizzy requested a review from skrawcz August 13, 2023 05:46
Copy link
Collaborator

@skrawcz skrawcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only questions is do we want to call it additional_vars vs additional_outputs (more intuitive)... but otherwise just comments that aren't blockers.

2. Alters the output to include the materializer nodes
3. Processes a list of "additional variables" (for debugging) to return intermediary data
4. Executes the DAG, including the materializers
5. Returns a tuple of (`materialization metadata`, `additional variables`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why "additional variables"? Because that's what we use in execute? Should we rename it to "additional_outputs"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep to mirror "final_vars", but I like "outputs" better than vars

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just make it outputs personally. But 🤷 .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought we already released this and I'm a stickler for semantic versioning...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your call.

Comment on lines +57 to +74
to.json(
dependencies=["model_parameters"], id="model_params_to_json", path="./data/params.json"
),
# classification report to .txt file
to.file(
dependencies=["classification_report"],
id="classification_report_to_txt",
path="./data/classification_report.txt",
),
# materialize the model to a pickle file
to.pickle(dependencies=["fit_clf"], id="clf_to_pickle", path="./data/clf.pkl"),
# materialize the predictions we made to a csv file
to.csv(
dependencies=["predicted_output_with_labels"],
id="predicted_output_with_labels_to_csv",
path="./data/predicted_output_with_labels.csv",
),
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want path to be a source, and have it provided as inputs? Or at least one of these should show that pattern so that people can see what can still be dynamic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so i can add that as an example or at least note it

@elijahbenizzy elijahbenizzy force-pushed the materialize-fixes branch 3 times, most recently from 572cb96 to 317dbe6 Compare August 15, 2023 03:50
This has:
1. A reference table, automatically generated using a custom sphinx
   directive
2. References for the base classes for extension

Re (1) we generate a bare-bones table but it should be enough. For now,
we just link to the code, but we will, at some point, link to actual
class docs.
@elijahbenizzy elijahbenizzy merged commit 2e65cef into main Aug 15, 2023
@elijahbenizzy elijahbenizzy deleted the materialize-fixes branch August 15, 2023 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants