Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver-level materialization #235

Closed
elijahbenizzy opened this issue Jul 25, 2023 · 2 comments
Closed

Driver-level materialization #235

elijahbenizzy opened this issue Jul 25, 2023 · 2 comments
Assignees
Labels

Comments

@elijahbenizzy
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Data savers (@save_to) are cool, but often materialization is more of an ad-hoc operation. This proposes making it easier to dynamically call materialization on a pre-existing DAG.

Describe the solution you'd like

See comment below for spec. Basically a materialize() function in the driver that modifies the DAG to include a saving node and executes it.

Describe alternatives you've considered
Do it:

  • At driver construction time
  • With save_to (already doable)
  • External to the driver -- (also doable)

Additional context
This is something we've been thinking about for a while and @save_to was the first piece of this. About time!

Will likely only work on the driverV2...

@elijahbenizzy elijahbenizzy self-assigned this Jul 25, 2023
@elijahbenizzy
Copy link
Collaborator Author

elijahbenizzy commented Jul 26, 2023

Spec

Requirements

  • Actually modifies the DAG so we get nodes that are traceable/manageable
    • Traceability
    • Etc...
  • Reuses the data adapter framework
    • we can get sources/ required nodes with this
  • handles API keys for saving in a graceful way
  • Is simple/easy to call
  • Handles custom joining of results prior to saving
  • Handles more than just pandas

API

  1. Set of complimentary classes to DataSavers -- for every DataSaver class we generate a materializer.
  2. These take in the same parameters as the DataSaver, with the following additional ones:
    • a joiner (ResultBuilder) that tells you how to group multiple
    • a set of nodes to save -- could be a list of string (*args, or a tag spec/query)
    • The parameters to the DataSaver can be value, source, or a literal (which will be interpreted as a value
  3. When .materialize is called, the following happens:
    • We create a new function graph from the existing one, with the default results builder + datasaver nodes applied
    • We execute the saver nodes, returning those results

We name for the node/saver? If not we can just return a list of metadata...

Consider the following DAG:

@tag(final_data_product="features")
def foo() -> pd.Series:
    pass
    
@tag(final_data_product="features")
def bar() -> pd.Series:
    pass
    
def baz() -> pd.Series:
    pass

def foo_bar_baz() -> pd.DataFrame:
def model() -> Model:
    pass

Then let's say we want to save this to parquet:

dr = driver.Driver({}, modules)

# materialize to Parquet
# can have multiple, each one is a saver
# tbd on the namespace for Parquet

dr.materialize(
   materializers.Parquet("foo", "bar", "baz", # nodes in the dataset
        path="./out.parquet",  # parameter to ParquetDataLoader
        join=DataFrameResult(), # only needed if we have multiple and the results_builder of the DAG doesn't apply...
        # We can probably kill this
    ),
  materializers.Parquet("foo", "bar", "baz", # nodes in the dataset
          path="./out.parquet",  # parameter to ParquetDataLoader
          # no need for a results builder as its just a dataframe
    ),
)

Or to MLFlow

dr = driver.Driver({}, modules)

# materialize to Parquet
# can have multiple, each one is a saver
# tbd on the namespace for Parquet

dr.materialize(
   materializers.Parquet("foo", "bar", "baz", # nodes in the dataset
        path="./out.parquet",  # parameter to ParquetDataLoader
        join=DataFrameResult(), # only needed if we have multiple and the results_builder of the DAG doesn't apply...
        # We can probably kill this
    ),
  materializers.Parquet("foo", "bar", "baz", # nodes in the dataset
          path="./out.parquet",  # parameter to ParquetDataLoader
          # no need for a results builder as its just a dataframe
    ),
)

Or we want to save to MLFlow:

dr.materialize(
   materializers.MLFlowRegistry(
        "model",
        train=source("training_data"), # needed for signature -- we could probably hardcode it
        predictions=source("predictions"),
        # other parameters
   )
)

Say we want to save all items in a production dataset:

dr.materialize(
  materializers.Parquet(tag_query(final_data_product="features"), path="./out.parquet") 
  ),
)

The trick here is attaching the materialize to the DataAdapter framework -- we have to wire through while keeping the dev experience (both for the user and the dev who customizes adapters). Ideally we'd make it so if you add a new DataAdapter you could use it as a materializer.

elijahbenizzy added a commit that referenced this issue Jul 31, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it
elijahbenizzy added a commit that referenced this issue Jul 31, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Jul 31, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 1, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 1, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 1, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 1, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 1, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 1, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 1, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 2, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 5, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 5, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 5, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 5, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 5, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
elijahbenizzy added a commit that referenced this issue Aug 6, 2023
See issue for more detailed notes. Overall design:

1. Add a .materialize(...) function
2. Materializers are dynamically registered with the same mechanism as
   data savers
3. This manipulates the DAG and calls the materialization node
4. The materialization node can also have a results builder associated
   with it

Left todo:
1. Documentation
2. More work on data savers/loaders
@elijahbenizzy
Copy link
Collaborator Author

This is released, see: #264 for additional improvements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant