Schema management for dataset-types #607

elijahbenizzy · 2023-12-24T23:14:58Z

Changes

How I tested this

Notes

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

sweep-ai · 2023-12-24T23:16:02Z

Apply Sweep Rules to your PR?

Apply: All new business logic should have corresponding unit tests.
Apply: Refactor large functions to be more modular.
Apply: Add docstrings to all functions and file headers.

elijahbenizzy · 2023-12-25T05:27:01Z

POC taking @roelbertens's implementation and making a little more general.

Design decisions to enable this:

This is going to take the form of a decorator
The data, for now, is not going to be accessible
We store the data under tags in the node as a serialized format. Under hamilton.internal.schema_output=...

Further questions:

How do we allow for API change?
- Keep with this API
- Have the parameter not be **kwargs -- be able to switch it later
- Change this to dataset_schema or something like that, use schema later
- Keep in experimental/call it schema_beta
Should schemas allow order?
Should we be able to build schemas for specific libraries (spark, pydantic, etc...)?
- Probably yes -- we should have a schema class
- This schema class will have the ability to specify a "flavor", which will be generic
- It will allow for column names/types
- We should be able to convert between them/check compatibility with the right plugin
How to decorate input types?
- Type-annotation

Proposal for more general schema

class SchemaFlavor(enum.Enum):
    common = "common" # common set of basic ones
    spark = "spark"
    pandas = "pandas"
    pydantic = "pydantic"


@dataclasses.dataclass
class Schema:
    # fields are limited to those supported by "flavor"
    # if ordereddict then it has order, otherwise it doesn't
    fields: Union[Dict[str, Type], OrderedDict[str, Type]] 
    flavor: SchemaFlavor = "common"
    allow_extra: bool = True # whether or not extra are allowed

The user has three entry-points:

Pass in a schema (specified above) to a @schema decorator
Pass in a Dict[str, Type] to shortcut the schema decorator
Add a type-annotation:

from hamilton.htypes import dataframe
SchemaIn=Schema(...)
SchemaOut=Schema(...)

# replace with spark, polars, etc...
def foo(input: dataframe[pd.DataFrame, SchemaIn]) -> dataFrame[pd.DataFrame, SchemaOut]:
    ...

Then for validation we can:

Add a lifecycle method that does automated validation at runtime of inputs, outputs
Add a graph method that validates types between edges

TODO remaining:

Add some basic unit tests for schema
Add/fix some tests for viz
Ensure that the node modifiers work together (E.G. we have some colors that we need to use with the different node modifiers)
Allow for disabling at the DAG level
Break type-registry on spark into separate commits

examples/spark/pyspark_feature_catalog/features.py

hamilton/function_modifiers/metadata.py

Beforehand we were just using pandas/pyspark. We now add this for pyspark, although the column-tye is left as None, as it does not carry the same role.

examples/spark/pyspark_feature_catalog/features.py

hamilton/function_modifiers/metadata.py

As discussed, we're going with a list of tuples, and not validating the types. That said, we have a set of types we allow and we can assume that any of those are supported, whereas the others are allowed but will not be fully supported. Note we also made the following decisions: 1. We implement the types as a single tag internally. This will change, so its not exposed to the user 2. We only allow this to decorate registered dataframe types

skrawcz reviewed Dec 25, 2023

View reviewed changes

examples/spark/pyspark_feature_catalog/features.py Outdated Show resolved Hide resolved

skrawcz reviewed Dec 25, 2023

View reviewed changes

hamilton/function_modifiers/metadata.py Outdated Show resolved Hide resolved

skrawcz reviewed Dec 25, 2023

View reviewed changes

hamilton/function_modifiers/metadata.py Outdated Show resolved Hide resolved

elijahbenizzy force-pushed the schema-management branch 2 times, most recently from 89c1fef to 4371c6c Compare December 26, 2023 00:15

skrawcz reviewed Dec 26, 2023

View reviewed changes

hamilton/function_modifiers/metadata.py Outdated Show resolved Hide resolved

elijahbenizzy force-pushed the schema-management branch 6 times, most recently from 9826486 to 3d2774c Compare December 27, 2023 00:47

roelbertens and others added 4 commits December 26, 2023 16:54

Extend DAG with column level lineage. Add example notebook.

f59db3a

Updates doc string for some driver methods

f79afc9

Adds Roel Bertens to list of contributors

660c7ef

Adds registration of dataframe types with pyspark

a92d477

Beforehand we were just using pandas/pyspark. We now add this for pyspark, although the column-tye is left as None, as it does not carry the same role.

elijahbenizzy force-pushed the schema-management branch from 3d2774c to 5c10cc5 Compare December 27, 2023 00:54

skrawcz reviewed Dec 27, 2023

View reviewed changes

examples/spark/pyspark_feature_catalog/features.py Outdated Show resolved Hide resolved

skrawcz reviewed Dec 27, 2023

View reviewed changes

examples/spark/pyspark_feature_catalog/features.py Outdated Show resolved Hide resolved

skrawcz reviewed Dec 27, 2023

View reviewed changes

hamilton/function_modifiers/metadata.py Outdated Show resolved Hide resolved

skrawcz reviewed Dec 27, 2023

View reviewed changes

hamilton/function_modifiers/metadata.py Outdated Show resolved Hide resolved

elijahbenizzy force-pushed the schema-management branch 4 times, most recently from d87fd65 to 609782d Compare December 27, 2023 04:26

elijahbenizzy marked this pull request as ready for review December 27, 2023 04:35

elijahbenizzy requested a review from skrawcz December 27, 2023 04:35

elijahbenizzy force-pushed the schema-management branch from 609782d to dfa8e68 Compare December 27, 2023 05:01

skrawcz approved these changes Dec 27, 2023

View reviewed changes

elijahbenizzy merged commit e8f83d3 into main Dec 27, 2023
22 checks passed

elijahbenizzy deleted the schema-management branch December 27, 2023 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema management for dataset-types #607

Schema management for dataset-types #607

elijahbenizzy commented Dec 24, 2023

sweep-ai bot commented Dec 24, 2023

elijahbenizzy commented Dec 25, 2023 •

edited

Loading

Schema management for dataset-types #607

Schema management for dataset-types #607

Conversation

elijahbenizzy commented Dec 24, 2023

Changes

How I tested this

Notes

Checklist

sweep-ai bot commented Dec 24, 2023

Apply Sweep Rules to your PR?

elijahbenizzy commented Dec 25, 2023 • edited Loading

Design decisions to enable this:

Further questions:

Proposal for more general schema

TODO remaining:

elijahbenizzy commented Dec 25, 2023 •

edited

Loading