Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema management for dataset-types #607

Merged
merged 5 commits into from
Dec 27, 2023
Merged

Schema management for dataset-types #607

merged 5 commits into from
Dec 27, 2023

Conversation

elijahbenizzy
Copy link
Collaborator

Changes

How I tested this

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Copy link
Contributor

sweep-ai bot commented Dec 24, 2023

Apply Sweep Rules to your PR?

  • Apply: All new business logic should have corresponding unit tests.
  • Apply: Refactor large functions to be more modular.
  • Apply: Add docstrings to all functions and file headers.

@elijahbenizzy
Copy link
Collaborator Author

elijahbenizzy commented Dec 25, 2023

POC taking @roelbertens's implementation and making a little more general.

Design decisions to enable this:

  1. This is going to take the form of a decorator
  2. The data, for now, is not going to be accessible
  3. We store the data under tags in the node as a serialized format. Under hamilton.internal.schema_output=...

Further questions:

  1. How do we allow for API change?
    • Keep with this API
    • Have the parameter not be **kwargs -- be able to switch it later
    • Change this to dataset_schema or something like that, use schema later
    • Keep in experimental/call it schema_beta
  2. Should schemas allow order?
  3. Should we be able to build schemas for specific libraries (spark, pydantic, etc...)?
    • Probably yes -- we should have a schema class
    • This schema class will have the ability to specify a "flavor", which will be generic
    • It will allow for column names/types
    • We should be able to convert between them/check compatibility with the right plugin
  4. How to decorate input types?
    • Type-annotation

Proposal for more general schema

class SchemaFlavor(enum.Enum):
    common = "common" # common set of basic ones
    spark = "spark"
    pandas = "pandas"
    pydantic = "pydantic"


@dataclasses.dataclass
class Schema:
    # fields are limited to those supported by "flavor"
    # if ordereddict then it has order, otherwise it doesn't
    fields: Union[Dict[str, Type], OrderedDict[str, Type]] 
    flavor: SchemaFlavor = "common"
    allow_extra: bool = True # whether or not extra are allowed

The user has three entry-points:

  1. Pass in a schema (specified above) to a @schema decorator
  2. Pass in a Dict[str, Type] to shortcut the schema decorator
  3. Add a type-annotation:
from hamilton.htypes import dataframe
SchemaIn=Schema(...)
SchemaOut=Schema(...)

# replace with spark, polars, etc...
def foo(input: dataframe[pd.DataFrame, SchemaIn]) -> dataFrame[pd.DataFrame, SchemaOut]:
    ...

Then for validation we can:

  1. Add a lifecycle method that does automated validation at runtime of inputs, outputs
  2. Add a graph method that validates types between edges

TODO remaining:

  • Add some basic unit tests for schema
  • Add/fix some tests for viz
  • Ensure that the node modifiers work together (E.G. we have some colors that we need to use with the different node modifiers)
  • Allow for disabling at the DAG level
  • Break type-registry on spark into separate commits

@elijahbenizzy elijahbenizzy force-pushed the schema-management branch 2 times, most recently from 89c1fef to 4371c6c Compare December 26, 2023 00:15
@elijahbenizzy elijahbenizzy force-pushed the schema-management branch 6 times, most recently from 9826486 to 3d2774c Compare December 27, 2023 00:47
roelbertens and others added 4 commits December 26, 2023 16:54
Beforehand we were just using pandas/pyspark. We now add this for
pyspark, although the column-tye is left as None, as it does not carry
the same role.
@elijahbenizzy elijahbenizzy force-pushed the schema-management branch 4 times, most recently from d87fd65 to 609782d Compare December 27, 2023 04:26
@elijahbenizzy elijahbenizzy marked this pull request as ready for review December 27, 2023 04:35
As discussed, we're going with a list of tuples, and not validating the types.
That said, we have a set of types we allow and we can assume that any of those are supported,
whereas the others are allowed but will not be fully supported.

Note we also made the following decisions:

1. We implement the types as a single tag internally. This will change,
   so its not exposed to the user
2. We only allow this to decorate registered dataframe types
@elijahbenizzy elijahbenizzy merged commit e8f83d3 into main Dec 27, 2023
22 checks passed
@elijahbenizzy elijahbenizzy deleted the schema-management branch December 27, 2023 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants