-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema management for dataset-types #607
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Apply Sweep Rules to your PR?
|
POC taking @roelbertens's implementation and making a little more general. Design decisions to enable this:
Further questions:
Proposal for more general schemaclass SchemaFlavor(enum.Enum):
common = "common" # common set of basic ones
spark = "spark"
pandas = "pandas"
pydantic = "pydantic"
@dataclasses.dataclass
class Schema:
# fields are limited to those supported by "flavor"
# if ordereddict then it has order, otherwise it doesn't
fields: Union[Dict[str, Type], OrderedDict[str, Type]]
flavor: SchemaFlavor = "common"
allow_extra: bool = True # whether or not extra are allowed The user has three entry-points:
from hamilton.htypes import dataframe
SchemaIn=Schema(...)
SchemaOut=Schema(...)
# replace with spark, polars, etc...
def foo(input: dataframe[pd.DataFrame, SchemaIn]) -> dataFrame[pd.DataFrame, SchemaOut]:
... Then for validation we can:
TODO remaining:
|
skrawcz
reviewed
Dec 25, 2023
skrawcz
reviewed
Dec 25, 2023
skrawcz
reviewed
Dec 25, 2023
elijahbenizzy
force-pushed
the
schema-management
branch
2 times, most recently
from
December 26, 2023 00:15
89c1fef
to
4371c6c
Compare
skrawcz
reviewed
Dec 26, 2023
elijahbenizzy
force-pushed
the
schema-management
branch
6 times, most recently
from
December 27, 2023 00:47
9826486
to
3d2774c
Compare
Beforehand we were just using pandas/pyspark. We now add this for pyspark, although the column-tye is left as None, as it does not carry the same role.
elijahbenizzy
force-pushed
the
schema-management
branch
from
December 27, 2023 00:54
3d2774c
to
5c10cc5
Compare
skrawcz
reviewed
Dec 27, 2023
skrawcz
reviewed
Dec 27, 2023
skrawcz
reviewed
Dec 27, 2023
skrawcz
reviewed
Dec 27, 2023
elijahbenizzy
force-pushed
the
schema-management
branch
4 times, most recently
from
December 27, 2023 04:26
d87fd65
to
609782d
Compare
As discussed, we're going with a list of tuples, and not validating the types. That said, we have a set of types we allow and we can assume that any of those are supported, whereas the others are allowed but will not be fully supported. Note we also made the following decisions: 1. We implement the types as a single tag internally. This will change, so its not exposed to the user 2. We only allow this to decorate registered dataframe types
elijahbenizzy
force-pushed
the
schema-management
branch
from
December 27, 2023 05:01
609782d
to
dfa8e68
Compare
skrawcz
approved these changes
Dec 27, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
How I tested this
Notes
Checklist