Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

explore extracting columns with validations #123

Closed
chrisaddy opened this issue Apr 29, 2022 · 3 comments
Closed

explore extracting columns with validations #123

chrisaddy opened this issue Apr 29, 2022 · 3 comments

Comments

@chrisaddy
Copy link

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

When we extract columns, it would be very handy to be able to run checks against those columns. pandera is a great, lightweight tool for validating dtypes, nullability, uniqueness, and any arbitrary Check callable.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Ideally this would be a decorator that would work similar to extra_columns, would ingest a DataFrame and return the same dataframe, and expand the nodes to have a dataframe validation node. This could be specific to pandera, or could be made more general, so something like

import pandas as pd
from pandera import DataFrameSchema, Column, Check

@validate_columns({
    "user_id": Column(str, unique=True),
    "age": Column(int, Check.in_range(18, 150),
    "shirt_size": Column(float, description="arm length in inches", Check.greater_than(10)),
    "favorite_apparel": Column(str, Check.isin(["pants", "shirts", "hats"]),
})
def users(input_file: str) -> pd.DataFrame:
    return pd.read_csv(input_file)

or more generically

import pandas as pd
import abc

class Schema(abc.ABC):
    @abc.abstract_method
    def validate(self):
        pass

class SimpleColumnChecker(Schema):
    def __init__(self, columns: Dict[str, Any]):
        self.columns = columns
    def validate(self, df):
        for column, col_schema in self.columns.items():
            assert column in df.columns
            if col_schema.get("unique"):
                assert df[column].shape[0] == df[column].drop_duplicates().shape[0]
            if col_schema.get("min"):
                assert df[column].min() > col_schema.get("min")
            if col_schema.get("max"):
                assert df[column].max() < col_schema.get("min")
            if col_schema.get("isin"):
                assert set(df[column]) == set(col_schema.get("isin"))
    
@validate_columns({
    "user_id": { "unique": True},
    "age": {"min": 18, "max": 150},
    "shirt_size": {"min": 10},
    "favorite_apparel": {"isin": ["pants", "shirts", "hats"]},
})
def users(input_file: str) -> pd.DataFrame:
    return pd.read_csv(input_file)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

certainly you can have a splitting node where you validate data yourself, but I think this is a common enough pattern (or it really should be common enough and made a first class citizen of any dataframe manipulation) that it would benefit from being easy to plug in directly to a node

Additional context
Add any other context or screenshots about the feature request here.

@skrawcz
Copy link
Collaborator

skrawcz commented Apr 29, 2022

We should see how we can connect this with #115 (#41).

I agree that at the beginning of any Hamilton DAG, data needs to be loaded, and it'll usually be a dataframe, that will then be split into columns; I see merit in enabling a data quality check before using extract_columns.

@skrawcz
Copy link
Collaborator

skrawcz commented Apr 30, 2022

BTW I added more notes in #41 on this.

@skrawcz
Copy link
Collaborator

skrawcz commented Jul 15, 2022

@chrisaddy I'm closing this since we integrated pandera support which allows dataframe validation -- or alternatively, people don't use extract_columns and instead manually write out functions that pull that column from a dataframe and then use the base validators, or pandera validators to qualify things.

@skrawcz skrawcz closed this as completed Jul 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants