explore extracting columns with validations #123

chrisaddy · 2022-04-29T17:18:10Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

When we extract columns, it would be very handy to be able to run checks against those columns. pandera is a great, lightweight tool for validating dtypes, nullability, uniqueness, and any arbitrary Check callable.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Ideally this would be a decorator that would work similar to extra_columns, would ingest a DataFrame and return the same dataframe, and expand the nodes to have a dataframe validation node. This could be specific to pandera, or could be made more general, so something like

import pandas as pd
from pandera import DataFrameSchema, Column, Check

@validate_columns({
    "user_id": Column(str, unique=True),
    "age": Column(int, Check.in_range(18, 150),
    "shirt_size": Column(float, description="arm length in inches", Check.greater_than(10)),
    "favorite_apparel": Column(str, Check.isin(["pants", "shirts", "hats"]),
})
def users(input_file: str) -> pd.DataFrame:
    return pd.read_csv(input_file)

or more generically

import pandas as pd
import abc

class Schema(abc.ABC):
    @abc.abstract_method
    def validate(self):
        pass

class SimpleColumnChecker(Schema):
    def __init__(self, columns: Dict[str, Any]):
        self.columns = columns
    def validate(self, df):
        for column, col_schema in self.columns.items():
            assert column in df.columns
            if col_schema.get("unique"):
                assert df[column].shape[0] == df[column].drop_duplicates().shape[0]
            if col_schema.get("min"):
                assert df[column].min() > col_schema.get("min")
            if col_schema.get("max"):
                assert df[column].max() < col_schema.get("min")
            if col_schema.get("isin"):
                assert set(df[column]) == set(col_schema.get("isin"))
    
@validate_columns({
    "user_id": { "unique": True},
    "age": {"min": 18, "max": 150},
    "shirt_size": {"min": 10},
    "favorite_apparel": {"isin": ["pants", "shirts", "hats"]},
})
def users(input_file: str) -> pd.DataFrame:
    return pd.read_csv(input_file)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

certainly you can have a splitting node where you validate data yourself, but I think this is a common enough pattern (or it really should be common enough and made a first class citizen of any dataframe manipulation) that it would benefit from being easy to plug in directly to a node

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

skrawcz · 2022-04-29T20:37:51Z

We should see how we can connect this with #115 (#41).

I agree that at the beginning of any Hamilton DAG, data needs to be loaded, and it'll usually be a dataframe, that will then be split into columns; I see merit in enabling a data quality check before using extract_columns.

skrawcz · 2022-04-30T04:41:50Z

BTW I added more notes in #41 on this.

skrawcz · 2022-07-15T05:23:19Z

@chrisaddy I'm closing this since we integrated pandera support which allows dataframe validation -- or alternatively, people don't use extract_columns and instead manually write out functions that pull that column from a dataframe and then use the base validators, or pandera validators to qualify things.

skrawcz mentioned this issue Apr 29, 2022

Prototype Data Quality Feature #41

Closed

skrawcz closed this as completed Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explore extracting columns with validations #123

explore extracting columns with validations #123

chrisaddy commented Apr 29, 2022

skrawcz commented Apr 29, 2022 •

edited

Loading

skrawcz commented Apr 30, 2022

skrawcz commented Jul 15, 2022

explore extracting columns with validations #123

explore extracting columns with validations #123

Comments

chrisaddy commented Apr 29, 2022

skrawcz commented Apr 29, 2022 • edited Loading

skrawcz commented Apr 30, 2022

skrawcz commented Jul 15, 2022

skrawcz commented Apr 29, 2022 •

edited

Loading