-
Notifications
You must be signed in to change notification settings - Fork 37
Conversation
See docstring for Will need to rebase/fix some stuff up. However, there are some questions that this raises:
|
4b87523
to
8a49bde
Compare
Some tasks prior to this being ready:
|
0b6ed7a
to
8bbe9ab
Compare
5523169
to
3b9ab68
Compare
a48f8cd
to
5eb2e98
Compare
05db9c7
to
da4e064
Compare
afaf503
to
94d478a
Compare
Alright, not 100% done but functional. Remaining:
And then later tasks:
|
2b57408
to
8e6d7da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there! Would have to play around with it and see how it impacts the GraphAdapters though. We will also want to show that it doesn't interfere with other decorators, or at least how it interacts with them, e.g. parameterize. This will then impact documentation, since I think this feature is large enough we probably need to show case a bunch of uses for it for people to cut and paste.
Discussion points to chat about:
- The double decorator issues seems like we could do without it? Make an issue to track though?
- 😆 at last commit message.
- We should standardize how we
import
, I think module imports are cleaner in general. - The naming of the validators I think should be as specific as possible, thus if they only operate over pandas series, we should have
pandas
andseries
in the name. - What else do you need help on? I didn't check the test coverage, but that would be something to double check that we have added appropriate unit tests.
import numbers | ||
from typing import Any, Type, List, Optional, Tuple | ||
|
||
from hamilton.data_quality.base import DataValidator, ValidationResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the style of importing the module, like google does. That way when you're reading the class, you know if it's defined locally or not. Default to local. The exceptions here are typing classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, happy to do that
import numbers | ||
from typing import Any, Type, List, Optional, Tuple | ||
|
||
from hamilton.data_quality.base import DataValidator, ValidationResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the style of importing the module, like google does. That way when you're reading the class, you know if it's defined locally or not. Default to local. The exceptions here are typing classes.
class ValidationResult: | ||
passes: bool # Whether or not this passed the validation | ||
message: str # Error message or success message | ||
diagnostics: Dict[str, Any] = dataclasses.field(default_factory=dict) # Any extra diagnostics information needed, free-form |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any set keys emerging? Maybe a TypedDict would help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't noticed any common ones. In fact, those will all be part of the dataclass itself... This is purely unstructured, but stable stuff.
class NansAllowedValidatorPandas(MaxFractionNansValidatorPandas): | ||
def __init__(self, allow_nans: bool, importance: str): | ||
if allow_nans: | ||
raise ValueError(f"Only allowed to block Nans with this validator." | ||
f"Otherwise leave blank or specify the percentage of Nans using {MaxFractionNansValidatorPandas.name()}") | ||
super(NansAllowedValidatorPandas, self).__init__(max_fraction_nan=0 if not allow_nans else 1.0, importance=importance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Language here is confusing. Should there not be any if
statement here at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. I could see people doing - allow_nans=True
and allow_nans=False
being fine uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People might do this just to be explicit with expectations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my original thought was allow_nans=False
was the only one you'd want. E.G. its a no-op if you have allow_nans=True
. Recall that this will be mixed with a bunch of other params...
4444cb2
to
3b1d1b0
Compare
OK, this is pretty close for a first release (rc version). |
A quick summary for the whylogs folks. what this does The decorator has two modes of working with it:
(1) looks something like this: # Note that this actually produces two as it uses two arguments
@check_output(range=(0, 1), allow_nans=False, importance='fail')
def data_between_0_and_1_with_no_nans(some_input: pd.Series) -> pd.Series:
... Where (2) looks something like this: # Note that this actually produces two as it uses two arguments
@check_output_custom(
MyCustomDataInRangeValidator(low=0, high=1),
MyCustomNoNansAllowedValidator())
def data_between_0_and_1_with_no_nans(some_input: pd.Series) -> pd.Series:
... how this fits into hamilton/the hamilton plan your task We would love feedback! From you...
Happy to pair as needed. Also happy to make changes in case we need any more abstractions. |
We now have a test-integrations section in config.yml. I've decided to group them together to avoid a proliferation. Should we arrive at conflicting requirements we can solve that later.
It was causing circular imports otherwise.
OK so pandera integration is here and its pretty clean IMO. That said, this is going to be tricky due to decorators... E.G. when we do an extract_columns on a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah just documentation thoughts.
## Default Validators | ||
|
||
The available default validators are listed in the variable `AVAILABLE_DEFAULT_VALIDATORS` | ||
in `default_validators.py`. To add more, please implement the class in that file then add to the list. | ||
There is a test that ensures that everything is added to that list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems out of place? Should be in a developer's section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah its a little grouped together now
@@ -0,0 +1,150 @@ | |||
# Data Quality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file needs to be reordered a bit I think.
- We want to optimize for someone getting started very easily - thus should list things like
importance
and accessing results up higher. E.g. how do I get something going, how do I configure that basic thing, how do I get the results. - More complex use cases should be pushed further down.
So for me it's:
- introduction with code to cut & paste
- information on how to customize/tweak that code (list of kwargs, importance levels)
- how to access results
- Pandera integration
- Writing your own custom validators
pass | ||
``` | ||
|
||
The check_output validator takes in arguments that each correspond to one of the default validators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check_output validator takes in arguments that each correspond to one of the default validators. | |
The check_output decorator takes in arguments that each correspond to one of the default validators. For a list of default available validators see .... |
from hamilton.data_quality import base | ||
from hamilton.data_quality.base import BaseDefaultValidator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can delete the class, and instead prepend base.
to where it's used.
|
||
|
||
class PanderaDataFrameValidator(base.BaseDefaultValidator): | ||
"""Pandera schema validator for dataframes""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link to pandera docs for dataframe schema / link to our docs.
|
||
|
||
class PanderaSeriesSchemaValidator(base.BaseDefaultValidator): | ||
"""Pandera schema validator for series""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link to pandera docs for series schema / link to our docs.
It's two words and it should be separated by an `_` to be consistent with all the other validators.
They were in consistent. This changes them to follow the following: * Name of validation + Data type E.g. MaxFractionNansValidator + PandasSeries
PandasSeries and Primitives are added. This enables one to do: ```python @check_output(values_in=[ LIST, OF, VALUES ], ...) ``` Which will check what the function outputs and validate that it is within one of the values provided. Currently the primitives operates over numbers and strings. I punted on lists and dictionaries -- they should probably be different validator classes.
There were multiple ways the same module was imported. Reduced it to a single way.
Using `inclusive=True` is deprecated. So changing to use `both` to not get a warning from pandas when using this.
Sorry, merging two commits here. (1) The `name()` was pretty much just the `argument` + `_validator`. So I just encoded that and updated the names of variables, and classes to match this format of doing things. Thus (2) is rolled into this. Because we should be making sure that arguments, names, class names, are following some semblance of structure.
Prior behavior stopped at the first failure. We don't want that to happen. Instead we want to run through all the checks and log them appropriately, this change does that. So I had to changed "act" to accommodate this. Since `act` itself was only used in a single place, I just moved the `if` statement into the BaseDataValidationDecorator. That said, the class structure here feels a little odd -- might be easy to introduce a circular dependency at some point accidentally. But yeah we need a better mechanism for storing results for people to access.
This only works over numbers and strings. If we want to do dicts and lists, we probably want a specific validator for them -- we don't need the if/else checks here.
Fixing rogue function doc that was not like the other function doc we have setup. Namely, we use `:param NAME:` rather than `@param NAME:`.
This data quality example is based on the example we provided with the outerbounds(metaflow) folks. It's purpose is to show how one might apply data quality. It also shows how to use the same code and make it work for dask*, ray, spark. Ray - everything seems to work as intended. Spark - had to change how the data & some features are computed. Dask* - had to change how the data & some features are computed. CAVEAT: right now the validators don't work properly for dask data types. That is because either (1) it's a future object, or (2) we use pandas series syntax, when instead we should use the dask specific syntax. In short - DEFAULT DQ DOES NOT WORK WITH DASK DATA TYPES. BUT it DOES WORK if you're just using spark for multi-processing, and not using dask data types. So we need to think how we might change/inject the validator implementation based on say a graph adapter or something... otherwise this forces one to really stick to one data type or another, i.e. pandas or dask dataframe. Documentation should hopefully be enough to document what is doing on. The only TODO is to create an analogous example using Pandera -- my hope is that it will handle dask datatypes...
Before it did not check anything -- and instead assumed a dictionary of series and scalars. Now if there is only a single value, and it happens to be a dataframe, we will return that, instead of trying to build another dataframe. Adds unit tests for this function.
It did not take in importance or call the super class. Updates unit tests.
Dask datatypes are lazily evaluated. So we need to check whether the "validate" result we get back from pandera is a dask like object. If so, we then want to "materialize" it so that we can actually compute the validation result. Without this check, they are never evaluated, because nothing downstream asks for the result to be computed.
Using the same trick as we employed before, we can simply compute a result for the scalar primitive validators to be a valid value to compare against. Without this, things break because we're trying to compare a dask data type thing. Note: we could do a similar strategy for the Pandas Series validators, however we'd need to do something akin to what pandera does under the hood with `map_partitions` over the dask like object. I vote to push people to use pandera if they're using dask data types.
Adds one logger statement to ensure things are logged nicely, one by one in the case of a failure -- they were otherwise hard to interpret. Fixes install instructions otherwise in the case of pandera validators.
So that the output does not take over your whole screen.
This example is virtually identical to `examples/data_quality/simple`. It instead makes the following choices: 1. Separate feature logic file for Pandas on Spark. Just to show another way to cut things. Well that, and to correct the "data type" validation issue with the simple example. 2. Uses Pandera and shows how to validate Series and Dataframes using Pandera + `@check_output`.
In the case there is a failure, it's probably useful to print the valid values expected. Also changes on applies_to check for `issubclass`, this is related to PR feedback, and I'm too lazy to make another commit just for it.
Co-authored-by: Stefan Krawczyk <skrawczyk@stitchfix.com>
[Short description explaining the high-level reason for the pull request]
Changes
Testing
Notes
Checklist
Testing checklist
Python - local testing