Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Data quality #115

Merged
merged 43 commits into from
Jul 13, 2022
Merged

Data quality #115

merged 43 commits into from
Jul 13, 2022

Conversation

elijahbenizzy
Copy link
Collaborator

@elijahbenizzy elijahbenizzy commented Apr 16, 2022

[Short description explaining the high-level reason for the pull request]

Changes

Testing

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Python - local testing

  • python 3.6
  • python 3.7

@elijahbenizzy
Copy link
Collaborator Author

See docstring for check_output for basic design.

Will need to rebase/fix some stuff up. However, there are some questions that this raises:

  1. How to handle the notion of "anonymous nodes". E.G. nodes that are used as intermediaries in a subdag. Our example of this is the node that computes the results before feeding it to a data validator. I propose a central name\_node(prefix, suffix, **kwargs) function that does a stable hash of the inputs, allowing for readable prefixes/suffixes

  2. The weird API that we have in wihch you pass either a set of kwargs for the default resolver or a list of custom resolvers. I propose two separate APIs. Both check_outputs and check_outputs.custom(*validators). Ideally they should both go to the same API, but the problem is that we can't resolve arguments without knowing the dependent node types...

  3. Say we decorate @extract_columns with @check_output. This brings up some qs. (1) Should this create one DQ checks for each column. IMO yes. (2) What if you just want to decorate specific columns? I think this is too much complexity to handle for now, but we could add provisions later on. (3) What

  4. How to get a report of DQ results? IMO node tags should be decorated with DQ metadata -- we should be able to pretty easily query it.

@elijahbenizzy elijahbenizzy changed the base branch from main to tag-nodes April 16, 2022 23:39
@elijahbenizzy
Copy link
Collaborator Author

elijahbenizzy commented Apr 16, 2022

Some tasks prior to this being ready:

  • Add a set of new default decorators
  • Add documentation
  • Add unit tests
  • Rebase from main
  • Separate into a few small commits
  • Merge tagging branch, set this to merge against main

@elijahbenizzy elijahbenizzy force-pushed the data-quality branch 2 times, most recently from 5523169 to 3b9ab68 Compare April 17, 2022 22:19
Base automatically changed from tag-nodes to main April 30, 2022 04:04
@skrawcz skrawcz linked an issue Apr 30, 2022 that may be closed by this pull request
@elijahbenizzy elijahbenizzy force-pushed the data-quality branch 4 times, most recently from a48f8cd to 5eb2e98 Compare May 9, 2022 00:13
@elijahbenizzy elijahbenizzy force-pushed the data-quality branch 5 times, most recently from 05db9c7 to da4e064 Compare June 1, 2022 19:00
@elijahbenizzy elijahbenizzy force-pushed the data-quality branch 2 times, most recently from afaf503 to 94d478a Compare June 7, 2022 23:35
@elijahbenizzy
Copy link
Collaborator Author

elijahbenizzy commented Jun 7, 2022

Alright, not 100% done but functional. Remaining:

  • End to end examples in documentation
  • Solve a bug in which layering decorators will cause two nodes of the same name
  • Fish around for feedback
  • Version bump
  • Fix tests

And then later tasks:

  • Publish
  • Provide some integrations (looking at whylabs)
  • Market

@elijahbenizzy elijahbenizzy force-pushed the data-quality branch 2 times, most recently from 2b57408 to 8e6d7da Compare June 9, 2022 16:06
Copy link
Collaborator

@skrawcz skrawcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! Would have to play around with it and see how it impacts the GraphAdapters though. We will also want to show that it doesn't interfere with other decorators, or at least how it interacts with them, e.g. parameterize. This will then impact documentation, since I think this feature is large enough we probably need to show case a bunch of uses for it for people to cut and paste.

Discussion points to chat about:

  1. The double decorator issues seems like we could do without it? Make an issue to track though?
  2. 😆 at last commit message.
  3. We should standardize how we import, I think module imports are cleaner in general.
  4. The naming of the validators I think should be as specific as possible, thus if they only operate over pandas series, we should have pandas and series in the name.
  5. What else do you need help on? I didn't check the test coverage, but that would be something to double check that we have added appropriate unit tests.

import numbers
from typing import Any, Type, List, Optional, Tuple

from hamilton.data_quality.base import DataValidator, ValidationResult
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the style of importing the module, like google does. That way when you're reading the class, you know if it's defined locally or not. Default to local. The exceptions here are typing classes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, happy to do that

import numbers
from typing import Any, Type, List, Optional, Tuple

from hamilton.data_quality.base import DataValidator, ValidationResult
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the style of importing the module, like google does. That way when you're reading the class, you know if it's defined locally or not. Default to local. The exceptions here are typing classes.

hamilton/data_quality/default_validators.py Outdated Show resolved Hide resolved
class ValidationResult:
passes: bool # Whether or not this passed the validation
message: str # Error message or success message
diagnostics: Dict[str, Any] = dataclasses.field(default_factory=dict) # Any extra diagnostics information needed, free-form
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any set keys emerging? Maybe a TypedDict would help here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't noticed any common ones. In fact, those will all be part of the dataclass itself... This is purely unstructured, but stable stuff.

hamilton/function_modifiers.py Outdated Show resolved Hide resolved
data_quality.md Outdated Show resolved Hide resolved
data_quality.md Outdated Show resolved Hide resolved
Comment on lines 182 to 187
class NansAllowedValidatorPandas(MaxFractionNansValidatorPandas):
def __init__(self, allow_nans: bool, importance: str):
if allow_nans:
raise ValueError(f"Only allowed to block Nans with this validator."
f"Otherwise leave blank or specify the percentage of Nans using {MaxFractionNansValidatorPandas.name()}")
super(NansAllowedValidatorPandas, self).__init__(max_fraction_nan=0 if not allow_nans else 1.0, importance=importance)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language here is confusing. Should there not be any if statement here at all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. I could see people doing - allow_nans=True and allow_nans=False being fine uses.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People might do this just to be explicit with expectations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my original thought was allow_nans=False was the only one you'd want. E.G. its a no-op if you have allow_nans=True. Recall that this will be mixed with a bunch of other params...

hamilton/function_modifiers.py Outdated Show resolved Hide resolved
data_quality.md Outdated Show resolved Hide resolved
@elijahbenizzy elijahbenizzy force-pushed the data-quality branch 2 times, most recently from 4444cb2 to 3b1d1b0 Compare June 15, 2022 00:34
@elijahbenizzy
Copy link
Collaborator Author

OK, this is pretty close for a first release (rc version).

@elijahbenizzy
Copy link
Collaborator Author

elijahbenizzy commented Jun 17, 2022

A quick summary for the whylogs folks.

what this does
This PR adds a new decorator for hamilton functions. This decorator does data quality checks. It has two modes: warn and fail -- should be easy to figure out what these do :)

The decorator has two modes of working with it:

  1. Passing in default arguments to use a set of included validators (check_output)
  2. Passing in a custom validator

(1) looks something like this:

# Note that this actually produces two as it uses two arguments
@check_output(range=(0, 1), allow_nans=False, importance='fail')
def data_between_0_and_1_with_no_nans(some_input: pd.Series) -> pd.Series:
    ...

Where (2) looks something like this:

# Note that this actually produces two as it uses two arguments
@check_output_custom(
    MyCustomDataInRangeValidator(low=0, high=1), 
    MyCustomNoNansAllowedValidator())
def data_between_0_and_1_with_no_nans(some_input: pd.Series) -> pd.Series:
    ...

how this fits into hamilton/the hamilton plan
This is another step towards making an all-encompassing dataflow tool. The small set of included validators should cover some base cases (and are extensible). We hope to encourage

your task

We would love feedback! From you...

  1. Check out the branch/mess around with it -- write a basic dataflow to use for testing
  2. Look at the class DataValidator - can you fit your client into it somehow?
  3. Write one!

Happy to pair as needed. Also happy to make changes in case we need any more abstractions.

We now have a test-integrations section in config.yml. I've
decided to group them together to avoid a proliferation. Should
we arrive at conflicting requirements we can solve that later.
It was causing circular imports otherwise.
@elijahbenizzy
Copy link
Collaborator Author

elijahbenizzy commented Jul 6, 2022

OK so pandera integration is here and its pretty clean IMO. That said, this is going to be tricky due to decorators...

E.G. when we do an extract_columns on a DataFrame[df_schema] to do this right we'll need that to handle the typing correctly so each node produces the right Series[series_schema] where series_schema represents the subschema of df_schema. Doable, but not sure how worth it it is (as of now), from the implementation perspective.

Copy link
Collaborator

@skrawcz skrawcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah just documentation thoughts.

data_quality.md Outdated Show resolved Hide resolved
Comment on lines +33 to +37
## Default Validators

The available default validators are listed in the variable `AVAILABLE_DEFAULT_VALIDATORS`
in `default_validators.py`. To add more, please implement the class in that file then add to the list.
There is a test that ensures that everything is added to that list.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems out of place? Should be in a developer's section?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah its a little grouped together now

@@ -0,0 +1,150 @@
# Data Quality
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file needs to be reordered a bit I think.

  1. We want to optimize for someone getting started very easily - thus should list things like importance and accessing results up higher. E.g. how do I get something going, how do I configure that basic thing, how do I get the results.
  2. More complex use cases should be pushed further down.

So for me it's:

  1. introduction with code to cut & paste
  2. information on how to customize/tweak that code (list of kwargs, importance levels)
  3. how to access results
  4. Pandera integration
  5. Writing your own custom validators

pass
```

The check_output validator takes in arguments that each correspond to one of the default validators.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The check_output validator takes in arguments that each correspond to one of the default validators.
The check_output decorator takes in arguments that each correspond to one of the default validators. For a list of default available validators see ....

Comment on lines +8 to +9
from hamilton.data_quality import base
from hamilton.data_quality.base import BaseDefaultValidator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can delete the class, and instead prepend base. to where it's used.



class PanderaDataFrameValidator(base.BaseDefaultValidator):
"""Pandera schema validator for dataframes"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to pandera docs for dataframe schema / link to our docs.



class PanderaSeriesSchemaValidator(base.BaseDefaultValidator):
"""Pandera schema validator for series"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to pandera docs for series schema / link to our docs.

skrawcz added 20 commits July 12, 2022 17:06
It's two words and it should be separated by an `_` to be consistent
with all the other validators.
They were in consistent. This changes them to follow the following:

* Name of validation + Data type

E.g. MaxFractionNansValidator + PandasSeries
PandasSeries and Primitives are added.

This enables one to do:

```python
@check_output(values_in=[ LIST,  OF, VALUES ], ...)
```
Which will check what the function outputs and validate that it
is within one of the values provided.

Currently the primitives operates over numbers and strings.
I punted on lists and dictionaries -- they should probably be
different validator classes.
There were multiple ways the same module was imported.
Reduced it to a single way.
Using `inclusive=True` is deprecated. So changing
to use `both` to not get a warning from pandas when
using this.
Sorry, merging two commits here.

(1) The `name()` was pretty much just the `argument` + `_validator`.
So I just encoded that and updated the names of variables, and classes
to match this format of doing things.
Thus (2) is rolled into this. Because we should be making sure that
arguments, names, class names, are following some semblance of
structure.
Prior behavior stopped at the first failure. We don't want
that to happen. Instead we want to run through all the
checks and log them appropriately, this change does that.
So I had to changed "act" to accommodate this.

Since `act` itself was only used in a single place, I just
moved the `if` statement into the BaseDataValidationDecorator.

That said, the class structure here feels a little odd -- might
be easy to introduce a circular dependency at some point accidentally.
But yeah we need a better mechanism for storing results
for people to access.
This only works over numbers and strings.
If we want to do dicts and lists, we probably
want a specific validator for them -- we don't
need the if/else checks here.
Fixing rogue function doc that was not like
the other function doc we have setup.
Namely, we use `:param NAME:` rather than
`@param NAME:`.
This data quality example is based on the example
we provided with the outerbounds(metaflow) folks.

It's purpose is to show how one might apply data quality.
It also shows how to use the same code and make it
work for dask*, ray, spark.

Ray - everything seems to work as intended.

Spark - had to change how the data & some features
are computed.

Dask* - had to change how the data & some features
are computed.  CAVEAT: right now the validators don't work properly
for dask data types.  That is because either (1) it's a future object, or (2)
we use pandas series syntax, when instead we should
use the dask specific syntax. In short - DEFAULT DQ DOES NOT WORK WITH
DASK DATA TYPES. BUT it DOES WORK if you're just using spark
for multi-processing, and not using dask data types.

So we need to think how we might change/inject the validator implementation
based on say a graph adapter or something... otherwise this forces one
to really stick to one data type or another, i.e. pandas or dask dataframe.

Documentation should hopefully be enough to document what is doing on.
The only TODO is to create an analogous example using Pandera -- my hope
is that it will handle dask datatypes...
Before it did not check anything -- and instead assumed
a dictionary of series and scalars.

Now if there is only a single value, and it happens to be a dataframe,
we will return that, instead of trying to build another dataframe.

Adds unit tests for this function.
It did not take in importance or call the super class.

Updates unit tests.
Dask datatypes are lazily evaluated. So we need to check
whether the "validate" result we get back from pandera
is a dask like object. If so, we then want to "materialize" it
so that we can actually compute the validation result.

Without this check, they are never evaluated, because
nothing downstream asks for the result to be computed.
Using the same trick as we employed before, we can simply
compute a result for the scalar primitive validators to be a valid
value to compare against.

Without this, things break because we're trying to compare
a dask data type thing.

Note: we could do a similar strategy for the Pandas Series validators,
however we'd need to do something akin to what pandera does
under the hood with `map_partitions` over the dask like object.
I vote to push people to use pandera if they're using dask data types.
Adds one logger statement to ensure things are logged nicely, one by one in the
case of a failure -- they were otherwise hard to interpret.

Fixes install instructions otherwise in the case of pandera validators.
So that the output does not take over your whole screen.
This example is virtually identical to `examples/data_quality/simple`. It instead
makes the following choices:

1. Separate feature logic file for Pandas on Spark. Just to show another way to
cut things. Well that, and to correct the "data type" validation issue with the simple example.
2. Uses Pandera and shows how to validate Series and Dataframes using Pandera +
`@check_output`.
In the case there is a failure, it's probably useful to
print the valid values expected.

Also changes on applies_to check for `issubclass`,
this is related to PR feedback, and I'm too lazy
to make another commit just for it.
@skrawcz skrawcz self-requested a review July 13, 2022 16:19
Co-authored-by: Stefan Krawczyk <skrawczyk@stitchfix.com>
@elijahbenizzy elijahbenizzy merged commit 860c60a into main Jul 13, 2022
@elijahbenizzy elijahbenizzy deleted the data-quality branch July 13, 2022 17:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prototype Data Quality Feature
2 participants