Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/develop' into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
Alex Sherstinsky committed Jul 3, 2021
2 parents d7c9d51 + b915866 commit 8738fb9
Show file tree
Hide file tree
Showing 10 changed files with 96 additions and 81 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ In this tutorial, you will develop hands-on experience with configuring a Rule-B

- Have a basic understanding of [Metrics in Great Expectations](https://docs.greatexpectations.io/en/latest/reference/core_concepts/metrics.html)
- Have a basic understanding of [Expectation Configurations in Great Expectations](https://docs.greatexpectations.io/en/latest/reference/core_concepts/expectations/expectations.html#expectation-concepts-domain-and-success-keys)
- Have read the sections in Core Concepts on [Profiling](../../../reference/core-concepts#profiling) and [Rule-Based Profilers](../../../reference/core-concepts#rule-based-profilers)
- Have read the sections in Core Concepts on [Profilers](../../../reference/profilers) and [Rule-Based Profilers](../../../reference/profilers#rule-based-profilers)

</Prerequisites>

Expand Down
Binary file modified docs/guides/images/rule_based_profiler_public_interface_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
68 changes: 1 addition & 67 deletions docs/reference/core-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,72 +110,6 @@ permissions.
In some cases **the thing that "makes a batch a batch" is the act of attending to it--for example by validating or
profiling the data**. It's all about **your** Expectations.

Great Expectations provides a mechanism to automatically generate expectations, using a feature called a **Profiler**. A Profiler builds an **Expectation Suite** from one or more **Data Assets**. It may also validates the data against the newly-generated Expectation Suite to return a **Validation Result**. There are several Profilers included with Great Expectations.
Great Expectations provides a mechanism to automatically generate expectations, using a feature called a [**Profiler**](./profilers). A Profiler builds an **Expectation Suite** from one or more **Data Assets**. It may also validates the data against the newly-generated Expectation Suite to return a **Validation Result**. There are several Profilers included with Great Expectations.

A Profiler makes it possible to quickly create a starting point for generating expectations about a Dataset. For example, during the `init` flow, Great Expectations currently uses the **UserConfigurableProfiler** to demonstrate important features of **Expectations** by creating and validating an Expectation Suite that has several different kinds of expectations built from a small sample of data. A Profiler is also critical to generating the Expectation Suites used during profiling.

## Rule-Based Profilers

**Rule-Based profilers** allow users to provide a highly configurable specification which is composed of **Rules** to use in order to build an **Expectation Suite** by profiling existing data.

Imagine you have a table of Sales that comes in every month. You could profile last month's data, inspecting it in order to automatically create a number of expectations that you can use to validate next month's data.

A **Rule** in a rule-based profiler could say something like "Look at every column in my Sales table, and if that column is numeric, add an `expect_column_values_to_be_between` Expectation to my Expectation Suite, where the `min_value` for the Expectation is the minimum value for the column, and the `max_value` for the Expectation is the maximum value for the column."

Each rule in a rule-based profiler has three types of components:

1. **DomainBuilders**: A DomainBuilder will inspect some data that you provide to the Profiler, and compile a list of Domains for which you would like to build expectations
1. **ParameterBuilders**: A ParameterBuilder will inspect some data that you provide to the Profiler, and compile a dictionary of Parameters that you can use when constructing your ExpectationConfigurations
1. **ExpectationConfigurationBuilders**: An ExpectationConfigurationBuilder will take the Domains compiled by the DomainBuilder, and assemble ExpectationConfigurations using Parameters built by the ParameterBuilder

In the above example, imagine your table of Sales has twenty columns, of which five are numeric:
* Your **DomainBuilder** would inspect all twenty columns, and then yield a list of the five numeric columns
* You would specify two **ParameterBuilders**: one which gets the min of a column, and one which gets a max. Your Profiler would loop over the Domain (or column) list built by the **DomainBuilder** and use the two ParameterBuilders to get the min and max for each column.
* Then the Profiler loops over Domains built by the DomainBuilder and uses the **ExpectationConfigurationBuilders** to add a `expect_column_values_to_between` column for each of these Domains, where the `min_value` and `max_value` are the values that we got in the ParameterBuilders.

In addition to Rules, a rule-based profiler enables you to specify **Variables**, which are global and can be used in any of the Rules. For instance, you may want to reference the same BatchRequest or the same tolerance in multiple Rules, and declaring these as Variables will enable you to do so.

#### Example Config:
```yaml
variables:
my_last_month_sales_batch_request: # We will use this BatchRequest in our DomainBuilder and both of our ParameterBuilders so we can pinpoint the data to Profile
datasource_name: my_sales_datasource
data_connector_name: monthly_sales
data_asset_name: sales_data
data_connector_query:
index: -1
mostly_default: 0.95 # We can set a variable here that we can reference as the `mostly` value for our expectations below
rules:
my_rule_for_numeric_columns: # This is the name of our Rule
domain_builder:
batch_request: $variables.my_last_month_sales_batch_request # We use the BatchRequest that we specified in Variables above using this $ syntax
class_name: SemanticTypeColumnDomainBuilder # We use this class of DomainBuilder so we can specify the numeric type below
semantic_types:
- numeric
parameter_builders:
- parameter_name: my_column_min
class_name: MetricParameterBuilder
batch_request: $variables.my_last_month_sales_batch_request
metric_name: column.min # This is the metric we want to get with this ParameterBuilder
metric_domain_kwargs: $domain.domain_kwargs # This tells us to use the same Domain that is gotten by the DomainBuilder. We could also put a different column name in here to get a metric for that column instead.
- parameter_name: my_column_max
class_name: MetricParameterBuilder
batch_request: $variables.my_last_month_sales_batch_request
metric_name: column.max
metric_domain_kwargs: $domain.domain_kwargs
expectation_configuration_builders:
- expectation_type: expect_column_values_to_be_between # This is the name of the expectation that we would like to add to our suite
class_name: DefaultExpectationConfigurationBuilder
column: $domain.domain_kwargs.column
min_value: $parameter.my_column_min.value # We can reference the Parameters created by our ParameterBuilders using the same $ notation that we use to get Variables
max_value: $parameter.my_column_max.value
mostly: $variables.mostly_default
```
You can see another example config containing multiple rules here: [alice_user_workflow_verbose_profiler_config.yml](https://github.com/great-expectations/great_expectations/blob/develop/tests/rule_based_profiler/alice_user_workflow_verbose_profiler_config.yml)
This config is used in the below diagram to provide a better sense of how the different parts of the Profiler config fit together. [You can see a larger version of this file here.](https://github.com/great-expectations/great_expectations/blob/develop/docs/guides/images/rule_based_profiler_public_interface_diagram.png)
![Rule-Based Profiler Public Interface Diagram](../guides/images/rule_based_profiler_public_interface_diagram.png)
### Next Steps
- You can try out a tutorial that walks you through the set-up of a Rule-Based Profiler here: [How to create a new Expectation Suite using Rule Based Profilers](../guides/expectations/advanced/how-to-create-a-new-expectation-suite-using-rule-based-profilers)
66 changes: 66 additions & 0 deletions docs/reference/profilers.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,69 @@ convention that all columns **named** "id" are primary keys, whereas all columns
**suffix** "_id" are foreign keys. In that case, when the team using Great Expectations first encounters a new dataset
that followed the convention, a Profiler could use that knowledge to add an expect_column_values_to_be_unique
Expectation to the "id" column (but not, for example an "address_id" column).

## Rule-Based Profilers

**Rule-Based profilers** allow users to provide a highly configurable specification which is composed of **Rules** to use in order to build an **Expectation Suite** by profiling existing data.

Imagine you have a table of Sales that comes in every month. You could profile last month's data, inspecting it in order to automatically create a number of expectations that you can use to validate next month's data.

A **Rule** in a rule-based profiler could say something like "Look at every column in my Sales table, and if that column is numeric, add an `expect_column_values_to_be_between` Expectation to my Expectation Suite, where the `min_value` for the Expectation is the minimum value for the column, and the `max_value` for the Expectation is the maximum value for the column."

Each rule in a rule-based profiler has three types of components:

1. **DomainBuilders**: A DomainBuilder will inspect some data that you provide to the Profiler, and compile a list of Domains for which you would like to build expectations
1. **ParameterBuilders**: A ParameterBuilder will inspect some data that you provide to the Profiler, and compile a dictionary of Parameters that you can use when constructing your ExpectationConfigurations
1. **ExpectationConfigurationBuilders**: An ExpectationConfigurationBuilder will take the Domains compiled by the DomainBuilder, and assemble ExpectationConfigurations using Parameters built by the ParameterBuilder

In the above example, imagine your table of Sales has twenty columns, of which five are numeric:
* Your **DomainBuilder** would inspect all twenty columns, and then yield a list of the five numeric columns
* You would specify two **ParameterBuilders**: one which gets the min of a column, and one which gets a max. Your Profiler would loop over the Domain (or column) list built by the **DomainBuilder** and use the two ParameterBuilders to get the min and max for each column.
* Then the Profiler loops over Domains built by the DomainBuilder and uses the **ExpectationConfigurationBuilders** to add a `expect_column_values_to_between` column for each of these Domains, where the `min_value` and `max_value` are the values that we got in the ParameterBuilders.

In addition to Rules, a rule-based profiler enables you to specify **Variables**, which are global and can be used in any of the Rules. For instance, you may want to reference the same BatchRequest or the same tolerance in multiple Rules, and declaring these as Variables will enable you to do so.

#### Example Config:
```yaml
variables:
my_last_month_sales_batch_request: # We will use this BatchRequest in our DomainBuilder and both of our ParameterBuilders so we can pinpoint the data to Profile
datasource_name: my_sales_datasource
data_connector_name: monthly_sales
data_asset_name: sales_data
data_connector_query:
index: -1
mostly_default: 0.95 # We can set a variable here that we can reference as the `mostly` value for our expectations below
rules:
my_rule_for_numeric_columns: # This is the name of our Rule
domain_builder:
batch_request: $variables.my_last_month_sales_batch_request # We use the BatchRequest that we specified in Variables above using this $ syntax
class_name: SemanticTypeColumnDomainBuilder # We use this class of DomainBuilder so we can specify the numeric type below
semantic_types:
- numeric
parameter_builders:
- parameter_name: my_column_min
class_name: MetricParameterBuilder
batch_request: $variables.my_last_month_sales_batch_request
metric_name: column.min # This is the metric we want to get with this ParameterBuilder
metric_domain_kwargs: $domain.domain_kwargs # This tells us to use the same Domain that is gotten by the DomainBuilder. We could also put a different column name in here to get a metric for that column instead.
- parameter_name: my_column_max
class_name: MetricParameterBuilder
batch_request: $variables.my_last_month_sales_batch_request
metric_name: column.max
metric_domain_kwargs: $domain.domain_kwargs
expectation_configuration_builders:
- expectation_type: expect_column_values_to_be_between # This is the name of the expectation that we would like to add to our suite
class_name: DefaultExpectationConfigurationBuilder
column: $domain.domain_kwargs.column
min_value: $parameter.my_column_min.value # We can reference the Parameters created by our ParameterBuilders using the same $ notation that we use to get Variables
max_value: $parameter.my_column_max.value
mostly: $variables.mostly_default
```
You can see another example config containing multiple rules here: [alice_user_workflow_verbose_profiler_config.yml](https://github.com/great-expectations/great_expectations/blob/develop/tests/rule_based_profiler/alice_user_workflow_verbose_profiler_config.yml)
This config is used in the below diagram to provide a better sense of how the different parts of the Profiler config fit together. [You can see a larger version of this file here.](https://github.com/great-expectations/great_expectations/blob/develop/docs/guides/images/rule_based_profiler_public_interface_diagram.png)
![Rule-Based Profiler Public Interface Diagram](../guides/images/rule_based_profiler_public_interface_diagram.png)
### Next Steps
- You can try out a tutorial that walks you through the set-up of a Rule-Based Profiler here: [How to create a new Expectation Suite using Rule Based Profilers](../guides/expectations/advanced/how-to-create-a-new-expectation-suite-using-rule-based-profilers)
2 changes: 2 additions & 0 deletions docs_rtd/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ Develop
- Support for both single-batch and multi-batch use-cases showcased
- Addition of the "bootstrap" mode of parameter estimation (default) to NumericMetricRangeMultiBatchParameterBuilder
- Initial documentation
* [BUGFIX] Modify read_excel() to handle new optional-dependency openpyxl for pandas >= 1.3.0 #2989


0.13.21
-----------------
Expand Down
8 changes: 7 additions & 1 deletion great_expectations/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -493,7 +493,13 @@ def read_excel(
"""
import pandas as pd

df = pd.read_excel(filename, *args, **kwargs)
try:
df = pd.read_excel(filename, *args, **kwargs)
except ImportError:
raise ImportError(
"Pandas now requires 'openpyxl' as an optional-dependency to read Excel files. Please use pip or conda to install openpyxl and try again"
)

if dataset_class is None:
verify_dynamic_loading_support(module_name=module_name)
dataset_class = load_class(class_name=class_name, module_name=module_name)
Expand Down
1 change: 1 addition & 0 deletions requirements-dev-base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ google-cloud-secret-manager>=1.0.0 # all_tests
google-cloud-storage>=1.28.0 # all_tests
isort==5.4.2 # lint
moto[ec2]>=1.3.7,<2.0.0 # all_tests
openpyxl>=3.0.7 # for read_excel test only
pre-commit>=2.6.0 # lint
pyarrow>=0.12.0 # all_tests
pypd==1.1.0 # all_tests
Expand Down
23 changes: 11 additions & 12 deletions sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -302,16 +302,16 @@ module.exports = {
{ type: 'doc', id: 'reference/expectations/distributional-expectations' },
{ type: 'doc', id: 'reference/expectations/expectations' },
{ type: 'doc', id: 'reference/expectations/implemented-expectations' },
{ type: 'doc', id: 'reference/expectation-suite-operations' },
]
{ type: 'doc', id: 'reference/expectation-suite-operations' }
]
},
{ type: 'doc', id: 'reference/metrics' },
{ type: 'doc', id: 'reference/profilers' },
{ type: 'doc', id: 'reference/expectations/result-format' },
{ type: 'doc', id: 'reference/expectations/standard-arguments' },
{ type: 'doc', id: 'reference/stores' },
{ type: 'doc', id: 'reference/dividing-data-assets-into-batches' },
{ type: 'doc', id: 'reference/validation' },
{ type: 'doc', id: 'reference/validation' }
]
},
{
Expand All @@ -327,42 +327,41 @@ module.exports = {
collapsed: true,
items: [
{ type: 'doc', id: 'reference/spare-parts' }
]
]
},
{
type: 'category',
label: 'API Reference',
collapsed: true,
items: [
{ type: 'doc', id: 'reference/api-reference' }
]
}
]
]
}
]
},
{
type: 'category',
label: 'Community Resources',
collapsed: true,
items: [
{ type: 'doc', id: 'community' }
{ type: 'doc', id: 'community' }
]
},
{
type: 'category',
label: 'Contributing',
collapsed: true,
items: [
{ type: 'doc', id: 'contributing/contributing' }
{ type: 'doc', id: 'contributing/contributing' }
]
},
{
type: 'category',
label: 'Changelog',
collapsed: true,
items: [
{ type: 'doc', id: 'changelog' }

{ type: 'doc', id: 'changelog' }
]
},
}
]
}
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,11 @@
profiler_config = """
# This profiler is meant to be used on the NYC taxi data (yellow_trip_data_sample_<YEAR>-<MONTH>.csv)
# located in tests/test_sets/taxi_yellow_trip_data_samples/
variables:
confidence_level: 9.75e-1
mostly: 1.0
rules:
row_count_rule:
domain_builder:
Expand Down
5 changes: 5 additions & 0 deletions tests/test_great_expectations.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from great_expectations.data_context.util import file_relative_path
from great_expectations.dataset import MetaPandasDataset, PandasDataset
from great_expectations.exceptions import InvalidCacheValueError
from great_expectations.util import is_library_loadable

try:
from unittest import mock
Expand Down Expand Up @@ -1014,6 +1015,10 @@ def test_read_json(self):
assert isinstance(df, PandasDataset)
assert sorted(list(df.keys())) == ["x", "y", "z"]

@pytest.mark.skipif(
not is_library_loadable(library_name="openpyxl"),
reason="GE uses pandas to read excel files, which requires openpyxl",
)
def test_read_excel(self):
script_path = os.path.dirname(os.path.realpath(__file__))
df = ge.read_excel(
Expand Down

0 comments on commit 8738fb9

Please sign in to comment.