diff --git a/docs/guides/expectations/advanced/how-to-create-a-new-expectation-suite-using-rule-based-profilers.md b/docs/guides/expectations/advanced/how-to-create-a-new-expectation-suite-using-rule-based-profilers.md index 952dac5a9429..a0d33245c8ca 100644 --- a/docs/guides/expectations/advanced/how-to-create-a-new-expectation-suite-using-rule-based-profilers.md +++ b/docs/guides/expectations/advanced/how-to-create-a-new-expectation-suite-using-rule-based-profilers.md @@ -9,7 +9,7 @@ In this tutorial, you will develop hands-on experience with configuring a Rule-B - Have a basic understanding of [Metrics in Great Expectations](https://docs.greatexpectations.io/en/latest/reference/core_concepts/metrics.html) - Have a basic understanding of [Expectation Configurations in Great Expectations](https://docs.greatexpectations.io/en/latest/reference/core_concepts/expectations/expectations.html#expectation-concepts-domain-and-success-keys) -- Have read the sections in Core Concepts on [Profiling](../../../reference/core-concepts#profiling) and [Rule-Based Profilers](../../../reference/core-concepts#rule-based-profilers) +- Have read the sections in Core Concepts on [Profilers](../../../reference/profilers) and [Rule-Based Profilers](../../../reference/profilers#rule-based-profilers) diff --git a/docs/guides/images/rule_based_profiler_public_interface_diagram.png b/docs/guides/images/rule_based_profiler_public_interface_diagram.png index 79ff1a7a076a..9e7bdbe28387 100644 Binary files a/docs/guides/images/rule_based_profiler_public_interface_diagram.png and b/docs/guides/images/rule_based_profiler_public_interface_diagram.png differ diff --git a/docs/reference/core-concepts.md b/docs/reference/core-concepts.md index e5305a3ebca3..ee26de85ebd3 100755 --- a/docs/reference/core-concepts.md +++ b/docs/reference/core-concepts.md @@ -110,72 +110,6 @@ permissions. In some cases **the thing that "makes a batch a batch" is the act of attending to it--for example by validating or profiling the data**. It's all about **your** Expectations. -Great Expectations provides a mechanism to automatically generate expectations, using a feature called a **Profiler**. A Profiler builds an **Expectation Suite** from one or more **Data Assets**. It may also validates the data against the newly-generated Expectation Suite to return a **Validation Result**. There are several Profilers included with Great Expectations. +Great Expectations provides a mechanism to automatically generate expectations, using a feature called a [**Profiler**](./profilers). A Profiler builds an **Expectation Suite** from one or more **Data Assets**. It may also validates the data against the newly-generated Expectation Suite to return a **Validation Result**. There are several Profilers included with Great Expectations. A Profiler makes it possible to quickly create a starting point for generating expectations about a Dataset. For example, during the `init` flow, Great Expectations currently uses the **UserConfigurableProfiler** to demonstrate important features of **Expectations** by creating and validating an Expectation Suite that has several different kinds of expectations built from a small sample of data. A Profiler is also critical to generating the Expectation Suites used during profiling. - -## Rule-Based Profilers - -**Rule-Based profilers** allow users to provide a highly configurable specification which is composed of **Rules** to use in order to build an **Expectation Suite** by profiling existing data. - -Imagine you have a table of Sales that comes in every month. You could profile last month's data, inspecting it in order to automatically create a number of expectations that you can use to validate next month's data. - -A **Rule** in a rule-based profiler could say something like "Look at every column in my Sales table, and if that column is numeric, add an `expect_column_values_to_be_between` Expectation to my Expectation Suite, where the `min_value` for the Expectation is the minimum value for the column, and the `max_value` for the Expectation is the maximum value for the column." - -Each rule in a rule-based profiler has three types of components: - -1. **DomainBuilders**: A DomainBuilder will inspect some data that you provide to the Profiler, and compile a list of Domains for which you would like to build expectations -1. **ParameterBuilders**: A ParameterBuilder will inspect some data that you provide to the Profiler, and compile a dictionary of Parameters that you can use when constructing your ExpectationConfigurations -1. **ExpectationConfigurationBuilders**: An ExpectationConfigurationBuilder will take the Domains compiled by the DomainBuilder, and assemble ExpectationConfigurations using Parameters built by the ParameterBuilder - -In the above example, imagine your table of Sales has twenty columns, of which five are numeric: -* Your **DomainBuilder** would inspect all twenty columns, and then yield a list of the five numeric columns -* You would specify two **ParameterBuilders**: one which gets the min of a column, and one which gets a max. Your Profiler would loop over the Domain (or column) list built by the **DomainBuilder** and use the two ParameterBuilders to get the min and max for each column. -* Then the Profiler loops over Domains built by the DomainBuilder and uses the **ExpectationConfigurationBuilders** to add a `expect_column_values_to_between` column for each of these Domains, where the `min_value` and `max_value` are the values that we got in the ParameterBuilders. - -In addition to Rules, a rule-based profiler enables you to specify **Variables**, which are global and can be used in any of the Rules. For instance, you may want to reference the same BatchRequest or the same tolerance in multiple Rules, and declaring these as Variables will enable you to do so. - -#### Example Config: -```yaml -variables: - my_last_month_sales_batch_request: # We will use this BatchRequest in our DomainBuilder and both of our ParameterBuilders so we can pinpoint the data to Profile - datasource_name: my_sales_datasource - data_connector_name: monthly_sales - data_asset_name: sales_data - data_connector_query: - index: -1 - mostly_default: 0.95 # We can set a variable here that we can reference as the `mostly` value for our expectations below -rules: - my_rule_for_numeric_columns: # This is the name of our Rule - domain_builder: - batch_request: $variables.my_last_month_sales_batch_request # We use the BatchRequest that we specified in Variables above using this $ syntax - class_name: SemanticTypeColumnDomainBuilder # We use this class of DomainBuilder so we can specify the numeric type below - semantic_types: - - numeric - parameter_builders: - - parameter_name: my_column_min - class_name: MetricParameterBuilder - batch_request: $variables.my_last_month_sales_batch_request - metric_name: column.min # This is the metric we want to get with this ParameterBuilder - metric_domain_kwargs: $domain.domain_kwargs # This tells us to use the same Domain that is gotten by the DomainBuilder. We could also put a different column name in here to get a metric for that column instead. - - parameter_name: my_column_max - class_name: MetricParameterBuilder - batch_request: $variables.my_last_month_sales_batch_request - metric_name: column.max - metric_domain_kwargs: $domain.domain_kwargs - expectation_configuration_builders: - - expectation_type: expect_column_values_to_be_between # This is the name of the expectation that we would like to add to our suite - class_name: DefaultExpectationConfigurationBuilder - column: $domain.domain_kwargs.column - min_value: $parameter.my_column_min.value # We can reference the Parameters created by our ParameterBuilders using the same $ notation that we use to get Variables - max_value: $parameter.my_column_max.value - mostly: $variables.mostly_default -``` - -You can see another example config containing multiple rules here: [alice_user_workflow_verbose_profiler_config.yml](https://github.com/great-expectations/great_expectations/blob/develop/tests/rule_based_profiler/alice_user_workflow_verbose_profiler_config.yml) - -This config is used in the below diagram to provide a better sense of how the different parts of the Profiler config fit together. [You can see a larger version of this file here.](https://github.com/great-expectations/great_expectations/blob/develop/docs/guides/images/rule_based_profiler_public_interface_diagram.png) -![Rule-Based Profiler Public Interface Diagram](../guides/images/rule_based_profiler_public_interface_diagram.png) - -### Next Steps -- You can try out a tutorial that walks you through the set-up of a Rule-Based Profiler here: [How to create a new Expectation Suite using Rule Based Profilers](../guides/expectations/advanced/how-to-create-a-new-expectation-suite-using-rule-based-profilers) diff --git a/docs/reference/profilers.md b/docs/reference/profilers.md index 6eaaffbdc0df..cff54ac029ee 100644 --- a/docs/reference/profilers.md +++ b/docs/reference/profilers.md @@ -19,3 +19,69 @@ convention that all columns **named** "id" are primary keys, whereas all columns **suffix** "_id" are foreign keys. In that case, when the team using Great Expectations first encounters a new dataset that followed the convention, a Profiler could use that knowledge to add an expect_column_values_to_be_unique Expectation to the "id" column (but not, for example an "address_id" column). + +## Rule-Based Profilers + +**Rule-Based profilers** allow users to provide a highly configurable specification which is composed of **Rules** to use in order to build an **Expectation Suite** by profiling existing data. + +Imagine you have a table of Sales that comes in every month. You could profile last month's data, inspecting it in order to automatically create a number of expectations that you can use to validate next month's data. + +A **Rule** in a rule-based profiler could say something like "Look at every column in my Sales table, and if that column is numeric, add an `expect_column_values_to_be_between` Expectation to my Expectation Suite, where the `min_value` for the Expectation is the minimum value for the column, and the `max_value` for the Expectation is the maximum value for the column." + +Each rule in a rule-based profiler has three types of components: + +1. **DomainBuilders**: A DomainBuilder will inspect some data that you provide to the Profiler, and compile a list of Domains for which you would like to build expectations +1. **ParameterBuilders**: A ParameterBuilder will inspect some data that you provide to the Profiler, and compile a dictionary of Parameters that you can use when constructing your ExpectationConfigurations +1. **ExpectationConfigurationBuilders**: An ExpectationConfigurationBuilder will take the Domains compiled by the DomainBuilder, and assemble ExpectationConfigurations using Parameters built by the ParameterBuilder + +In the above example, imagine your table of Sales has twenty columns, of which five are numeric: +* Your **DomainBuilder** would inspect all twenty columns, and then yield a list of the five numeric columns +* You would specify two **ParameterBuilders**: one which gets the min of a column, and one which gets a max. Your Profiler would loop over the Domain (or column) list built by the **DomainBuilder** and use the two ParameterBuilders to get the min and max for each column. +* Then the Profiler loops over Domains built by the DomainBuilder and uses the **ExpectationConfigurationBuilders** to add a `expect_column_values_to_between` column for each of these Domains, where the `min_value` and `max_value` are the values that we got in the ParameterBuilders. + +In addition to Rules, a rule-based profiler enables you to specify **Variables**, which are global and can be used in any of the Rules. For instance, you may want to reference the same BatchRequest or the same tolerance in multiple Rules, and declaring these as Variables will enable you to do so. + +#### Example Config: +```yaml +variables: + my_last_month_sales_batch_request: # We will use this BatchRequest in our DomainBuilder and both of our ParameterBuilders so we can pinpoint the data to Profile + datasource_name: my_sales_datasource + data_connector_name: monthly_sales + data_asset_name: sales_data + data_connector_query: + index: -1 + mostly_default: 0.95 # We can set a variable here that we can reference as the `mostly` value for our expectations below +rules: + my_rule_for_numeric_columns: # This is the name of our Rule + domain_builder: + batch_request: $variables.my_last_month_sales_batch_request # We use the BatchRequest that we specified in Variables above using this $ syntax + class_name: SemanticTypeColumnDomainBuilder # We use this class of DomainBuilder so we can specify the numeric type below + semantic_types: + - numeric + parameter_builders: + - parameter_name: my_column_min + class_name: MetricParameterBuilder + batch_request: $variables.my_last_month_sales_batch_request + metric_name: column.min # This is the metric we want to get with this ParameterBuilder + metric_domain_kwargs: $domain.domain_kwargs # This tells us to use the same Domain that is gotten by the DomainBuilder. We could also put a different column name in here to get a metric for that column instead. + - parameter_name: my_column_max + class_name: MetricParameterBuilder + batch_request: $variables.my_last_month_sales_batch_request + metric_name: column.max + metric_domain_kwargs: $domain.domain_kwargs + expectation_configuration_builders: + - expectation_type: expect_column_values_to_be_between # This is the name of the expectation that we would like to add to our suite + class_name: DefaultExpectationConfigurationBuilder + column: $domain.domain_kwargs.column + min_value: $parameter.my_column_min.value # We can reference the Parameters created by our ParameterBuilders using the same $ notation that we use to get Variables + max_value: $parameter.my_column_max.value + mostly: $variables.mostly_default +``` + +You can see another example config containing multiple rules here: [alice_user_workflow_verbose_profiler_config.yml](https://github.com/great-expectations/great_expectations/blob/develop/tests/rule_based_profiler/alice_user_workflow_verbose_profiler_config.yml) + +This config is used in the below diagram to provide a better sense of how the different parts of the Profiler config fit together. [You can see a larger version of this file here.](https://github.com/great-expectations/great_expectations/blob/develop/docs/guides/images/rule_based_profiler_public_interface_diagram.png) +![Rule-Based Profiler Public Interface Diagram](../guides/images/rule_based_profiler_public_interface_diagram.png) + +### Next Steps +- You can try out a tutorial that walks you through the set-up of a Rule-Based Profiler here: [How to create a new Expectation Suite using Rule Based Profilers](../guides/expectations/advanced/how-to-create-a-new-expectation-suite-using-rule-based-profilers) diff --git a/docs_rtd/changelog.rst b/docs_rtd/changelog.rst index 7c8431d6d2ff..d22f92256cce 100644 --- a/docs_rtd/changelog.rst +++ b/docs_rtd/changelog.rst @@ -11,6 +11,8 @@ Develop - Support for both single-batch and multi-batch use-cases showcased - Addition of the "bootstrap" mode of parameter estimation (default) to NumericMetricRangeMultiBatchParameterBuilder - Initial documentation +* [BUGFIX] Modify read_excel() to handle new optional-dependency openpyxl for pandas >= 1.3.0 #2989 + 0.13.21 ----------------- diff --git a/great_expectations/util.py b/great_expectations/util.py index 138267f5e315..5e7f99398a58 100644 --- a/great_expectations/util.py +++ b/great_expectations/util.py @@ -493,7 +493,13 @@ def read_excel( """ import pandas as pd - df = pd.read_excel(filename, *args, **kwargs) + try: + df = pd.read_excel(filename, *args, **kwargs) + except ImportError: + raise ImportError( + "Pandas now requires 'openpyxl' as an optional-dependency to read Excel files. Please use pip or conda to install openpyxl and try again" + ) + if dataset_class is None: verify_dynamic_loading_support(module_name=module_name) dataset_class = load_class(class_name=class_name, module_name=module_name) diff --git a/requirements-dev-base.txt b/requirements-dev-base.txt index b9d4bcd252b8..f0917980c807 100644 --- a/requirements-dev-base.txt +++ b/requirements-dev-base.txt @@ -20,6 +20,7 @@ google-cloud-secret-manager>=1.0.0 # all_tests google-cloud-storage>=1.28.0 # all_tests isort==5.4.2 # lint moto[ec2]>=1.3.7,<2.0.0 # all_tests +openpyxl>=3.0.7 # for read_excel test only pre-commit>=2.6.0 # lint pyarrow>=0.12.0 # all_tests pypd==1.1.0 # all_tests diff --git a/sidebars.js b/sidebars.js index 1d472b435b3d..df86d60e43c4 100755 --- a/sidebars.js +++ b/sidebars.js @@ -302,8 +302,8 @@ module.exports = { { type: 'doc', id: 'reference/expectations/distributional-expectations' }, { type: 'doc', id: 'reference/expectations/expectations' }, { type: 'doc', id: 'reference/expectations/implemented-expectations' }, - { type: 'doc', id: 'reference/expectation-suite-operations' }, - ] + { type: 'doc', id: 'reference/expectation-suite-operations' } + ] }, { type: 'doc', id: 'reference/metrics' }, { type: 'doc', id: 'reference/profilers' }, @@ -311,7 +311,7 @@ module.exports = { { type: 'doc', id: 'reference/expectations/standard-arguments' }, { type: 'doc', id: 'reference/stores' }, { type: 'doc', id: 'reference/dividing-data-assets-into-batches' }, - { type: 'doc', id: 'reference/validation' }, + { type: 'doc', id: 'reference/validation' } ] }, { @@ -327,7 +327,7 @@ module.exports = { collapsed: true, items: [ { type: 'doc', id: 'reference/spare-parts' } - ] + ] }, { type: 'category', @@ -335,16 +335,16 @@ module.exports = { collapsed: true, items: [ { type: 'doc', id: 'reference/api-reference' } - ] - } - ] + ] + } + ] }, { type: 'category', label: 'Community Resources', collapsed: true, items: [ - { type: 'doc', id: 'community' } + { type: 'doc', id: 'community' } ] }, { @@ -352,7 +352,7 @@ module.exports = { label: 'Contributing', collapsed: true, items: [ - { type: 'doc', id: 'contributing/contributing' } + { type: 'doc', id: 'contributing/contributing' } ] }, { @@ -360,9 +360,8 @@ module.exports = { label: 'Changelog', collapsed: true, items: [ - { type: 'doc', id: 'changelog' } - + { type: 'doc', id: 'changelog' } ] - }, + } ] } diff --git a/tests/integration/docusaurus/expectations/advanced/multi_batch_rule_based_profiler_example.py b/tests/integration/docusaurus/expectations/advanced/multi_batch_rule_based_profiler_example.py index ed3d0b1b28ec..1e75f35f951f 100644 --- a/tests/integration/docusaurus/expectations/advanced/multi_batch_rule_based_profiler_example.py +++ b/tests/integration/docusaurus/expectations/advanced/multi_batch_rule_based_profiler_example.py @@ -6,9 +6,11 @@ profiler_config = """ # This profiler is meant to be used on the NYC taxi data (yellow_trip_data_sample_-.csv) # located in tests/test_sets/taxi_yellow_trip_data_samples/ + variables: confidence_level: 9.75e-1 mostly: 1.0 + rules: row_count_rule: domain_builder: diff --git a/tests/test_great_expectations.py b/tests/test_great_expectations.py index 7f6f978550f0..34e9f4945fb6 100644 --- a/tests/test_great_expectations.py +++ b/tests/test_great_expectations.py @@ -25,6 +25,7 @@ from great_expectations.data_context.util import file_relative_path from great_expectations.dataset import MetaPandasDataset, PandasDataset from great_expectations.exceptions import InvalidCacheValueError +from great_expectations.util import is_library_loadable try: from unittest import mock @@ -1014,6 +1015,10 @@ def test_read_json(self): assert isinstance(df, PandasDataset) assert sorted(list(df.keys())) == ["x", "y", "z"] + @pytest.mark.skipif( + not is_library_loadable(library_name="openpyxl"), + reason="GE uses pandas to read excel files, which requires openpyxl", + ) def test_read_excel(self): script_path = os.path.dirname(os.path.realpath(__file__)) df = ge.read_excel(