Merge branch 'develop' into bugfix/GE-160/GE-314/alexsherstinsky/fix_…

…minor_bug_in_param_getter-2021_07_02-42b
great-expectations · Jul 3, 2021 · 886a716 · 886a716
2 parents 2a1e39c + 8738fb9
commit 886a716
Show file tree

Hide file tree

Showing 10 changed files with 96 additions and 81 deletions.
diff --git a/...ns/advanced/how-to-create-a-new-expectation-suite-using-rule-based-profilers.md b/...ns/advanced/how-to-create-a-new-expectation-suite-using-rule-based-profilers.md
@@ -9,7 +9,7 @@ In this tutorial, you will develop hands-on experience with configuring a Rule-B
 
 - Have a basic understanding of [Metrics in Great Expectations](https://docs.greatexpectations.io/en/latest/reference/core_concepts/metrics.html)
 - Have a basic understanding of [Expectation Configurations in Great Expectations](https://docs.greatexpectations.io/en/latest/reference/core_concepts/expectations/expectations.html#expectation-concepts-domain-and-success-keys)
-- Have read the sections in Core Concepts on [Profiling](../../../reference/core-concepts#profiling) and [Rule-Based Profilers](../../../reference/core-concepts#rule-based-profilers)
+- Have read the sections in Core Concepts on [Profilers](../../../reference/profilers) and [Rule-Based Profilers](../../../reference/profilers#rule-based-profilers)
 
 </Prerequisites>
 

diff --git a/docs/guides/images/rule_based_profiler_public_interface_diagram.png b/docs/guides/images/rule_based_profiler_public_interface_diagram.png
diff --git a/docs/reference/core-concepts.md b/docs/reference/core-concepts.md
@@ -110,72 +110,6 @@ permissions.
 In some cases **the thing that "makes a batch a batch" is the act of attending to it--for example by validating or
 profiling the data**. It's all about **your** Expectations.
 
-Great Expectations provides a mechanism to automatically generate expectations, using a feature called a **Profiler**. A Profiler builds an **Expectation Suite** from one or more **Data Assets**. It may also validates the data against the newly-generated Expectation Suite to return a **Validation Result**. There are several Profilers included with Great Expectations.
+Great Expectations provides a mechanism to automatically generate expectations, using a feature called a [**Profiler**](./profilers). A Profiler builds an **Expectation Suite** from one or more **Data Assets**. It may also validates the data against the newly-generated Expectation Suite to return a **Validation Result**. There are several Profilers included with Great Expectations.
 
 A Profiler makes it possible to quickly create a starting point for generating expectations about a Dataset. For example, during the `init` flow, Great Expectations currently uses the **UserConfigurableProfiler** to demonstrate important features of **Expectations** by creating and validating an Expectation Suite that has several different kinds of expectations built from a small sample of data. A Profiler is also critical to generating the Expectation Suites used during profiling.
-
-## Rule-Based Profilers
-
-**Rule-Based profilers** allow users to provide a highly configurable specification which is composed of **Rules** to use in order to build an **Expectation Suite** by profiling existing data.
-
-Imagine you have a table of Sales that comes in every month. You could profile last month's data, inspecting it in order to automatically create a number of expectations that you can use to validate next month's data.  
-
-A **Rule** in a rule-based profiler could say something like "Look at every column in my Sales table, and if that column is numeric, add an `expect_column_values_to_be_between` Expectation to my Expectation Suite, where the `min_value` for the Expectation is the minimum value for the column, and the `max_value` for the Expectation is the maximum value for the column."
-
-Each rule in a rule-based profiler has three types of components:
-
-1. **DomainBuilders**: A DomainBuilder will inspect some data that you provide to the Profiler, and compile a list of Domains for which you would like to build expectations
-1. **ParameterBuilders**: A ParameterBuilder will inspect some data that you provide to the Profiler, and compile a dictionary of Parameters that you can use when constructing your ExpectationConfigurations
-1. **ExpectationConfigurationBuilders**: An ExpectationConfigurationBuilder will take the Domains compiled by the DomainBuilder, and assemble ExpectationConfigurations using Parameters built by the ParameterBuilder
-
-In the above example, imagine your table of Sales has twenty columns, of which five are numeric:
-* Your **DomainBuilder** would inspect all twenty columns, and then yield a list of the five numeric columns
-* You would specify two **ParameterBuilders**: one which gets the min of a column, and one which gets a max. Your Profiler would loop over the Domain (or column) list built by the **DomainBuilder** and use the two ParameterBuilders to get the min and max for each column.
-* Then the Profiler loops over Domains built by the DomainBuilder and uses the **ExpectationConfigurationBuilders** to add a `expect_column_values_to_between` column for each of these Domains, where the `min_value` and `max_value` are the values that we got in the ParameterBuilders.
-
-In addition to Rules, a rule-based profiler enables you to specify **Variables**, which are global and can be used in any of the Rules. For instance, you may want to reference the same BatchRequest or the same tolerance in multiple Rules, and declaring these as Variables will enable you to do so. 
-
-#### Example Config:
-```yaml
-variables:
-  my_last_month_sales_batch_request: # We will use this BatchRequest in our DomainBuilder and both of our ParameterBuilders so we can pinpoint the data to Profile
-    datasource_name: my_sales_datasource
-    data_connector_name: monthly_sales
-    data_asset_name: sales_data
-    data_connector_query:
-      index: -1
-  mostly_default: 0.95 # We can set a variable here that we can reference as the `mostly` value for our expectations below
-rules:
-  my_rule_for_numeric_columns: # This is the name of our Rule
-    domain_builder:
-      batch_request: $variables.my_last_month_sales_batch_request # We use the BatchRequest that we specified in Variables above using this $ syntax
-      class_name: SemanticTypeColumnDomainBuilder # We use this class of DomainBuilder so we can specify the numeric type below
-      semantic_types:
-        - numeric
-    parameter_builders:
-      - parameter_name: my_column_min
-        class_name: MetricParameterBuilder
-        batch_request: $variables.my_last_month_sales_batch_request
-        metric_name: column.min # This is the metric we want to get with this ParameterBuilder
-        metric_domain_kwargs: $domain.domain_kwargs # This tells us to use the same Domain that is gotten by the DomainBuilder. We could also put a different column name in here to get a metric for that column instead.
-      - parameter_name: my_column_max
-        class_name: MetricParameterBuilder
-        batch_request: $variables.my_last_month_sales_batch_request
-        metric_name: column.max
-        metric_domain_kwargs: $domain.domain_kwargs
-    expectation_configuration_builders:
-      - expectation_type: expect_column_values_to_be_between # This is the name of the expectation that we would like to add to our suite
-        class_name: DefaultExpectationConfigurationBuilder
-        column: $domain.domain_kwargs.column
-        min_value: $parameter.my_column_min.value # We can reference the Parameters created by our ParameterBuilders using the same $ notation that we use to get Variables
-        max_value: $parameter.my_column_max.value
-        mostly: $variables.mostly_default
-```
-
-You can see another example config containing multiple rules here: [alice_user_workflow_verbose_profiler_config.yml](https://github.com/great-expectations/great_expectations/blob/develop/tests/rule_based_profiler/alice_user_workflow_verbose_profiler_config.yml)
-
-This config is used in the below diagram to provide a better sense of how the different parts of the Profiler config fit together. [You can see a larger version of this file here.](https://github.com/great-expectations/great_expectations/blob/develop/docs/guides/images/rule_based_profiler_public_interface_diagram.png)
-![Rule-Based Profiler Public Interface Diagram](../guides/images/rule_based_profiler_public_interface_diagram.png)
-
-### Next Steps
-- You can try out a tutorial that walks you through the set-up of a Rule-Based Profiler here: [How to create a new Expectation Suite using Rule Based Profilers](../guides/expectations/advanced/how-to-create-a-new-expectation-suite-using-rule-based-profilers)
diff --git a/docs/reference/profilers.md b/docs/reference/profilers.md
@@ -19,3 +19,69 @@ convention that all columns **named** "id" are primary keys, whereas all columns
 **suffix** "_id" are foreign keys. In that case, when the team using Great Expectations first encounters a new dataset
 that followed the convention, a Profiler could use that knowledge to add an expect_column_values_to_be_unique
 Expectation to the "id" column (but not, for example an "address_id" column).
+
+## Rule-Based Profilers
+
+**Rule-Based profilers** allow users to provide a highly configurable specification which is composed of **Rules** to use in order to build an **Expectation Suite** by profiling existing data.
+
+Imagine you have a table of Sales that comes in every month. You could profile last month's data, inspecting it in order to automatically create a number of expectations that you can use to validate next month's data.  
+
+A **Rule** in a rule-based profiler could say something like "Look at every column in my Sales table, and if that column is numeric, add an `expect_column_values_to_be_between` Expectation to my Expectation Suite, where the `min_value` for the Expectation is the minimum value for the column, and the `max_value` for the Expectation is the maximum value for the column."
+
+Each rule in a rule-based profiler has three types of components:
+
+1. **DomainBuilders**: A DomainBuilder will inspect some data that you provide to the Profiler, and compile a list of Domains for which you would like to build expectations
+1. **ParameterBuilders**: A ParameterBuilder will inspect some data that you provide to the Profiler, and compile a dictionary of Parameters that you can use when constructing your ExpectationConfigurations
+1. **ExpectationConfigurationBuilders**: An ExpectationConfigurationBuilder will take the Domains compiled by the DomainBuilder, and assemble ExpectationConfigurations using Parameters built by the ParameterBuilder
+
+In the above example, imagine your table of Sales has twenty columns, of which five are numeric:
+* Your **DomainBuilder** would inspect all twenty columns, and then yield a list of the five numeric columns
+* You would specify two **ParameterBuilders**: one which gets the min of a column, and one which gets a max. Your Profiler would loop over the Domain (or column) list built by the **DomainBuilder** and use the two ParameterBuilders to get the min and max for each column.
+* Then the Profiler loops over Domains built by the DomainBuilder and uses the **ExpectationConfigurationBuilders** to add a `expect_column_values_to_between` column for each of these Domains, where the `min_value` and `max_value` are the values that we got in the ParameterBuilders.
+
+In addition to Rules, a rule-based profiler enables you to specify **Variables**, which are global and can be used in any of the Rules. For instance, you may want to reference the same BatchRequest or the same tolerance in multiple Rules, and declaring these as Variables will enable you to do so. 
+
+#### Example Config:
+```yaml
+variables:
+  my_last_month_sales_batch_request: # We will use this BatchRequest in our DomainBuilder and both of our ParameterBuilders so we can pinpoint the data to Profile
+    datasource_name: my_sales_datasource
+    data_connector_name: monthly_sales
+    data_asset_name: sales_data
+    data_connector_query:
+      index: -1
+  mostly_default: 0.95 # We can set a variable here that we can reference as the `mostly` value for our expectations below
+rules:
+  my_rule_for_numeric_columns: # This is the name of our Rule
+    domain_builder:
+      batch_request: $variables.my_last_month_sales_batch_request # We use the BatchRequest that we specified in Variables above using this $ syntax
+      class_name: SemanticTypeColumnDomainBuilder # We use this class of DomainBuilder so we can specify the numeric type below
+      semantic_types:
+        - numeric
+    parameter_builders:
+      - parameter_name: my_column_min
+        class_name: MetricParameterBuilder
+        batch_request: $variables.my_last_month_sales_batch_request
+        metric_name: column.min # This is the metric we want to get with this ParameterBuilder
+        metric_domain_kwargs: $domain.domain_kwargs # This tells us to use the same Domain that is gotten by the DomainBuilder. We could also put a different column name in here to get a metric for that column instead.
+      - parameter_name: my_column_max
+        class_name: MetricParameterBuilder
+        batch_request: $variables.my_last_month_sales_batch_request
+        metric_name: column.max
+        metric_domain_kwargs: $domain.domain_kwargs
+    expectation_configuration_builders:
+      - expectation_type: expect_column_values_to_be_between # This is the name of the expectation that we would like to add to our suite
+        class_name: DefaultExpectationConfigurationBuilder
+        column: $domain.domain_kwargs.column
+        min_value: $parameter.my_column_min.value # We can reference the Parameters created by our ParameterBuilders using the same $ notation that we use to get Variables
+        max_value: $parameter.my_column_max.value
+        mostly: $variables.mostly_default
+```
+
+You can see another example config containing multiple rules here: [alice_user_workflow_verbose_profiler_config.yml](https://github.com/great-expectations/great_expectations/blob/develop/tests/rule_based_profiler/alice_user_workflow_verbose_profiler_config.yml)
+
+This config is used in the below diagram to provide a better sense of how the different parts of the Profiler config fit together. [You can see a larger version of this file here.](https://github.com/great-expectations/great_expectations/blob/develop/docs/guides/images/rule_based_profiler_public_interface_diagram.png)
+![Rule-Based Profiler Public Interface Diagram](../guides/images/rule_based_profiler_public_interface_diagram.png)
+
+### Next Steps
+- You can try out a tutorial that walks you through the set-up of a Rule-Based Profiler here: [How to create a new Expectation Suite using Rule Based Profilers](../guides/expectations/advanced/how-to-create-a-new-expectation-suite-using-rule-based-profilers)
diff --git a/docs_rtd/changelog.rst b/docs_rtd/changelog.rst
@@ -11,6 +11,8 @@ Develop
   - Support for both single-batch and multi-batch use-cases showcased
   - Addition of the "bootstrap" mode of parameter estimation (default) to NumericMetricRangeMultiBatchParameterBuilder
   - Initial documentation
+* [BUGFIX] Modify read_excel() to handle new optional-dependency openpyxl for pandas >= 1.3.0 #2989
+
 
 0.13.21
 -----------------

diff --git a/great_expectations/util.py b/great_expectations/util.py
@@ -493,7 +493,13 @@ def read_excel(
     """
     import pandas as pd
 
-    df = pd.read_excel(filename, *args, **kwargs)
+    try:
+        df = pd.read_excel(filename, *args, **kwargs)
+    except ImportError:
+        raise ImportError(
+            "Pandas now requires 'openpyxl' as an optional-dependency to read Excel files. Please use pip or conda to install openpyxl and try again"
+        )
+
     if dataset_class is None:
         verify_dynamic_loading_support(module_name=module_name)
         dataset_class = load_class(class_name=class_name, module_name=module_name)

diff --git a/requirements-dev-base.txt b/requirements-dev-base.txt
@@ -20,6 +20,7 @@ google-cloud-secret-manager>=1.0.0  # all_tests
 google-cloud-storage>=1.28.0  # all_tests
 isort==5.4.2  # lint
 moto[ec2]>=1.3.7,<2.0.0  # all_tests
+openpyxl>=3.0.7  # for read_excel test only
 pre-commit>=2.6.0  # lint
 pyarrow>=0.12.0  # all_tests
 pypd==1.1.0  # all_tests

diff --git a/sidebars.js b/sidebars.js
@@ -302,16 +302,16 @@ module.exports = {
                 { type: 'doc', id: 'reference/expectations/distributional-expectations' },
                 { type: 'doc', id: 'reference/expectations/expectations' },
                 { type: 'doc', id: 'reference/expectations/implemented-expectations' },
-                { type: 'doc', id: 'reference/expectation-suite-operations' },
-               ]
+                { type: 'doc', id: 'reference/expectation-suite-operations' }
+              ]
             },
             { type: 'doc', id: 'reference/metrics' },
             { type: 'doc', id: 'reference/profilers' },
             { type: 'doc', id: 'reference/expectations/result-format' },
             { type: 'doc', id: 'reference/expectations/standard-arguments' },
             { type: 'doc', id: 'reference/stores' },
             { type: 'doc', id: 'reference/dividing-data-assets-into-batches' },
-            { type: 'doc', id: 'reference/validation' },
+            { type: 'doc', id: 'reference/validation' }
           ]
         },
         {
@@ -327,42 +327,41 @@ module.exports = {
           collapsed: true,
           items: [
             { type: 'doc', id: 'reference/spare-parts' }
-           ]
+          ]
         },
         {
           type: 'category',
           label: 'API Reference',
           collapsed: true,
           items: [
             { type: 'doc', id: 'reference/api-reference' }
-           ]
-       }
-     ]
+          ]
+        }
+      ]
     },
     {
       type: 'category',
       label: 'Community Resources',
       collapsed: true,
       items: [
-	    { type: 'doc', id: 'community' }
+        { type: 'doc', id: 'community' }
       ]
     },
     {
       type: 'category',
       label: 'Contributing',
       collapsed: true,
       items: [
-	    { type: 'doc', id: 'contributing/contributing' }
+        { type: 'doc', id: 'contributing/contributing' }
       ]
     },
     {
       type: 'category',
       label: 'Changelog',
       collapsed: true,
       items: [
-	    { type: 'doc', id: 'changelog' }
-
+        { type: 'doc', id: 'changelog' }
       ]
-    },
+    }
   ]
 }
diff --git a/...s/integration/docusaurus/expectations/advanced/multi_batch_rule_based_profiler_example.py b/...s/integration/docusaurus/expectations/advanced/multi_batch_rule_based_profiler_example.py
@@ -6,9 +6,11 @@
 profiler_config = """
 # This profiler is meant to be used on the NYC taxi data (yellow_trip_data_sample_<YEAR>-<MONTH>.csv)
 # located in tests/test_sets/taxi_yellow_trip_data_samples/
+
 variables:
   confidence_level: 9.75e-1
   mostly: 1.0
+
 rules:
   row_count_rule:
     domain_builder:

diff --git a/tests/test_great_expectations.py b/tests/test_great_expectations.py
@@ -25,6 +25,7 @@
 from great_expectations.data_context.util import file_relative_path
 from great_expectations.dataset import MetaPandasDataset, PandasDataset
 from great_expectations.exceptions import InvalidCacheValueError
+from great_expectations.util import is_library_loadable
 
 try:
     from unittest import mock
@@ -1014,6 +1015,10 @@ def test_read_json(self):
         assert isinstance(df, PandasDataset)
         assert sorted(list(df.keys())) == ["x", "y", "z"]
 
+    @pytest.mark.skipif(
+        not is_library_loadable(library_name="openpyxl"),
+        reason="GE uses pandas to read excel files, which requires openpyxl",
+    )
     def test_read_excel(self):
         script_path = os.path.dirname(os.path.realpath(__file__))
         df = ge.read_excel(