Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Choosing and configuring DataConnectors #3533

Merged
merged 71 commits into from
Nov 2, 2021
Merged
Show file tree
Hide file tree
Changes from 57 commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
6f05c13
[DOCS] How to configure an InferredAssetDataConnector
Oct 12, 2021
9507500
[DOCS] Change log (#3533)
Oct 12, 2021
c0bd2d1
[DOCS] How to configure a ConfiguredAssetDataConnector (#3533)
Oct 12, 2021
e76cc75
[DOCS] How to choose which DataConnector to use (#3533)
Oct 12, 2021
9485973
[DOCS] How to configure a RuntimeDataConnector (#3533)
Oct 13, 2021
c3a4f34
[DOCS] Add option for RuntimeDataConnector to how-to-choose document …
Oct 13, 2021
39e8b39
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer Oct 13, 2021
dbf1024
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer Oct 14, 2021
c43f569
[DOCS] Clarify that the data we are connecting to is known as a DataA…
Oct 19, 2021
d31261c
Merge branch 'docs/DEVREL-213/everything-dataconnectors' of github.co…
Oct 19, 2021
979d581
[DOCS] Cleanup working and a typo (#3533)
Oct 19, 2021
c9ca799
[DOCS] Rearrange this list into the order of the examples below (#3533)
Oct 19, 2021
bc05b3c
[DOCS] Typo (#3533)
Oct 19, 2021
44985cb
Typo
Oct 19, 2021
5806b67
Clarify that the data is what we are referring to as a DataAsset
Oct 19, 2021
32d2769
Clean up
Oct 19, 2021
5953dd8
Basic working integration tests
Oct 19, 2021
98112be
Cleanup
Oct 19, 2021
b33d342
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer Oct 20, 2021
9d92240
Test datasets
Oct 20, 2021
c0051f6
Rearrange test set directory structure and get basic tests functioning
Oct 20, 2021
bdf1b05
YAML config example alongside Python
Oct 20, 2021
2dde0ec
Assert that yaml and python configs are equivalent
Oct 20, 2021
4408446
Steps 1 and 2 with tabs
Oct 20, 2021
53616a2
Remove test sets that are no longer needed
Oct 20, 2021
aaed2ef
WIP step 3
Oct 20, 2021
d05d440
Linting
Oct 20, 2021
c4ab4c6
WIP step 3
Oct 20, 2021
95d7f90
Add S3 examples
Oct 21, 2021
0e58964
Add S3 examples
Oct 21, 2021
82cde75
Linting
Oct 21, 2021
3fe6ac4
Misalignment of line numbers
Oct 21, 2021
9e5a52a
Example 1
Oct 21, 2021
174c028
Linting
Oct 22, 2021
a389060
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer Oct 22, 2021
0586292
WIP example 2
Oct 22, 2021
8d0046a
Working example 2
Oct 22, 2021
ea5eb83
WIP example 3
Oct 22, 2021
80d26b8
Working example 3
Oct 25, 2021
2ba5ded
WIP example 4
Oct 25, 2021
4faee0f
Working example 4
Oct 25, 2021
cf72978
Working example 5
Oct 25, 2021
cab32ec
RuntimeDataConnector under test
Oct 25, 2021
88dcc15
Clean up
Oct 25, 2021
4ca117f
How to choose under test
Oct 25, 2021
e997160
Clean up
Oct 25, 2021
87b4fc6
Link to capture group documentation
Oct 25, 2021
bf3ffd0
Enable final test
Oct 25, 2021
58c231d
Example RuntimeBatchRequest with batch_data df
Oct 25, 2021
edcec55
Reducde test set record count from 20 to 5
Oct 25, 2021
36c1ad8
Listing all available inferred and configured DataConnectors
Oct 26, 2021
9620eb6
Clean up
Oct 26, 2021
d505cd6
Escape underscore
Oct 26, 2021
3e62c56
Clean up
Oct 26, 2021
fed7126
Links to test scripts provided at the bottom of each document
Oct 26, 2021
6d4073d
Data Assets and data_references clarification
Oct 26, 2021
b1a3f36
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer Oct 26, 2021
5b74dae
Minor revisions
Oct 27, 2021
8cc7e7e
Example for data_references
Oct 27, 2021
b9accbc
Update glob_directive docstrings
Oct 27, 2021
4ea85cc
Explain what glob_directive does in this context
Oct 27, 2021
c977227
Better explanation for ConfiguredAssetDataConnector
Oct 27, 2021
3630de2
Also add clarifiaction for what we mean by Data Asset to how_to_confi…
Oct 27, 2021
d21bfe8
Example for loading a specific batch with batch_identifiers
Nov 2, 2021
2b1086c
Linting
Nov 2, 2021
0b79069
Re-align line numbers after lint
Nov 2, 2021
499fc83
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer Nov 2, 2021
648f64d
Batching core concepts link
Nov 2, 2021
8130542
Merge branch 'docs/DEVREL-213/everything-dataconnectors' of github.co…
Nov 2, 2021
b780691
Clean up
Nov 2, 2021
4033abc
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer Nov 2, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ title: Changelog
---

### Develop
* [DOCS] Choosing and configuring DataConnectors (#3533)

### 0.13.39
* [FEATURE] Migration of Expectations to Atomic Prescriptive Renderers (#3530, #3537)
Expand Down Expand Up @@ -45,7 +46,6 @@ title: Changelog
* [MAINTENANCE] Content and test script update (#3532)
* [MAINTENANCE] Provide Deprecation Notice for the "parse_strings_as_datetimes" Expectation Parameter in V3 (#3539)


### 0.13.37
* [FEATURE] Implement CompoundColumnsUnique metric for SqlAlchemyExecutionEngine (#3477)
* [FEATURE] add get_available_data_asset_names_and_types (#3476)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
---
title: How to choose which DataConnector to use
---
import Prerequisites from '../connecting_to_your_data/components/prerequisites.jsx'
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

This guide demonstrates how to choose which `DataConnector`s to configure within your `Datasource`s.

<Prerequisites>

- [Understand the basics of Datasources in the V3 (Batch Request) API](../../reference/datasources.md)
- Learned how to configure a [Data Context using test_yaml_config](../setup/configuring_data_contexts/how_to_configure_datacontext_components_using_test_yaml_config.md)

</Prerequisites>

Great Expectations provides three types of `DataConnector` classes. Two classes are for connecting to Data Assets stored as file-system-like data (this includes files on disk, but also S3 object stores, etc) as well as relational database data:

- An InferredAssetDataConnector infers `data_asset_name` by using a regex that takes advantage of patterns that exist in the filename or folder structure.
- A ConfiguredAssetDataConnector allows users to have the most fine-tuning, and requires an explicit listing of each Data Asset you want to connect to.

| InferredAssetDataConnectors | ConfiguredAssetDataConnectors |
| --- | --- |
| InferredAssetFilesystemDataConnector | ConfiguredAssetFilesystemDataConnector |
| InferredAssetFilePathDataConnector | ConfiguredAssetFilePathDataConnector |
| InferredAssetAzureDataConnector | ConfiguredAssetAzureDataConnector |
| InferredAssetGCSDataConnector | ConfiguredAssetGCSDataConnector |
| InferredAssetS3DataConnector | ConfiguredAssetS3DataConnector |
| InferredAssetSqlDataConnector | ConfiguredAssetSqlDataConnector |
Comment on lines +22 to +29
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this table.


InferredAssetDataConnectors and ConfiguredAssetDataConnectors are used to define Data Assets and their associated data_references. A Data Asset is an abstraction that can consist of one or more data_references to CSVs or relational database tables.
NathanFarmer marked this conversation as resolved.
Show resolved Hide resolved

The third type of `DataConnector` class is for providing a batch's data directly at runtime:

- A `RuntimeDataConnector` enables you to use a `RuntimeBatchRequest` to wrap either an in-memory dataframe, filepath, or SQL query, and must include batch identifiers that uniquely identify the data (e.g. a `run_id` from an AirFlow DAG run).

If you know for example, that your Pipeline Runner will already have your batch data in memory at runtime, you can choose to configure a `RuntimeDataConnector` with unique batch identifiers. Reference the documents on [How to configure a RuntimeDataConnector](guides/connecting_to_your_data/how_to_configure_a_runtimedataconnector.md) and [How to create a Batch of data from an in-memory Spark or Pandas dataframe](guides/connecting_to_your_data/how_to_create_a_batch_of_data_from_an_in_memory_spark_or_pandas_dataframe.md) to get started with `RuntimeDataConnectors`.

If you aren't sure which type of the remaining `DataConnector`s to use, the following examples will use `DataConnector` classes designed to connect to files on disk, namely `InferredAssetFilesystemDataConnector` and `ConfiguredAssetFilesystemDataConnector` to demonstrate the difference between these types of `DataConnectors`.
NathanFarmer marked this conversation as resolved.
Show resolved Hide resolved

### When to use an InferredAssetDataConnector

If you have the following `<MY DIRECTORY>/` directory in your filesystem, and you want to treat the `yellow_tripdata_*.csv` files as batches within the `yellow_tripdata` Data Asset, and also do the same for files in the `green_tripdata` directory:

```
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-01.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-02.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-03.csv
<MY DIRECTORY>/green_tripdata/2019-01.csv
<MY DIRECTORY>/green_tripdata/2019-02.csv
<MY DIRECTORY>/green_tripdata/2019-03.csv
```

This configuration:

<Tabs
groupId="yaml-or-python"
defaultValue='yaml'
values={[
{label: 'YAML', value:'yaml'},
{label: 'Python', value:'python'},
]}>
<TabItem value="yaml">

```python file=../../../tests/integration/docusaurus/connecting_to_your_data/how_to_choose_which_dataconnector_to_use.py#L8-L26
```

</TabItem>
<TabItem value="python">

```python file=../../../tests/integration/docusaurus/connecting_to_your_data/how_to_choose_which_dataconnector_to_use.py#L37-L60
```

</TabItem>
</Tabs>

...will make available the following Data Assets and data_references:

```bash
Available data_asset_names (2 of 2):
green_tripdata (3 of 3): ['green_tripdata/*2019-01.csv', 'green_tripdata/*2019-02.csv', 'green_tripdata/*2019-03.csv']
yellow_tripdata (3 of 3): ['yellow_tripdata/*2019-01.csv', 'yellow_tripdata/*2019-02.csv', 'yellow_tripdata/*2019-03.csv']

Unmatched data_references (0 of 0):[]
```

Note that the `InferredAssetFileSystemDataConnector` **infers** `data_asset_names` **from the regex you provide.** This is the key difference between InferredAssetDataConnector and ConfiguredAssetDataConnector, and also requires that one of the `group_names` in the `default_regex` configuration be `data_asset_name`.
NathanFarmer marked this conversation as resolved.
Show resolved Hide resolved

### When to use a ConfiguredAssetDataConnector

On the other hand, `ConfiguredAssetFilesSystemDataConnector` requires an explicit listing of each Data Asset you want to connect to. This tends to be helpful when the naming conventions for your Data Assets are less standardized.

NathanFarmer marked this conversation as resolved.
Show resolved Hide resolved
If you have the same `<MY DIRECTORY>/` directory in your filesystem,

```
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-01.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-02.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-03.csv
<MY DIRECTORY>/green_tripdata/2019-01.csv
<MY DIRECTORY>/green_tripdata/2019-02.csv
<MY DIRECTORY>/green_tripdata/2019-03.csv
```

Then this configuration:

<Tabs
groupId="yaml-or-python"
defaultValue='yaml'
values={[
{label: 'YAML', value:'yaml'},
{label: 'Python', value:'python'},
]}>
<TabItem value="yaml">

```python file=../../../tests/integration/docusaurus/connecting_to_your_data/how_to_choose_which_dataconnector_to_use.py#L90-L114
```

</TabItem>
<TabItem value="python">

```python file=../../../tests/integration/docusaurus/connecting_to_your_data/how_to_choose_which_dataconnector_to_use.py#L125-L151
```

</TabItem>
</Tabs>

...will make available the following Data Assets and data_references:
NathanFarmer marked this conversation as resolved.
Show resolved Hide resolved

```bash
Available data_asset_names (2 of 2):
green_tripdata (3 of 3): ['2019-01.csv', '2019-02.csv', '2019-03.csv']
yellow_tripdata (3 of 3): ['yellow_tripdata_2019-01.csv', 'yellow_tripdata_2019-02.csv', 'yellow_tripdata_2019-03.csv']

Unmatched data_references (0 of 0):[]
```

### Additional Notes

- Additional examples and configurations for `ConfiguredAssetFilesystemDataConnector`s can be found here: [How to configure a ConfiguredAssetDataConnector](./how_to_configure_a_configuredassetdataconnector.md)
- Additional examples and configurations for `InferredAssetFilesystemDataConnector`s can be found here: [How to configure an InferredAssetDataConnector](./how_to_configure_an_inferredassetdataconnector.md)
- Additional examples and configurations for `RuntimeDataConnector`s can be found here: [How to configure a RuntimeDataConnector](./how_to_configure_a_runtimedataconnector.md)

To view the full script used in this page, see it on GitHub:
- [how_to_choose_which_dataconnector_to_use.py](https://github.com/great-expectations/great_expectations/tree/develop/tests/integration/docusaurus/connecting_to_your_data/how_to_choose_which_dataconnector_to_use.py)
Loading