-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Choosing and configuring DataConnectors #3533
Merged
NathanFarmer
merged 71 commits into
develop
from
docs/DEVREL-213/everything-dataconnectors
Nov 2, 2021
Merged
Changes from 69 commits
Commits
Show all changes
71 commits
Select commit
Hold shift + click to select a range
6f05c13
[DOCS] How to configure an InferredAssetDataConnector
9507500
[DOCS] Change log (#3533)
c0bd2d1
[DOCS] How to configure a ConfiguredAssetDataConnector (#3533)
e76cc75
[DOCS] How to choose which DataConnector to use (#3533)
9485973
[DOCS] How to configure a RuntimeDataConnector (#3533)
c3a4f34
[DOCS] Add option for RuntimeDataConnector to how-to-choose document …
39e8b39
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer dbf1024
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer c43f569
[DOCS] Clarify that the data we are connecting to is known as a DataA…
d31261c
Merge branch 'docs/DEVREL-213/everything-dataconnectors' of github.co…
979d581
[DOCS] Cleanup working and a typo (#3533)
c9ca799
[DOCS] Rearrange this list into the order of the examples below (#3533)
bc05b3c
[DOCS] Typo (#3533)
44985cb
Typo
5806b67
Clarify that the data is what we are referring to as a DataAsset
32d2769
Clean up
5953dd8
Basic working integration tests
98112be
Cleanup
b33d342
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer 9d92240
Test datasets
c0051f6
Rearrange test set directory structure and get basic tests functioning
bdf1b05
YAML config example alongside Python
2dde0ec
Assert that yaml and python configs are equivalent
4408446
Steps 1 and 2 with tabs
53616a2
Remove test sets that are no longer needed
aaed2ef
WIP step 3
d05d440
Linting
c4ab4c6
WIP step 3
95d7f90
Add S3 examples
0e58964
Add S3 examples
82cde75
Linting
3fe6ac4
Misalignment of line numbers
9e5a52a
Example 1
174c028
Linting
a389060
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer 0586292
WIP example 2
8d0046a
Working example 2
ea5eb83
WIP example 3
80d26b8
Working example 3
2ba5ded
WIP example 4
4faee0f
Working example 4
cf72978
Working example 5
cab32ec
RuntimeDataConnector under test
88dcc15
Clean up
4ca117f
How to choose under test
e997160
Clean up
87b4fc6
Link to capture group documentation
bf3ffd0
Enable final test
58c231d
Example RuntimeBatchRequest with batch_data df
edcec55
Reducde test set record count from 20 to 5
36c1ad8
Listing all available inferred and configured DataConnectors
9620eb6
Clean up
d505cd6
Escape underscore
3e62c56
Clean up
fed7126
Links to test scripts provided at the bottom of each document
6d4073d
Data Assets and data_references clarification
b1a3f36
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer 5b74dae
Minor revisions
8cc7e7e
Example for data_references
b9accbc
Update glob_directive docstrings
4ea85cc
Explain what glob_directive does in this context
c977227
Better explanation for ConfiguredAssetDataConnector
3630de2
Also add clarifiaction for what we mean by Data Asset to how_to_confi…
d21bfe8
Example for loading a specific batch with batch_identifiers
2b1086c
Linting
0b79069
Re-align line numbers after lint
499fc83
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer 648f64d
Batching core concepts link
8130542
Merge branch 'docs/DEVREL-213/everything-dataconnectors' of github.co…
b780691
Clean up
4033abc
Merge branch 'develop' into docs/DEVREL-213/everything-dataconnectors
NathanFarmer File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
146 changes: 146 additions & 0 deletions
146
docs/guides/connecting_to_your_data/how_to_choose_which_dataconnector_to_use.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
--- | ||
title: How to choose which DataConnector to use | ||
--- | ||
import Prerequisites from '../connecting_to_your_data/components/prerequisites.jsx' | ||
import Tabs from '@theme/Tabs'; | ||
import TabItem from '@theme/TabItem'; | ||
|
||
This guide demonstrates how to choose which `DataConnector`s to configure within your `Datasource`s. | ||
|
||
<Prerequisites> | ||
|
||
- [Understand the basics of Datasources in the V3 (Batch Request) API](../../reference/datasources.md) | ||
- Learned how to configure a [Data Context using test_yaml_config](../setup/configuring_data_contexts/how_to_configure_datacontext_components_using_test_yaml_config.md) | ||
|
||
</Prerequisites> | ||
|
||
Great Expectations provides three types of `DataConnector` classes. Two classes are for connecting to Data Assets stored as file-system-like data (this includes files on disk, but also S3 object stores, etc) as well as relational database data: | ||
|
||
- An InferredAssetDataConnector infers `data_asset_name` by using a regex that takes advantage of patterns that exist in the filename or folder structure. | ||
- A ConfiguredAssetDataConnector allows users to have the most fine-tuning, and requires an explicit listing of each Data Asset you want to connect to. | ||
|
||
| InferredAssetDataConnectors | ConfiguredAssetDataConnectors | | ||
| --- | --- | | ||
| InferredAssetFilesystemDataConnector | ConfiguredAssetFilesystemDataConnector | | ||
| InferredAssetFilePathDataConnector | ConfiguredAssetFilePathDataConnector | | ||
| InferredAssetAzureDataConnector | ConfiguredAssetAzureDataConnector | | ||
| InferredAssetGCSDataConnector | ConfiguredAssetGCSDataConnector | | ||
| InferredAssetS3DataConnector | ConfiguredAssetS3DataConnector | | ||
| InferredAssetSqlDataConnector | ConfiguredAssetSqlDataConnector | | ||
|
||
InferredAssetDataConnectors and ConfiguredAssetDataConnectors are used to define Data Assets and their associated data_references. A Data Asset is an abstraction that can consist of one or more data_references to CSVs or relational database tables. For instance, you might have a `yellow_tripdata` Data Asset containing information about taxi rides, which consists of twelve data_references to twelve CSVs, each consisting of one month of data. | ||
|
||
The third type of `DataConnector` class is for providing a batch's data directly at runtime: | ||
|
||
- A `RuntimeDataConnector` enables you to use a `RuntimeBatchRequest` to wrap either an in-memory dataframe, filepath, or SQL query, and must include batch identifiers that uniquely identify the data (e.g. a `run_id` from an AirFlow DAG run). | ||
|
||
If you know for example, that your Pipeline Runner will already have your batch data in memory at runtime, you can choose to configure a `RuntimeDataConnector` with unique batch identifiers. Reference the documents on [How to configure a RuntimeDataConnector](guides/connecting_to_your_data/how_to_configure_a_runtimedataconnector.md) and [How to create a Batch of data from an in-memory Spark or Pandas dataframe](guides/connecting_to_your_data/how_to_create_a_batch_of_data_from_an_in_memory_spark_or_pandas_dataframe.md) to get started with `RuntimeDataConnectors`. | ||
|
||
If you aren't sure which type of the remaining `DataConnector`s to use, the following examples will use `DataConnector` classes designed to connect to files on disk, namely `InferredAssetFilesystemDataConnector` and `ConfiguredAssetFilesystemDataConnector` to demonstrate the difference between these types of `DataConnectors`. | ||
NathanFarmer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### When to use an InferredAssetDataConnector | ||
|
||
If you have the following `<MY DIRECTORY>/` directory in your filesystem, and you want to treat the `yellow_tripdata_*.csv` files as batches within the `yellow_tripdata` Data Asset, and also do the same for files in the `green_tripdata` directory: | ||
|
||
``` | ||
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-01.csv | ||
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-02.csv | ||
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-03.csv | ||
<MY DIRECTORY>/green_tripdata/2019-01.csv | ||
<MY DIRECTORY>/green_tripdata/2019-02.csv | ||
<MY DIRECTORY>/green_tripdata/2019-03.csv | ||
``` | ||
|
||
This configuration: | ||
|
||
<Tabs | ||
groupId="yaml-or-python" | ||
defaultValue='yaml' | ||
values={[ | ||
{label: 'YAML', value:'yaml'}, | ||
{label: 'Python', value:'python'}, | ||
]}> | ||
<TabItem value="yaml"> | ||
|
||
```python file=../../../tests/integration/docusaurus/connecting_to_your_data/how_to_choose_which_dataconnector_to_use.py#L8-L26 | ||
``` | ||
|
||
</TabItem> | ||
<TabItem value="python"> | ||
|
||
```python file=../../../tests/integration/docusaurus/connecting_to_your_data/how_to_choose_which_dataconnector_to_use.py#L37-L60 | ||
``` | ||
|
||
</TabItem> | ||
</Tabs> | ||
|
||
will make available the following Data Assets and data_references: | ||
|
||
```bash | ||
Available data_asset_names (2 of 2): | ||
green_tripdata (3 of 3): ['green_tripdata/*2019-01.csv', 'green_tripdata/*2019-02.csv', 'green_tripdata/*2019-03.csv'] | ||
yellow_tripdata (3 of 3): ['yellow_tripdata/*2019-01.csv', 'yellow_tripdata/*2019-02.csv', 'yellow_tripdata/*2019-03.csv'] | ||
|
||
Unmatched data_references (0 of 0):[] | ||
``` | ||
|
||
Note that the `InferredAssetFileSystemDataConnector` **infers** `data_asset_names` **from the regex you provide.** This is the key difference between InferredAssetDataConnector and ConfiguredAssetDataConnector, and also requires that one of the `group_names` in the `default_regex` configuration be `data_asset_name`. | ||
NathanFarmer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The `glob_directive` is provided to give the `DataConnector` information about the directory structure to expect for each Data Asset. The default `glob_directive` for the `InferredAssetFileSystemDataConnector` is `"*"` and therefore must be overridden when your data_references exist in subdirectories. | ||
|
||
### When to use a ConfiguredAssetDataConnector | ||
|
||
On the other hand, `ConfiguredAssetFilesSystemDataConnector` requires an explicit listing of each Data Asset you want to connect to. This tends to be helpful when the naming conventions for your Data Assets are less standardized, but the user has a strong understanding of the semantics governing the segmentation of data (files, database tables). | ||
|
||
NathanFarmer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
If you have the same `<MY DIRECTORY>/` directory in your filesystem, | ||
|
||
``` | ||
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-01.csv | ||
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-02.csv | ||
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-03.csv | ||
<MY DIRECTORY>/green_tripdata/2019-01.csv | ||
<MY DIRECTORY>/green_tripdata/2019-02.csv | ||
<MY DIRECTORY>/green_tripdata/2019-03.csv | ||
``` | ||
|
||
Then this configuration: | ||
|
||
<Tabs | ||
groupId="yaml-or-python" | ||
defaultValue='yaml' | ||
values={[ | ||
{label: 'YAML', value:'yaml'}, | ||
{label: 'Python', value:'python'}, | ||
]}> | ||
<TabItem value="yaml"> | ||
|
||
```python file=../../../tests/integration/docusaurus/connecting_to_your_data/how_to_choose_which_dataconnector_to_use.py#L90-L114 | ||
``` | ||
|
||
</TabItem> | ||
<TabItem value="python"> | ||
|
||
```python file=../../../tests/integration/docusaurus/connecting_to_your_data/how_to_choose_which_dataconnector_to_use.py#L125-L151 | ||
``` | ||
|
||
</TabItem> | ||
</Tabs> | ||
|
||
will make available the following Data Assets and data_references: | ||
|
||
```bash | ||
Available data_asset_names (2 of 2): | ||
green_tripdata (3 of 3): ['2019-01.csv', '2019-02.csv', '2019-03.csv'] | ||
yellow_tripdata (3 of 3): ['yellow_tripdata_2019-01.csv', 'yellow_tripdata_2019-02.csv', 'yellow_tripdata_2019-03.csv'] | ||
|
||
Unmatched data_references (0 of 0):[] | ||
``` | ||
|
||
### Additional Notes | ||
|
||
- Additional examples and configurations for `ConfiguredAssetFilesystemDataConnector`s can be found here: [How to configure a ConfiguredAssetDataConnector](./how_to_configure_a_configuredassetdataconnector.md) | ||
- Additional examples and configurations for `InferredAssetFilesystemDataConnector`s can be found here: [How to configure an InferredAssetDataConnector](./how_to_configure_an_inferredassetdataconnector.md) | ||
- Additional examples and configurations for `RuntimeDataConnector`s can be found here: [How to configure a RuntimeDataConnector](./how_to_configure_a_runtimedataconnector.md) | ||
|
||
To view the full script used in this page, see it on GitHub: | ||
- [how_to_choose_which_dataconnector_to_use.py](https://github.com/great-expectations/great_expectations/tree/develop/tests/integration/docusaurus/connecting_to_your_data/how_to_choose_which_dataconnector_to_use.py) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this table.