From e467327a430c3c9d565b344b8a71b73ce75528e0 Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Wed, 2 Aug 2023 15:58:46 +0100 Subject: [PATCH 01/19] First drop of newly organised data catalog docs Signed-off-by: Jo Stichbury --- docs/build-docs.sh | 4 +- .../data/advanced_data_catalog_usage.md | 240 ++++++ docs/source/data/data_catalog.md | 806 +----------------- docs/source/data/data_catalog_basic_how_to.md | 130 +++ .../source/data/data_catalog_yaml_examples.md | 574 +++++++++++++ .../how_to_create_a_custom_dataset.md} | 31 +- docs/source/data/index.md | 17 +- ...> partitioned_and_incremental_datasets.md} | 241 +----- 8 files changed, 1000 insertions(+), 1043 deletions(-) create mode 100644 docs/source/data/advanced_data_catalog_usage.md create mode 100644 docs/source/data/data_catalog_basic_how_to.md create mode 100644 docs/source/data/data_catalog_yaml_examples.md rename docs/source/{extend_kedro/custom_datasets.md => data/how_to_create_a_custom_dataset.md} (90%) rename docs/source/data/{kedro_io.md => partitioned_and_incremental_datasets.md} (67%) diff --git a/docs/build-docs.sh b/docs/build-docs.sh index d55076e118..eb64351b4f 100755 --- a/docs/build-docs.sh +++ b/docs/build-docs.sh @@ -8,7 +8,7 @@ set -o nounset action=$1 if [ "$action" == "linkcheck" ]; then - sphinx-build -WETan -j auto -D language=en -b linkcheck -d docs/build/doctrees docs/source docs/build/linkcheck + sphinx-build -ETan -j auto -D language=en -b linkcheck -d docs/build/doctrees docs/source docs/build/linkcheck elif [ "$action" == "docs" ]; then - sphinx-build -WETa -j auto -D language=en -b html -d docs/build/doctrees docs/source docs/build/html + sphinx-build -ETa -j auto -D language=en -b html -d docs/build/doctrees docs/source docs/build/html fi diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md new file mode 100644 index 0000000000..b252589544 --- /dev/null +++ b/docs/source/data/advanced_data_catalog_usage.md @@ -0,0 +1,240 @@ +# Advanced: Access the Data Catalog in code + +The code API allows you to: + +* configure data sources in code +* operate the IO module within notebooks + +### Configure a Data Catalog + +In a file like `catalog.py`, you can construct a `DataCatalog` object programmatically. In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets). + +```python +from kedro.io import DataCatalog +from kedro_datasets.pandas import ( + CSVDataSet, + SQLTableDataSet, + SQLQueryDataSet, + ParquetDataSet, +) + +io = DataCatalog( + { + "bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"), + "cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")), + "cars_table": SQLTableDataSet( + table_name="cars", credentials=dict(con="sqlite:///kedro.db") + ), + "scooters_query": SQLQueryDataSet( + sql="select * from cars where gear=4", + credentials=dict(con="sqlite:///kedro.db"), + ), + "ranked": ParquetDataSet(filepath="ranked.parquet"), + } +) +``` + +When using `SQLTableDataSet` or `SQLQueryDataSet` you must provide a `con` key containing [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) database connection string. In the example above we pass it as part of `credentials` argument. Alternative to `credentials` is to put `con` into `load_args` and `save_args` (`SQLTableDataSet` only). + +### Load datasets + +You can access each dataset by its name. + +```python +cars = io.load("cars") # data is now loaded as a DataFrame in 'cars' +gear = cars["gear"].values +``` + +#### Behind the scenes + +The following steps happened behind the scenes when `load` was called: + +- The value `cars` was located in the Data Catalog +- The corresponding `AbstractDataSet` object was retrieved +- The `load` method of this dataset was called +- This `load` method delegated the loading to the underlying pandas `read_csv` function + +### View the available data sources + +If you forget what data was assigned, you can always review the `DataCatalog`. + +```python +io.list() +``` + +### Save data + +You can save data using an API similar to that used to load data. + +```{warning} +This use is not recommended unless you are prototyping in notebooks. +``` + +#### Save data to memory + +```python +from kedro.io import MemoryDataSet + +memory = MemoryDataSet(data=None) +io.add("cars_cache", memory) +io.save("cars_cache", "Memory can store anything.") +io.load("cars_cache") +``` + +#### Save data to a SQL database for querying + +We might now want to put the data in a SQLite database to run queries on it. Let's use that to rank scooters by their mpg. + +```python +import os + +# This cleans up the database in case it exists at this point +try: + os.remove("kedro.db") +except FileNotFoundError: + pass + +io.save("cars_table", cars) +ranked = io.load("scooters_query")[["brand", "mpg"]] +``` + +#### Save data in Parquet + +Finally, we can save the processed data in Parquet format. + +```python +io.save("ranked", ranked) +``` + +```{warning} +Saving `None` to a dataset is not allowed! +``` + +### Accessing a dataset that needs credentials +Before instantiating the `DataCatalog`, Kedro will first attempt to read [the credentials from the project configuration](../configuration/credentials.md). The resulting dictionary is then passed into `DataCatalog.from_config()` as the `credentials` argument. + +Let's assume that the project contains the file `conf/local/credentials.yml` with the following contents: + +```yaml +dev_s3: + client_kwargs: + aws_access_key_id: key + aws_secret_access_key: secret + +scooters_credentials: + con: sqlite:///kedro.db + +my_gcp_credentials: + id_token: key +``` + +Your code will look as follows: + +```python +CSVDataSet( + filepath="s3://test_bucket/data/02_intermediate/company/motorbikes.csv", + load_args=dict(sep=",", skiprows=5, skipfooter=1, na_values=["#NA", "NA"]), + credentials=dict(key="token", secret="key"), +) +``` + +### Versioning using the Code API + +In order to do that, pass a dictionary with exact load versions to `DataCatalog.from_config`: + +```python +load_versions = {"cars": "2019-02-13T14.35.36.518Z"} +io = DataCatalog.from_config(catalog_config, credentials, load_versions=load_versions) +cars = io.load("cars") +``` + +The last row in the example above would attempt to load a CSV file from `data/01_raw/company/car_data.csv/2019-02-13T14.35.36.518Z/car_data.csv`: + +* `load_versions` configuration has an effect only if a dataset versioning has been enabled in the catalog config file - see the example above. + +* We recommend that you do not override `save_version` argument in `DataCatalog.from_config` unless strongly required to do so, since it may lead to inconsistencies between loaded and saved versions of the versioned datasets. + +```{warning} +The `DataCatalog` does not re-generate save versions between instantiations. Therefore, if you call `catalog.save('cars', some_data)` twice, then the second call will fail, since it tries to overwrite a versioned dataset using the same save version. To mitigate this, reload your data catalog by calling `%reload_kedro` line magic. This limitation does not apply to `load` operation. +``` + +**** HOW DOES THE BELOW FIT WITH THE ABOVE?? + +Should you require more control over load and save versions of a specific dataset, you can instantiate `Version` and pass it as a parameter to the dataset initialisation: + +```python +from kedro.io import DataCatalog, Version +from kedro_datasets.pandas import CSVDataSet +import pandas as pd + +data1 = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]}) +data2 = pd.DataFrame({"col1": [7], "col2": [8], "col3": [9]}) +version = Version( + load=None, # load the latest available version + save=None, # generate save version automatically on each save operation +) + +test_data_set = CSVDataSet( + filepath="data/01_raw/test.csv", save_args={"index": False}, version=version +) +io = DataCatalog({"test_data_set": test_data_set}) + +# save the dataset to data/01_raw/test.csv//test.csv +io.save("test_data_set", data1) +# save the dataset into a new file data/01_raw/test.csv//test.csv +io.save("test_data_set", data2) + +# load the latest version from data/test.csv/*/test.csv +reloaded = io.load("test_data_set") +assert data2.equals(reloaded) +``` + +```{note} +In the example above, we did not fix any versions. If we do, then the behaviour of load and save operations becomes slightly different: +``` + +```python +version = Version( + load="my_exact_version", # load exact version + save="my_exact_version", # save to exact version +) + +test_data_set = CSVDataSet( + filepath="data/01_raw/test.csv", save_args={"index": False}, version=version +) +io = DataCatalog({"test_data_set": test_data_set}) + +# save the dataset to data/01_raw/test.csv/my_exact_version/test.csv +io.save("test_data_set", data1) +# load from data/01_raw/test.csv/my_exact_version/test.csv +reloaded = io.load("test_data_set") +assert data1.equals(reloaded) + +# raises DataSetError since the path +# data/01_raw/test.csv/my_exact_version/test.csv already exists +io.save("test_data_set", data2) +``` + +```{warning} +We do not recommend passing exact load and/or save versions, since it might lead to inconsistencies between operations. For example, if versions for load and save operations do not match, a save operation would result in a `UserWarning` indicating that save and load versions do not match. Load after save might also return an error if the corresponding load version is not found: +``` + +```python +version = Version( + load="exact_load_version", # load exact version + save="exact_save_version", # save to exact version +) + +test_data_set = CSVDataSet( + filepath="data/01_raw/test.csv", save_args={"index": False}, version=version +) +io = DataCatalog({"test_data_set": test_data_set}) + +io.save("test_data_set", data1) # emits a UserWarning due to version inconsistency + +# raises DataSetError since the data/01_raw/test.csv/exact_load_version/test.csv +# file does not exist +reloaded = io.load("test_data_set") +``` + + diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index 3cc2cbd90b..fe3daa7b71 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -1,4 +1,5 @@ -# The Data Catalog + +# Introduction to the Kedro Data Catalog This section introduces `catalog.yml`, the project-shareable Data Catalog. The file is located in `conf/base` and is a registry of all data sources available for use by a project; it manages loading and saving of data. @@ -41,804 +42,15 @@ The following prepends are available: `fsspec` also provides other file systems, such as SSH, FTP and WebHDFS. [See the fsspec documentation for more information](https://filesystem-spec.readthedocs.io/en/latest/api.html#implementations). -## Data Catalog `*_args` parameters - -Data Catalog accepts two different groups of `*_args` parameters that serve different purposes: -- `fs_args` -- `load_args` and `save_args` - -The `fs_args` is used to configure the interaction with a filesystem. -All the top-level parameters of `fs_args` (except `open_args_load` and `open_args_save`) will be passed in an underlying filesystem class. - -### Example 1: Provide the `project` value to the underlying filesystem class (`GCSFileSystem`) to interact with Google Cloud Storage (GCS) - -```yaml -test_dataset: - type: ... - fs_args: - project: test_project -``` - -The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively. - -### Example 2: Load data from a local binary file using `utf-8` encoding - -```yaml -test_dataset: - type: ... - fs_args: - open_args_load: - mode: "rb" - encoding: "utf-8" -``` - -`load_args` and `save_args` configure how a third-party library (e.g. `pandas` for `CSVDataSet`) loads/saves data from/to a file. - -### Example 3: Save data to a CSV file without row names (index) using `utf-8` encoding - -```yaml -test_dataset: - type: pandas.CSVDataSet - ... - save_args: - index: False - encoding: "utf-8" -``` - -## Use the Data Catalog with the YAML API - -The YAML API allows you to configure your datasets in a YAML configuration file, `conf/base/catalog.yml` or `conf/local/catalog.yml`. - -Here are some examples of data configuration in a `catalog.yml`: - -### Example 1: Loads / saves a CSV file from / to a local file system - -```yaml -bikes: - type: pandas.CSVDataSet - filepath: data/01_raw/bikes.csv -``` - -### Example 2: Loads and saves a CSV on a local file system, using specified load and save arguments - -```yaml -cars: - type: pandas.CSVDataSet - filepath: data/01_raw/company/cars.csv - load_args: - sep: ',' - save_args: - index: False - date_format: '%Y-%m-%d %H:%M' - decimal: . - -``` - -### Example 3: Loads and saves a compressed CSV on a local file system - -```yaml -boats: - type: pandas.CSVDataSet - filepath: data/01_raw/company/boats.csv.gz - load_args: - sep: ',' - compression: 'gzip' - fs_args: - open_args_load: - mode: 'rb' -``` - -### Example 4: Loads a CSV file from a specific S3 bucket, using credentials and load arguments - -```yaml -motorbikes: - type: pandas.CSVDataSet - filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv - credentials: dev_s3 - load_args: - sep: ',' - skiprows: 5 - skipfooter: 1 - na_values: ['#NA', NA] -``` - -### Example 5: Loads / saves a pickle file from / to a local file system - -```yaml -airplanes: - type: pickle.PickleDataSet - filepath: data/06_models/airplanes.pkl - backend: pickle -``` - -### Example 6: Loads an Excel file from Google Cloud Storage - -```yaml -rockets: - type: pandas.ExcelDataSet - filepath: gcs://your_bucket/data/02_intermediate/company/motorbikes.xlsx - fs_args: - project: my-project - credentials: my_gcp_credentials - save_args: - sheet_name: Sheet1 -``` - -### Example 7: Loads a multi-sheet Excel file from a local file system - -```yaml -trains: - type: pandas.ExcelDataSet - filepath: data/02_intermediate/company/trains.xlsx - load_args: - sheet_name: [Sheet1, Sheet2, Sheet3] -``` - -### Example 8: Saves an image created with Matplotlib on Google Cloud Storage - -```yaml -results_plot: - type: matplotlib.MatplotlibWriter - filepath: gcs://your_bucket/data/08_results/plots/output_1.jpeg - fs_args: - project: my-project - credentials: my_gcp_credentials -``` - - -### Example 9: Loads / saves an HDF file on local file system storage, using specified load and save arguments - -```yaml -skateboards: - type: pandas.HDFDataSet - filepath: data/02_intermediate/skateboards.hdf - key: name - load_args: - columns: [brand, length] - save_args: - mode: w # Overwrite even when the file already exists - dropna: True -``` - -### Example 10: Loads / saves a parquet file on local file system storage, using specified load and save arguments - -```yaml -trucks: - type: pandas.ParquetDataSet - filepath: data/02_intermediate/trucks.parquet - load_args: - columns: [name, gear, disp, wt] - categories: list - index: name - save_args: - compression: GZIP - file_scheme: hive - has_nulls: False - partition_on: [name] -``` - - -### Example 11: Loads / saves a Spark table on S3, using specified load and save arguments - -```yaml -weather: - type: spark.SparkDataSet - filepath: s3a://your_bucket/data/01_raw/weather* - credentials: dev_s3 - file_format: csv - load_args: - header: True - inferSchema: True - save_args: - sep: '|' - header: True -``` - - -### Example 12: Loads / saves a SQL table using credentials, a database connection, using specified load and save arguments - -```yaml -scooters: - type: pandas.SQLTableDataSet - credentials: scooters_credentials - table_name: scooters - load_args: - index_col: [name] - columns: [name, gear] - save_args: - if_exists: replace -``` - -### Example 13: Loads an SQL table with credentials, a database connection, and applies a SQL query to the table - - -```yaml -scooters_query: - type: pandas.SQLQueryDataSet - credentials: scooters_credentials - sql: select * from cars where gear=4 - load_args: - index_col: [name] -``` - -When you use [`pandas.SQLTableDataSet`](/kedro_datasets.pandas.SQLTableDataSet) or [`pandas.SQLQueryDataSet`](/kedro_datasets.pandas.SQLQueryDataSet), you must provide a database connection string. In the above example, we pass it using the `scooters_credentials` key from the credentials (see the details in the [Feeding in credentials](#feeding-in-credentials) section below). `scooters_credentials` must have a top-level key `con` containing a [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) connection string. As an alternative to credentials, you could explicitly put `con` into `load_args` and `save_args` (`pandas.SQLTableDataSet` only). - - -### Example 14: Loads data from an API endpoint, example US corn yield data from USDA - -```yaml -us_corn_yield_data: - type: api.APIDataSet - url: https://quickstats.nass.usda.gov - credentials: usda_credentials - params: - key: SOME_TOKEN - format: JSON - commodity_desc: CORN - statisticcat_des: YIELD - agg_level_desc: STATE - year: 2000 -``` - -Note that `usda_credientials` will be passed as the `auth` argument in the `requests` library. Specify the username and password as a list in your `credentials.yml` file as follows: - -```yaml -usda_credentials: - - username - - password -``` - - -### Example 15: Loads data from Minio (S3 API Compatible Storage) - - -```yaml -test: - type: pandas.CSVDataSet - filepath: s3://your_bucket/test.csv # assume `test.csv` is uploaded to the Minio server. - credentials: dev_minio -``` -In `credentials.yml`, define the `key`, `secret` and the `endpoint_url` as follows: - -```yaml -dev_minio: - key: token - secret: key - client_kwargs: - endpoint_url : 'http://localhost:9000' -``` - -```{note} -The easiest way to setup MinIO is to run a Docker image. After the following command, you can access the Minio server with `http://localhost:9000` and create a bucket and add files as if it is on S3. -``` - -`docker run -p 9000:9000 -e "MINIO_ACCESS_KEY=token" -e "MINIO_SECRET_KEY=key" minio/minio server /data` - - -### Example 16: Loads a model saved as a pickle from Azure Blob Storage - -```yaml -ml_model: - type: pickle.PickleDataSet - filepath: "abfs://models/ml_models.pickle" - versioned: True - credentials: dev_abs -``` -In the `credentials.yml` file, define the `account_name` and `account_key`: - -```yaml -dev_abs: - account_name: accountname - account_key: key -``` - - -### Example 17: Loads a CSV file stored in a remote location through SSH - -```{note} -This example requires [Paramiko](https://www.paramiko.org) to be installed (`pip install paramiko`). -``` -```yaml -cool_dataset: - type: pandas.CSVDataSet - filepath: "sftp:///path/to/remote_cluster/cool_data.csv" - credentials: cluster_credentials -``` -All parameters required to establish the SFTP connection can be defined through `fs_args` or in the `credentials.yml` file as follows: - -```yaml -cluster_credentials: - username: my_username - host: host_address - port: 22 - password: password -``` -The list of all available parameters is given in the [Paramiko documentation](https://docs.paramiko.org/en/2.4/api/client.html#paramiko.client.SSHClient.connect). - -## Create a Data Catalog YAML configuration file via CLI - -You can use the [`kedro catalog create` command to create a Data Catalog YAML configuration](../development/commands_reference.md#create-a-data-catalog-yaml-configuration-file). - -This creates a `//catalog/.yml` configuration file with `MemoryDataSet` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`. - -```yaml -# //catalog/.yml -rockets: - type: MemoryDataSet -scooters: - type: MemoryDataSet -``` - -## Adding parameters - -You can [configure parameters](../configuration/parameters.md) for your project and [reference them](../configuration/parameters.md#how-to-use-parameters) in your nodes. To do this, use the `add_feed_dict()` method ([API documentation](/kedro.io.DataCatalog)). You can use this method to add any other entry or metadata you wish on the `DataCatalog`. - - -## Feeding in credentials - -Before instantiating the `DataCatalog`, Kedro will first attempt to read [the credentials from the project configuration](../configuration/credentials.md). The resulting dictionary is then passed into `DataCatalog.from_config()` as the `credentials` argument. - -Let's assume that the project contains the file `conf/local/credentials.yml` with the following contents: - -```yaml -dev_s3: - client_kwargs: - aws_access_key_id: key - aws_secret_access_key: secret - -scooters_credentials: - con: sqlite:///kedro.db - -my_gcp_credentials: - id_token: key -``` - -In the example above, the `catalog.yml` file contains references to credentials keys `dev_s3` and `scooters_credentials`. This means that when it instantiates the `motorbikes` dataset, for example, the `DataCatalog` will attempt to read top-level key `dev_s3` from the received `credentials` dictionary, and then will pass its values into the dataset `__init__` as a `credentials` argument. This is essentially equivalent to calling this: - -```python -CSVDataSet( - filepath="s3://test_bucket/data/02_intermediate/company/motorbikes.csv", - load_args=dict(sep=",", skiprows=5, skipfooter=1, na_values=["#NA", "NA"]), - credentials=dict(key="token", secret="key"), -) -``` - - -## Load multiple datasets with similar configuration using YAML anchors - -Different datasets might use the same file format, load and save arguments, and be stored in the same folder. [YAML has a built-in syntax](https://yaml.org/spec/1.2.1/#Syntax) for factorising parts of a YAML file, which means that you can decide what is generalisable across your datasets, so that you need not spend time copying and pasting dataset configurations in the `catalog.yml` file. - -You can see this in the following example: - -```yaml -_csv: &csv - type: spark.SparkDataSet - file_format: csv - load_args: - sep: ',' - na_values: ['#NA', NA] - header: True - inferSchema: False - -cars: - <<: *csv - filepath: s3a://data/01_raw/cars.csv - -trucks: - <<: *csv - filepath: s3a://data/01_raw/trucks.csv - -bikes: - <<: *csv - filepath: s3a://data/01_raw/bikes.csv - load_args: - header: False -``` - -The syntax `&csv` names the following block `csv` and the syntax `<<: *csv` inserts the contents of the block named `csv`. Locally declared keys entirely override inserted ones as seen in `bikes`. - -```{note} -It's important that the name of the template entry starts with a `_` so Kedro knows not to try and instantiate it as a dataset. -``` - -You can also nest reuseable YAML syntax: - -```yaml -_csv: &csv - type: spark.SparkDataSet - file_format: csv - load_args: &csv_load_args - header: True - inferSchema: False - -airplanes: - <<: *csv - filepath: s3a://data/01_raw/airplanes.csv - load_args: - <<: *csv_load_args - sep: ; -``` - -In this example, the default `csv` configuration is inserted into `airplanes` and then the `load_args` block is overridden. Normally, that would replace the whole dictionary. In order to extend `load_args`, the defaults for that block are then re-inserted. - -## Load multiple datasets with similar configuration using dataset factories -For catalog entries that share configuration details, you can also use the dataset factories introduced in Kedro 0.18.12. This syntax allows you to generalise the configuration and -reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns. - -### Example 1: Generalise datasets with similar names and types into one dataset factory -Consider the following catalog entries: -```yaml -factory_data: - type: pandas.CSVDataSet - filepath: data/01_raw/factory_data.csv - - -process_data: - type: pandas.CSVDataSet - filepath: data/01_raw/process_data.csv -``` -The datasets in this catalog can be generalised to the following dataset factory: -```yaml -"{name}_data": - type: pandas.CSVDataSet - filepath: data/01_raw/{name}_data.csv -``` -When `factory_data` or `process_data` is used in your pipeline, it is matched to the factory pattern `{name}_data`. The factory pattern must always be enclosed in -quotes to avoid YAML parsing errors. - - -### Example 2: Generalise datasets of the same type into one dataset factory -You can also combine all the datasets with the same type and configuration details. For example, consider the following -catalog with three datasets named `boats`, `cars` and `planes` of the type `pandas.CSVDataSet`: -```yaml -boats: - type: pandas.CSVDataSet - filepath: data/01_raw/shuttles.csv - -cars: - type: pandas.CSVDataSet - filepath: data/01_raw/reviews.csv - -planes: - type: pandas.CSVDataSet - filepath: data/01_raw/companies.csv -``` -These datasets can be combined into the following dataset factory: -```yaml -"{dataset_name}#csv": - type: pandas.CSVDataSet - filepath: data/01_raw/{dataset_name}.csv -``` -You will then have to update the pipelines in your project located at `src///pipeline.py` to refer to these datasets as `boats#csv`, -`cars#csv` and `planes#csv`. Adding a suffix or a prefix to the dataset names and the dataset factory patterns, like `#csv` here, ensures that the dataset -names are matched with the intended pattern. -```python -from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles - - -def create_pipeline(**kwargs) -> Pipeline: - return pipeline( - [ - node( - func=preprocess_boats, - inputs="boats#csv", - outputs="preprocessed_boats", - name="preprocess_boats_node", - ), - node( - func=preprocess_cars, - inputs="cars#csv", - outputs="preprocessed_cars", - name="preprocess_cars_node", - ), - node( - func=preprocess_planes, - inputs="planes#csv", - outputs="preprocessed_planes", - name="preprocess_planes_node", - ), - node( - func=create_model_input_table, - inputs=[ - "preprocessed_boats", - "preprocessed_planes", - "preprocessed_cars", - ], - outputs="model_input_table", - name="create_model_input_table_node", - ), - ] - ) -``` -### Example 3: Generalise datasets using namespaces into one dataset factory -You can also generalise the catalog entries for datasets belonging to namespaced modular pipelines. Consider the -following pipeline which takes in a `model_input_table` and outputs two regressors belonging to the -`active_modelling_pipeline` and the `candidate_modelling_pipeline` namespaces: -```python -from kedro.pipeline import Pipeline, node -from kedro.pipeline.modular_pipeline import pipeline - -from .nodes import evaluate_model, split_data, train_model - - -def create_pipeline(**kwargs) -> Pipeline: - pipeline_instance = pipeline( - [ - node( - func=split_data, - inputs=["model_input_table", "params:model_options"], - outputs=["X_train", "y_train"], - name="split_data_node", - ), - node( - func=train_model, - inputs=["X_train", "y_train"], - outputs="regressor", - name="train_model_node", - ), - ] - ) - ds_pipeline_1 = pipeline( - pipe=pipeline_instance, - inputs="model_input_table", - namespace="active_modelling_pipeline", - ) - ds_pipeline_2 = pipeline( - pipe=pipeline_instance, - inputs="model_input_table", - namespace="candidate_modelling_pipeline", - ) - - return ds_pipeline_1 + ds_pipeline_2 -``` -You can now have one dataset factory pattern in your catalog instead of two separate entries for `active_modelling_pipeline.regressor` -and `candidate_modelling_pipeline.regressor` as below: -```yaml -{namespace}.regressor: - type: pickle.PickleDataSet - filepath: data/06_models/regressor_{namespace}.pkl - versioned: true -``` -### Example 4: Generalise datasets of the same type in different layers into one dataset factory with multiple placeholders - -You can use multiple placeholders in the same pattern. For example, consider the following catalog where the dataset -entries share `type`, `file_format` and `save_args`: -```yaml -processing.factory_data: - type: spark.SparkDataSet - filepath: data/processing/factory_data.pq - file_format: parquet - save_args: - mode: overwrite - -processing.process_data: - type: spark.SparkDataSet - filepath: data/processing/process_data.pq - file_format: parquet - save_args: - mode: overwrite - -modelling.metrics: - type: spark.SparkDataSet - filepath: data/modelling/factory_data.pq - file_format: parquet - save_args: - mode: overwrite -``` -This could be generalised to the following pattern: -```yaml -"{layer}.{dataset_name}": - type: spark.SparkDataSet - filepath: data/{layer}/{dataset_name}.pq - file_format: parquet - save_args: - mode: overwrite -``` -All the placeholders used in the catalog entry body must exist in the factory pattern name. - -### Example 5: Generalise datasets using multiple dataset factories -You can have multiple dataset factories in your catalog. For example: -```yaml -"{namespace}.{dataset_name}@spark": - type: spark.SparkDataSet - filepath: data/{namespace}/{dataset_name}.pq - file_format: parquet - -"{dataset_name}@csv": - type: pandas.CSVDataSet - filepath: data/01_raw/{dataset_name}.csv -``` - -Having multiple dataset factories in your catalog can lead to a situation where a dataset name from your pipeline might -match multiple patterns. To overcome this, Kedro sorts all the potential matches for the dataset name in the pipeline and picks the best match. -The matches are ranked according to the following criteria : -1. Number of exact character matches between the dataset name and the factory pattern. For example, a dataset named `factory_data$csv` would match `{dataset}_data$csv` over `{dataset_name}$csv`. -2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`. -3. Alphabetical order - -### Example 6: Generalise all datasets with a catch-all dataset factory to overwrite the default `MemoryDataSet` -You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataSet` creation. -```yaml -"{default_dataset}": - type: pandas.CSVDataSet - filepath: data/{default_dataset}.csv - -``` -Kedro will now treat all the datasets mentioned in your project's pipelines that do not appear as specific patterns or explicit entries in your catalog -as `pandas.CSVDataSet`. - -## Transcode datasets - -You might come across a situation where you would like to read the same file using two different dataset implementations. Use transcoding when you want to load and save the same file, via its specified `filepath`, using different `DataSet` implementations. - -### A typical example of transcoding - -For instance, parquet files can not only be loaded via the `ParquetDataSet` using `pandas`, but also directly by `SparkDataSet`. This conversion is typical when coordinating a `Spark` to `pandas` workflow. - -To enable transcoding, define two `DataCatalog` entries for the same dataset in a common format (Parquet, JSON, CSV, etc.) in your `conf/base/catalog.yml`: -```yaml -my_dataframe@spark: - type: spark.SparkDataSet - filepath: data/02_intermediate/data.parquet - file_format: parquet -my_dataframe@pandas: - type: pandas.ParquetDataSet - filepath: data/02_intermediate/data.parquet -``` - -These entries are used in the pipeline like this: - -```python -pipeline( - [ - node(func=my_func1, inputs="spark_input", outputs="my_dataframe@spark"), - node(func=my_func2, inputs="my_dataframe@pandas", outputs="pipeline_output"), - ] -) -``` - -### How does transcoding work? - -In this example, Kedro understands that `my_dataframe` is the same dataset in its `spark.SparkDataSet` and `pandas.ParquetDataSet` formats and helps resolve the node execution order. - -In the pipeline, Kedro uses the `spark.SparkDataSet` implementation for saving and `pandas.ParquetDataSet` -for loading, so the first node should output a `pyspark.sql.DataFrame`, while the second node would receive a `pandas.Dataframe`. - - -## Version datasets and ML models - -Making a simple addition to your Data Catalog allows you to perform versioning of datasets and machine learning models. - -Consider the following versioned dataset defined in the `catalog.yml`: - -```yaml -cars: - type: pandas.CSVDataSet - filepath: data/01_raw/company/cars.csv - versioned: True -``` - -The `DataCatalog` will create a versioned `CSVDataSet` called `cars`. The actual csv file location will look like `data/01_raw/company/cars.csv//cars.csv`, where `` corresponds to a global save version string formatted as `YYYY-MM-DDThh.mm.ss.sssZ`. - -You can run the pipeline with a particular versioned data set with `--load-version` flag as follows: - -```bash -kedro run --load-version=cars:YYYY-MM-DDThh.mm.ss.sssZ -``` -where `--load-version` is dataset name and version timestamp separated by `:`. - -This section shows just the very basics of versioning, which is described further in [the documentation about Kedro IO](../data/kedro_io.md#versioning). - -## Use the Data Catalog with the Code API - -The code API allows you to: - -* configure data sources in code -* operate the IO module within notebooks - -### Configure a Data Catalog - -In a file like `catalog.py`, you can construct a `DataCatalog` object programmatically. In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets). - -```python -from kedro.io import DataCatalog -from kedro_datasets.pandas import ( - CSVDataSet, - SQLTableDataSet, - SQLQueryDataSet, - ParquetDataSet, -) - -io = DataCatalog( - { - "bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"), - "cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")), - "cars_table": SQLTableDataSet( - table_name="cars", credentials=dict(con="sqlite:///kedro.db") - ), - "scooters_query": SQLQueryDataSet( - sql="select * from cars where gear=4", - credentials=dict(con="sqlite:///kedro.db"), - ), - "ranked": ParquetDataSet(filepath="ranked.parquet"), - } -) -``` - -When using `SQLTableDataSet` or `SQLQueryDataSet` you must provide a `con` key containing [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) database connection string. In the example above we pass it as part of `credentials` argument. Alternative to `credentials` is to put `con` into `load_args` and `save_args` (`SQLTableDataSet` only). - -### Load datasets -You can access each dataset by its name. - -```python -cars = io.load("cars") # data is now loaded as a DataFrame in 'cars' -gear = cars["gear"].values -``` - -#### Behind the scenes - -The following steps happened behind the scenes when `load` was called: - -- The value `cars` was located in the Data Catalog -- The corresponding `AbstractDataSet` object was retrieved -- The `load` method of this dataset was called -- This `load` method delegated the loading to the underlying pandas `read_csv` function - -### View the available data sources - -If you forget what data was assigned, you can always review the `DataCatalog`. - -```python -io.list() -``` - -### Save data - -You can save data using an API similar to that used to load data. - -```{warning} -This use is not recommended unless you are prototyping in notebooks. -``` - -#### Save data to memory - -```python -from kedro.io import MemoryDataSet - -memory = MemoryDataSet(data=None) -io.add("cars_cache", memory) -io.save("cars_cache", "Memory can store anything.") -io.load("cars_cache") -``` - -#### Save data to a SQL database for querying - -We might now want to put the data in a SQLite database to run queries on it. Let's use that to rank scooters by their mpg. - -```python -import os - -# This cleans up the database in case it exists at this point -try: - os.remove("kedro.db") -except FileNotFoundError: - pass - -io.save("cars_table", cars) -ranked = io.load("scooters_query")[["brand", "mpg"]] -``` - -#### Save data in Parquet - -Finally, we can save the processed data in Parquet format. - -```python -io.save("ranked", ranked) -``` +```{toctree} +:maxdepth: 1 -```{warning} -Saving `None` to a dataset is not allowed! +data_catalog_yaml_examples +data_catalog_basic_how_to +partitioned_and_incremental_datasets +advanced_data_catalog_usage +how_to_create_a_custom_dataset ``` diff --git a/docs/source/data/data_catalog_basic_how_to.md b/docs/source/data/data_catalog_basic_how_to.md new file mode 100644 index 0000000000..4d5c8c4604 --- /dev/null +++ b/docs/source/data/data_catalog_basic_how_to.md @@ -0,0 +1,130 @@ +# Data Catalog how to guide + + +TO DO: Revise any explanations where possible to make it more hands on/task based + +## How to version datasets and ML models + +Making a simple addition to your Data Catalog allows you to perform versioning of datasets and machine learning models. + +Consider the following versioned dataset defined in the `catalog.yml`: + +```yaml +cars: + type: pandas.CSVDataSet + filepath: data/01_raw/company/cars.csv + versioned: True +``` + +The `DataCatalog` will create a folder to store a version of the `CSVDataSet` called `cars`. The actual csv file location will look like `data/01_raw/company/cars.csv//cars.csv`, where `` corresponds to a global save version string formatted as `YYYY-MM-DDThh.mm.ss.sssZ`. + +By default, the `DataCatalog` will load the latest version of the dataset. However, you can also specify a particular versioned data set with `--load-version` flag as follows: + +```bash +kedro run --load-version=cars:YYYY-MM-DDThh.mm.ss.sssZ +``` +where `--load-version` is dataset name and version timestamp separated by `:`. + +### Supported datasets + +Currently, the following datasets support versioning: + +- `kedro_datasets.matplotlib.MatplotlibWriter` +- `kedro_datasets.holoviews.HoloviewsWriter` +- `kedro_datasets.networkx.NetworkXDataSet` +- `kedro_datasets.pandas.CSVDataSet` +- `kedro_datasets.pandas.ExcelDataSet` +- `kedro_datasets.pandas.FeatherDataSet` +- `kedro_datasets.pandas.HDFDataSet` +- `kedro_datasets.pandas.JSONDataSet` +- `kedro_datasets.pandas.ParquetDataSet` +- `kedro_datasets.pickle.PickleDataSet` +- `kedro_datasets.pillow.ImageDataSet` +- `kedro_datasets.text.TextDataSet` +- `kedro_datasets.spark.SparkDataSet` +- `kedro_datasets.yaml.YAMLDataSet` +- `kedro_datasets.api.APIDataSet` +- `kedro_datasets.tensorflow.TensorFlowModelDataSet` +- `kedro_datasets.json.JSONDataSet` + +```{note} +Although HTTP(S) is a supported file system in the dataset implementations, it does not support versioning. +``` + +## How to create a Data Catalog YAML configuration file via the CLI + +You can use the [`kedro catalog create` command to create a Data Catalog YAML configuration](../development/commands_reference.md#create-a-data-catalog-yaml-configuration-file). + +This creates a `//catalog/.yml` configuration file with `MemoryDataSet` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`. + +```yaml +# //catalog/.yml +rockets: + type: MemoryDataSet +scooters: + type: MemoryDataSet +``` + + +## How to access a dataset that needs credentials + +Before instantiating the `DataCatalog`, Kedro will first attempt to read [the credentials from the project configuration](../configuration/credentials.md). The resulting dictionary is then passed into `DataCatalog.from_config()` as the `credentials` argument. + +Let's assume that the project contains the file `conf/local/credentials.yml` with the following contents: + +```yaml +dev_s3: + client_kwargs: + aws_access_key_id: key + aws_secret_access_key: secret + +scooters_credentials: + con: sqlite:///kedro.db + +my_gcp_credentials: + id_token: key +``` + +In the example above, the `catalog.yml` file contains references to credentials keys `dev_s3` and `scooters_credentials`. This means that when it instantiates the `motorbikes` dataset, for example, the `DataCatalog` will attempt to read top-level key `dev_s3` from the received `credentials` dictionary, and then will pass its values into the dataset `__init__` as a `credentials` argument. + + +## How to read the same file using two different dataset implementations + +When you want to load and save the same file, via its specified `filepath`, using different `DataSet` implementations, you'll need to use transcoding. + +### A typical example of transcoding + +For instance, parquet files can not only be loaded via the `ParquetDataSet` using `pandas`, but also directly by `SparkDataSet`. This conversion is typical when coordinating a `Spark` to `pandas` workflow. + +To enable transcoding, define two `DataCatalog` entries for the same dataset in a common format (Parquet, JSON, CSV, etc.) in your `conf/base/catalog.yml`: + +```yaml +my_dataframe@spark: + type: spark.SparkDataSet + filepath: data/02_intermediate/data.parquet + file_format: parquet + +my_dataframe@pandas: + type: pandas.ParquetDataSet + filepath: data/02_intermediate/data.parquet +``` + +These entries are used in the pipeline like this: + +```python +pipeline( + [ + node(func=my_func1, inputs="spark_input", outputs="my_dataframe@spark"), + node(func=my_func2, inputs="my_dataframe@pandas", outputs="pipeline_output"), + ] +) +``` + +### How does transcoding work? + +In this example, Kedro understands that `my_dataframe` is the same dataset in its `spark.SparkDataSet` and `pandas.ParquetDataSet` formats and helps resolve the node execution order. + +In the pipeline, Kedro uses the `spark.SparkDataSet` implementation for saving and `pandas.ParquetDataSet` +for loading, so the first node should output a `pyspark.sql.DataFrame`, while the second node would receive a `pandas.Dataframe`. + + diff --git a/docs/source/data/data_catalog_yaml_examples.md b/docs/source/data/data_catalog_yaml_examples.md new file mode 100644 index 0000000000..e90e5e0d68 --- /dev/null +++ b/docs/source/data/data_catalog_yaml_examples.md @@ -0,0 +1,574 @@ +# Data Catalog YAML examples + +You can configure your datasets in a YAML configuration file, `conf/base/catalog.yml` or `conf/local/catalog.yml`. + +Here are some examples of data configuration in a `catalog.yml`: + +Data Catalog accepts two different groups of `*_args` parameters that serve different purposes: +- `fs_args` +- `load_args` and `save_args` + +The `fs_args` is used to configure the interaction with a filesystem. +All the top-level parameters of `fs_args` (except `open_args_load` and `open_args_save`) will be passed in an underlying filesystem class. + +**Provide the `project` value to the underlying filesystem class (`GCSFileSystem`) to interact with Google Cloud Storage (GCS) +** +```yaml +test_dataset: + type: ... + fs_args: + project: test_project +``` + +The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively. + +**Load data from a local binary file using `utf-8` encoding +** +```yaml +test_dataset: + type: ... + fs_args: + open_args_load: + mode: "rb" + encoding: "utf-8" +``` + +`load_args` and `save_args` configure how a third-party library (e.g. `pandas` for `CSVDataSet`) loads/saves data from/to a file. + +**Save data to a CSV file without row names (index) using `utf-8` encoding +** +```yaml +test_dataset: + type: pandas.CSVDataSet + ... + save_args: + index: False + encoding: "utf-8" +``` + +## Load / save a CSV file from / to a local file system + +```yaml +bikes: + type: pandas.CSVDataSet + filepath: data/01_raw/bikes.csv +``` + +## Load / save a CSV on a local file system, using specified load / save arguments + +```yaml +cars: + type: pandas.CSVDataSet + filepath: data/01_raw/company/cars.csv + load_args: + sep: ',' + save_args: + index: False + date_format: '%Y-%m-%d %H:%M' + decimal: . + +``` + +## Load / save a compressed CSV on a local file system + +```yaml +boats: + type: pandas.CSVDataSet + filepath: data/01_raw/company/boats.csv.gz + load_args: + sep: ',' + compression: 'gzip' + fs_args: + open_args_load: + mode: 'rb' +``` + +## Load a CSV file from a specific S3 bucket, using credentials and load arguments + +```yaml +motorbikes: + type: pandas.CSVDataSet + filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv + credentials: dev_s3 + load_args: + sep: ',' + skiprows: 5 + skipfooter: 1 + na_values: ['#NA', NA] +``` + +## Load / save a pickle file from / to a local file system + +```yaml +airplanes: + type: pickle.PickleDataSet + filepath: data/06_models/airplanes.pkl + backend: pickle +``` + +## Load an Excel file from Google Cloud Storage + +```yaml +rockets: + type: pandas.ExcelDataSet + filepath: gcs://your_bucket/data/02_intermediate/company/motorbikes.xlsx + fs_args: + project: my-project + credentials: my_gcp_credentials + save_args: + sheet_name: Sheet1 +``` + +## Load a multi-sheet Excel file from a local file system + +```yaml +trains: + type: pandas.ExcelDataSet + filepath: data/02_intermediate/company/trains.xlsx + load_args: + sheet_name: [Sheet1, Sheet2, Sheet3] +``` + +## Save an image created with Matplotlib on Google Cloud Storage + +```yaml +results_plot: + type: matplotlib.MatplotlibWriter + filepath: gcs://your_bucket/data/08_results/plots/output_1.jpeg + fs_args: + project: my-project + credentials: my_gcp_credentials +``` + + +## Load / save an HDF file on local file system storage, using specified load / save arguments + +```yaml +skateboards: + type: pandas.HDFDataSet + filepath: data/02_intermediate/skateboards.hdf + key: name + load_args: + columns: [brand, length] + save_args: + mode: w # Overwrite even when the file already exists + dropna: True +``` + +## Load / save a parquet file on local file system storage, using specified load / save arguments + +```yaml +trucks: + type: pandas.ParquetDataSet + filepath: data/02_intermediate/trucks.parquet + load_args: + columns: [name, gear, disp, wt] + categories: list + index: name + save_args: + compression: GZIP + file_scheme: hive + has_nulls: False + partition_on: [name] +``` + + +## Load / save a Spark table on S3, using specified load / save arguments + +```yaml +weather: + type: spark.SparkDataSet + filepath: s3a://your_bucket/data/01_raw/weather* + credentials: dev_s3 + file_format: csv + load_args: + header: True + inferSchema: True + save_args: + sep: '|' + header: True +``` + + +## Load / save a SQL table using credentials, a database connection, using specified load / save arguments + +```yaml +scooters: + type: pandas.SQLTableDataSet + credentials: scooters_credentials + table_name: scooters + load_args: + index_col: [name] + columns: [name, gear] + save_args: + if_exists: replace +``` + +## Load an SQL table with credentials, a database connection, and applies a SQL query to the table + + +```yaml +scooters_query: + type: pandas.SQLQueryDataSet + credentials: scooters_credentials + sql: select * from cars where gear=4 + load_args: + index_col: [name] +``` + +When you use [`pandas.SQLTableDataSet`](/kedro_datasets.pandas.SQLTableDataSet) or [`pandas.SQLQueryDataSet`](/kedro_datasets.pandas.SQLQueryDataSet), you must provide a database connection string. In the above example, we pass it using the `scooters_credentials` key from the credentials (see the details in the [Feeding in credentials](#feeding-in-credentials) section below). `scooters_credentials` must have a top-level key `con` containing a [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) connection string. As an alternative to credentials, you could explicitly put `con` into `load_args` and `save_args` (`pandas.SQLTableDataSet` only). + + +## Load data from an API endpoint, example US corn yield data from USDA + +```yaml +us_corn_yield_data: + type: api.APIDataSet + url: https://quickstats.nass.usda.gov + credentials: usda_credentials + params: + key: SOME_TOKEN + format: JSON + commodity_desc: CORN + statisticcat_des: YIELD + agg_level_desc: STATE + year: 2000 +``` + +Note that `usda_credientials` will be passed as the `auth` argument in the `requests` library. Specify the username and password as a list in your `credentials.yml` file as follows: + +```yaml +usda_credentials: + - username + - password +``` + + +## Load data from Minio (S3 API Compatible Storage) + + +```yaml +test: + type: pandas.CSVDataSet + filepath: s3://your_bucket/test.csv # assume `test.csv` is uploaded to the Minio server. + credentials: dev_minio +``` +In `credentials.yml`, define the `key`, `secret` and the `endpoint_url` as follows: + +```yaml +dev_minio: + key: token + secret: key + client_kwargs: + endpoint_url : 'http://localhost:9000' +``` + +```{note} +The easiest way to setup MinIO is to run a Docker image. After the following command, you can access the Minio server with `http://localhost:9000` and create a bucket and add files as if it is on S3. +``` + +`docker run -p 9000:9000 -e "MINIO_ACCESS_KEY=token" -e "MINIO_SECRET_KEY=key" minio/minio server /data` + + +## Load a model saved as a pickle from Azure Blob Storage + +```yaml +ml_model: + type: pickle.PickleDataSet + filepath: "abfs://models/ml_models.pickle" + versioned: True + credentials: dev_abs +``` +In the `credentials.yml` file, define the `account_name` and `account_key`: + +```yaml +dev_abs: + account_name: accountname + account_key: key +``` + + +## Load a CSV file stored in a remote location through SSH + +```{note} +This example requires [Paramiko](https://www.paramiko.org) to be installed (`pip install paramiko`). +``` +```yaml +cool_dataset: + type: pandas.CSVDataSet + filepath: "sftp:///path/to/remote_cluster/cool_data.csv" + credentials: cluster_credentials +``` +All parameters required to establish the SFTP connection can be defined through `fs_args` or in the `credentials.yml` file as follows: + +```yaml +cluster_credentials: + username: my_username + host: host_address + port: 22 + password: password +``` +The list of all available parameters is given in the [Paramiko documentation](https://docs.paramiko.org/en/2.4/api/client.html#paramiko.client.SSHClient.connect). + +## Load multiple datasets with similar configuration using YAML anchors + +Different datasets might use the same file format, load and save arguments, and be stored in the same folder. [YAML has a built-in syntax](https://yaml.org/spec/1.2.1/#Syntax) for factorising parts of a YAML file, which means that you can decide what is generalisable across your datasets, so that you need not spend time copying and pasting dataset configurations in the `catalog.yml` file. + +You can see this in the following example: + +```yaml +_csv: &csv + type: spark.SparkDataSet + file_format: csv + load_args: + sep: ',' + na_values: ['#NA', NA] + header: True + inferSchema: False + +cars: + <<: *csv + filepath: s3a://data/01_raw/cars.csv + +trucks: + <<: *csv + filepath: s3a://data/01_raw/trucks.csv + +bikes: + <<: *csv + filepath: s3a://data/01_raw/bikes.csv + load_args: + header: False +``` + +The syntax `&csv` names the following block `csv` and the syntax `<<: *csv` inserts the contents of the block named `csv`. Locally declared keys entirely override inserted ones as seen in `bikes`. + +```{note} +It's important that the name of the template entry starts with a `_` so Kedro knows not to try and instantiate it as a dataset. +``` + +You can also nest reuseable YAML syntax: + +```yaml +_csv: &csv + type: spark.SparkDataSet + file_format: csv + load_args: &csv_load_args + header: True + inferSchema: False + +airplanes: + <<: *csv + filepath: s3a://data/01_raw/airplanes.csv + load_args: + <<: *csv_load_args + sep: ; +``` + +In this example, the default `csv` configuration is inserted into `airplanes` and then the `load_args` block is overridden. Normally, that would replace the whole dictionary. In order to extend `load_args`, the defaults for that block are then re-inserted. + +## Load multiple datasets with similar configuration using dataset factories +For catalog entries that share configuration details, you can also use the dataset factories introduced in Kedro 0.18.12. This syntax allows you to generalise the configuration and +reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns. + +### Generalise datasets with similar names and types into one dataset factory +Consider the following catalog entries: +```yaml +factory_data: + type: pandas.CSVDataSet + filepath: data/01_raw/factory_data.csv + + +process_data: + type: pandas.CSVDataSet + filepath: data/01_raw/process_data.csv +``` +The datasets in this catalog can be generalised to the following dataset factory: +```yaml +"{name}_data": + type: pandas.CSVDataSet + filepath: data/01_raw/{name}_data.csv +``` +When `factory_data` or `process_data` is used in your pipeline, it is matched to the factory pattern `{name}_data`. The factory pattern must always be enclosed in +quotes to avoid YAML parsing errors. + + +### Generalise datasets of the same type into one dataset factory +You can also combine all the datasets with the same type and configuration details. For example, consider the following +catalog with three datasets named `boats`, `cars` and `planes` of the type `pandas.CSVDataSet`: +```yaml +boats: + type: pandas.CSVDataSet + filepath: data/01_raw/shuttles.csv + +cars: + type: pandas.CSVDataSet + filepath: data/01_raw/reviews.csv + +planes: + type: pandas.CSVDataSet + filepath: data/01_raw/companies.csv +``` +These datasets can be combined into the following dataset factory: +```yaml +"{dataset_name}#csv": + type: pandas.CSVDataSet + filepath: data/01_raw/{dataset_name}.csv +``` +You will then have to update the pipelines in your project located at `src///pipeline.py` to refer to these datasets as `boats#csv`, +`cars#csv` and `planes#csv`. Adding a suffix or a prefix to the dataset names and the dataset factory patterns, like `#csv` here, ensures that the dataset +names are matched with the intended pattern. +```python +from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles + + +def create_pipeline(**kwargs) -> Pipeline: + return pipeline( + [ + node( + func=preprocess_boats, + inputs="boats#csv", + outputs="preprocessed_boats", + name="preprocess_boats_node", + ), + node( + func=preprocess_cars, + inputs="cars#csv", + outputs="preprocessed_cars", + name="preprocess_cars_node", + ), + node( + func=preprocess_planes, + inputs="planes#csv", + outputs="preprocessed_planes", + name="preprocess_planes_node", + ), + node( + func=create_model_input_table, + inputs=[ + "preprocessed_boats", + "preprocessed_planes", + "preprocessed_cars", + ], + outputs="model_input_table", + name="create_model_input_table_node", + ), + ] + ) +``` +### Generalise datasets using namespaces into one dataset factory +You can also generalise the catalog entries for datasets belonging to namespaced modular pipelines. Consider the +following pipeline which takes in a `model_input_table` and outputs two regressors belonging to the +`active_modelling_pipeline` and the `candidate_modelling_pipeline` namespaces: +```python +from kedro.pipeline import Pipeline, node +from kedro.pipeline.modular_pipeline import pipeline + +from .nodes import evaluate_model, split_data, train_model + + +def create_pipeline(**kwargs) -> Pipeline: + pipeline_instance = pipeline( + [ + node( + func=split_data, + inputs=["model_input_table", "params:model_options"], + outputs=["X_train", "y_train"], + name="split_data_node", + ), + node( + func=train_model, + inputs=["X_train", "y_train"], + outputs="regressor", + name="train_model_node", + ), + ] + ) + ds_pipeline_1 = pipeline( + pipe=pipeline_instance, + inputs="model_input_table", + namespace="active_modelling_pipeline", + ) + ds_pipeline_2 = pipeline( + pipe=pipeline_instance, + inputs="model_input_table", + namespace="candidate_modelling_pipeline", + ) + + return ds_pipeline_1 + ds_pipeline_2 +``` +You can now have one dataset factory pattern in your catalog instead of two separate entries for `active_modelling_pipeline.regressor` +and `candidate_modelling_pipeline.regressor` as below: +```yaml +{namespace}.regressor: + type: pickle.PickleDataSet + filepath: data/06_models/regressor_{namespace}.pkl + versioned: true +``` +### Generalise datasets of the same type in different layers into one dataset factory with multiple placeholders + +You can use multiple placeholders in the same pattern. For example, consider the following catalog where the dataset +entries share `type`, `file_format` and `save_args`: +```yaml +processing.factory_data: + type: spark.SparkDataSet + filepath: data/processing/factory_data.pq + file_format: parquet + save_args: + mode: overwrite + +processing.process_data: + type: spark.SparkDataSet + filepath: data/processing/process_data.pq + file_format: parquet + save_args: + mode: overwrite + +modelling.metrics: + type: spark.SparkDataSet + filepath: data/modelling/factory_data.pq + file_format: parquet + save_args: + mode: overwrite +``` +This could be generalised to the following pattern: +```yaml +"{layer}.{dataset_name}": + type: spark.SparkDataSet + filepath: data/{layer}/{dataset_name}.pq + file_format: parquet + save_args: + mode: overwrite +``` +All the placeholders used in the catalog entry body must exist in the factory pattern name. + +### Generalise datasets using multiple dataset factories +You can have multiple dataset factories in your catalog. For example: +```yaml +"{namespace}.{dataset_name}@spark": + type: spark.SparkDataSet + filepath: data/{namespace}/{dataset_name}.pq + file_format: parquet + +"{dataset_name}@csv": + type: pandas.CSVDataSet + filepath: data/01_raw/{dataset_name}.csv +``` + +Having multiple dataset factories in your catalog can lead to a situation where a dataset name from your pipeline might +match multiple patterns. To overcome this, Kedro sorts all the potential matches for the dataset name in the pipeline and picks the best match. +The matches are ranked according to the following criteria : +1. Number of exact character matches between the dataset name and the factory pattern. For example, a dataset named `factory_data$csv` would match `{dataset}_data$csv` over `{dataset_name}$csv`. +2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`. +3. Alphabetical order + +### Generalise all datasets with a catch-all dataset factory to overwrite the default `MemoryDataSet` +You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataSet` creation. +```yaml +"{default_dataset}": + type: pandas.CSVDataSet + filepath: data/{default_dataset}.csv + +``` +Kedro will now treat all the datasets mentioned in your project's pipelines that do not appear as specific patterns or explicit entries in your catalog +as `pandas.CSVDataSet`. diff --git a/docs/source/extend_kedro/custom_datasets.md b/docs/source/data/how_to_create_a_custom_dataset.md similarity index 90% rename from docs/source/extend_kedro/custom_datasets.md rename to docs/source/data/how_to_create_a_custom_dataset.md index 9e4b0713eb..c4c080ad93 100644 --- a/docs/source/extend_kedro/custom_datasets.md +++ b/docs/source/data/how_to_create_a_custom_dataset.md @@ -1,7 +1,12 @@ -# Custom datasets +# Tutorial: How to create a custom dataset [Kedro supports many datasets](/kedro_datasets) out of the box, but you may find that you need to create a custom dataset. For example, you may need to handle a proprietary data format or filesystem in your pipeline, or perhaps you have found a particular use case for a dataset that Kedro does not support. This tutorial explains how to create a custom dataset to read and save image data. +## AbstractDataSet + +For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataSet` interface](/kedro.io.AbstractDataSet), which is the underlying interface that all datasets extend. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation. + + ## Scenario In this example, we use a [Kaggle dataset of Pokémon images and types](https://www.kaggle.com/vishalsubbiah/pokemon-images-and-types) to train a model to classify the type of a given [Pokémon](https://en.wikipedia.org/wiki/Pok%C3%A9mon), e.g. Water, Fire, Bug, etc., based on its appearance. To train the model, we read the Pokémon images from PNG files into `numpy` arrays before further manipulation in the Kedro pipeline. To work with PNG images out of the box, in this example we create an `ImageDataSet` to read and save image data. @@ -297,6 +302,29 @@ $ ls -la data/01_raw/pokemon-images-and-types/images/images/*.png | wc -l ## Versioning +### How to implement versioning in your dataset + + +***** TOOK THIS FROM A SEPARATE PAGE ON KEDRO-IO + +In order to enable versioning, you need to update the `catalog.yml` config file and set the `versioned` attribute to `true` for the given dataset. If this is a custom dataset, the implementation must also: + 1. extend `kedro.io.core.AbstractVersionedDataSet` AND + 2. add `version` namedtuple as an argument to its `__init__` method AND + 3. call `super().__init__()` with positional arguments `filepath`, `version`, and, optionally, with `glob` and `exists` functions if it uses a non-local filesystem (see [kedro_datasets.pandas.CSVDataSet](/kedro_datasets.pandas.CSVDataSet) as an example) AND + 4. modify its `_describe`, `_load` and `_save` methods respectively to support versioning (see [`kedro_datasets.pandas.CSVDataSet`](/kedro_datasets.pandas.CSVDataSet) for an example implementation) + + +### `version` namedtuple + +Versioned dataset `__init__` method must have an optional argument called `version` with a default value of `None`. If provided, this argument must be an instance of [`kedro.io.core.Version`](/kedro.io.Version). Its `load` and `save` attributes must either be `None` or contain string values representing exact load and save versions: + +* If `version` is `None`, then the dataset is considered *not versioned*. +* If `version.load` is `None`, then the latest available version will be used to load the dataset, otherwise a string representing exact load version must be provided. +* If `version.save` is `None`, then a new save version string will be generated by calling `kedro.io.core.generate_timestamp()`, otherwise a string representing the exact save version must be provided. + + +*****THIS WAS THE ORIGINAL CONTENT + ```{note} Versioning doesn't work with `PartitionedDataSet`. You can't use both of them at the same time. ``` @@ -498,7 +526,6 @@ In [2]: context.catalog.save('pikachu', data=img) Inspect the content of the data directory to find a new version of the data, written by `save`. -You may also want to consult the [in-depth documentation about the Versioning API](../data/kedro_io.md#versioning). ## Thread-safety diff --git a/docs/source/data/index.md b/docs/source/data/index.md index 00c05353fc..23012f66e8 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -1,8 +1,21 @@ -# Data Catalog + +# The Kedro Data Catalog + +Kedro's Data Catalog is a registry for all the data sources that a project can use. The Data Catalog is used to manage loading and saving data and it maps the names of node inputs and outputs as keys in a `DataCatalog`, a Kedro class that can be specialised for different types of data storage. + +[Kedro provides different built-in datasets](/kedro_datasets) for numerous file types and file systems, so you don’t have to write any of the logic for reading/writing data. + +This section is comprised of a set of pages that do the following: + +TO DO -- summarise ```{toctree} :maxdepth: 1 data_catalog -kedro_io +data_catalog_yaml_examples +data_catalog_basic_how_to +partitioned_and_incremental_datasets +advanced_data_catalog_usage +how_to_create_a_custom_dataset ``` diff --git a/docs/source/data/kedro_io.md b/docs/source/data/partitioned_and_incremental_datasets.md similarity index 67% rename from docs/source/data/kedro_io.md rename to docs/source/data/partitioned_and_incremental_datasets.md index 6fdfefdd66..bbadae2512 100644 --- a/docs/source/data/kedro_io.md +++ b/docs/source/data/partitioned_and_incremental_datasets.md @@ -1,243 +1,4 @@ -# Kedro IO - - -In this tutorial, we cover advanced uses of [the Kedro IO module](/kedro.io) to understand the underlying implementation. The relevant API documentation is [kedro.io.AbstractDataSet](/kedro.io.AbstractDataSet) and [kedro.io.DataSetError](/kedro.io.DataSetError). - -## Error handling - -We have custom exceptions for the main classes of errors that you can handle to deal with failures. - -```python -from kedro.io import * -``` - -```python -io = DataCatalog(data_sets=dict()) # empty catalog - -try: - cars_df = io.load("cars") -except DataSetError: - print("Error raised.") -``` - - -## AbstractDataSet - -To understand what is going on behind the scenes, you should study the [AbstractDataSet interface](/kedro.io.AbstractDataSet). `AbstractDataSet` is the underlying interface that all datasets extend. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation. - -If you have a dataset called `parts`, you can make direct calls to it like so: - -```python -parts_df = parts.load() -``` - -We recommend using a `DataCatalog` instead (for more details, see [the `DataCatalog` documentation](../data/data_catalog.md)) as it has been designed to make all datasets available to project members. - -For contributors, if you would like to submit a new dataset, you must extend the `AbstractDataSet`. For a complete guide, please read [the section on custom datasets](../extend_kedro/custom_datasets.md). - - -## Versioning - -In order to enable versioning, you need to update the `catalog.yml` config file and set the `versioned` attribute to `true` for the given dataset. If this is a custom dataset, the implementation must also: - 1. extend `kedro.io.core.AbstractVersionedDataSet` AND - 2. add `version` namedtuple as an argument to its `__init__` method AND - 3. call `super().__init__()` with positional arguments `filepath`, `version`, and, optionally, with `glob` and `exists` functions if it uses a non-local filesystem (see [kedro_datasets.pandas.CSVDataSet](/kedro_datasets.pandas.CSVDataSet) as an example) AND - 4. modify its `_describe`, `_load` and `_save` methods respectively to support versioning (see [`kedro_datasets.pandas.CSVDataSet`](/kedro_datasets.pandas.CSVDataSet) for an example implementation) - -```{note} -If a new version of a dataset is created mid-run, for instance by an external system adding new files, it will not interfere in the current run, i.e. the load version stays the same throughout subsequent loads. -``` - -An example dataset could look similar to the below: - -```python -from pathlib import Path, PurePosixPath - -import pandas as pd - -from kedro.io import AbstractVersionedDataSet - - -class MyOwnDataSet(AbstractVersionedDataSet): - def __init__(self, filepath, version, param1, param2=True): - super().__init__(PurePosixPath(filepath), version) - self._param1 = param1 - self._param2 = param2 - - def _load(self) -> pd.DataFrame: - load_path = self._get_load_path() - return pd.read_csv(load_path) - - def _save(self, df: pd.DataFrame) -> None: - save_path = self._get_save_path() - df.to_csv(save_path) - - def _exists(self) -> bool: - path = self._get_load_path() - return Path(path).exists() - - def _describe(self): - return dict(version=self._version, param1=self._param1, param2=self._param2) -``` - -With `catalog.yml` specifying: - -```yaml -my_dataset: - type: .MyOwnDataSet - filepath: data/01_raw/my_data.csv - versioned: true - param1: # param1 is a required argument - # param2 will be True by default -``` - -### `version` namedtuple - -Versioned dataset `__init__` method must have an optional argument called `version` with a default value of `None`. If provided, this argument must be an instance of [`kedro.io.core.Version`](/kedro.io.Version). Its `load` and `save` attributes must either be `None` or contain string values representing exact load and save versions: - -* If `version` is `None`, then the dataset is considered *not versioned*. -* If `version.load` is `None`, then the latest available version will be used to load the dataset, otherwise a string representing exact load version must be provided. -* If `version.save` is `None`, then a new save version string will be generated by calling `kedro.io.core.generate_timestamp()`, otherwise a string representing the exact save version must be provided. - -### Versioning using the YAML API - -The easiest way to version a specific dataset is to change the corresponding entry in the `catalog.yml` file. For example, if the following dataset was defined in the `catalog.yml` file: - -```yaml -cars: - type: pandas.CSVDataSet - filepath: data/01_raw/company/car_data.csv - versioned: true -``` - -The `DataCatalog` will create a versioned `CSVDataSet` called `cars`. The actual csv file location will look like `data/01_raw/company/car_data.csv//car_data.csv`, where `` corresponds to a global save version string formatted as `YYYY-MM-DDThh.mm.ss.sssZ`. Every time the `DataCatalog` is instantiated, it generates a new global save version, which is propagated to all versioned datasets it contains. - -The `catalog.yml` file only allows you to version your datasets, but does not allow you to choose which version to load or save. This is deliberate because we have chosen to separate the data catalog from any runtime configuration. If you need to pin a dataset version, you can either [specify the versions in a separate `yml` file and call it at runtime](../nodes_and_pipelines/run_a_pipeline.md#configure-kedro-run-arguments) or [instantiate your versioned datasets using Code API and define a version parameter explicitly](#versioning-using-the-code-api). - -By default, the `DataCatalog` will load the latest version of the dataset. However, you can also specify an exact load version. In order to do that, pass a dictionary with exact load versions to `DataCatalog.from_config`: - -```python -load_versions = {"cars": "2019-02-13T14.35.36.518Z"} -io = DataCatalog.from_config(catalog_config, credentials, load_versions=load_versions) -cars = io.load("cars") -``` - -The last row in the example above would attempt to load a CSV file from `data/01_raw/company/car_data.csv/2019-02-13T14.35.36.518Z/car_data.csv`: - -* `load_versions` configuration has an effect only if a dataset versioning has been enabled in the catalog config file - see the example above. - -* We recommend that you do not override `save_version` argument in `DataCatalog.from_config` unless strongly required to do so, since it may lead to inconsistencies between loaded and saved versions of the versioned datasets. - -```{warning} -The `DataCatalog` does not re-generate save versions between instantiations. Therefore, if you call `catalog.save('cars', some_data)` twice, then the second call will fail, since it tries to overwrite a versioned dataset using the same save version. To mitigate this, reload your data catalog by calling `%reload_kedro` line magic. This limitation does not apply to `load` operation. -``` - -### Versioning using the Code API - -Although we recommend enabling versioning using the `catalog.yml` config file as described in the section above, you might require more control over load and save versions of a specific dataset. To achieve this, you can instantiate `Version` and pass it as a parameter to the dataset initialisation: - -```python -from kedro.io import DataCatalog, Version -from kedro_datasets.pandas import CSVDataSet -import pandas as pd - -data1 = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]}) -data2 = pd.DataFrame({"col1": [7], "col2": [8], "col3": [9]}) -version = Version( - load=None, # load the latest available version - save=None, # generate save version automatically on each save operation -) - -test_data_set = CSVDataSet( - filepath="data/01_raw/test.csv", save_args={"index": False}, version=version -) -io = DataCatalog({"test_data_set": test_data_set}) - -# save the dataset to data/01_raw/test.csv//test.csv -io.save("test_data_set", data1) -# save the dataset into a new file data/01_raw/test.csv//test.csv -io.save("test_data_set", data2) - -# load the latest version from data/test.csv/*/test.csv -reloaded = io.load("test_data_set") -assert data2.equals(reloaded) -``` - -```{note} -In the example above, we did not fix any versions. If we do, then the behaviour of load and save operations becomes slightly different: -``` - -```python -version = Version( - load="my_exact_version", # load exact version - save="my_exact_version", # save to exact version -) - -test_data_set = CSVDataSet( - filepath="data/01_raw/test.csv", save_args={"index": False}, version=version -) -io = DataCatalog({"test_data_set": test_data_set}) - -# save the dataset to data/01_raw/test.csv/my_exact_version/test.csv -io.save("test_data_set", data1) -# load from data/01_raw/test.csv/my_exact_version/test.csv -reloaded = io.load("test_data_set") -assert data1.equals(reloaded) - -# raises DataSetError since the path -# data/01_raw/test.csv/my_exact_version/test.csv already exists -io.save("test_data_set", data2) -``` - -```{warning} -We do not recommend passing exact load and/or save versions, since it might lead to inconsistencies between operations. For example, if versions for load and save operations do not match, a save operation would result in a `UserWarning` indicating that save and load versions do not match. Load after save might also return an error if the corresponding load version is not found: -``` - -```python -version = Version( - load="exact_load_version", # load exact version - save="exact_save_version", # save to exact version -) - -test_data_set = CSVDataSet( - filepath="data/01_raw/test.csv", save_args={"index": False}, version=version -) -io = DataCatalog({"test_data_set": test_data_set}) - -io.save("test_data_set", data1) # emits a UserWarning due to version inconsistency - -# raises DataSetError since the data/01_raw/test.csv/exact_load_version/test.csv -# file does not exist -reloaded = io.load("test_data_set") -``` - -### Supported datasets - -Currently, the following datasets support versioning: - -- `kedro_datasets.matplotlib.MatplotlibWriter` -- `kedro_datasets.holoviews.HoloviewsWriter` -- `kedro_datasets.networkx.NetworkXDataSet` -- `kedro_datasets.pandas.CSVDataSet` -- `kedro_datasets.pandas.ExcelDataSet` -- `kedro_datasets.pandas.FeatherDataSet` -- `kedro_datasets.pandas.HDFDataSet` -- `kedro_datasets.pandas.JSONDataSet` -- `kedro_datasets.pandas.ParquetDataSet` -- `kedro_datasets.pickle.PickleDataSet` -- `kedro_datasets.pillow.ImageDataSet` -- `kedro_datasets.text.TextDataSet` -- `kedro_datasets.spark.SparkDataSet` -- `kedro_datasets.yaml.YAMLDataSet` -- `kedro_datasets.api.APIDataSet` -- `kedro_datasets.tensorflow.TensorFlowModelDataSet` -- `kedro_datasets.json.JSONDataSet` - -```{note} -Although HTTP(S) is a supported file system in the dataset implementations, it does not support versioning. -``` - -## Partitioned dataset +## Advanced: Partitioned and incremental datasets These days, distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible. From 6dcffc99edf4e0cbb8322323fd95cbe927b1f56b Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Wed, 2 Aug 2023 16:14:33 +0100 Subject: [PATCH 02/19] linter Signed-off-by: Jo Stichbury --- docs/source/data/advanced_data_catalog_usage.md | 2 -- docs/source/data/data_catalog_basic_how_to.md | 4 +--- 2 files changed, 1 insertion(+), 5 deletions(-) diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md index b252589544..b8c390f1dd 100644 --- a/docs/source/data/advanced_data_catalog_usage.md +++ b/docs/source/data/advanced_data_catalog_usage.md @@ -236,5 +236,3 @@ io.save("test_data_set", data1) # emits a UserWarning due to version inconsiste # file does not exist reloaded = io.load("test_data_set") ``` - - diff --git a/docs/source/data/data_catalog_basic_how_to.md b/docs/source/data/data_catalog_basic_how_to.md index 4d5c8c4604..c527cf1517 100644 --- a/docs/source/data/data_catalog_basic_how_to.md +++ b/docs/source/data/data_catalog_basic_how_to.md @@ -85,7 +85,7 @@ my_gcp_credentials: id_token: key ``` -In the example above, the `catalog.yml` file contains references to credentials keys `dev_s3` and `scooters_credentials`. This means that when it instantiates the `motorbikes` dataset, for example, the `DataCatalog` will attempt to read top-level key `dev_s3` from the received `credentials` dictionary, and then will pass its values into the dataset `__init__` as a `credentials` argument. +In the example above, the `catalog.yml` file contains references to credentials keys `dev_s3` and `scooters_credentials`. This means that when it instantiates the `motorbikes` dataset, for example, the `DataCatalog` will attempt to read top-level key `dev_s3` from the received `credentials` dictionary, and then will pass its values into the dataset `__init__` as a `credentials` argument. ## How to read the same file using two different dataset implementations @@ -126,5 +126,3 @@ In this example, Kedro understands that `my_dataframe` is the same dataset in it In the pipeline, Kedro uses the `spark.SparkDataSet` implementation for saving and `pandas.ParquetDataSet` for loading, so the first node should output a `pyspark.sql.DataFrame`, while the second node would receive a `pandas.Dataframe`. - - From 91cfb5cd00abf9a31a516876868f6dd5ceb3f97e Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Wed, 2 Aug 2023 16:24:15 +0100 Subject: [PATCH 03/19] Added to-do notes Signed-off-by: Jo Stichbury --- docs/source/data/advanced_data_catalog_usage.md | 2 ++ docs/source/data/data_catalog.md | 2 ++ docs/source/data/data_catalog_basic_how_to.md | 4 +--- docs/source/data/data_catalog_yaml_examples.md | 4 ++++ docs/source/data/how_to_create_a_custom_dataset.md | 2 ++ docs/source/data/partitioned_and_incremental_datasets.md | 2 ++ 6 files changed, 13 insertions(+), 3 deletions(-) diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md index b8c390f1dd..16aa371c0d 100644 --- a/docs/source/data/advanced_data_catalog_usage.md +++ b/docs/source/data/advanced_data_catalog_usage.md @@ -1,5 +1,7 @@ # Advanced: Access the Data Catalog in code +TO REMOVE -- Diataxis: How to + The code API allows you to: * configure data sources in code diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index fe3daa7b71..6aea45754a 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -1,6 +1,8 @@ # Introduction to the Kedro Data Catalog +TO REMOVE -- Diataxis: Explanation + This section introduces `catalog.yml`, the project-shareable Data Catalog. The file is located in `conf/base` and is a registry of all data sources available for use by a project; it manages loading and saving of data. All supported data connectors are available in [`kedro-datasets`](/kedro_datasets). diff --git a/docs/source/data/data_catalog_basic_how_to.md b/docs/source/data/data_catalog_basic_how_to.md index c527cf1517..51b2a5e44a 100644 --- a/docs/source/data/data_catalog_basic_how_to.md +++ b/docs/source/data/data_catalog_basic_how_to.md @@ -1,7 +1,5 @@ # Data Catalog how to guide - - -TO DO: Revise any explanations where possible to make it more hands on/task based +TO REMOVE -- Diataxis: How to but there is some explanation and it would be ideal to make this more task-based where possible. ## How to version datasets and ML models diff --git a/docs/source/data/data_catalog_yaml_examples.md b/docs/source/data/data_catalog_yaml_examples.md index e90e5e0d68..09fc6b787d 100644 --- a/docs/source/data/data_catalog_yaml_examples.md +++ b/docs/source/data/data_catalog_yaml_examples.md @@ -1,5 +1,9 @@ # Data Catalog YAML examples +TO REMOVE -- Diataxis: How to guide (code example) + +TO DO: Add a set of anchor links + You can configure your datasets in a YAML configuration file, `conf/base/catalog.yml` or `conf/local/catalog.yml`. Here are some examples of data configuration in a `catalog.yml`: diff --git a/docs/source/data/how_to_create_a_custom_dataset.md b/docs/source/data/how_to_create_a_custom_dataset.md index c4c080ad93..d449925c8f 100644 --- a/docs/source/data/how_to_create_a_custom_dataset.md +++ b/docs/source/data/how_to_create_a_custom_dataset.md @@ -1,5 +1,7 @@ # Tutorial: How to create a custom dataset +TO REMOVE -- Diataxis: Tutorial + [Kedro supports many datasets](/kedro_datasets) out of the box, but you may find that you need to create a custom dataset. For example, you may need to handle a proprietary data format or filesystem in your pipeline, or perhaps you have found a particular use case for a dataset that Kedro does not support. This tutorial explains how to create a custom dataset to read and save image data. ## AbstractDataSet diff --git a/docs/source/data/partitioned_and_incremental_datasets.md b/docs/source/data/partitioned_and_incremental_datasets.md index bbadae2512..6997f9db79 100644 --- a/docs/source/data/partitioned_and_incremental_datasets.md +++ b/docs/source/data/partitioned_and_incremental_datasets.md @@ -1,5 +1,7 @@ ## Advanced: Partitioned and incremental datasets +TO REMOVE -- Diataxis: Explanation + These days, distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible. This is why Kedro provides a built-in [PartitionedDataSet](/kedro.io.PartitionedDataSet), with the following features: From 66b8384c9de11f59b718987a53dacfeb5b2bfb8c Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Thu, 10 Aug 2023 17:45:13 +0100 Subject: [PATCH 04/19] Afternoon's work in rewriting/reorganising content Signed-off-by: Jo Stichbury --- .../data/advanced_data_catalog_usage.md | 46 +++- docs/source/data/data_catalog.md | 179 +++++++++++--- docs/source/data/data_catalog_basic_how_to.md | 126 ---------- .../source/data/data_catalog_yaml_examples.md | 227 +----------------- .../data/how_to_create_a_custom_dataset.md | 2 +- docs/source/data/index.md | 31 ++- docs/source/data/kedro_dataset_factories.md | 222 +++++++++++++++++ .../partitioned_and_incremental_datasets.md | 31 ++- 8 files changed, 470 insertions(+), 394 deletions(-) delete mode 100644 docs/source/data/data_catalog_basic_how_to.md create mode 100644 docs/source/data/kedro_dataset_factories.md diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md index 16aa371c0d..bb148e7792 100644 --- a/docs/source/data/advanced_data_catalog_usage.md +++ b/docs/source/data/advanced_data_catalog_usage.md @@ -1,7 +1,51 @@ -# Advanced: Access the Data Catalog in code +# Advanced Data Catalog usage + +## How to read the same file using two different dataset implementations + +When you want to load and save the same file, via its specified `filepath`, using different `DataSet` implementations, you'll need to use transcoding. + +### A typical example of transcoding + +For instance, parquet files can not only be loaded via the `ParquetDataSet` using `pandas`, but also directly by `SparkDataSet`. This conversion is typical when coordinating a `Spark` to `pandas` workflow. + +To enable transcoding, define two `DataCatalog` entries for the same dataset in a common format (Parquet, JSON, CSV, etc.) in your `conf/base/catalog.yml`: + +```yaml +my_dataframe@spark: + type: spark.SparkDataSet + filepath: data/02_intermediate/data.parquet + file_format: parquet + +my_dataframe@pandas: + type: pandas.ParquetDataSet + filepath: data/02_intermediate/data.parquet +``` + +These entries are used in the pipeline like this: + +```python +pipeline( + [ + node(func=my_func1, inputs="spark_input", outputs="my_dataframe@spark"), + node(func=my_func2, inputs="my_dataframe@pandas", outputs="pipeline_output"), + ] +) +``` + +### How does transcoding work? + +In this example, Kedro understands that `my_dataframe` is the same dataset in its `spark.SparkDataSet` and `pandas.ParquetDataSet` formats and helps resolve the node execution order. + +In the pipeline, Kedro uses the `spark.SparkDataSet` implementation for saving and `pandas.ParquetDataSet` +for loading, so the first node should output a `pyspark.sql.DataFrame`, while the second node would receive a `pandas.Dataframe`. + + +## Access the Data Catalog in code TO REMOVE -- Diataxis: How to +You can define a Data Catalog in two ways - through YAML configuration, or programmatically using an API. + The code API allows you to: * configure data sources in code diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index 16e3aeed05..f6b545374d 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -1,34 +1,30 @@ +# Introduction to the Data Catalog -# Introduction to the Kedro Data Catalog +## The basics of `catalog.yml` +A separate page of [Data Catalog YAML examples](./data_catalog_yaml_examples.md) gives further examples of how to work with `catalog.yml`, but here we revisit the [basic `catalog.yml` introduced by the spaceflights tutorial](../tutorial/set_up_data.md). -TO REMOVE -- Diataxis: Explanation +The example below registers two `csv` datasets, and an `xlsx` dataset. The minimum details needed to load and save a file within a local file system are the key, which is name of the dataset, the type of data to indicate the dataset to use (`type`) and the file's location (`filepath`). -This section introduces `catalog.yml`, the project-shareable Data Catalog. The file is located in `conf/base` and is a registry of all data sources available for use by a project; it manages loading and saving of data. +```yaml +companies: + type: pandas.CSVDataSet + filepath: data/01_raw/companies.csv -All supported data connectors are available in [`kedro-datasets`](/kedro_datasets). +reviews: + type: pandas.CSVDataSet + filepath: data/01_raw/reviews.csv -## Use the Data Catalog within Kedro configuration - -Kedro uses configuration to make your code reproducible when it has to reference datasets in different locations and/or in different environments. - -You can copy this file and reference additional locations for the same datasets. For instance, you can use the `catalog.yml` file in `conf/base/` to register the locations of datasets that would run in production, while copying and updating a second version of `catalog.yml` in `conf/local/` to register the locations of sample datasets that you are using for prototyping your data pipeline(s). - -Built-in functionality for `conf/local/` to overwrite `conf/base/` is [described in the documentation about configuration](../configuration/configuration_basics.md). This means that a dataset called `cars` could exist in the `catalog.yml` files in `conf/base/` and `conf/local/`. In code, in `src`, you would only call a dataset named `cars` and Kedro would detect which definition of `cars` dataset to use to run your pipeline - `cars` definition from `conf/local/catalog.yml` would take precedence in this case. - -The Data Catalog also works with the `credentials.yml` file in `conf/local/`, allowing you to specify usernames and passwords required to load certain datasets. - -You can define a Data Catalog in two ways - through YAML configuration, or programmatically using an API. Both methods allow you to specify: - - - Dataset name - - Dataset type - - Location of the dataset using `fsspec`, detailed in the next section - - Credentials needed to access the dataset - - Load and saving arguments - - Whether you want a [dataset or ML model to be versioned](kedro_io.md#versioning) when you run your data pipeline +shuttles: + type: pandas.ExcelDataSet + filepath: data/01_raw/shuttles.xlsx + load_args: + engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0) +``` +### Dataset `type` -## Specify the location of the dataset +### Dataset `filepath` -Kedro relies on [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to read and save data from a variety of data stores including local file systems, network file systems, cloud object stores, and Hadoop. When specifying a storage location in `filepath:`, you should provide a URL using the general form `protocol://path/to/data`. If no protocol is provided, the local file system is assumed (same as ``file://``). +Kedro relies on [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to read and save data from a variety of data stores including local file systems, network file systems, cloud object stores, and Hadoop. When specifying a storage location in `filepath:`, you should provide a URL using the general form `protocol://path/to/data`. If no protocol is provided, the local file system is assumed (which is the same as ``file://``). The following prepends are available: @@ -44,12 +40,133 @@ The following prepends are available: `fsspec` also provides other file systems, such as SSH, FTP and WebHDFS. [See the fsspec documentation for more information](https://filesystem-spec.readthedocs.io/en/latest/api.html#implementations). -```{toctree} -:maxdepth: 1 -data_catalog_yaml_examples -data_catalog_basic_how_to -partitioned_and_incremental_datasets -advanced_data_catalog_usage -how_to_create_a_custom_dataset +## Additional settings in `catalog.yml` + +This section explains the additional settings available within `catalog.yml`. + +### Load and save arguments +The Kedro Data Catalog also accepts two different groups of `*_args` parameters that serve different purposes: + +* **`load_args` and `save_args`**: Configures how a third-party library loads/saves data from/to a file. In the spaceflights example above, `load_args`, is passed to the excel file read method (`pd.read_excel`) as a [keyword argument](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html). Although not specified here, the equivalent output is `save_args` and the value would be passed to [`pd.DataFrame.to_excel` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html). + +For example, to load or save a CSV on a local file system, using specified load/save arguments: + +```yaml +cars: + type: pandas.CSVDataSet + filepath: data/01_raw/company/cars.csv + load_args: + sep: ',' + save_args: + index: False + date_format: '%Y-%m-%d %H:%M' + decimal: . +``` + +* **`fs_args`**: Configures the interaction with a filesystem. +All the top-level parameters of `fs_args` (except `open_args_load` and `open_args_save`) will be passed to an underlying filesystem class. + +For example, to provide the `project` value to the underlying filesystem class (`GCSFileSystem`) to interact with Google Cloud Storage: + +```yaml +test_dataset: + type: ... + fs_args: + project: test_project +``` + +The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively. + +For example, to load data from a local binary file using `utf-8` encoding: + +```yaml +test_dataset: + type: ... + fs_args: + open_args_load: + mode: "rb" + encoding: "utf-8" +``` + +### Dataset access credentials +The Data Catalog also works with the `credentials.yml` file in `conf/local/`, allowing you to specify usernames and passwords required to load certain datasets. + +Before instantiating the `DataCatalog`, Kedro will first attempt to read [the credentials from the project configuration](../configuration/credentials.md). The resulting dictionary is then passed into `DataCatalog.from_config()` as the `credentials` argument. + +Let's assume that the project contains the file `conf/local/credentials.yml` with the following contents: + +```yaml +dev_s3: + client_kwargs: + aws_access_key_id: key + aws_secret_access_key: secret +``` + +and the Data Catalog is specified in `catalog.yml` as follows: + +```yaml +motorbikes: + type: pandas.CSVDataSet + filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv + credentials: dev_s3 + load_args: + sep: ',' +``` +In the example above, the `catalog.yml` file contains references to credentials keys `dev_s3`. The Data Catalog first reads `dev_s3` from the received `credentials` dictionary, and then passes its values into the dataset as a `credentials` argument to `__init__`. + + +### Dataset versioning + +Kedro enables dataset and ML model versioning through the `versioned` definition. For example: + +```yaml +cars: + type: pandas.CSVDataSet + filepath: data/01_raw/company/cars.csv + versioned: True ``` + +In this example, `filepath` is used as the basis of a folder that stores versions of the `cars` dataset. Each time a new version is created by a pipeline run it is stored within `data/01_raw/company/cars.csv//cars.csv`, where `` corresponds to a version string formatted as `YYYY-MM-DDThh.mm.ss.sssZ`. + +By default, `kedro run` loads the latest version of the dataset. However, you can also specify a particular versioned data set with `--load-version` flag as follows: + +```bash +kedro run --load-version=cars:YYYY-MM-DDThh.mm.ss.sssZ +``` +where `--load-version` is dataset name and version timestamp separated by `:`. + +Currently, the following datasets support versioning: + +- `kedro_datasets.matplotlib.MatplotlibWriter` +- `kedro_datasets.holoviews.HoloviewsWriter` +- `kedro_datasets.networkx.NetworkXDataSet` +- `kedro_datasets.pandas.CSVDataSet` +- `kedro_datasets.pandas.ExcelDataSet` +- `kedro_datasets.pandas.FeatherDataSet` +- `kedro_datasets.pandas.HDFDataSet` +- `kedro_datasets.pandas.JSONDataSet` +- `kedro_datasets.pandas.ParquetDataSet` +- `kedro_datasets.pickle.PickleDataSet` +- `kedro_datasets.pillow.ImageDataSet` +- `kedro_datasets.text.TextDataSet` +- `kedro_datasets.spark.SparkDataSet` +- `kedro_datasets.yaml.YAMLDataSet` +- `kedro_datasets.api.APIDataSet` +- `kedro_datasets.tensorflow.TensorFlowModelDataSet` +- `kedro_datasets.json.JSONDataSet` + +```{note} +Although HTTP(S) is a supported file system in the dataset implementations, it does not support versioning. +``` + + +## Use the Data Catalog within Kedro configuration + +Kedro configuration enables you to organise your project for different stages of your data pipeline. For example, you might need different Data Catalog settings for development, testing, and production environments. + +By default, Kedro has a `base` and a `local` folder for configuration. The Data Catalog configuration is loaded using a configuration loader class which recursively scans for configuration files inside the `conf` folder, firstly in `conf/base` and then in `conf/local` (which is the designated overriding environment). Kedro merges the configuration information and returns a configuration dictionary according to rules set out in the [configuration documentation](../configuration/configuration_basics.md). + +In summary, if you need to configure your datasets for different environments, you can create both `conf/base/catalog.yml` and `conf/local/catalog.yml`. For instance, you can use the `catalog.yml` file in `conf/base/` to register the locations of datasets that would run in production, while adding a second version of `catalog.yml` in `conf/local/` to register the locations of sample datasets while you are using them for prototyping data pipeline(s). + +To illustrate this, if you include a dataset called `cars` in `catalog.yml` stored in both `conf/base` and `conf/local`, your pipeline code would use the `cars` dataset and rely on Kedro to detect which definition of `cars` dataset to use in your pipeline. diff --git a/docs/source/data/data_catalog_basic_how_to.md b/docs/source/data/data_catalog_basic_how_to.md deleted file mode 100644 index 51b2a5e44a..0000000000 --- a/docs/source/data/data_catalog_basic_how_to.md +++ /dev/null @@ -1,126 +0,0 @@ -# Data Catalog how to guide -TO REMOVE -- Diataxis: How to but there is some explanation and it would be ideal to make this more task-based where possible. - -## How to version datasets and ML models - -Making a simple addition to your Data Catalog allows you to perform versioning of datasets and machine learning models. - -Consider the following versioned dataset defined in the `catalog.yml`: - -```yaml -cars: - type: pandas.CSVDataSet - filepath: data/01_raw/company/cars.csv - versioned: True -``` - -The `DataCatalog` will create a folder to store a version of the `CSVDataSet` called `cars`. The actual csv file location will look like `data/01_raw/company/cars.csv//cars.csv`, where `` corresponds to a global save version string formatted as `YYYY-MM-DDThh.mm.ss.sssZ`. - -By default, the `DataCatalog` will load the latest version of the dataset. However, you can also specify a particular versioned data set with `--load-version` flag as follows: - -```bash -kedro run --load-version=cars:YYYY-MM-DDThh.mm.ss.sssZ -``` -where `--load-version` is dataset name and version timestamp separated by `:`. - -### Supported datasets - -Currently, the following datasets support versioning: - -- `kedro_datasets.matplotlib.MatplotlibWriter` -- `kedro_datasets.holoviews.HoloviewsWriter` -- `kedro_datasets.networkx.NetworkXDataSet` -- `kedro_datasets.pandas.CSVDataSet` -- `kedro_datasets.pandas.ExcelDataSet` -- `kedro_datasets.pandas.FeatherDataSet` -- `kedro_datasets.pandas.HDFDataSet` -- `kedro_datasets.pandas.JSONDataSet` -- `kedro_datasets.pandas.ParquetDataSet` -- `kedro_datasets.pickle.PickleDataSet` -- `kedro_datasets.pillow.ImageDataSet` -- `kedro_datasets.text.TextDataSet` -- `kedro_datasets.spark.SparkDataSet` -- `kedro_datasets.yaml.YAMLDataSet` -- `kedro_datasets.api.APIDataSet` -- `kedro_datasets.tensorflow.TensorFlowModelDataSet` -- `kedro_datasets.json.JSONDataSet` - -```{note} -Although HTTP(S) is a supported file system in the dataset implementations, it does not support versioning. -``` - -## How to create a Data Catalog YAML configuration file via the CLI - -You can use the [`kedro catalog create` command to create a Data Catalog YAML configuration](../development/commands_reference.md#create-a-data-catalog-yaml-configuration-file). - -This creates a `//catalog/.yml` configuration file with `MemoryDataSet` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`. - -```yaml -# //catalog/.yml -rockets: - type: MemoryDataSet -scooters: - type: MemoryDataSet -``` - - -## How to access a dataset that needs credentials - -Before instantiating the `DataCatalog`, Kedro will first attempt to read [the credentials from the project configuration](../configuration/credentials.md). The resulting dictionary is then passed into `DataCatalog.from_config()` as the `credentials` argument. - -Let's assume that the project contains the file `conf/local/credentials.yml` with the following contents: - -```yaml -dev_s3: - client_kwargs: - aws_access_key_id: key - aws_secret_access_key: secret - -scooters_credentials: - con: sqlite:///kedro.db - -my_gcp_credentials: - id_token: key -``` - -In the example above, the `catalog.yml` file contains references to credentials keys `dev_s3` and `scooters_credentials`. This means that when it instantiates the `motorbikes` dataset, for example, the `DataCatalog` will attempt to read top-level key `dev_s3` from the received `credentials` dictionary, and then will pass its values into the dataset `__init__` as a `credentials` argument. - - -## How to read the same file using two different dataset implementations - -When you want to load and save the same file, via its specified `filepath`, using different `DataSet` implementations, you'll need to use transcoding. - -### A typical example of transcoding - -For instance, parquet files can not only be loaded via the `ParquetDataSet` using `pandas`, but also directly by `SparkDataSet`. This conversion is typical when coordinating a `Spark` to `pandas` workflow. - -To enable transcoding, define two `DataCatalog` entries for the same dataset in a common format (Parquet, JSON, CSV, etc.) in your `conf/base/catalog.yml`: - -```yaml -my_dataframe@spark: - type: spark.SparkDataSet - filepath: data/02_intermediate/data.parquet - file_format: parquet - -my_dataframe@pandas: - type: pandas.ParquetDataSet - filepath: data/02_intermediate/data.parquet -``` - -These entries are used in the pipeline like this: - -```python -pipeline( - [ - node(func=my_func1, inputs="spark_input", outputs="my_dataframe@spark"), - node(func=my_func2, inputs="my_dataframe@pandas", outputs="pipeline_output"), - ] -) -``` - -### How does transcoding work? - -In this example, Kedro understands that `my_dataframe` is the same dataset in its `spark.SparkDataSet` and `pandas.ParquetDataSet` formats and helps resolve the node execution order. - -In the pipeline, Kedro uses the `spark.SparkDataSet` implementation for saving and `pandas.ParquetDataSet` -for loading, so the first node should output a `pyspark.sql.DataFrame`, while the second node would receive a `pandas.Dataframe`. diff --git a/docs/source/data/data_catalog_yaml_examples.md b/docs/source/data/data_catalog_yaml_examples.md index 09fc6b787d..02530e3f62 100644 --- a/docs/source/data/data_catalog_yaml_examples.md +++ b/docs/source/data/data_catalog_yaml_examples.md @@ -6,17 +6,8 @@ TO DO: Add a set of anchor links You can configure your datasets in a YAML configuration file, `conf/base/catalog.yml` or `conf/local/catalog.yml`. -Here are some examples of data configuration in a `catalog.yml`: +## Provide the `project` value to the underlying filesystem class (`GCSFileSystem`) to interact with Google Cloud Storage (GCS) -Data Catalog accepts two different groups of `*_args` parameters that serve different purposes: -- `fs_args` -- `load_args` and `save_args` - -The `fs_args` is used to configure the interaction with a filesystem. -All the top-level parameters of `fs_args` (except `open_args_load` and `open_args_save`) will be passed in an underlying filesystem class. - -**Provide the `project` value to the underlying filesystem class (`GCSFileSystem`) to interact with Google Cloud Storage (GCS) -** ```yaml test_dataset: type: ... @@ -26,8 +17,8 @@ test_dataset: The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively. -**Load data from a local binary file using `utf-8` encoding -** +## Load data from a local binary file using `utf-8` encoding + ```yaml test_dataset: type: ... @@ -39,8 +30,8 @@ test_dataset: `load_args` and `save_args` configure how a third-party library (e.g. `pandas` for `CSVDataSet`) loads/saves data from/to a file. -**Save data to a CSV file without row names (index) using `utf-8` encoding -** +## Save data to a CSV file without row names (index) using `utf-8` encoding + ```yaml test_dataset: type: pandas.CSVDataSet @@ -371,208 +362,16 @@ airplanes: In this example, the default `csv` configuration is inserted into `airplanes` and then the `load_args` block is overridden. Normally, that would replace the whole dictionary. In order to extend `load_args`, the defaults for that block are then re-inserted. -## Load multiple datasets with similar configuration using dataset factories -For catalog entries that share configuration details, you can also use the dataset factories introduced in Kedro 0.18.12. This syntax allows you to generalise the configuration and -reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns. +## Create a Data Catalog YAML configuration file via the CLI -### Generalise datasets with similar names and types into one dataset factory -Consider the following catalog entries: -```yaml -factory_data: - type: pandas.CSVDataSet - filepath: data/01_raw/factory_data.csv +You can use the [`kedro catalog create` command to create a Data Catalog YAML configuration](../development/commands_reference.md#create-a-data-catalog-yaml-configuration-file). +This creates a `//catalog/.yml` configuration file with `MemoryDataSet` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`. -process_data: - type: pandas.CSVDataSet - filepath: data/01_raw/process_data.csv -``` -The datasets in this catalog can be generalised to the following dataset factory: ```yaml -"{name}_data": - type: pandas.CSVDataSet - filepath: data/01_raw/{name}_data.csv -``` -When `factory_data` or `process_data` is used in your pipeline, it is matched to the factory pattern `{name}_data`. The factory pattern must always be enclosed in -quotes to avoid YAML parsing errors. - - -### Generalise datasets of the same type into one dataset factory -You can also combine all the datasets with the same type and configuration details. For example, consider the following -catalog with three datasets named `boats`, `cars` and `planes` of the type `pandas.CSVDataSet`: -```yaml -boats: - type: pandas.CSVDataSet - filepath: data/01_raw/shuttles.csv - -cars: - type: pandas.CSVDataSet - filepath: data/01_raw/reviews.csv - -planes: - type: pandas.CSVDataSet - filepath: data/01_raw/companies.csv -``` -These datasets can be combined into the following dataset factory: -```yaml -"{dataset_name}#csv": - type: pandas.CSVDataSet - filepath: data/01_raw/{dataset_name}.csv -``` -You will then have to update the pipelines in your project located at `src///pipeline.py` to refer to these datasets as `boats#csv`, -`cars#csv` and `planes#csv`. Adding a suffix or a prefix to the dataset names and the dataset factory patterns, like `#csv` here, ensures that the dataset -names are matched with the intended pattern. -```python -from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles - - -def create_pipeline(**kwargs) -> Pipeline: - return pipeline( - [ - node( - func=preprocess_boats, - inputs="boats#csv", - outputs="preprocessed_boats", - name="preprocess_boats_node", - ), - node( - func=preprocess_cars, - inputs="cars#csv", - outputs="preprocessed_cars", - name="preprocess_cars_node", - ), - node( - func=preprocess_planes, - inputs="planes#csv", - outputs="preprocessed_planes", - name="preprocess_planes_node", - ), - node( - func=create_model_input_table, - inputs=[ - "preprocessed_boats", - "preprocessed_planes", - "preprocessed_cars", - ], - outputs="model_input_table", - name="create_model_input_table_node", - ), - ] - ) -``` -### Generalise datasets using namespaces into one dataset factory -You can also generalise the catalog entries for datasets belonging to namespaced modular pipelines. Consider the -following pipeline which takes in a `model_input_table` and outputs two regressors belonging to the -`active_modelling_pipeline` and the `candidate_modelling_pipeline` namespaces: -```python -from kedro.pipeline import Pipeline, node -from kedro.pipeline.modular_pipeline import pipeline - -from .nodes import evaluate_model, split_data, train_model - - -def create_pipeline(**kwargs) -> Pipeline: - pipeline_instance = pipeline( - [ - node( - func=split_data, - inputs=["model_input_table", "params:model_options"], - outputs=["X_train", "y_train"], - name="split_data_node", - ), - node( - func=train_model, - inputs=["X_train", "y_train"], - outputs="regressor", - name="train_model_node", - ), - ] - ) - ds_pipeline_1 = pipeline( - pipe=pipeline_instance, - inputs="model_input_table", - namespace="active_modelling_pipeline", - ) - ds_pipeline_2 = pipeline( - pipe=pipeline_instance, - inputs="model_input_table", - namespace="candidate_modelling_pipeline", - ) - - return ds_pipeline_1 + ds_pipeline_2 -``` -You can now have one dataset factory pattern in your catalog instead of two separate entries for `active_modelling_pipeline.regressor` -and `candidate_modelling_pipeline.regressor` as below: -```yaml -{namespace}.regressor: - type: pickle.PickleDataSet - filepath: data/06_models/regressor_{namespace}.pkl - versioned: true -``` -### Generalise datasets of the same type in different layers into one dataset factory with multiple placeholders - -You can use multiple placeholders in the same pattern. For example, consider the following catalog where the dataset -entries share `type`, `file_format` and `save_args`: -```yaml -processing.factory_data: - type: spark.SparkDataSet - filepath: data/processing/factory_data.pq - file_format: parquet - save_args: - mode: overwrite - -processing.process_data: - type: spark.SparkDataSet - filepath: data/processing/process_data.pq - file_format: parquet - save_args: - mode: overwrite - -modelling.metrics: - type: spark.SparkDataSet - filepath: data/modelling/factory_data.pq - file_format: parquet - save_args: - mode: overwrite -``` -This could be generalised to the following pattern: -```yaml -"{layer}.{dataset_name}": - type: spark.SparkDataSet - filepath: data/{layer}/{dataset_name}.pq - file_format: parquet - save_args: - mode: overwrite -``` -All the placeholders used in the catalog entry body must exist in the factory pattern name. - -### Generalise datasets using multiple dataset factories -You can have multiple dataset factories in your catalog. For example: -```yaml -"{namespace}.{dataset_name}@spark": - type: spark.SparkDataSet - filepath: data/{namespace}/{dataset_name}.pq - file_format: parquet - -"{dataset_name}@csv": - type: pandas.CSVDataSet - filepath: data/01_raw/{dataset_name}.csv -``` - -Having multiple dataset factories in your catalog can lead to a situation where a dataset name from your pipeline might -match multiple patterns. To overcome this, Kedro sorts all the potential matches for the dataset name in the pipeline and picks the best match. -The matches are ranked according to the following criteria : -1. Number of exact character matches between the dataset name and the factory pattern. For example, a dataset named `factory_data$csv` would match `{dataset}_data$csv` over `{dataset_name}$csv`. -2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`. -3. Alphabetical order - -### Generalise all datasets with a catch-all dataset factory to overwrite the default `MemoryDataSet` -You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataSet` creation. -```yaml -"{default_dataset}": - type: pandas.CSVDataSet - filepath: data/{default_dataset}.csv - +# //catalog/.yml +rockets: + type: MemoryDataSet +scooters: + type: MemoryDataSet ``` -Kedro will now treat all the datasets mentioned in your project's pipelines that do not appear as specific patterns or explicit entries in your catalog -as `pandas.CSVDataSet`. diff --git a/docs/source/data/how_to_create_a_custom_dataset.md b/docs/source/data/how_to_create_a_custom_dataset.md index d449925c8f..4e259408a9 100644 --- a/docs/source/data/how_to_create_a_custom_dataset.md +++ b/docs/source/data/how_to_create_a_custom_dataset.md @@ -1,4 +1,4 @@ -# Tutorial: How to create a custom dataset +# Advanced: Tutorial to create a custom dataset TO REMOVE -- Diataxis: Tutorial diff --git a/docs/source/data/index.md b/docs/source/data/index.md index 23012f66e8..b40df5b282 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -1,21 +1,42 @@ # The Kedro Data Catalog -Kedro's Data Catalog is a registry for all the data sources that a project can use. The Data Catalog is used to manage loading and saving data and it maps the names of node inputs and outputs as keys in a `DataCatalog`, a Kedro class that can be specialised for different types of data storage. +In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. The catalog is stored in a YAML file (`catalog.yml`) that maps the names of node inputs and outputs as keys in the `DataCatalog` class. -[Kedro provides different built-in datasets](/kedro_datasets) for numerous file types and file systems, so you don’t have to write any of the logic for reading/writing data. +[Kedro provides different built-in datasets in the `kedro-datasets` package](/kedro_datasets) for numerous file types and file systems, so you don’t have to write any of the logic for reading/writing data. -This section is comprised of a set of pages that do the following: -TO DO -- summarise + +We first introduce the basic sections of `catalog.yml`, which is the file used to register data sources for a Kedro project. ```{toctree} :maxdepth: 1 data_catalog +``` + +The following page offers a range of examples of YAML specification for various Data Catalog use cases: + +```{toctree} +:maxdepth: 1 + data_catalog_yaml_examples -data_catalog_basic_how_to +``` + +Further pages describe more advanced usage: + +```{toctree} +:maxdepth: 1 + +kedro_dataset_factories partitioned_and_incremental_datasets advanced_data_catalog_usage +``` + +The section concludes with an advanced use case tutorial to create your own custom dataset: + +```{toctree} +:maxdepth: 1 + how_to_create_a_custom_dataset ``` diff --git a/docs/source/data/kedro_dataset_factories.md b/docs/source/data/kedro_dataset_factories.md new file mode 100644 index 0000000000..48782950ec --- /dev/null +++ b/docs/source/data/kedro_dataset_factories.md @@ -0,0 +1,222 @@ +# Kedro dataset factories +You can load multiple datasets with similar configuration using dataset factories, introduced in Kedro 0.18.12. + +The syntax allows you to generalise the configuration and reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns. + +## Generalise datasets with similar names and types into one dataset factory +Consider the following catalog entries: + +```yaml +factory_data: + type: pandas.CSVDataSet + filepath: data/01_raw/factory_data.csv + + +process_data: + type: pandas.CSVDataSet + filepath: data/01_raw/process_data.csv +``` + +The datasets in this catalog can be generalised to the following dataset factory: + +```yaml +"{name}_data": + type: pandas.CSVDataSet + filepath: data/01_raw/{name}_data.csv +``` + +When `factory_data` or `process_data` is used in your pipeline, it is matched to the factory pattern `{name}_data`. The factory pattern must always be enclosed in +quotes to avoid YAML parsing errors. + + +## Generalise datasets of the same type into one dataset factory +You can also combine all the datasets with the same type and configuration details. For example, consider the following +catalog with three datasets named `boats`, `cars` and `planes` of the type `pandas.CSVDataSet`: + +```yaml +boats: + type: pandas.CSVDataSet + filepath: data/01_raw/shuttles.csv + +cars: + type: pandas.CSVDataSet + filepath: data/01_raw/reviews.csv + +planes: + type: pandas.CSVDataSet + filepath: data/01_raw/companies.csv +``` + +These datasets can be combined into the following dataset factory: + +```yaml +"{dataset_name}#csv": + type: pandas.CSVDataSet + filepath: data/01_raw/{dataset_name}.csv +``` + +You will then have to update the pipelines in your project located at `src///pipeline.py` to refer to these datasets as `boats#csv`, +`cars#csv` and `planes#csv`. Adding a suffix or a prefix to the dataset names and the dataset factory patterns, like `#csv` here, ensures that the dataset +names are matched with the intended pattern. + +```python +from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles + +def create_pipeline(**kwargs) -> Pipeline: + return pipeline( + [ + node( + func=preprocess_boats, + inputs="boats#csv", + outputs="preprocessed_boats", + name="preprocess_boats_node", + ), + node( + func=preprocess_cars, + inputs="cars#csv", + outputs="preprocessed_cars", + name="preprocess_cars_node", + ), + node( + func=preprocess_planes, + inputs="planes#csv", + outputs="preprocessed_planes", + name="preprocess_planes_node", + ), + node( + func=create_model_input_table, + inputs=[ + "preprocessed_boats", + "preprocessed_planes", + "preprocessed_cars", + ], + outputs="model_input_table", + name="create_model_input_table_node", + ), + ] + ) +``` +## Generalise datasets using namespaces into one dataset factory +You can also generalise the catalog entries for datasets belonging to namespaced modular pipelines. Consider the +following pipeline which takes in a `model_input_table` and outputs two regressors belonging to the +`active_modelling_pipeline` and the `candidate_modelling_pipeline` namespaces: + +```python +from kedro.pipeline import Pipeline, node +from kedro.pipeline.modular_pipeline import pipeline + +from .nodes import evaluate_model, split_data, train_model + + +def create_pipeline(**kwargs) -> Pipeline: + pipeline_instance = pipeline( + [ + node( + func=split_data, + inputs=["model_input_table", "params:model_options"], + outputs=["X_train", "y_train"], + name="split_data_node", + ), + node( + func=train_model, + inputs=["X_train", "y_train"], + outputs="regressor", + name="train_model_node", + ), + ] + ) + ds_pipeline_1 = pipeline( + pipe=pipeline_instance, + inputs="model_input_table", + namespace="active_modelling_pipeline", + ) + ds_pipeline_2 = pipeline( + pipe=pipeline_instance, + inputs="model_input_table", + namespace="candidate_modelling_pipeline", + ) + + return ds_pipeline_1 + ds_pipeline_2 +``` +You can now have one dataset factory pattern in your catalog instead of two separate entries for `active_modelling_pipeline.regressor` +and `candidate_modelling_pipeline.regressor` as below: + +```yaml +{namespace}.regressor: + type: pickle.PickleDataSet + filepath: data/06_models/regressor_{namespace}.pkl + versioned: true +``` +## Generalise datasets of the same type in different layers into one dataset factory with multiple placeholders + +You can use multiple placeholders in the same pattern. For example, consider the following catalog where the dataset +entries share `type`, `file_format` and `save_args`: + +```yaml +processing.factory_data: + type: spark.SparkDataSet + filepath: data/processing/factory_data.pq + file_format: parquet + save_args: + mode: overwrite + +processing.process_data: + type: spark.SparkDataSet + filepath: data/processing/process_data.pq + file_format: parquet + save_args: + mode: overwrite + +modelling.metrics: + type: spark.SparkDataSet + filepath: data/modelling/factory_data.pq + file_format: parquet + save_args: + mode: overwrite +``` + +This could be generalised to the following pattern: + +```yaml +"{layer}.{dataset_name}": + type: spark.SparkDataSet + filepath: data/{layer}/{dataset_name}.pq + file_format: parquet + save_args: + mode: overwrite +``` +All the placeholders used in the catalog entry body must exist in the factory pattern name. + +### Generalise datasets using multiple dataset factories +You can have multiple dataset factories in your catalog. For example: + +```yaml +"{namespace}.{dataset_name}@spark": + type: spark.SparkDataSet + filepath: data/{namespace}/{dataset_name}.pq + file_format: parquet + +"{dataset_name}@csv": + type: pandas.CSVDataSet + filepath: data/01_raw/{dataset_name}.csv +``` + +Having multiple dataset factories in your catalog can lead to a situation where a dataset name from your pipeline might +match multiple patterns. To overcome this, Kedro sorts all the potential matches for the dataset name in the pipeline and picks the best match. +The matches are ranked according to the following criteria: + +1. Number of exact character matches between the dataset name and the factory pattern. For example, a dataset named `factory_data$csv` would match `{dataset}_data$csv` over `{dataset_name}$csv`. +2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`. +3. Alphabetical order + +### Generalise all datasets with a catch-all dataset factory to overwrite the default `MemoryDataSet` +You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataSet` creation. + +```yaml +"{default_dataset}": + type: pandas.CSVDataSet + filepath: data/{default_dataset}.csv + +``` +Kedro will now treat all the datasets mentioned in your project's pipelines that do not appear as specific patterns or explicit entries in your catalog +as `pandas.CSVDataSet`. diff --git a/docs/source/data/partitioned_and_incremental_datasets.md b/docs/source/data/partitioned_and_incremental_datasets.md index 6997f9db79..b3f0e2e6a4 100644 --- a/docs/source/data/partitioned_and_incremental_datasets.md +++ b/docs/source/data/partitioned_and_incremental_datasets.md @@ -1,6 +1,6 @@ -## Advanced: Partitioned and incremental datasets +# Advanced: Partitioned and incremental datasets -TO REMOVE -- Diataxis: Explanation +## Partitioned datasets These days, distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible. @@ -15,9 +15,9 @@ This is why Kedro provides a built-in [PartitionedDataSet](/kedro.io.Partitioned In this section, each individual file inside a given location is called a partition. ``` -### Partitioned dataset definition +### How to use `PartitionedDataSet` -`PartitionedDataSet` definition can be put in your `catalog.yml` file like any other regular dataset definition. The definition represents the following structure: +You can use a `PartitionedDataSet` in `catalog.yml` file like any other regular dataset definition: ```yaml # conf/base/catalog.yml @@ -83,22 +83,22 @@ Here is an exhaustive list of the arguments supported by `PartitionedDataSet`: | `filepath_arg` | No | `str` (defaults to `filepath`) | Argument name of the underlying dataset initializer that will contain a path to an individual partition | | `filename_suffix` | No | `str` (defaults to an empty string) | If specified, partitions that don't end with this string will be ignored | -#### Dataset definition +### Dataset definition -Dataset definition should be passed into the `dataset` argument of the `PartitionedDataSet`. The dataset definition is used to instantiate a new dataset object for each individual partition, and use that dataset object for load and save operations. Dataset definition supports shorthand and full notations. +The dataset definition should be passed into the `dataset` argument of the `PartitionedDataSet`. The dataset definition is used to instantiate a new dataset object for each individual partition, and use that dataset object for load and save operations. Dataset definition supports shorthand and full notations. -##### Shorthand notation +#### Shorthand notation Requires you only to specify a class of the underlying dataset either as a string (e.g. `pandas.CSVDataSet` or a fully qualified class path like `kedro_datasets.pandas.CSVDataSet`) or as a class object that is a subclass of the [AbstractDataSet](/kedro.io.AbstractDataSet). -##### Full notation +#### Full notation Full notation allows you to specify a dictionary with the full underlying dataset definition _except_ the following arguments: * The argument that receives the partition path (`filepath` by default) - if specified, a `UserWarning` will be emitted stating that this value will be overridden by individual partition paths * `credentials` key - specifying it will result in a `DataSetError` being raised; dataset credentials should be passed into the `credentials` argument of the `PartitionedDataSet` rather than the underlying dataset definition - see the section below on [partitioned dataset credentials](#partitioned-dataset-credentials) for details * `versioned` flag - specifying it will result in a `DataSetError` being raised; versioning cannot be enabled for the underlying datasets -#### Partitioned dataset credentials +### Partitioned dataset credentials ```{note} Support for `dataset_credentials` key in the credentials for `PartitionedDataSet` is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. @@ -236,8 +236,7 @@ def create_partitions() -> Dict[str, Callable[[], Any]]: ```{note} When using lazy saving, the dataset will be written _after_ the `after_node_run` [hook](../hooks/introduction). ``` - -### Incremental loads with `IncrementalDataSet` +## Incremental datasets [IncrementalDataSet](/kedro.io.IncrementalDataSet) is a subclass of `PartitionedDataSet`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataSet` addresses the use case when partitions have to be processed incrementally, i.e. each subsequent pipeline run should only process the partitions which were not processed by the previous runs. @@ -245,17 +244,17 @@ This checkpoint, by default, is persisted to the location of the data partitions The checkpoint file is only created _after_ [the partitioned dataset is explicitly confirmed](#incremental-dataset-confirm). -#### Incremental dataset load +### Incremental dataset loads Loading `IncrementalDataSet` works similarly to [`PartitionedDataSet`](#partitioned-dataset-load) with several exceptions: 1. `IncrementalDataSet` loads the data _eagerly_, so the values in the returned dictionary represent the actual data stored in the corresponding partition, rather than a pointer to the load function. `IncrementalDataSet` considers a partition relevant for processing if its ID satisfies the comparison function, given the checkpoint value. 2. `IncrementalDataSet` _does not_ raise a `DataSetError` if load finds no partitions to return - an empty dictionary is returned instead. An empty list of available partitions is part of a normal workflow for `IncrementalDataSet`. -#### Incremental dataset save +### Incremental dataset save The `IncrementalDataSet` save operation is identical to the [save operation of the `PartitionedDataSet`](#partitioned-dataset-save). -#### Incremental dataset confirm +### Incremental dataset confirm ```{note} The checkpoint value *is not* automatically updated when a new set of partitions is successfully loaded or saved. @@ -312,7 +311,7 @@ Important notes about the confirmation operation: * A pipeline cannot contain more than one node confirming the same dataset. -#### Checkpoint configuration +### Checkpoint configuration `IncrementalDataSet` does not require explicit configuration of the checkpoint unless there is a need to deviate from the defaults. To update the checkpoint configuration, add a `checkpoint` key containing the valid dataset configuration. This may be required if, say, the pipeline has read-only permissions to the location of partitions (or write operations are undesirable for any other reason). In such cases, `IncrementalDataSet` can be configured to save the checkpoint elsewhere. The `checkpoint` key also supports partial config updates where only some checkpoint attributes are overwritten, while the defaults are kept for the rest: @@ -328,7 +327,7 @@ my_partitioned_dataset: k1: v1 ``` -#### Special checkpoint config keys +### Special checkpoint config keys Along with the standard dataset attributes, `checkpoint` config also accepts two special optional keys: * `comparison_func` (defaults to `operator.gt`) - a fully qualified import path to the function that will be used to compare a partition ID with the checkpoint value, to determine whether a partition should be processed. Such functions must accept two positional string arguments - partition ID and checkpoint value - and return `True` if such partition is considered to be past the checkpoint. It might be useful to specify your own `comparison_func` if you need to customise the checkpoint filtration mechanism - for example, you might want to implement windowed loading, where you always want to load the partitions representing the last calendar month. See the example config specifying a custom comparison function: From 88002d53eca4de83367d8b60855681fd47fb011f Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Fri, 11 Aug 2023 17:31:10 +0100 Subject: [PATCH 05/19] More changes Signed-off-by: Jo Stichbury --- .../data/advanced_data_catalog_usage.md | 2 +- .../source/data/data_catalog_yaml_examples.md | 40 +++++++++---------- docs/source/data/index.md | 13 ++++-- docs/source/data/kedro_dataset_factories.md | 15 ++++--- 4 files changed, 40 insertions(+), 30 deletions(-) diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md index bb148e7792..b7b3769041 100644 --- a/docs/source/data/advanced_data_catalog_usage.md +++ b/docs/source/data/advanced_data_catalog_usage.md @@ -1,6 +1,6 @@ # Advanced Data Catalog usage -## How to read the same file using two different dataset implementations +## How to read the same file using two different datasets When you want to load and save the same file, via its specified `filepath`, using different `DataSet` implementations, you'll need to use transcoding. diff --git a/docs/source/data/data_catalog_yaml_examples.md b/docs/source/data/data_catalog_yaml_examples.md index 02530e3f62..4bb44ffeee 100644 --- a/docs/source/data/data_catalog_yaml_examples.md +++ b/docs/source/data/data_catalog_yaml_examples.md @@ -6,19 +6,10 @@ TO DO: Add a set of anchor links You can configure your datasets in a YAML configuration file, `conf/base/catalog.yml` or `conf/local/catalog.yml`. -## Provide the `project` value to the underlying filesystem class (`GCSFileSystem`) to interact with Google Cloud Storage (GCS) - -```yaml -test_dataset: - type: ... - fs_args: - project: test_project -``` +## Load data from a local binary file using `utf-8` encoding The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively. -## Load data from a local binary file using `utf-8` encoding - ```yaml test_dataset: type: ... @@ -41,7 +32,7 @@ test_dataset: encoding: "utf-8" ``` -## Load / save a CSV file from / to a local file system +## Load/save a CSV file from/to a local file system ```yaml bikes: @@ -49,7 +40,7 @@ bikes: filepath: data/01_raw/bikes.csv ``` -## Load / save a CSV on a local file system, using specified load / save arguments +## Load/save a CSV on a local file system, using specified load/save arguments ```yaml cars: @@ -64,7 +55,7 @@ cars: ``` -## Load / save a compressed CSV on a local file system +## Load/save a compressed CSV on a local file system ```yaml boats: @@ -92,7 +83,7 @@ motorbikes: na_values: ['#NA', NA] ``` -## Load / save a pickle file from / to a local file system +## Load/save a pickle file from/to a local file system ```yaml airplanes: @@ -103,6 +94,8 @@ airplanes: ## Load an Excel file from Google Cloud Storage +The example includes the `project` value for the underlying filesystem class (`GCSFileSystem`) within Google Cloud Storage (GCS) + ```yaml rockets: type: pandas.ExcelDataSet @@ -114,6 +107,7 @@ rockets: sheet_name: Sheet1 ``` + ## Load a multi-sheet Excel file from a local file system ```yaml @@ -136,7 +130,7 @@ results_plot: ``` -## Load / save an HDF file on local file system storage, using specified load / save arguments +## Load/save an HDF file on local file system storage, using specified load/save arguments ```yaml skateboards: @@ -150,7 +144,7 @@ skateboards: dropna: True ``` -## Load / save a parquet file on local file system storage, using specified load / save arguments +## Load/save a parquet file on local file system storage, using specified load/save arguments ```yaml trucks: @@ -168,7 +162,7 @@ trucks: ``` -## Load / save a Spark table on S3, using specified load / save arguments +## Load/save a Spark table on S3, using specified load/save arguments ```yaml weather: @@ -185,7 +179,7 @@ weather: ``` -## Load / save a SQL table using credentials, a database connection, using specified load / save arguments +## Load/save a SQL table using credentials, a database connection, and specified load/save arguments ```yaml scooters: @@ -199,7 +193,7 @@ scooters: if_exists: replace ``` -## Load an SQL table with credentials, a database connection, and applies a SQL query to the table +## Load a SQL table with credentials and a database connection, and apply a SQL query to the table ```yaml @@ -211,10 +205,14 @@ scooters_query: index_col: [name] ``` -When you use [`pandas.SQLTableDataSet`](/kedro_datasets.pandas.SQLTableDataSet) or [`pandas.SQLQueryDataSet`](/kedro_datasets.pandas.SQLQueryDataSet), you must provide a database connection string. In the above example, we pass it using the `scooters_credentials` key from the credentials (see the details in the [Feeding in credentials](#feeding-in-credentials) section below). `scooters_credentials` must have a top-level key `con` containing a [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) connection string. As an alternative to credentials, you could explicitly put `con` into `load_args` and `save_args` (`pandas.SQLTableDataSet` only). +When you use [`pandas.SQLTableDataSet`](/kedro_datasets.pandas.SQLTableDataSet) or [`pandas.SQLQueryDataSet`](/kedro_datasets.pandas.SQLQueryDataSet), you must provide a database connection string. In the above example, we pass it using the `scooters_credentials` key from the credentials. + +Note that `scooters_credentials` must have a top-level key `con` containing a [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) connection string. As an alternative to credentials, you could explicitly put `con` into `load_args` and `save_args` (`pandas.SQLTableDataSet` only). + +## Load data from an API endpoint -## Load data from an API endpoint, example US corn yield data from USDA +This example uses US corn yield data from USDA. ```yaml us_corn_yield_data: diff --git a/docs/source/data/index.md b/docs/source/data/index.md index b40df5b282..16abff0e59 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -6,7 +6,6 @@ In a Kedro project, the Data Catalog is a registry of all data sources available [Kedro provides different built-in datasets in the `kedro-datasets` package](/kedro_datasets) for numerous file types and file systems, so you don’t have to write any of the logic for reading/writing data. - We first introduce the basic sections of `catalog.yml`, which is the file used to register data sources for a Kedro project. ```{toctree} @@ -23,17 +22,25 @@ The following page offers a range of examples of YAML specification for various data_catalog_yaml_examples ``` -Further pages describe more advanced usage: +Once you are familiar with the format of `catalog.yml`, you may find your catalog gets repetitive if you need to load multiple datasets with similar configuration. From Kedro 0.18.12 you can use dataset factories to generalise the configuration and reduce the number of similar catalog entries. This works by by matching datasets used in your project’s pipelines to dataset factory patterns and is explained in a new page about Kedro dataset factories: + ```{toctree} :maxdepth: 1 kedro_dataset_factories +``` + +Further pages describe more advanced concepts: + +```{toctree} +:maxdepth: 1 + partitioned_and_incremental_datasets advanced_data_catalog_usage ``` -The section concludes with an advanced use case tutorial to create your own custom dataset: +This section on handing data with Kedro concludes with an advanced use case, illustrated with a tutorial that explains how to create your own custom dataset: ```{toctree} :maxdepth: 1 diff --git a/docs/source/data/kedro_dataset_factories.md b/docs/source/data/kedro_dataset_factories.md index 48782950ec..fd5454c3da 100644 --- a/docs/source/data/kedro_dataset_factories.md +++ b/docs/source/data/kedro_dataset_factories.md @@ -3,7 +3,8 @@ You can load multiple datasets with similar configuration using dataset factorie The syntax allows you to generalise the configuration and reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns. -## Generalise datasets with similar names and types into one dataset factory +## Generalise datasets with similar names and types + Consider the following catalog entries: ```yaml @@ -29,7 +30,8 @@ When `factory_data` or `process_data` is used in your pipeline, it is matched to quotes to avoid YAML parsing errors. -## Generalise datasets of the same type into one dataset factory +## Generalise datasets of the same type + You can also combine all the datasets with the same type and configuration details. For example, consider the following catalog with three datasets named `boats`, `cars` and `planes` of the type `pandas.CSVDataSet`: @@ -62,6 +64,7 @@ names are matched with the intended pattern. ```python from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles + def create_pipeline(**kwargs) -> Pipeline: return pipeline( [ @@ -96,7 +99,8 @@ def create_pipeline(**kwargs) -> Pipeline: ] ) ``` -## Generalise datasets using namespaces into one dataset factory +## Generalise datasets using namespaces + You can also generalise the catalog entries for datasets belonging to namespaced modular pipelines. Consider the following pipeline which takes in a `model_input_table` and outputs two regressors belonging to the `active_modelling_pipeline` and the `candidate_modelling_pipeline` namespaces: @@ -147,7 +151,7 @@ and `candidate_modelling_pipeline.regressor` as below: filepath: data/06_models/regressor_{namespace}.pkl versioned: true ``` -## Generalise datasets of the same type in different layers into one dataset factory with multiple placeholders +## Generalise datasets of the same type in different layers You can use multiple placeholders in the same pattern. For example, consider the following catalog where the dataset entries share `type`, `file_format` and `save_args`: @@ -209,7 +213,8 @@ The matches are ranked according to the following criteria: 2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`. 3. Alphabetical order -### Generalise all datasets with a catch-all dataset factory to overwrite the default `MemoryDataSet` +### Generalise all datasets with a catch-all dataset factory + You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataSet` creation. ```yaml From 157a3c38ae264a9f2db2b5eff17a579e7697d6c9 Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Mon, 14 Aug 2023 14:44:00 +0100 Subject: [PATCH 06/19] Further changes Signed-off-by: Jo Stichbury --- .../data/advanced_data_catalog_usage.md | 24 +++++++--------- docs/source/data/data_catalog.md | 2 ++ .../source/data/data_catalog_yaml_examples.md | 28 +++++++++++++++---- 3 files changed, 35 insertions(+), 19 deletions(-) diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md index b7b3769041..386cb8de5f 100644 --- a/docs/source/data/advanced_data_catalog_usage.md +++ b/docs/source/data/advanced_data_catalog_usage.md @@ -2,11 +2,11 @@ ## How to read the same file using two different datasets -When you want to load and save the same file, via its specified `filepath`, using different `DataSet` implementations, you'll need to use transcoding. +Use transcoding to load and save a file via its specified `filepath` using more than one `DataSet` implementation. ### A typical example of transcoding -For instance, parquet files can not only be loaded via the `ParquetDataSet` using `pandas`, but also directly by `SparkDataSet`. This conversion is typical when coordinating a `Spark` to `pandas` workflow. +Parquet files can not only be loaded via the `ParquetDataSet` using `pandas`, but also directly by `SparkDataSet`. This conversion is typical when coordinating a `Spark` to `pandas` workflow. To enable transcoding, define two `DataCatalog` entries for the same dataset in a common format (Parquet, JSON, CSV, etc.) in your `conf/base/catalog.yml`: @@ -37,16 +37,12 @@ pipeline( In this example, Kedro understands that `my_dataframe` is the same dataset in its `spark.SparkDataSet` and `pandas.ParquetDataSet` formats and helps resolve the node execution order. In the pipeline, Kedro uses the `spark.SparkDataSet` implementation for saving and `pandas.ParquetDataSet` -for loading, so the first node should output a `pyspark.sql.DataFrame`, while the second node would receive a `pandas.Dataframe`. +for loading, so the first node outputs a `pyspark.sql.DataFrame`, while the second node receives a `pandas.Dataframe`. ## Access the Data Catalog in code -TO REMOVE -- Diataxis: How to - -You can define a Data Catalog in two ways - through YAML configuration, or programmatically using an API. - -The code API allows you to: +You can define a Data Catalog in two ways. Most use cases can be through a YAML configuration file as [illustrated previously](./data_catalog.md#the-basics-of-catalog-yml), but it is possible to access the Data Catalog programmatically using an API that allows you to: * configure data sources in code * operate the IO module within notebooks @@ -84,15 +80,13 @@ When using `SQLTableDataSet` or `SQLQueryDataSet` you must provide a `con` key c ### Load datasets -You can access each dataset by its name. +Access each dataset by its name: ```python cars = io.load("cars") # data is now loaded as a DataFrame in 'cars' gear = cars["gear"].values ``` -#### Behind the scenes - The following steps happened behind the scenes when `load` was called: - The value `cars` was located in the Data Catalog @@ -110,7 +104,7 @@ io.list() ### Save data -You can save data using an API similar to that used to load data. +Save data using an API similar to that used to load data. ```{warning} This use is not recommended unless you are prototyping in notebooks. @@ -129,7 +123,7 @@ io.load("cars_cache") #### Save data to a SQL database for querying -We might now want to put the data in a SQLite database to run queries on it. Let's use that to rank scooters by their mpg. +To put the data in a SQLite database: ```python import os @@ -141,12 +135,14 @@ except FileNotFoundError: pass io.save("cars_table", cars) + +# rank scooters by their mpg ranked = io.load("scooters_query")[["brand", "mpg"]] ``` #### Save data in Parquet -Finally, we can save the processed data in Parquet format. +To save the processed data in Parquet format: ```python io.save("ranked", ranked) diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index f6b545374d..f9bde8fee8 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -22,6 +22,8 @@ shuttles: ``` ### Dataset `type` +TO DO + ### Dataset `filepath` Kedro relies on [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to read and save data from a variety of data stores including local file systems, network file systems, cloud object stores, and Hadoop. When specifying a storage location in `filepath:`, you should provide a URL using the general form `protocol://path/to/data`. If no protocol is provided, the local file system is assumed (which is the same as ``file://``). diff --git a/docs/source/data/data_catalog_yaml_examples.md b/docs/source/data/data_catalog_yaml_examples.md index 4bb44ffeee..dfd8accc88 100644 --- a/docs/source/data/data_catalog_yaml_examples.md +++ b/docs/source/data/data_catalog_yaml_examples.md @@ -1,10 +1,28 @@ # Data Catalog YAML examples -TO REMOVE -- Diataxis: How to guide (code example) - -TO DO: Add a set of anchor links - -You can configure your datasets in a YAML configuration file, `conf/base/catalog.yml` or `conf/local/catalog.yml`. +This page contains a set of examples to help you structure your YAML configuration file in `conf/base/catalog.yml` or `conf/local/catalog.yml`. + +* [Load data from a local binary file using utf-8 encoding](#todo) +* [Save data to a CSV file without row names (index) using utf-8 encoding](#todo) +* [Load/save a CSV file from/to a local file system](#todo) +* [Load/save a CSV on a local file system, using specified load/save arguments](#todo) +* [Load/save a compressed CSV on a local file system](#todo) +* [Load a CSV file from a specific S3 bucket, using credentials and load arguments](#todo) +* [Load/save a pickle file from/to a local file system](#todo) +* [Load an Excel file from Google Cloud Storage](#todo) +* [Load a multi-sheet Excel file from a local file system](#todo) +* [Save an image created with Matplotlib on Google Cloud Storage](#todo) +* [Load/save an HDF file on local file system storage, using specified load/save arguments](#todo) +* [Load/save a parquet file on local file system storage, using specified load/save arguments](#todo) +* [Load/save a Spark table on S3, using specified load/save arguments](#todo) +* [Load/save a SQL table using credentials, a database connection, and specified load/save arguments](#todo) +* [Load a SQL table with credentials and a database connection, and apply a SQL query to the table](#todo) +* [Load data from an API endpoint](#todo) +* [Load data from Minio (S3 API Compatible Storage)](#todo) +* [Load a model saved as a pickle from Azure Blob Storage](#todo) +* [Load a CSV file stored in a remote location through SSH](#todo) +* [Load multiple datasets with similar configuration using YAML anchors](#todo) +* [Create a Data Catalog YAML configuration file via the CLI](#todo) ## Load data from a local binary file using `utf-8` encoding From f00abd50f608d59f0ac804184df63c36f28e6576 Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Mon, 14 Aug 2023 17:31:33 +0100 Subject: [PATCH 07/19] Another chunk of changes Signed-off-by: Jo Stichbury --- docs/source/configuration/credentials.md | 2 +- .../developer_contributor_guidelines.md | 2 +- .../data/advanced_data_catalog_usage.md | 72 +++++++------------ docs/source/data/data_catalog.md | 15 ++-- .../source/data/data_catalog_yaml_examples.md | 44 ++++++------ .../data/how_to_create_a_custom_dataset.md | 10 ++- .../partitioned_and_incremental_datasets.md | 7 +- docs/source/deployment/argo.md | 2 +- docs/source/deployment/aws_batch.md | 2 +- .../databricks_deployment_workflow.md | 2 +- .../databricks_ide_development_workflow.md | 2 +- docs/source/development/commands_reference.md | 2 +- docs/source/experiment_tracking/index.md | 2 +- docs/source/extend_kedro/common_use_cases.md | 2 +- docs/source/extend_kedro/index.md | 1 - docs/source/faq/faq.md | 2 +- docs/source/nodes_and_pipelines/nodes.md | 2 +- .../kedro_and_notebooks.md | 2 +- docs/source/tutorial/add_another_pipeline.md | 2 +- docs/source/tutorial/set_up_data.md | 2 +- 20 files changed, 78 insertions(+), 99 deletions(-) diff --git a/docs/source/configuration/credentials.md b/docs/source/configuration/credentials.md index 620fb569ac..0d91da9cbc 100644 --- a/docs/source/configuration/credentials.md +++ b/docs/source/configuration/credentials.md @@ -3,7 +3,7 @@ For security reasons, we strongly recommend that you *do not* commit any credentials or other secrets to version control. Kedro is set up so that, by default, if a file inside the `conf` folder (and its subfolders) contains `credentials` in its name, it will be ignored by git. -Credentials configuration can be used on its own directly in code or [fed into the `DataCatalog`](../data/data_catalog.md#feeding-in-credentials). +Credentials configuration can be used on its own directly in code or [fed into the `DataCatalog`](../data/data_catalog.md#dataset-access-credentials). If you would rather store your credentials in environment variables instead of a file, you can use the `OmegaConfigLoader` [to load credentials from environment variables](advanced_configuration.md#how-to-load-credentials-through-environment-variables) as described in the advanced configuration chapter. ## How to load credentials in code diff --git a/docs/source/contribution/developer_contributor_guidelines.md b/docs/source/contribution/developer_contributor_guidelines.md index 787a838d90..f2d6f5d71c 100644 --- a/docs/source/contribution/developer_contributor_guidelines.md +++ b/docs/source/contribution/developer_contributor_guidelines.md @@ -62,7 +62,7 @@ We focus on three areas for contribution: `core`, `extras` and `plugin`: - `core` refers to the primary Kedro library. Read the [`core` contribution process](#core-contribution-process) for details. - `extras` refers to features that could be added to `core` that do not introduce too many dependencies or require new Kedro CLI commands to be created. Read the [`extras` contribution process](#extras-contribution-process) for more information. -- [`plugin`](../extend_kedro/plugins.md) refers to new functionality that requires a Kedro CLI command e.g. adding in Airflow functionality and [adding a new dataset](../extend_kedro/custom_datasets.md) to the `kedro-datasets` package. The [`plugin` development documentation](../extend_kedro/plugins.md) contains guidance on how to design and develop a Kedro `plugin`. +- [`plugin`](../extend_kedro/plugins.md) refers to new functionality that requires a Kedro CLI command e.g. adding in Airflow functionality or adding a new dataset to the `kedro-datasets` package. The [`plugin` development documentation](../extend_kedro/plugins.md) contains guidance on how to design and develop a Kedro `plugin`. ### `core` contribution process diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md index 386cb8de5f..338d3aa608 100644 --- a/docs/source/data/advanced_data_catalog_usage.md +++ b/docs/source/data/advanced_data_catalog_usage.md @@ -40,14 +40,11 @@ In the pipeline, Kedro uses the `spark.SparkDataSet` implementation for saving a for loading, so the first node outputs a `pyspark.sql.DataFrame`, while the second node receives a `pandas.Dataframe`. -## Access the Data Catalog in code +## How to access the Data Catalog in code -You can define a Data Catalog in two ways. Most use cases can be through a YAML configuration file as [illustrated previously](./data_catalog.md#the-basics-of-catalog-yml), but it is possible to access the Data Catalog programmatically using an API that allows you to: +You can define a Data Catalog in two ways. Most use cases can be through a YAML configuration file as [illustrated previously](./data_catalog.md), but it is possible to access the Data Catalog programmatically through [`kedro.io.DataCatalog`](/kedro.io.DataCatalog) using an API that allows you to configure data sources in code and use the IO module within notebooks. -* configure data sources in code -* operate the IO module within notebooks - -### Configure a Data Catalog +### How to configure a Data Catalog using the `DataCatalog` API In a file like `catalog.py`, you can construct a `DataCatalog` object programmatically. In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets). @@ -78,9 +75,17 @@ io = DataCatalog( When using `SQLTableDataSet` or `SQLQueryDataSet` you must provide a `con` key containing [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) database connection string. In the example above we pass it as part of `credentials` argument. Alternative to `credentials` is to put `con` into `load_args` and `save_args` (`SQLTableDataSet` only). -### Load datasets +### How to view the available data sources programmatically + +To review the `DataCatalog`: + +```python +io.list() +``` + +### How to load datasets programmatically -Access each dataset by its name: +To access each dataset by its name: ```python cars = io.load("cars") # data is now loaded as a DataFrame in 'cars' @@ -94,23 +99,15 @@ The following steps happened behind the scenes when `load` was called: - The `load` method of this dataset was called - This `load` method delegated the loading to the underlying pandas `read_csv` function -### View the available data sources - -If you forget what data was assigned, you can always review the `DataCatalog`. - -```python -io.list() -``` - -### Save data +### How to save data programmatically -Save data using an API similar to that used to load data. +To save data using an API similar to that used to load data: ```{warning} This use is not recommended unless you are prototyping in notebooks. ``` -#### Save data to memory +#### How to save data to memory ```python from kedro.io import MemoryDataSet @@ -121,9 +118,9 @@ io.save("cars_cache", "Memory can store anything.") io.load("cars_cache") ``` -#### Save data to a SQL database for querying +#### How to save data to a SQL database for querying -To put the data in a SQLite database: +To put the data in a SQLite database: ```python import os @@ -140,7 +137,7 @@ io.save("cars_table", cars) ranked = io.load("scooters_query")[["brand", "mpg"]] ``` -#### Save data in Parquet +#### How to save data in Parquet To save the processed data in Parquet format: @@ -152,7 +149,7 @@ io.save("ranked", ranked) Saving `None` to a dataset is not allowed! ``` -### Accessing a dataset that needs credentials +### How to access a dataset programmatically with credentials Before instantiating the `DataCatalog`, Kedro will first attempt to read [the credentials from the project configuration](../configuration/credentials.md). The resulting dictionary is then passed into `DataCatalog.from_config()` as the `credentials` argument. Let's assume that the project contains the file `conf/local/credentials.yml` with the following contents: @@ -180,29 +177,11 @@ CSVDataSet( ) ``` -### Versioning using the Code API +### How to version a dataset using the Code API -In order to do that, pass a dictionary with exact load versions to `DataCatalog.from_config`: +In an earlier section of the documentation we described how [Kedro enables dataset and ML model versioning](./data_catalog.md/#dataset-versioning). -```python -load_versions = {"cars": "2019-02-13T14.35.36.518Z"} -io = DataCatalog.from_config(catalog_config, credentials, load_versions=load_versions) -cars = io.load("cars") -``` - -The last row in the example above would attempt to load a CSV file from `data/01_raw/company/car_data.csv/2019-02-13T14.35.36.518Z/car_data.csv`: - -* `load_versions` configuration has an effect only if a dataset versioning has been enabled in the catalog config file - see the example above. - -* We recommend that you do not override `save_version` argument in `DataCatalog.from_config` unless strongly required to do so, since it may lead to inconsistencies between loaded and saved versions of the versioned datasets. - -```{warning} -The `DataCatalog` does not re-generate save versions between instantiations. Therefore, if you call `catalog.save('cars', some_data)` twice, then the second call will fail, since it tries to overwrite a versioned dataset using the same save version. To mitigate this, reload your data catalog by calling `%reload_kedro` line magic. This limitation does not apply to `load` operation. -``` - -**** HOW DOES THE BELOW FIT WITH THE ABOVE?? - -Should you require more control over load and save versions of a specific dataset, you can instantiate `Version` and pass it as a parameter to the dataset initialisation: +If you require programmatic control over load and save versions of a specific dataset, you can instantiate `Version` and pass it as a parameter to the dataset initialisation: ```python from kedro.io import DataCatalog, Version @@ -231,9 +210,8 @@ reloaded = io.load("test_data_set") assert data2.equals(reloaded) ``` -```{note} -In the example above, we did not fix any versions. If we do, then the behaviour of load and save operations becomes slightly different: -``` +In the example above, we do not fix any versions. The behaviour of load and save operations becomes slightly different when we set a version: + ```python version = Version( diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index f9bde8fee8..8605cb9d40 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -22,7 +22,9 @@ shuttles: ``` ### Dataset `type` -TO DO +Kedro offers a range of datasets, including CSV, Excel, Parquet, Feather, HDF5, JSON, Pickle, SQL Tables, SQL Queries, Spark DataFrames and more. They are supported with the APIs of pandas, spark, networkx, matplotlib, yaml and more. + +[The `kedro-datasets` package documentation](/kedro_datasets) contains a comprehensive list of all available file types. ### Dataset `filepath` @@ -138,10 +140,12 @@ kedro run --load-version=cars:YYYY-MM-DDThh.mm.ss.sssZ ``` where `--load-version` is dataset name and version timestamp separated by `:`. -Currently, the following datasets support versioning: +A dataset offers versioning support if it extends the [`AbstractVersionedDataSet`](/kedro.io.AbstractVersionedDataSet) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively. In Kedro version 0.18.2, the following datasets support versioning: -- `kedro_datasets.matplotlib.MatplotlibWriter` +- `kedro_datasets.api.APIDataSet` - `kedro_datasets.holoviews.HoloviewsWriter` +- `kedro_datasets.json.JSONDataSet` +- `kedro_datasets.matplotlib.MatplotlibWriter` - `kedro_datasets.networkx.NetworkXDataSet` - `kedro_datasets.pandas.CSVDataSet` - `kedro_datasets.pandas.ExcelDataSet` @@ -151,18 +155,15 @@ Currently, the following datasets support versioning: - `kedro_datasets.pandas.ParquetDataSet` - `kedro_datasets.pickle.PickleDataSet` - `kedro_datasets.pillow.ImageDataSet` +- `kedro_datasets.tensorflow.TensorFlowModelDataSet` - `kedro_datasets.text.TextDataSet` - `kedro_datasets.spark.SparkDataSet` - `kedro_datasets.yaml.YAMLDataSet` -- `kedro_datasets.api.APIDataSet` -- `kedro_datasets.tensorflow.TensorFlowModelDataSet` -- `kedro_datasets.json.JSONDataSet` ```{note} Although HTTP(S) is a supported file system in the dataset implementations, it does not support versioning. ``` - ## Use the Data Catalog within Kedro configuration Kedro configuration enables you to organise your project for different stages of your data pipeline. For example, you might need different Data Catalog settings for development, testing, and production environments. diff --git a/docs/source/data/data_catalog_yaml_examples.md b/docs/source/data/data_catalog_yaml_examples.md index dfd8accc88..4c74162f9a 100644 --- a/docs/source/data/data_catalog_yaml_examples.md +++ b/docs/source/data/data_catalog_yaml_examples.md @@ -2,27 +2,29 @@ This page contains a set of examples to help you structure your YAML configuration file in `conf/base/catalog.yml` or `conf/local/catalog.yml`. -* [Load data from a local binary file using utf-8 encoding](#todo) -* [Save data to a CSV file without row names (index) using utf-8 encoding](#todo) -* [Load/save a CSV file from/to a local file system](#todo) -* [Load/save a CSV on a local file system, using specified load/save arguments](#todo) -* [Load/save a compressed CSV on a local file system](#todo) -* [Load a CSV file from a specific S3 bucket, using credentials and load arguments](#todo) -* [Load/save a pickle file from/to a local file system](#todo) -* [Load an Excel file from Google Cloud Storage](#todo) -* [Load a multi-sheet Excel file from a local file system](#todo) -* [Save an image created with Matplotlib on Google Cloud Storage](#todo) -* [Load/save an HDF file on local file system storage, using specified load/save arguments](#todo) -* [Load/save a parquet file on local file system storage, using specified load/save arguments](#todo) -* [Load/save a Spark table on S3, using specified load/save arguments](#todo) -* [Load/save a SQL table using credentials, a database connection, and specified load/save arguments](#todo) -* [Load a SQL table with credentials and a database connection, and apply a SQL query to the table](#todo) -* [Load data from an API endpoint](#todo) -* [Load data from Minio (S3 API Compatible Storage)](#todo) -* [Load a model saved as a pickle from Azure Blob Storage](#todo) -* [Load a CSV file stored in a remote location through SSH](#todo) -* [Load multiple datasets with similar configuration using YAML anchors](#todo) -* [Create a Data Catalog YAML configuration file via the CLI](#todo) + + +* [Load data from a local binary file using utf-8 encoding](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Save data to a CSV file without row names (index) using utf-8 encoding](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load/save a CSV file from/to a local file system](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load/save a CSV on a local file system, using specified load/save arguments](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load/save a compressed CSV on a local file system](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load a CSV file from a specific S3 bucket, using credentials and load arguments](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load/save a pickle file from/to a local file system](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load an Excel file from Google Cloud Storage](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load a multi-sheet Excel file from a local file system](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Save an image created with Matplotlib on Google Cloud Storage](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load/save an HDF file on local file system storage, using specified load/save arguments](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load/save a parquet file on local file system storage, using specified load/save arguments](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load/save a Spark table on S3, using specified load/save arguments](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load/save a SQL table using credentials, a database connection, and specified load/save arguments](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load a SQL table with credentials and a database connection, and apply a SQL query to the table](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load data from an API endpoint](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load data from Minio (S3 API Compatible Storage)](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load a model saved as a pickle from Azure Blob Storage](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load a CSV file stored in a remote location through SSH](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Load multiple datasets with similar configuration using YAML anchors](#load-data-from-a-local-binary-file-using-utf-8-encoding) +* [Create a Data Catalog YAML configuration file via the CLI](#load-data-from-a-local-binary-file-using-utf-8-encoding) ## Load data from a local binary file using `utf-8` encoding diff --git a/docs/source/data/how_to_create_a_custom_dataset.md b/docs/source/data/how_to_create_a_custom_dataset.md index 4e259408a9..483ce90557 100644 --- a/docs/source/data/how_to_create_a_custom_dataset.md +++ b/docs/source/data/how_to_create_a_custom_dataset.md @@ -1,7 +1,5 @@ # Advanced: Tutorial to create a custom dataset -TO REMOVE -- Diataxis: Tutorial - [Kedro supports many datasets](/kedro_datasets) out of the box, but you may find that you need to create a custom dataset. For example, you may need to handle a proprietary data format or filesystem in your pipeline, or perhaps you have found a particular use case for a dataset that Kedro does not support. This tutorial explains how to create a custom dataset to read and save image data. ## AbstractDataSet @@ -100,7 +98,7 @@ src/kedro_pokemon/extras ## Implement the `_load` method with `fsspec` -Many of the built-in Kedro datasets rely on [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) as a consistent interface to different data sources, as described earlier in the section about the [Data Catalog](../data/data_catalog.md#specify-the-location-of-the-dataset). In this example, it's particularly convenient to use `fsspec` in conjunction with `Pillow` to read image data, since it allows the dataset to work flexibly with different image locations and formats. +Many of the built-in Kedro datasets rely on [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) as a consistent interface to different data sources, as described earlier in the section about the [Data Catalog](../data/data_catalog.md#dataset-filepath). In this example, it's particularly convenient to use `fsspec` in conjunction with `Pillow` to read image data, since it allows the dataset to work flexibly with different image locations and formats. Here is the implementation of the `_load` method using `fsspec` and `Pillow` to read the data of a single image into a `numpy` array: @@ -273,7 +271,7 @@ class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]): Currently, the `ImageDataSet` only works with a single image, but this example needs to load all Pokemon images from the raw data directory for further processing. -Kedro's [`PartitionedDataSet`](../data/kedro_io.md#partitioned-dataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory. +Kedro's [`PartitionedDataSet`](./partitioned_and_incremental_datasets.md) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory. To use `PartitionedDataSet` with `ImageDataSet` to load all Pokemon PNG images, add this to the data catalog YAML so that `PartitionedDataSet` loads all PNG files from the data directory using `ImageDataSet`: @@ -330,7 +328,7 @@ Versioned dataset `__init__` method must have an optional argument called `versi ```{note} Versioning doesn't work with `PartitionedDataSet`. You can't use both of them at the same time. ``` -To add [Versioning](../data/kedro_io.md#versioning) support to the new dataset we need to extend the +To add versioning support to the new dataset we need to extend the [AbstractVersionedDataSet](/kedro.io.AbstractVersionedDataSet) to: * Accept a `version` keyword argument as part of the constructor @@ -591,7 +589,7 @@ class ImageDataSet(AbstractVersionedDataSet): ... ``` -We provide additional examples of [how to use parameters through the data catalog's YAML API](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api). For an example of how to use these parameters in your dataset's constructor, please see the [SparkDataSet](/kedro_datasets.spark.SparkDataSet)'s implementation. +We provide additional examples of [how to use parameters through the data catalog's YAML API](./data_catalog_yaml_examples.md). For an example of how to use these parameters in your dataset's constructor, please see the [SparkDataSet](/kedro_datasets.spark.SparkDataSet)'s implementation. ## How to contribute a custom dataset implementation diff --git a/docs/source/data/partitioned_and_incremental_datasets.md b/docs/source/data/partitioned_and_incremental_datasets.md index b3f0e2e6a4..f1345af097 100644 --- a/docs/source/data/partitioned_and_incremental_datasets.md +++ b/docs/source/data/partitioned_and_incremental_datasets.md @@ -2,7 +2,7 @@ ## Partitioned datasets -These days, distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible. +Distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible. This is why Kedro provides a built-in [PartitionedDataSet](/kedro.io.PartitionedDataSet), with the following features: @@ -177,7 +177,7 @@ new_partitioned_dataset: filename_suffix: ".csv" ``` -node definition: +Here is the node definition: ```python from kedro.pipeline import node @@ -185,7 +185,7 @@ from kedro.pipeline import node node(create_partitions, inputs=None, outputs="new_partitioned_dataset") ``` -and underlying node function `create_partitions`: +The underlying node function is as follows in `create_partitions`: ```python from typing import Any, Dict @@ -212,6 +212,7 @@ Writing to an existing partition may result in its data being overwritten, if th ### Partitioned dataset lazy saving `PartitionedDataSet` also supports lazy saving, where the partition's data is not materialised until it is time to write. + To use this, simply return `Callable` types in the dictionary: ```python diff --git a/docs/source/deployment/argo.md b/docs/source/deployment/argo.md index f66b809b0e..9207debe3d 100644 --- a/docs/source/deployment/argo.md +++ b/docs/source/deployment/argo.md @@ -24,7 +24,7 @@ To use Argo Workflows, ensure you have the following prerequisites in place: - [Argo Workflows is installed](https://github.com/argoproj/argo/blob/master/README.md#quickstart) on your Kubernetes cluster - [Argo CLI is installed](https://github.com/argoproj/argo/releases) on your machine - A `name` attribute is set for each [Kedro node](/kedro.pipeline.node) since it is used to build a DAG -- [All node input/output DataSets must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3); you cannot use the `MemoryDataSet` in your workflow +- [All node input/output DataSets must be configured in `catalog.yml`](../data/data_catalog_yaml_examples.md) and refer to an external location (e.g. AWS S3); you cannot use the `MemoryDataSet` in your workflow ```{note} Each node will run in its own container. diff --git a/docs/source/deployment/aws_batch.md b/docs/source/deployment/aws_batch.md index 976d5e9e5a..c83b58f8ea 100644 --- a/docs/source/deployment/aws_batch.md +++ b/docs/source/deployment/aws_batch.md @@ -18,7 +18,7 @@ To use AWS Batch, ensure you have the following prerequisites in place: - An [AWS account set up](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/). - A `name` attribute is set for each [Kedro node](/kedro.pipeline.node). Each node will run in its own Batch job, so having sensible node names will make it easier to `kedro run --node=`. -- [All node input/output `DataSets` must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3). A clean way to do this is to create a new configuration environment `conf/aws_batch` containing a `catalog.yml` file with the appropriate configuration, as illustrated below. +- [All node input/output `DataSets` must be configured in `catalog.yml`](../data/data_catalog_yaml_examples.md) and refer to an external location (e.g. AWS S3). A clean way to do this is to create a new configuration environment `conf/aws_batch` containing a `catalog.yml` file with the appropriate configuration, as illustrated below.
Click to expand diff --git a/docs/source/deployment/databricks/databricks_deployment_workflow.md b/docs/source/deployment/databricks/databricks_deployment_workflow.md index 799a5044c1..245708e6bf 100644 --- a/docs/source/deployment/databricks/databricks_deployment_workflow.md +++ b/docs/source/deployment/databricks/databricks_deployment_workflow.md @@ -170,7 +170,7 @@ A Kedro project's configuration and data do not get included when it is packaged Your packaged Kedro project needs access to data and configuration in order to run. Therefore, you will need to upload your project's data and configuration to a location accessible to Databricks. In this guide, we will store the data on the Databricks File System (DBFS). -The `databricks-iris` starter contains a [catalog](../../data/data_catalog.md#the-data-catalog) that is set up to access data stored in DBFS (`/conf/`). You will point your project to use configuration stored on DBFS using the `--conf-source` option when you create your job on Databricks. +The `databricks-iris` starter contains a [catalog](../../data/data_catalog.md) that is set up to access data stored in DBFS (`/conf/`). You will point your project to use configuration stored on DBFS using the `--conf-source` option when you create your job on Databricks. There are several ways to upload data to DBFS: you can use the [DBFS API](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/dbfs), the [`dbutils` module](https://docs.databricks.com/dev-tools/databricks-utils.html) in a Databricks notebook or the [Databricks CLI](https://docs.databricks.com/dev-tools/cli/dbfs-cli.html). In this guide, it is recommended to use the Databricks CLI because of the convenience it offers. diff --git a/docs/source/deployment/databricks/databricks_ide_development_workflow.md b/docs/source/deployment/databricks/databricks_ide_development_workflow.md index dc723189c9..2cf8f40ca2 100644 --- a/docs/source/deployment/databricks/databricks_ide_development_workflow.md +++ b/docs/source/deployment/databricks/databricks_ide_development_workflow.md @@ -142,7 +142,7 @@ Name the new folder `local`. In this guide, we have no local credentials to stor When run on Databricks, Kedro cannot access data stored in your project's directory. Therefore, you will need to upload your project's data to an accessible location. In this guide, we will store the data on the Databricks File System (DBFS). -The `databricks-iris` starter contains a [catalog](../../data/data_catalog.md#the-data-catalog) that is set up to access data stored in DBFS (`/conf/`). You will point your project to use configuration stored on DBFS using the `--conf-source` option when you create your job on Databricks. +The `databricks-iris` starter contains a [catalog](../../data/data_catalog.md) that is set up to access data stored in DBFS (`/conf/`). You will point your project to use configuration stored on DBFS using the `--conf-source` option when you create your job on Databricks. There are several ways to upload data to DBFS. In this guide, it is recommended to use [Databricks CLI](https://docs.databricks.com/dev-tools/cli/dbfs-cli.html) because of the convenience it offers. At the command line in your local environment, use the following Databricks CLI command to upload your locally stored data to DBFS: diff --git a/docs/source/development/commands_reference.md b/docs/source/development/commands_reference.md index 45801ea112..815bae91f8 100644 --- a/docs/source/development/commands_reference.md +++ b/docs/source/development/commands_reference.md @@ -498,7 +498,7 @@ kedro catalog list --pipeline=ds,de kedro catalog rank ``` -The output includes a list of any [dataset factories](../data/data_catalog.md#load-multiple-datasets-with-similar-configuration-using-dataset-factories) in the catalog, ranked by the priority on which they are matched against. +The output includes a list of any [dataset factories](../data/kedro_dataset_factories.md) in the catalog, ranked by the priority on which they are matched against. #### Data Catalog diff --git a/docs/source/experiment_tracking/index.md b/docs/source/experiment_tracking/index.md index a8e94dd05b..31bff89ee2 100644 --- a/docs/source/experiment_tracking/index.md +++ b/docs/source/experiment_tracking/index.md @@ -19,7 +19,7 @@ Kedro's [experiment tracking demo](https://demo.kedro.org/experiment-tracking) e ![](../meta/images/experiment-tracking_demo.gif) ## Kedro versions supporting experiment tracking -Kedro has always supported parameter versioning (as part of your codebase with a version control system like `git`) and Kedro’s dataset versioning capabilities enabled you to [snapshot models, datasets and plots](../data/data_catalog.md#version-datasets-and-ml-models). +Kedro has always supported parameter versioning (as part of your codebase with a version control system like `git`) and Kedro’s dataset versioning capabilities enabled you to [snapshot models, datasets and plots](../data/data_catalog.md#dataset-versioning). Kedro-Viz version 4.1.1 introduced metadata capture, visualisation, discovery and comparison, enabling you to access, edit and [compare your experiments](#access-run-data-and-compare-runs) and additionally [track how your metrics change over time](#view-and-compare-metrics-data). diff --git a/docs/source/extend_kedro/common_use_cases.md b/docs/source/extend_kedro/common_use_cases.md index 04b36d6ca5..9f8d32dc9f 100644 --- a/docs/source/extend_kedro/common_use_cases.md +++ b/docs/source/extend_kedro/common_use_cases.md @@ -12,7 +12,7 @@ This can now achieved by using [Hooks](../hooks/introduction.md), to define the ## Use Case 2: How to integrate Kedro with additional data sources -You can use [DataSets](/kedro_datasets) to interface with various different data sources. If the data source you plan to use is not supported out of the box by Kedro, you can [create a custom dataset](custom_datasets.md). +You can use [DataSets](/kedro_datasets) to interface with various different data sources. If the data source you plan to use is not supported out of the box by Kedro, you can [create a custom dataset](../data/how_to_create_a_custom_dataset.md). ## Use Case 3: How to add or modify CLI commands diff --git a/docs/source/extend_kedro/index.md b/docs/source/extend_kedro/index.md index f368ac9a73..fefa8e21f9 100644 --- a/docs/source/extend_kedro/index.md +++ b/docs/source/extend_kedro/index.md @@ -4,6 +4,5 @@ :maxdepth: 1 common_use_cases -custom_datasets plugins ``` diff --git a/docs/source/faq/faq.md b/docs/source/faq/faq.md index 75790690a9..754fb34e19 100644 --- a/docs/source/faq/faq.md +++ b/docs/source/faq/faq.md @@ -41,7 +41,7 @@ This is a growing set of technical FAQs. The [product FAQs on the Kedro website] ## Datasets and the Data Catalog -* [Can I read the same data file using two different dataset implementations](../data/data_catalog.md#transcode-datasets)? +* [Can I read the same data file using two different dataset implementations](../data/advanced_data_catalog_usage.md#how-to-read-the-same-file-using-two-different-datasets)? ## Nodes and pipelines diff --git a/docs/source/nodes_and_pipelines/nodes.md b/docs/source/nodes_and_pipelines/nodes.md index 7a22b8765e..d740e0a8ca 100644 --- a/docs/source/nodes_and_pipelines/nodes.md +++ b/docs/source/nodes_and_pipelines/nodes.md @@ -213,7 +213,7 @@ With `pandas` built-in support, you can use the `chunksize` argument to read dat ### Saving data with Generators To use generators to save data lazily, you need do three things: - Update the `make_prediction` function definition to use `return` instead of `yield`. -- Create a [custom dataset](../extend_kedro/custom_datasets.md) called `ChunkWiseCSVDataset` +- Create a [custom dataset](../data/how_to_create_a_custom_dataset.md) called `ChunkWiseCSVDataset` - Update `catalog.yml` to use a newly created `ChunkWiseCSVDataset`. Copy the following code to `nodes.py`. The main change is to use a new model `DecisionTreeClassifier` to make prediction by chunks in `make_predictions`. diff --git a/docs/source/notebooks_and_ipython/kedro_and_notebooks.md b/docs/source/notebooks_and_ipython/kedro_and_notebooks.md index d32139b2f8..8344b1346f 100644 --- a/docs/source/notebooks_and_ipython/kedro_and_notebooks.md +++ b/docs/source/notebooks_and_ipython/kedro_and_notebooks.md @@ -101,7 +101,7 @@ INFO Loading data from 'parameters' (MemoryDataSet)... ``` ```{note} -If you enable [versioning](../data/data_catalog.md#version-datasets-and-ml-models) you can load a particular version of a dataset, e.g. `catalog.load("example_train_x", version="2021-12-13T15.08.09.255Z")`. +If you enable [versioning](../data/data_catalog.md#dataset-versioning) you can load a particular version of a dataset, e.g. `catalog.load("example_train_x", version="2021-12-13T15.08.09.255Z")`. ``` ### `context` diff --git a/docs/source/tutorial/add_another_pipeline.md b/docs/source/tutorial/add_another_pipeline.md index 95093b5d0b..1ceba96edc 100644 --- a/docs/source/tutorial/add_another_pipeline.md +++ b/docs/source/tutorial/add_another_pipeline.md @@ -125,7 +125,7 @@ regressor: versioned: true ``` -By setting `versioned` to `true`, versioning is enabled for `regressor`. This means that the pickled output of the `regressor` is saved every time the pipeline runs, which stores the history of the models built using this pipeline. You can learn more in the [Versioning section](../data/kedro_io.md#versioning). +By setting `versioned` to `true`, versioning is enabled for `regressor`. This means that the pickled output of the `regressor` is saved every time the pipeline runs, which stores the history of the models built using this pipeline. You can learn more in the [later section about dataset and ML model versioning](../data/data_catalog.md#dataset-versioning). ## Data science pipeline diff --git a/docs/source/tutorial/set_up_data.md b/docs/source/tutorial/set_up_data.md index 364818b3a1..2315f04068 100644 --- a/docs/source/tutorial/set_up_data.md +++ b/docs/source/tutorial/set_up_data.md @@ -120,7 +120,7 @@ When you have finished, close `ipython` session with `exit()`. [Kedro supports numerous datasets](/kedro_datasets) out of the box, but you can also add support for any proprietary data format or filesystem. -You can find further information about [how to add support for custom datasets](../extend_kedro/custom_datasets.md) in specific documentation covering advanced usage. +You can find further information about [how to add support for custom datasets](../data/how_to_create_a_custom_dataset.md) in specific documentation covering advanced usage. ### Supported data locations From 634ea0d65d88eb9663f808015fa64909694d7a08 Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Mon, 14 Aug 2023 17:34:03 +0100 Subject: [PATCH 08/19] Final changes Signed-off-by: Jo Stichbury --- .../data/how_to_create_a_custom_dataset.md | 21 ------------------- 1 file changed, 21 deletions(-) diff --git a/docs/source/data/how_to_create_a_custom_dataset.md b/docs/source/data/how_to_create_a_custom_dataset.md index 483ce90557..3ded1637a5 100644 --- a/docs/source/data/how_to_create_a_custom_dataset.md +++ b/docs/source/data/how_to_create_a_custom_dataset.md @@ -304,27 +304,6 @@ $ ls -la data/01_raw/pokemon-images-and-types/images/images/*.png | wc -l ### How to implement versioning in your dataset - -***** TOOK THIS FROM A SEPARATE PAGE ON KEDRO-IO - -In order to enable versioning, you need to update the `catalog.yml` config file and set the `versioned` attribute to `true` for the given dataset. If this is a custom dataset, the implementation must also: - 1. extend `kedro.io.core.AbstractVersionedDataSet` AND - 2. add `version` namedtuple as an argument to its `__init__` method AND - 3. call `super().__init__()` with positional arguments `filepath`, `version`, and, optionally, with `glob` and `exists` functions if it uses a non-local filesystem (see [kedro_datasets.pandas.CSVDataSet](/kedro_datasets.pandas.CSVDataSet) as an example) AND - 4. modify its `_describe`, `_load` and `_save` methods respectively to support versioning (see [`kedro_datasets.pandas.CSVDataSet`](/kedro_datasets.pandas.CSVDataSet) for an example implementation) - - -### `version` namedtuple - -Versioned dataset `__init__` method must have an optional argument called `version` with a default value of `None`. If provided, this argument must be an instance of [`kedro.io.core.Version`](/kedro.io.Version). Its `load` and `save` attributes must either be `None` or contain string values representing exact load and save versions: - -* If `version` is `None`, then the dataset is considered *not versioned*. -* If `version.load` is `None`, then the latest available version will be used to load the dataset, otherwise a string representing exact load version must be provided. -* If `version.save` is `None`, then a new save version string will be generated by calling `kedro.io.core.generate_timestamp()`, otherwise a string representing the exact save version must be provided. - - -*****THIS WAS THE ORIGINAL CONTENT - ```{note} Versioning doesn't work with `PartitionedDataSet`. You can't use both of them at the same time. ``` From 47d6ac63aa8bf3928cfb8774cbe74b117b80253b Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Mon, 14 Aug 2023 17:40:12 +0100 Subject: [PATCH 09/19] Revise ordering of pages Signed-off-by: Jo Stichbury --- docs/source/data/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/index.md b/docs/source/data/index.md index 16abff0e59..b90a3d9961 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -36,8 +36,8 @@ Further pages describe more advanced concepts: ```{toctree} :maxdepth: 1 -partitioned_and_incremental_datasets advanced_data_catalog_usage +partitioned_and_incremental_datasets ``` This section on handing data with Kedro concludes with an advanced use case, illustrated with a tutorial that explains how to create your own custom dataset: From 24cd7196ae74e906fa9601e5c72d44d39dad109e Mon Sep 17 00:00:00 2001 From: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com> Date: Thu, 17 Aug 2023 10:43:54 +0100 Subject: [PATCH 10/19] Add new CLI commands to dataset factory docs (#2935) * Add changes from #2930 Signed-off-by: Ahdra Merali * Lint Signed-off-by: Ahdra Merali * Apply suggestions from code review Co-authored-by: Jo Stichbury * Make code snippets collapsable Signed-off-by: Ahdra Merali --------- Signed-off-by: Ahdra Merali Co-authored-by: Jo Stichbury --- docs/source/data/kedro_dataset_factories.md | 158 ++++++++++++++++++++ 1 file changed, 158 insertions(+) diff --git a/docs/source/data/kedro_dataset_factories.md b/docs/source/data/kedro_dataset_factories.md index fd5454c3da..fe9687d825 100644 --- a/docs/source/data/kedro_dataset_factories.md +++ b/docs/source/data/kedro_dataset_factories.md @@ -225,3 +225,161 @@ You can use dataset factories to define a catch-all pattern which will overwrite ``` Kedro will now treat all the datasets mentioned in your project's pipelines that do not appear as specific patterns or explicit entries in your catalog as `pandas.CSVDataSet`. + +## CLI commands for dataset factories + +To manage your dataset factories, two new commands have been added to the Kedro CLI: `kedro catalog rank` (0.18.12) and `kedro catalog resolve` (0.18.13). + +#### How to use `kedro catalog rank` + +This command outputs a list of all dataset factories in the catalog, ranked in the order by which pipeline datasets are matched against them. The ordering is determined by the following criteria: + +1. The number of non-placeholder characters in the pattern +2. The number of placeholders in the pattern +3. Alphabetic ordering + +Consider a catalog file with the following patterns: + +
+Click to expand + +```yaml +"{layer}.{dataset_name}": + type: pandas.CSVDataSet + filepath: data/{layer}/{dataset_name}.csv + +preprocessed_{dataset_name}: + type: pandas.ParquetDataSet + filepath: data/02_intermediate/preprocessed_{dataset_name}.pq + +processed_{dataset_name}: + type: pandas.ParquetDataSet + filepath: data/03_primary/processed_{dataset_name}.pq + +"{dataset_name}_csv": + type: pandas.CSVDataSet + filepath: data/03_primary/{dataset_name}.csv + +"{namespace}.{dataset_name}_pq": + type: pandas.ParquetDataSet + filepath: data/03_primary/{dataset_name}_{namespace}.pq + +"{default_dataset}": + type: pickle.PickleDataSet + filepath: data/01_raw/{default_dataset}.pickle +``` +
+ +Running `kedro catalog rank` will result in the following output: + +``` +- preprocessed_{dataset_name} +- processed_{dataset_name} +- '{namespace}.{dataset_name}_pq' +- '{dataset_name}_csv' +- '{layer}.{dataset_name}' +- '{default_dataset}' +``` + +As we can see, the entries are ranked firstly by how many non-placeholders are in the pattern, in descending order. Where two entries have the same number of non-placeholder characters, `{namespace}.{dataset_name}_pq` and `{dataset_name}_csv` with four each, they are then ranked by the number of placeholders, also in decreasing order. `{default_dataset}` is the least specific pattern possible, and will always be matched against last. + +#### How to use `kedro catalog resolve` + +This command resolves dataset patterns in the catalog against any explicit dataset entries in the project pipeline. The resulting output contains all explicit dataset entries in the catalog and any dataset in the default pipeline that resolves some dataset pattern. + +To illustrate this, consider the following catalog file: + +
+Click to expand + +```yaml +companies: + type: pandas.CSVDataSet + filepath: data/01_raw/companies.csv + +reviews: + type: pandas.CSVDataSet + filepath: data/01_raw/reviews.csv + +shuttles: + type: pandas.ExcelDataSet + filepath: data/01_raw/shuttles.xlsx + load_args: + engine: openpyxl # Use modern Excel engine, it is the default since Kedro 0.18.0 + +preprocessed_{name}: + type: pandas.ParquetDataSet + filepath: data/02_intermediate/preprocessed_{name}.pq + +"{default}": + type: pandas.ParquetDataSet + filepath: data/03_primary/{default}.pq +``` +
+ +and the following pipeline in `pipeline.py`: + +
+Click to expand + +```python +def create_pipeline(**kwargs) -> Pipeline: + return pipeline( + [ + node( + func=preprocess_companies, + inputs="companies", + outputs="preprocessed_companies", + name="preprocess_companies_node", + ), + node( + func=preprocess_shuttles, + inputs="shuttles", + outputs="preprocessed_shuttles", + name="preprocess_shuttles_node", + ), + node( + func=create_model_input_table, + inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"], + outputs="model_input_table", + name="create_model_input_table_node", + ), + ] + ) +``` +
+ +The resolved catalog output by the command will be as follows: + +
+Click to expand + +```yaml +companies: + filepath: data/01_raw/companies.csv + type: pandas.CSVDataSet +model_input_table: + filepath: data/03_primary/model_input_table.pq + type: pandas.ParquetDataSet +preprocessed_companies: + filepath: data/02_intermediate/preprocessed_companies.pq + type: pandas.ParquetDataSet +preprocessed_shuttles: + filepath: data/02_intermediate/preprocessed_shuttles.pq + type: pandas.ParquetDataSet +reviews: + filepath: data/01_raw/reviews.csv + type: pandas.CSVDataSet +shuttles: + filepath: data/01_raw/shuttles.xlsx + load_args: + engine: openpyxl + type: pandas.ExcelDataSet +``` +
+ +By default this is output to the terminal. However, if you wish to output the resolved catalog to a specific file, you can use the redirection operator `>`: + +```bash +kedro catalog resolve > output_file.yaml +``` From 43a55ec2b9e568720bfa6aceca7f670689d71ffa Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Thu, 17 Aug 2023 13:12:26 +0100 Subject: [PATCH 11/19] Bunch of changes from feedback Signed-off-by: Jo Stichbury --- .../data/advanced_data_catalog_usage.md | 72 +++++-------------- docs/source/data/data_catalog.md | 8 ++- .../source/data/data_catalog_yaml_examples.md | 59 +++++++++------ .../data/how_to_create_a_custom_dataset.md | 2 +- docs/source/data/kedro_dataset_factories.md | 16 ++--- docs/source/faq/faq.md | 3 - 6 files changed, 67 insertions(+), 93 deletions(-) diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md index 338d3aa608..70465022bd 100644 --- a/docs/source/data/advanced_data_catalog_usage.md +++ b/docs/source/data/advanced_data_catalog_usage.md @@ -1,52 +1,12 @@ -# Advanced Data Catalog usage - -## How to read the same file using two different datasets - -Use transcoding to load and save a file via its specified `filepath` using more than one `DataSet` implementation. - -### A typical example of transcoding - -Parquet files can not only be loaded via the `ParquetDataSet` using `pandas`, but also directly by `SparkDataSet`. This conversion is typical when coordinating a `Spark` to `pandas` workflow. - -To enable transcoding, define two `DataCatalog` entries for the same dataset in a common format (Parquet, JSON, CSV, etc.) in your `conf/base/catalog.yml`: - -```yaml -my_dataframe@spark: - type: spark.SparkDataSet - filepath: data/02_intermediate/data.parquet - file_format: parquet - -my_dataframe@pandas: - type: pandas.ParquetDataSet - filepath: data/02_intermediate/data.parquet -``` - -These entries are used in the pipeline like this: - -```python -pipeline( - [ - node(func=my_func1, inputs="spark_input", outputs="my_dataframe@spark"), - node(func=my_func2, inputs="my_dataframe@pandas", outputs="pipeline_output"), - ] -) -``` - -### How does transcoding work? - -In this example, Kedro understands that `my_dataframe` is the same dataset in its `spark.SparkDataSet` and `pandas.ParquetDataSet` formats and helps resolve the node execution order. - -In the pipeline, Kedro uses the `spark.SparkDataSet` implementation for saving and `pandas.ParquetDataSet` -for loading, so the first node outputs a `pyspark.sql.DataFrame`, while the second node receives a `pandas.Dataframe`. - - -## How to access the Data Catalog in code +# Advanced: Access the Data Catalog in code You can define a Data Catalog in two ways. Most use cases can be through a YAML configuration file as [illustrated previously](./data_catalog.md), but it is possible to access the Data Catalog programmatically through [`kedro.io.DataCatalog`](/kedro.io.DataCatalog) using an API that allows you to configure data sources in code and use the IO module within notebooks. -### How to configure a Data Catalog using the `DataCatalog` API +## How to configure the Data Catalog -In a file like `catalog.py`, you can construct a `DataCatalog` object programmatically. In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets). +To use the `DataCatalog` API, construct a `DataCatalog` object programmatically in a file like `catalog.py`. + +In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets). ```python from kedro.io import DataCatalog @@ -75,7 +35,7 @@ io = DataCatalog( When using `SQLTableDataSet` or `SQLQueryDataSet` you must provide a `con` key containing [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) database connection string. In the example above we pass it as part of `credentials` argument. Alternative to `credentials` is to put `con` into `load_args` and `save_args` (`SQLTableDataSet` only). -### How to view the available data sources programmatically +## How to view the available data sources To review the `DataCatalog`: @@ -83,7 +43,7 @@ To review the `DataCatalog`: io.list() ``` -### How to load datasets programmatically +## How to load datasets programmatically To access each dataset by its name: @@ -99,15 +59,15 @@ The following steps happened behind the scenes when `load` was called: - The `load` method of this dataset was called - This `load` method delegated the loading to the underlying pandas `read_csv` function -### How to save data programmatically - -To save data using an API similar to that used to load data: +## How to save data programmatically ```{warning} -This use is not recommended unless you are prototyping in notebooks. +This pattern is not recommended unless you are using platform notebook environments (Sagemaker, Databricks etc) or writing unit/integration tests for your Kedro pipeline. Use the YAML approach in preference. ``` -#### How to save data to memory +### How to save data to memory + +To save data using an API similar to that used to load data: ```python from kedro.io import MemoryDataSet @@ -118,7 +78,7 @@ io.save("cars_cache", "Memory can store anything.") io.load("cars_cache") ``` -#### How to save data to a SQL database for querying +### How to save data to a SQL database for querying To put the data in a SQLite database: @@ -137,7 +97,7 @@ io.save("cars_table", cars) ranked = io.load("scooters_query")[["brand", "mpg"]] ``` -#### How to save data in Parquet +### How to save data in Parquet To save the processed data in Parquet format: @@ -149,7 +109,7 @@ io.save("ranked", ranked) Saving `None` to a dataset is not allowed! ``` -### How to access a dataset programmatically with credentials +## How to access a dataset with credentials Before instantiating the `DataCatalog`, Kedro will first attempt to read [the credentials from the project configuration](../configuration/credentials.md). The resulting dictionary is then passed into `DataCatalog.from_config()` as the `credentials` argument. Let's assume that the project contains the file `conf/local/credentials.yml` with the following contents: @@ -177,7 +137,7 @@ CSVDataSet( ) ``` -### How to version a dataset using the Code API +## How to version a dataset using the Code API In an earlier section of the documentation we described how [Kedro enables dataset and ML model versioning](./data_catalog.md/#dataset-versioning). diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index 8605cb9d40..1380d878f3 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -1,5 +1,9 @@ # Introduction to the Data Catalog +In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. It is specified with a YAML catalog file that maps the names of node inputs and outputs as keys in the `DataCatalog` class. + +This page introduces the basic sections of `catalog.yml`, which is the file used to register data sources for a Kedro project. + ## The basics of `catalog.yml` A separate page of [Data Catalog YAML examples](./data_catalog_yaml_examples.md) gives further examples of how to work with `catalog.yml`, but here we revisit the [basic `catalog.yml` introduced by the spaceflights tutorial](../tutorial/set_up_data.md). @@ -140,7 +144,7 @@ kedro run --load-version=cars:YYYY-MM-DDThh.mm.ss.sssZ ``` where `--load-version` is dataset name and version timestamp separated by `:`. -A dataset offers versioning support if it extends the [`AbstractVersionedDataSet`](/kedro.io.AbstractVersionedDataSet) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively. In Kedro version 0.18.2, the following datasets support versioning: +A dataset offers versioning support if it extends the [`AbstractVersionedDataSet`](/kedro.io.AbstractVersionedDataSet) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively. In versions of Kedro from 0.18.2 onwards, the following datasets support versioning: - `kedro_datasets.api.APIDataSet` - `kedro_datasets.holoviews.HoloviewsWriter` @@ -161,7 +165,7 @@ A dataset offers versioning support if it extends the [`AbstractVersionedDataSet - `kedro_datasets.yaml.YAMLDataSet` ```{note} -Although HTTP(S) is a supported file system in the dataset implementations, it does not support versioning. +Note that HTTP(S) is a supported file system in the dataset implementations, but if you it, you can't also use versioning. ``` ## Use the Data Catalog within Kedro configuration diff --git a/docs/source/data/data_catalog_yaml_examples.md b/docs/source/data/data_catalog_yaml_examples.md index 4c74162f9a..0570aa0f2c 100644 --- a/docs/source/data/data_catalog_yaml_examples.md +++ b/docs/source/data/data_catalog_yaml_examples.md @@ -2,29 +2,9 @@ This page contains a set of examples to help you structure your YAML configuration file in `conf/base/catalog.yml` or `conf/local/catalog.yml`. - - -* [Load data from a local binary file using utf-8 encoding](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Save data to a CSV file without row names (index) using utf-8 encoding](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load/save a CSV file from/to a local file system](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load/save a CSV on a local file system, using specified load/save arguments](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load/save a compressed CSV on a local file system](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load a CSV file from a specific S3 bucket, using credentials and load arguments](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load/save a pickle file from/to a local file system](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load an Excel file from Google Cloud Storage](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load a multi-sheet Excel file from a local file system](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Save an image created with Matplotlib on Google Cloud Storage](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load/save an HDF file on local file system storage, using specified load/save arguments](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load/save a parquet file on local file system storage, using specified load/save arguments](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load/save a Spark table on S3, using specified load/save arguments](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load/save a SQL table using credentials, a database connection, and specified load/save arguments](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load a SQL table with credentials and a database connection, and apply a SQL query to the table](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load data from an API endpoint](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load data from Minio (S3 API Compatible Storage)](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load a model saved as a pickle from Azure Blob Storage](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load a CSV file stored in a remote location through SSH](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Load multiple datasets with similar configuration using YAML anchors](#load-data-from-a-local-binary-file-using-utf-8-encoding) -* [Create a Data Catalog YAML configuration file via the CLI](#load-data-from-a-local-binary-file-using-utf-8-encoding) +```{contents} Table of Contents +:depth: 3 +``` ## Load data from a local binary file using `utf-8` encoding @@ -380,6 +360,39 @@ airplanes: In this example, the default `csv` configuration is inserted into `airplanes` and then the `load_args` block is overridden. Normally, that would replace the whole dictionary. In order to extend `load_args`, the defaults for that block are then re-inserted. +## Read the same file using two different datasets + +You might come across a situation where you would like to read the same file using two different dataset implementations (known as transcoding). For example, Parquet files can not only be loaded via the `ParquetDataSet` using `pandas`, but also directly by `SparkDataSet`. This conversion is typical when coordinating a `Spark` to `pandas` workflow. + +Define two `DataCatalog` entries for the same dataset in a common format (Parquet, JSON, CSV, etc.) in your `conf/base/catalog.yml`: + +```yaml +my_dataframe@spark: + type: spark.SparkDataSet + filepath: data/02_intermediate/data.parquet + file_format: parquet + +my_dataframe@pandas: + type: pandas.ParquetDataSet + filepath: data/02_intermediate/data.parquet +``` + +These entries are used in the pipeline like this: + +```python +pipeline( + [ + node(func=my_func1, inputs="spark_input", outputs="my_dataframe@spark"), + node(func=my_func2, inputs="my_dataframe@pandas", outputs="pipeline_output"), + ] +) +``` + +In this example, Kedro understands that `my_dataframe` is the same dataset in its `spark.SparkDataSet` and `pandas.ParquetDataSet` formats and resolves the node execution order. + +In the pipeline, Kedro uses the `spark.SparkDataSet` implementation for saving and `pandas.ParquetDataSet` +for loading, so the first node outputs a `pyspark.sql.DataFrame`, while the second node receives a `pandas.Dataframe`. + ## Create a Data Catalog YAML configuration file via the CLI You can use the [`kedro catalog create` command to create a Data Catalog YAML configuration](../development/commands_reference.md#create-a-data-catalog-yaml-configuration-file). diff --git a/docs/source/data/how_to_create_a_custom_dataset.md b/docs/source/data/how_to_create_a_custom_dataset.md index fe7e6cecf8..34168cd72c 100644 --- a/docs/source/data/how_to_create_a_custom_dataset.md +++ b/docs/source/data/how_to_create_a_custom_dataset.md @@ -4,7 +4,7 @@ ## AbstractDataSet -For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataSet` interface](/kedro.io.AbstractDataSet), which is the underlying interface that all datasets extend. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation. +For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataSet` interface](/kedro.io.AbstractDataSet) or [`AbstractVersionedDataSet` interface](/kedro.io.AbstractVersionedDataSet) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation. ## Scenario diff --git a/docs/source/data/kedro_dataset_factories.md b/docs/source/data/kedro_dataset_factories.md index fe9687d825..93cf787648 100644 --- a/docs/source/data/kedro_dataset_factories.md +++ b/docs/source/data/kedro_dataset_factories.md @@ -3,7 +3,7 @@ You can load multiple datasets with similar configuration using dataset factorie The syntax allows you to generalise the configuration and reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns. -## Generalise datasets with similar names and types +## How to generalise datasets with similar names and types Consider the following catalog entries: @@ -30,7 +30,7 @@ When `factory_data` or `process_data` is used in your pipeline, it is matched to quotes to avoid YAML parsing errors. -## Generalise datasets of the same type +## How to generalise datasets of the same type You can also combine all the datasets with the same type and configuration details. For example, consider the following catalog with three datasets named `boats`, `cars` and `planes` of the type `pandas.CSVDataSet`: @@ -99,7 +99,7 @@ def create_pipeline(**kwargs) -> Pipeline: ] ) ``` -## Generalise datasets using namespaces +## How to generalise datasets using namespaces You can also generalise the catalog entries for datasets belonging to namespaced modular pipelines. Consider the following pipeline which takes in a `model_input_table` and outputs two regressors belonging to the @@ -151,7 +151,7 @@ and `candidate_modelling_pipeline.regressor` as below: filepath: data/06_models/regressor_{namespace}.pkl versioned: true ``` -## Generalise datasets of the same type in different layers +## How to generalise datasets of the same type in different layers You can use multiple placeholders in the same pattern. For example, consider the following catalog where the dataset entries share `type`, `file_format` and `save_args`: @@ -191,7 +191,7 @@ This could be generalised to the following pattern: ``` All the placeholders used in the catalog entry body must exist in the factory pattern name. -### Generalise datasets using multiple dataset factories +## How to generalise datasets using multiple dataset factories You can have multiple dataset factories in your catalog. For example: ```yaml @@ -213,7 +213,7 @@ The matches are ranked according to the following criteria: 2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`. 3. Alphabetical order -### Generalise all datasets with a catch-all dataset factory +## How to generalise all datasets with a catch-all dataset factory You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataSet` creation. @@ -230,7 +230,7 @@ as `pandas.CSVDataSet`. To manage your dataset factories, two new commands have been added to the Kedro CLI: `kedro catalog rank` (0.18.12) and `kedro catalog resolve` (0.18.13). -#### How to use `kedro catalog rank` +### How to use `kedro catalog rank` This command outputs a list of all dataset factories in the catalog, ranked in the order by which pipeline datasets are matched against them. The ordering is determined by the following criteria: @@ -283,7 +283,7 @@ Running `kedro catalog rank` will result in the following output: As we can see, the entries are ranked firstly by how many non-placeholders are in the pattern, in descending order. Where two entries have the same number of non-placeholder characters, `{namespace}.{dataset_name}_pq` and `{dataset_name}_csv` with four each, they are then ranked by the number of placeholders, also in decreasing order. `{default_dataset}` is the least specific pattern possible, and will always be matched against last. -#### How to use `kedro catalog resolve` +### How to use `kedro catalog resolve` This command resolves dataset patterns in the catalog against any explicit dataset entries in the project pipeline. The resulting output contains all explicit dataset entries in the catalog and any dataset in the default pipeline that resolves some dataset pattern. diff --git a/docs/source/faq/faq.md b/docs/source/faq/faq.md index 754fb34e19..23cfa6b094 100644 --- a/docs/source/faq/faq.md +++ b/docs/source/faq/faq.md @@ -39,9 +39,6 @@ This is a growing set of technical FAQs. The [product FAQs on the Kedro website] * [How do I use resolvers in the `OmegaConfigLoader`](../configuration/advanced_configuration.md#how-to-use-resolvers-in-the-omegaconfigloader)? * [How do I load credentials through environment variables](../configuration/advanced_configuration.md#how-to-load-credentials-through-environment-variables)? -## Datasets and the Data Catalog - -* [Can I read the same data file using two different dataset implementations](../data/advanced_data_catalog_usage.md#how-to-read-the-same-file-using-two-different-datasets)? ## Nodes and pipelines From 352a7c757668f5ce7a77196d5bde8fa5a02d206c Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Thu, 17 Aug 2023 15:24:08 +0100 Subject: [PATCH 12/19] A few more tweaks Signed-off-by: Jo Stichbury --- RELEASE.md | 2 ++ docs/source/data/data_catalog.md | 22 +++------------------- 2 files changed, 5 insertions(+), 19 deletions(-) diff --git a/RELEASE.md b/RELEASE.md index ea0fce323a..603cb61f46 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -18,6 +18,8 @@ * Updated `kedro pipeline create` and `kedro catalog create` to use new `/conf` file structure. ## Documentation changes +* Revised the `data` section to restructure beginner and advanced pages about the Data Catalog and datasets. +* Moved contributor documentation to the [GitHub wiki](https://github.com/kedro-org/kedro/wiki/Contribute-to-Kedro). * Update example of using generator functions in nodes. * Added migration guide from the `ConfigLoader` to the `OmegaConfigLoader`. The `ConfigLoader` is deprecated and will be removed in the `0.19.0` release. diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index 1380d878f3..d1544b9a54 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -144,25 +144,9 @@ kedro run --load-version=cars:YYYY-MM-DDThh.mm.ss.sssZ ``` where `--load-version` is dataset name and version timestamp separated by `:`. -A dataset offers versioning support if it extends the [`AbstractVersionedDataSet`](/kedro.io.AbstractVersionedDataSet) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively. In versions of Kedro from 0.18.2 onwards, the following datasets support versioning: - -- `kedro_datasets.api.APIDataSet` -- `kedro_datasets.holoviews.HoloviewsWriter` -- `kedro_datasets.json.JSONDataSet` -- `kedro_datasets.matplotlib.MatplotlibWriter` -- `kedro_datasets.networkx.NetworkXDataSet` -- `kedro_datasets.pandas.CSVDataSet` -- `kedro_datasets.pandas.ExcelDataSet` -- `kedro_datasets.pandas.FeatherDataSet` -- `kedro_datasets.pandas.HDFDataSet` -- `kedro_datasets.pandas.JSONDataSet` -- `kedro_datasets.pandas.ParquetDataSet` -- `kedro_datasets.pickle.PickleDataSet` -- `kedro_datasets.pillow.ImageDataSet` -- `kedro_datasets.tensorflow.TensorFlowModelDataSet` -- `kedro_datasets.text.TextDataSet` -- `kedro_datasets.spark.SparkDataSet` -- `kedro_datasets.yaml.YAMLDataSet` +A dataset offers versioning support if it extends the [`AbstractVersionedDataSet`](/kedro.io.AbstractVersionedDataSet) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively. + +To verify whether a dataset can undergo versioning, you should examine the dataset class code to inspect its inheritance [(you can find contributed datasets within the `kedro-datasets` repository)](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets). Check if the dataset class inherits from the `AbstractVersionedDataSet`. For instance, if you encounter a class like `CSVDataSet(AbstractVersionedDataSet[pd.DataFrame, pd.DataFrame])`, this indicates that the dataset is set up to support versioning. ```{note} Note that HTTP(S) is a supported file system in the dataset implementations, but if you it, you can't also use versioning. From 2d5840e55e1a9d3747505ddcd72899e5a1aa1c88 Mon Sep 17 00:00:00 2001 From: Tynan DeBold Date: Thu, 17 Aug 2023 11:48:15 +0100 Subject: [PATCH 13/19] Update h1,h2,h3 font sizes Signed-off-by: Tynan DeBold --- docs/source/_static/css/qb1-sphinx-rtd.css | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/_static/css/qb1-sphinx-rtd.css b/docs/source/_static/css/qb1-sphinx-rtd.css index 3f11d0ceee..fa58317d22 100644 --- a/docs/source/_static/css/qb1-sphinx-rtd.css +++ b/docs/source/_static/css/qb1-sphinx-rtd.css @@ -321,16 +321,16 @@ h1, h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend { } .wy-body-for-nav h1 { - font-size: 2.6rem; + font-size: 2.6rem !important; letter-spacing: -0.3px; } .wy-body-for-nav h2 { - font-size: 2.3rem; + font-size: 2rem; } .wy-body-for-nav h3 { - font-size: 2.1rem; + font-size: 2rem; } .wy-body-for-nav h4 { From d266f938bf67f4e33167b5e1e715a38ab96bc714 Mon Sep 17 00:00:00 2001 From: Ankita Katiyar Date: Thu, 17 Aug 2023 17:33:31 +0100 Subject: [PATCH 14/19] Add code snippet for using DataCatalog with Kedro config Signed-off-by: Ankita Katiyar --- docs/source/data/data_catalog.md | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index d1544b9a54..eb1ec9561d 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -160,4 +160,16 @@ By default, Kedro has a `base` and a `local` folder for configuration. The Data In summary, if you need to configure your datasets for different environments, you can create both `conf/base/catalog.yml` and `conf/local/catalog.yml`. For instance, you can use the `catalog.yml` file in `conf/base/` to register the locations of datasets that would run in production, while adding a second version of `catalog.yml` in `conf/local/` to register the locations of sample datasets while you are using them for prototyping data pipeline(s). -To illustrate this, if you include a dataset called `cars` in `catalog.yml` stored in both `conf/base` and `conf/local`, your pipeline code would use the `cars` dataset and rely on Kedro to detect which definition of `cars` dataset to use in your pipeline. +To illustrate this, consider the following catalog entry for a dataset named `cars` in your `conf/base/catalog.yml` which points to a csv file stored in your bucket on AWS S3 : +```yaml +cars: + filepath: s3://my_bucket/cars.csv + type: pandas.CSVDataSet + ``` +You can overwrite this catalog entry in `conf/local/catalog.yml` to point to a locally stored file instead: +```yaml +cars: + filepath: data/01_raw/cars.csv + type: pandas.CSVDataSet +``` +In your pipeline code, when the `cars` dataset is used, it will use the overwritten catalog entry from `conf/local/catalog.yml` which points to the local file.` dataset and rely on Kedro to detect which definition of `cars` dataset to use in your pipeline. From 2a7cacd98411d40b9154c13038762c357fd34e55 Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Thu, 17 Aug 2023 17:43:57 +0100 Subject: [PATCH 15/19] Few more tweaks Signed-off-by: Jo Stichbury --- docs/build-docs.sh | 4 ++-- docs/source/data/advanced_data_catalog_usage.md | 15 +++++++++++---- docs/source/data/kedro_dataset_factories.md | 4 ++-- 3 files changed, 15 insertions(+), 8 deletions(-) diff --git a/docs/build-docs.sh b/docs/build-docs.sh index eb64351b4f..d55076e118 100755 --- a/docs/build-docs.sh +++ b/docs/build-docs.sh @@ -8,7 +8,7 @@ set -o nounset action=$1 if [ "$action" == "linkcheck" ]; then - sphinx-build -ETan -j auto -D language=en -b linkcheck -d docs/build/doctrees docs/source docs/build/linkcheck + sphinx-build -WETan -j auto -D language=en -b linkcheck -d docs/build/doctrees docs/source docs/build/linkcheck elif [ "$action" == "docs" ]; then - sphinx-build -ETa -j auto -D language=en -b html -d docs/build/doctrees docs/source docs/build/html + sphinx-build -WETa -j auto -D language=en -b html -d docs/build/doctrees docs/source docs/build/html fi diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md index 70465022bd..03670eaac7 100644 --- a/docs/source/data/advanced_data_catalog_usage.md +++ b/docs/source/data/advanced_data_catalog_usage.md @@ -195,14 +195,21 @@ assert data1.equals(reloaded) io.save("test_data_set", data2) ``` -```{warning} -We do not recommend passing exact load and/or save versions, since it might lead to inconsistencies between operations. For example, if versions for load and save operations do not match, a save operation would result in a `UserWarning` indicating that save and load versions do not match. Load after save might also return an error if the corresponding load version is not found: +We do not recommend passing exact load and/or save versions, since it might lead to inconsistencies between operations. For example, if versions for load and save operations do not match, a save operation would result in a `UserWarning`. + +Imagine a simple pipeline with two nodes, where B takes the output from A. If you specify the load-version of the data for B to be `my_data_2023_08_16.csv`, the data that A produces (`my_data_20230818.csv`) is not used. + +```text +Node_A -> my_data_20230818.csv +my_data_2023_08_16.csv -> Node B ``` +In code: + ```python version = Version( - load="exact_load_version", # load exact version - save="exact_save_version", # save to exact version + load="my_data_2023_08_16.csv", # load exact version + save="my_data_20230818.csv", # save to exact version ) test_data_set = CSVDataSet( diff --git a/docs/source/data/kedro_dataset_factories.md b/docs/source/data/kedro_dataset_factories.md index 93cf787648..693272c013 100644 --- a/docs/source/data/kedro_dataset_factories.md +++ b/docs/source/data/kedro_dataset_factories.md @@ -213,9 +213,9 @@ The matches are ranked according to the following criteria: 2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`. 3. Alphabetical order -## How to generalise all datasets with a catch-all dataset factory +## How to override the default dataset creation with dataset factories -You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataSet` creation. +You can use dataset factories to define a catch-all pattern which will overwrite the default [`MemoryDataSet`](/kedro.io.MemoryDataset) creation. ```yaml "{default_dataset}": From 4358d44799bed422711294b0bb2b3b4cf7f7a10a Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Thu, 17 Aug 2023 18:00:31 +0100 Subject: [PATCH 16/19] Update docs/source/data/data_catalog.md --- docs/source/data/data_catalog.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index eb1ec9561d..deac0b2779 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -160,7 +160,7 @@ By default, Kedro has a `base` and a `local` folder for configuration. The Data In summary, if you need to configure your datasets for different environments, you can create both `conf/base/catalog.yml` and `conf/local/catalog.yml`. For instance, you can use the `catalog.yml` file in `conf/base/` to register the locations of datasets that would run in production, while adding a second version of `catalog.yml` in `conf/local/` to register the locations of sample datasets while you are using them for prototyping data pipeline(s). -To illustrate this, consider the following catalog entry for a dataset named `cars` in your `conf/base/catalog.yml` which points to a csv file stored in your bucket on AWS S3 : +To illustrate this, consider the following catalog entry for a dataset named `cars` in `conf/base/catalog.yml`, which points to a csv file stored in a bucket on AWS S3: ```yaml cars: filepath: s3://my_bucket/cars.csv From 28606a00131dfab178e4c6c90d256cf1dd0f3e23 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Juan=20Luis=20Cano=20Rodr=C3=ADguez?= Date: Fri, 18 Aug 2023 13:08:47 +0200 Subject: [PATCH 17/19] Upgrade kedro-datasets for docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Juan Luis Cano Rodríguez --- setup.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/setup.py b/setup.py index e78ea817a7..8d94b9c965 100644 --- a/setup.py +++ b/setup.py @@ -97,7 +97,7 @@ def _collect_requirements(requires): "sphinxcontrib-mermaid~=0.7.1", "myst-parser~=1.0.0", "Jinja2<3.1.0", - "kedro-datasets[all,pandas-deltatabledataset]~=1.5.1", + "kedro-datasets[all]~=1.5.3", ], "geopandas": _collect_requirements(geopandas_require), "matplotlib": _collect_requirements(matplotlib_require), From 954b82a2e6fa76b6d6bf7442ee7625c223367c43 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Juan=20Luis=20Cano=20Rodr=C3=ADguez?= Date: Fri, 18 Aug 2023 13:12:12 +0200 Subject: [PATCH 18/19] Improve prose MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Juan Luis Cano Rodríguez Co-authored-by: Jo Stichbury --- docs/source/data/data_catalog.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index deac0b2779..be79f25dfd 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -172,4 +172,4 @@ cars: filepath: data/01_raw/cars.csv type: pandas.CSVDataSet ``` -In your pipeline code, when the `cars` dataset is used, it will use the overwritten catalog entry from `conf/local/catalog.yml` which points to the local file.` dataset and rely on Kedro to detect which definition of `cars` dataset to use in your pipeline. +In your pipeline code, when the `cars` dataset is used, it will use the overwritten catalog entry from `conf/local/catalog.yml` and rely on Kedro to detect which definition of `cars` dataset to use in your pipeline. From 11f6760d38dac8455083f21daaadc7be1f93d0b0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Juan=20Luis=20Cano=20Rodr=C3=ADguez?= Date: Fri, 18 Aug 2023 13:22:24 +0200 Subject: [PATCH 19/19] Typos MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Juan Luis Cano Rodríguez --- docs/source/data/data_catalog.md | 2 +- docs/source/data/how_to_create_a_custom_dataset.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index be79f25dfd..680db626f7 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -144,7 +144,7 @@ kedro run --load-version=cars:YYYY-MM-DDThh.mm.ss.sssZ ``` where `--load-version` is dataset name and version timestamp separated by `:`. -A dataset offers versioning support if it extends the [`AbstractVersionedDataSet`](/kedro.io.AbstractVersionedDataSet) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively. +A dataset offers versioning support if it extends the [`AbstractVersionedDataSet`](/kedro.io.AbstractVersionedDataset) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively. To verify whether a dataset can undergo versioning, you should examine the dataset class code to inspect its inheritance [(you can find contributed datasets within the `kedro-datasets` repository)](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets). Check if the dataset class inherits from the `AbstractVersionedDataSet`. For instance, if you encounter a class like `CSVDataSet(AbstractVersionedDataSet[pd.DataFrame, pd.DataFrame])`, this indicates that the dataset is set up to support versioning. diff --git a/docs/source/data/how_to_create_a_custom_dataset.md b/docs/source/data/how_to_create_a_custom_dataset.md index 34168cd72c..86010b4f18 100644 --- a/docs/source/data/how_to_create_a_custom_dataset.md +++ b/docs/source/data/how_to_create_a_custom_dataset.md @@ -4,7 +4,7 @@ ## AbstractDataSet -For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataSet` interface](/kedro.io.AbstractDataSet) or [`AbstractVersionedDataSet` interface](/kedro.io.AbstractVersionedDataSet) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation. +For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataSet` interface](/kedro.io.AbstractDataset) or [`AbstractVersionedDataSet` interface](/kedro.io.AbstractVersionedDataset) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation. ## Scenario @@ -309,7 +309,7 @@ Versioning doesn't work with `PartitionedDataSet`. You can't use both of them at ``` To add versioning support to the new dataset we need to extend the - [AbstractVersionedDataSet](/kedro.io.AbstractVersionedDataSet) to: + [AbstractVersionedDataSet](/kedro.io.AbstractVersionedDataset) to: * Accept a `version` keyword argument as part of the constructor * Adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively