From c45e629ff308f4ac76cb57cf35291d3855eba4b1 Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Fri, 18 Aug 2023 13:12:01 +0100 Subject: [PATCH] Reorganise and improve the data catalog documentation (#2888) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * First drop of newly organised data catalog docs Signed-off-by: Jo Stichbury * linter Signed-off-by: Jo Stichbury * Added to-do notes Signed-off-by: Jo Stichbury * Afternoon's work in rewriting/reorganising content Signed-off-by: Jo Stichbury * More changes Signed-off-by: Jo Stichbury * Further changes Signed-off-by: Jo Stichbury * Another chunk of changes Signed-off-by: Jo Stichbury * Final changes Signed-off-by: Jo Stichbury * Revise ordering of pages Signed-off-by: Jo Stichbury * Add new CLI commands to dataset factory docs (#2935) * Add changes from #2930 Signed-off-by: Ahdra Merali * Lint Signed-off-by: Ahdra Merali * Apply suggestions from code review Co-authored-by: Jo Stichbury * Make code snippets collapsable Signed-off-by: Ahdra Merali --------- Signed-off-by: Ahdra Merali Co-authored-by: Jo Stichbury * Bunch of changes from feedback Signed-off-by: Jo Stichbury * A few more tweaks Signed-off-by: Jo Stichbury * Update h1,h2,h3 font sizes Signed-off-by: Tynan DeBold * Add code snippet for using DataCatalog with Kedro config Signed-off-by: Ankita Katiyar * Few more tweaks Signed-off-by: Jo Stichbury * Update docs/source/data/data_catalog.md * Upgrade kedro-datasets for docs Signed-off-by: Juan Luis Cano Rodríguez * Improve prose Signed-off-by: Juan Luis Cano Rodríguez Co-authored-by: Jo Stichbury --------- Signed-off-by: Jo Stichbury Signed-off-by: Ahdra Merali Signed-off-by: Tynan DeBold Signed-off-by: Ankita Katiyar Signed-off-by: Juan Luis Cano Rodríguez Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com> Co-authored-by: Tynan DeBold Co-authored-by: Ankita Katiyar Co-authored-by: Juan Luis Cano Rodríguez --- RELEASE.md | 2 + docs/source/_static/css/qb1-sphinx-rtd.css | 6 +- docs/source/configuration/credentials.md | 2 +- .../data/advanced_data_catalog_usage.md | 225 +++++ docs/source/data/data_catalog.md | 819 ++---------------- .../source/data/data_catalog_yaml_examples.md | 408 +++++++++ .../how_to_create_a_custom_dataset.md} | 21 +- docs/source/data/index.md | 45 +- docs/source/data/kedro_dataset_factories.md | 385 ++++++++ ...> partitioned_and_incremental_datasets.md} | 274 +----- docs/source/deployment/argo.md | 2 +- docs/source/deployment/aws_batch.md | 2 +- .../databricks_deployment_workflow.md | 2 +- .../databricks_ide_development_workflow.md | 2 +- docs/source/development/commands_reference.md | 2 +- docs/source/experiment_tracking/index.md | 2 +- docs/source/extend_kedro/common_use_cases.md | 2 +- docs/source/extend_kedro/index.md | 1 - docs/source/faq/faq.md | 3 - docs/source/nodes_and_pipelines/nodes.md | 2 +- .../kedro_and_notebooks.md | 2 +- docs/source/tutorial/add_another_pipeline.md | 2 +- docs/source/tutorial/set_up_data.md | 2 +- setup.py | 2 +- 24 files changed, 1187 insertions(+), 1028 deletions(-) create mode 100644 docs/source/data/advanced_data_catalog_usage.md create mode 100644 docs/source/data/data_catalog_yaml_examples.md rename docs/source/{extend_kedro/custom_datasets.md => data/how_to_create_a_custom_dataset.md} (93%) create mode 100644 docs/source/data/kedro_dataset_factories.md rename docs/source/data/{kedro_io.md => partitioned_and_incremental_datasets.md} (62%) diff --git a/RELEASE.md b/RELEASE.md index ea0fce323a..603cb61f46 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -18,6 +18,8 @@ * Updated `kedro pipeline create` and `kedro catalog create` to use new `/conf` file structure. ## Documentation changes +* Revised the `data` section to restructure beginner and advanced pages about the Data Catalog and datasets. +* Moved contributor documentation to the [GitHub wiki](https://github.com/kedro-org/kedro/wiki/Contribute-to-Kedro). * Update example of using generator functions in nodes. * Added migration guide from the `ConfigLoader` to the `OmegaConfigLoader`. The `ConfigLoader` is deprecated and will be removed in the `0.19.0` release. diff --git a/docs/source/_static/css/qb1-sphinx-rtd.css b/docs/source/_static/css/qb1-sphinx-rtd.css index 3f11d0ceee..fa58317d22 100644 --- a/docs/source/_static/css/qb1-sphinx-rtd.css +++ b/docs/source/_static/css/qb1-sphinx-rtd.css @@ -321,16 +321,16 @@ h1, h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend { } .wy-body-for-nav h1 { - font-size: 2.6rem; + font-size: 2.6rem !important; letter-spacing: -0.3px; } .wy-body-for-nav h2 { - font-size: 2.3rem; + font-size: 2rem; } .wy-body-for-nav h3 { - font-size: 2.1rem; + font-size: 2rem; } .wy-body-for-nav h4 { diff --git a/docs/source/configuration/credentials.md b/docs/source/configuration/credentials.md index 620fb569ac..0d91da9cbc 100644 --- a/docs/source/configuration/credentials.md +++ b/docs/source/configuration/credentials.md @@ -3,7 +3,7 @@ For security reasons, we strongly recommend that you *do not* commit any credentials or other secrets to version control. Kedro is set up so that, by default, if a file inside the `conf` folder (and its subfolders) contains `credentials` in its name, it will be ignored by git. -Credentials configuration can be used on its own directly in code or [fed into the `DataCatalog`](../data/data_catalog.md#feeding-in-credentials). +Credentials configuration can be used on its own directly in code or [fed into the `DataCatalog`](../data/data_catalog.md#dataset-access-credentials). If you would rather store your credentials in environment variables instead of a file, you can use the `OmegaConfigLoader` [to load credentials from environment variables](advanced_configuration.md#how-to-load-credentials-through-environment-variables) as described in the advanced configuration chapter. ## How to load credentials in code diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md new file mode 100644 index 0000000000..03670eaac7 --- /dev/null +++ b/docs/source/data/advanced_data_catalog_usage.md @@ -0,0 +1,225 @@ +# Advanced: Access the Data Catalog in code + +You can define a Data Catalog in two ways. Most use cases can be through a YAML configuration file as [illustrated previously](./data_catalog.md), but it is possible to access the Data Catalog programmatically through [`kedro.io.DataCatalog`](/kedro.io.DataCatalog) using an API that allows you to configure data sources in code and use the IO module within notebooks. + +## How to configure the Data Catalog + +To use the `DataCatalog` API, construct a `DataCatalog` object programmatically in a file like `catalog.py`. + +In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets). + +```python +from kedro.io import DataCatalog +from kedro_datasets.pandas import ( + CSVDataSet, + SQLTableDataSet, + SQLQueryDataSet, + ParquetDataSet, +) + +io = DataCatalog( + { + "bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"), + "cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")), + "cars_table": SQLTableDataSet( + table_name="cars", credentials=dict(con="sqlite:///kedro.db") + ), + "scooters_query": SQLQueryDataSet( + sql="select * from cars where gear=4", + credentials=dict(con="sqlite:///kedro.db"), + ), + "ranked": ParquetDataSet(filepath="ranked.parquet"), + } +) +``` + +When using `SQLTableDataSet` or `SQLQueryDataSet` you must provide a `con` key containing [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) database connection string. In the example above we pass it as part of `credentials` argument. Alternative to `credentials` is to put `con` into `load_args` and `save_args` (`SQLTableDataSet` only). + +## How to view the available data sources + +To review the `DataCatalog`: + +```python +io.list() +``` + +## How to load datasets programmatically + +To access each dataset by its name: + +```python +cars = io.load("cars") # data is now loaded as a DataFrame in 'cars' +gear = cars["gear"].values +``` + +The following steps happened behind the scenes when `load` was called: + +- The value `cars` was located in the Data Catalog +- The corresponding `AbstractDataSet` object was retrieved +- The `load` method of this dataset was called +- This `load` method delegated the loading to the underlying pandas `read_csv` function + +## How to save data programmatically + +```{warning} +This pattern is not recommended unless you are using platform notebook environments (Sagemaker, Databricks etc) or writing unit/integration tests for your Kedro pipeline. Use the YAML approach in preference. +``` + +### How to save data to memory + +To save data using an API similar to that used to load data: + +```python +from kedro.io import MemoryDataSet + +memory = MemoryDataSet(data=None) +io.add("cars_cache", memory) +io.save("cars_cache", "Memory can store anything.") +io.load("cars_cache") +``` + +### How to save data to a SQL database for querying + +To put the data in a SQLite database: + +```python +import os + +# This cleans up the database in case it exists at this point +try: + os.remove("kedro.db") +except FileNotFoundError: + pass + +io.save("cars_table", cars) + +# rank scooters by their mpg +ranked = io.load("scooters_query")[["brand", "mpg"]] +``` + +### How to save data in Parquet + +To save the processed data in Parquet format: + +```python +io.save("ranked", ranked) +``` + +```{warning} +Saving `None` to a dataset is not allowed! +``` + +## How to access a dataset with credentials +Before instantiating the `DataCatalog`, Kedro will first attempt to read [the credentials from the project configuration](../configuration/credentials.md). The resulting dictionary is then passed into `DataCatalog.from_config()` as the `credentials` argument. + +Let's assume that the project contains the file `conf/local/credentials.yml` with the following contents: + +```yaml +dev_s3: + client_kwargs: + aws_access_key_id: key + aws_secret_access_key: secret + +scooters_credentials: + con: sqlite:///kedro.db + +my_gcp_credentials: + id_token: key +``` + +Your code will look as follows: + +```python +CSVDataSet( + filepath="s3://test_bucket/data/02_intermediate/company/motorbikes.csv", + load_args=dict(sep=",", skiprows=5, skipfooter=1, na_values=["#NA", "NA"]), + credentials=dict(key="token", secret="key"), +) +``` + +## How to version a dataset using the Code API + +In an earlier section of the documentation we described how [Kedro enables dataset and ML model versioning](./data_catalog.md/#dataset-versioning). + +If you require programmatic control over load and save versions of a specific dataset, you can instantiate `Version` and pass it as a parameter to the dataset initialisation: + +```python +from kedro.io import DataCatalog, Version +from kedro_datasets.pandas import CSVDataSet +import pandas as pd + +data1 = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]}) +data2 = pd.DataFrame({"col1": [7], "col2": [8], "col3": [9]}) +version = Version( + load=None, # load the latest available version + save=None, # generate save version automatically on each save operation +) + +test_data_set = CSVDataSet( + filepath="data/01_raw/test.csv", save_args={"index": False}, version=version +) +io = DataCatalog({"test_data_set": test_data_set}) + +# save the dataset to data/01_raw/test.csv//test.csv +io.save("test_data_set", data1) +# save the dataset into a new file data/01_raw/test.csv//test.csv +io.save("test_data_set", data2) + +# load the latest version from data/test.csv/*/test.csv +reloaded = io.load("test_data_set") +assert data2.equals(reloaded) +``` + +In the example above, we do not fix any versions. The behaviour of load and save operations becomes slightly different when we set a version: + + +```python +version = Version( + load="my_exact_version", # load exact version + save="my_exact_version", # save to exact version +) + +test_data_set = CSVDataSet( + filepath="data/01_raw/test.csv", save_args={"index": False}, version=version +) +io = DataCatalog({"test_data_set": test_data_set}) + +# save the dataset to data/01_raw/test.csv/my_exact_version/test.csv +io.save("test_data_set", data1) +# load from data/01_raw/test.csv/my_exact_version/test.csv +reloaded = io.load("test_data_set") +assert data1.equals(reloaded) + +# raises DataSetError since the path +# data/01_raw/test.csv/my_exact_version/test.csv already exists +io.save("test_data_set", data2) +``` + +We do not recommend passing exact load and/or save versions, since it might lead to inconsistencies between operations. For example, if versions for load and save operations do not match, a save operation would result in a `UserWarning`. + +Imagine a simple pipeline with two nodes, where B takes the output from A. If you specify the load-version of the data for B to be `my_data_2023_08_16.csv`, the data that A produces (`my_data_20230818.csv`) is not used. + +```text +Node_A -> my_data_20230818.csv +my_data_2023_08_16.csv -> Node B +``` + +In code: + +```python +version = Version( + load="my_data_2023_08_16.csv", # load exact version + save="my_data_20230818.csv", # save to exact version +) + +test_data_set = CSVDataSet( + filepath="data/01_raw/test.csv", save_args={"index": False}, version=version +) +io = DataCatalog({"test_data_set": test_data_set}) + +io.save("test_data_set", data1) # emits a UserWarning due to version inconsistency + +# raises DataSetError since the data/01_raw/test.csv/exact_load_version/test.csv +# file does not exist +reloaded = io.load("test_data_set") +``` diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index fb1f7ac3dc..680db626f7 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -1,31 +1,38 @@ -# The Data Catalog +# Introduction to the Data Catalog -This section introduces `catalog.yml`, the project-shareable Data Catalog. The file is located in `conf/base` and is a registry of all data sources available for use by a project; it manages loading and saving of data. +In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. It is specified with a YAML catalog file that maps the names of node inputs and outputs as keys in the `DataCatalog` class. -All supported data connectors are available in [`kedro-datasets`](/kedro_datasets). +This page introduces the basic sections of `catalog.yml`, which is the file used to register data sources for a Kedro project. -## Use the Data Catalog within Kedro configuration +## The basics of `catalog.yml` +A separate page of [Data Catalog YAML examples](./data_catalog_yaml_examples.md) gives further examples of how to work with `catalog.yml`, but here we revisit the [basic `catalog.yml` introduced by the spaceflights tutorial](../tutorial/set_up_data.md). -Kedro uses configuration to make your code reproducible when it has to reference datasets in different locations and/or in different environments. +The example below registers two `csv` datasets, and an `xlsx` dataset. The minimum details needed to load and save a file within a local file system are the key, which is name of the dataset, the type of data to indicate the dataset to use (`type`) and the file's location (`filepath`). -You can copy this file and reference additional locations for the same datasets. For instance, you can use the `catalog.yml` file in `conf/base/` to register the locations of datasets that would run in production, while copying and updating a second version of `catalog.yml` in `conf/local/` to register the locations of sample datasets that you are using for prototyping your data pipeline(s). +```yaml +companies: + type: pandas.CSVDataSet + filepath: data/01_raw/companies.csv -Built-in functionality for `conf/local/` to overwrite `conf/base/` is [described in the documentation about configuration](../configuration/configuration_basics.md). This means that a dataset called `cars` could exist in the `catalog.yml` files in `conf/base/` and `conf/local/`. In code, in `src`, you would only call a dataset named `cars` and Kedro would detect which definition of `cars` dataset to use to run your pipeline - `cars` definition from `conf/local/catalog.yml` would take precedence in this case. +reviews: + type: pandas.CSVDataSet + filepath: data/01_raw/reviews.csv -The Data Catalog also works with the `credentials.yml` file in `conf/local/`, allowing you to specify usernames and passwords required to load certain datasets. +shuttles: + type: pandas.ExcelDataSet + filepath: data/01_raw/shuttles.xlsx + load_args: + engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0) +``` +### Dataset `type` -You can define a Data Catalog in two ways - through YAML configuration, or programmatically using an API. Both methods allow you to specify: +Kedro offers a range of datasets, including CSV, Excel, Parquet, Feather, HDF5, JSON, Pickle, SQL Tables, SQL Queries, Spark DataFrames and more. They are supported with the APIs of pandas, spark, networkx, matplotlib, yaml and more. - - Dataset name - - Dataset type - - Location of the dataset using `fsspec`, detailed in the next section - - Credentials needed to access the dataset - - Load and saving arguments - - Whether you want a [dataset or ML model to be versioned](kedro_io.md#versioning) when you run your data pipeline +[The `kedro-datasets` package documentation](/kedro_datasets) contains a comprehensive list of all available file types. -## Specify the location of the dataset +### Dataset `filepath` -Kedro relies on [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to read and save data from a variety of data stores including local file systems, network file systems, cloud object stores, and Hadoop. When specifying a storage location in `filepath:`, you should provide a URL using the general form `protocol://path/to/data`. If no protocol is provided, the local file system is assumed (same as ``file://``). +Kedro relies on [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to read and save data from a variety of data stores including local file systems, network file systems, cloud object stores, and Hadoop. When specifying a storage location in `filepath:`, you should provide a URL using the general form `protocol://path/to/data`. If no protocol is provided, the local file system is assumed (which is the same as ``file://``). The following prepends are available: @@ -41,65 +48,17 @@ The following prepends are available: `fsspec` also provides other file systems, such as SSH, FTP and WebHDFS. [See the fsspec documentation for more information](https://filesystem-spec.readthedocs.io/en/latest/api.html#implementations). -## Data Catalog `*_args` parameters - -Data Catalog accepts two different groups of `*_args` parameters that serve different purposes: -- `fs_args` -- `load_args` and `save_args` - -The `fs_args` is used to configure the interaction with a filesystem. -All the top-level parameters of `fs_args` (except `open_args_load` and `open_args_save`) will be passed in an underlying filesystem class. - -### Example 1: Provide the `project` value to the underlying filesystem class (`GCSFileSystem`) to interact with Google Cloud Storage (GCS) - -```yaml -test_dataset: - type: ... - fs_args: - project: test_project -``` -The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively. +## Additional settings in `catalog.yml` -### Example 2: Load data from a local binary file using `utf-8` encoding +This section explains the additional settings available within `catalog.yml`. -```yaml -test_dataset: - type: ... - fs_args: - open_args_load: - mode: "rb" - encoding: "utf-8" -``` +### Load and save arguments +The Kedro Data Catalog also accepts two different groups of `*_args` parameters that serve different purposes: -`load_args` and `save_args` configure how a third-party library (e.g. `pandas` for `CSVDataSet`) loads/saves data from/to a file. +* **`load_args` and `save_args`**: Configures how a third-party library loads/saves data from/to a file. In the spaceflights example above, `load_args`, is passed to the excel file read method (`pd.read_excel`) as a [keyword argument](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html). Although not specified here, the equivalent output is `save_args` and the value would be passed to [`pd.DataFrame.to_excel` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html). -### Example 3: Save data to a CSV file without row names (index) using `utf-8` encoding - -```yaml -test_dataset: - type: pandas.CSVDataSet - ... - save_args: - index: False - encoding: "utf-8" -``` - -## Use the Data Catalog with the YAML API - -The YAML API allows you to configure your datasets in a YAML configuration file, `conf/base/catalog.yml` or `conf/local/catalog.yml`. - -Here are some examples of data configuration in a `catalog.yml`: - -### Example 1: Loads / saves a CSV file from / to a local file system - -```yaml -bikes: - type: pandas.CSVDataSet - filepath: data/01_raw/bikes.csv -``` - -### Example 2: Loads and saves a CSV on a local file system, using specified load and save arguments +For example, to load or save a CSV on a local file system, using specified load/save arguments: ```yaml cars: @@ -111,270 +70,35 @@ cars: index: False date_format: '%Y-%m-%d %H:%M' decimal: . - ``` -### Example 3: Loads and saves a compressed CSV on a local file system +* **`fs_args`**: Configures the interaction with a filesystem. +All the top-level parameters of `fs_args` (except `open_args_load` and `open_args_save`) will be passed to an underlying filesystem class. -```yaml -boats: - type: pandas.CSVDataSet - filepath: data/01_raw/company/boats.csv.gz - load_args: - sep: ',' - compression: 'gzip' - fs_args: - open_args_load: - mode: 'rb' -``` - -### Example 4: Loads a CSV file from a specific S3 bucket, using credentials and load arguments +For example, to provide the `project` value to the underlying filesystem class (`GCSFileSystem`) to interact with Google Cloud Storage: ```yaml -motorbikes: - type: pandas.CSVDataSet - filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv - credentials: dev_s3 - load_args: - sep: ',' - skiprows: 5 - skipfooter: 1 - na_values: ['#NA', NA] -``` - -### Example 5: Loads / saves a pickle file from / to a local file system - -```yaml -airplanes: - type: pickle.PickleDataSet - filepath: data/06_models/airplanes.pkl - backend: pickle -``` - -### Example 6: Loads an Excel file from Google Cloud Storage - -```yaml -rockets: - type: pandas.ExcelDataSet - filepath: gcs://your_bucket/data/02_intermediate/company/motorbikes.xlsx +test_dataset: + type: ... fs_args: - project: my-project - credentials: my_gcp_credentials - save_args: - sheet_name: Sheet1 + project: test_project ``` -### Example 7: Loads a multi-sheet Excel file from a local file system - -```yaml -trains: - type: pandas.ExcelDataSet - filepath: data/02_intermediate/company/trains.xlsx - load_args: - sheet_name: [Sheet1, Sheet2, Sheet3] -``` +The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively. -### Example 8: Saves an image created with Matplotlib on Google Cloud Storage +For example, to load data from a local binary file using `utf-8` encoding: ```yaml -results_plot: - type: matplotlib.MatplotlibWriter - filepath: gcs://your_bucket/data/08_results/plots/output_1.jpeg +test_dataset: + type: ... fs_args: - project: my-project - credentials: my_gcp_credentials -``` - - -### Example 9: Loads / saves an HDF file on local file system storage, using specified load and save arguments - -```yaml -skateboards: - type: pandas.HDFDataSet - filepath: data/02_intermediate/skateboards.hdf - key: name - load_args: - columns: [brand, length] - save_args: - mode: w # Overwrite even when the file already exists - dropna: True -``` - -### Example 10: Loads / saves a parquet file on local file system storage, using specified load and save arguments - -```yaml -trucks: - type: pandas.ParquetDataSet - filepath: data/02_intermediate/trucks.parquet - load_args: - columns: [name, gear, disp, wt] - categories: list - index: name - save_args: - compression: GZIP - file_scheme: hive - has_nulls: False - partition_on: [name] -``` - - -### Example 11: Loads / saves a Spark table on S3, using specified load and save arguments - -```yaml -weather: - type: spark.SparkDataSet - filepath: s3a://your_bucket/data/01_raw/weather* - credentials: dev_s3 - file_format: csv - load_args: - header: True - inferSchema: True - save_args: - sep: '|' - header: True -``` - - -### Example 12: Loads / saves a SQL table using credentials, a database connection, using specified load and save arguments - -```yaml -scooters: - type: pandas.SQLTableDataSet - credentials: scooters_credentials - table_name: scooters - load_args: - index_col: [name] - columns: [name, gear] - save_args: - if_exists: replace -``` - -### Example 13: Loads an SQL table with credentials, a database connection, and applies a SQL query to the table - - -```yaml -scooters_query: - type: pandas.SQLQueryDataSet - credentials: scooters_credentials - sql: select * from cars where gear=4 - load_args: - index_col: [name] -``` - -When you use [`pandas.SQLTableDataSet`](/kedro_datasets.pandas.SQLTableDataSet) or [`pandas.SQLQueryDataSet`](/kedro_datasets.pandas.SQLQueryDataSet), you must provide a database connection string. In the above example, we pass it using the `scooters_credentials` key from the credentials (see the details in the [Feeding in credentials](#feeding-in-credentials) section below). `scooters_credentials` must have a top-level key `con` containing a [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) connection string. As an alternative to credentials, you could explicitly put `con` into `load_args` and `save_args` (`pandas.SQLTableDataSet` only). - - -### Example 14: Loads data from an API endpoint, example US corn yield data from USDA - -```yaml -us_corn_yield_data: - type: api.APIDataSet - url: https://quickstats.nass.usda.gov - credentials: usda_credentials - params: - key: SOME_TOKEN - format: JSON - commodity_desc: CORN - statisticcat_des: YIELD - agg_level_desc: STATE - year: 2000 -``` - -Note that `usda_credientials` will be passed as the `auth` argument in the `requests` library. Specify the username and password as a list in your `credentials.yml` file as follows: - -```yaml -usda_credentials: - - username - - password -``` - - -### Example 15: Loads data from Minio (S3 API Compatible Storage) - - -```yaml -test: - type: pandas.CSVDataSet - filepath: s3://your_bucket/test.csv # assume `test.csv` is uploaded to the Minio server. - credentials: dev_minio -``` -In `credentials.yml`, define the `key`, `secret` and the `endpoint_url` as follows: - -```yaml -dev_minio: - key: token - secret: key - client_kwargs: - endpoint_url : 'http://localhost:9000' -``` - -```{note} -The easiest way to setup MinIO is to run a Docker image. After the following command, you can access the Minio server with `http://localhost:9000` and create a bucket and add files as if it is on S3. -``` - -`docker run -p 9000:9000 -e "MINIO_ACCESS_KEY=token" -e "MINIO_SECRET_KEY=key" minio/minio server /data` - - -### Example 16: Loads a model saved as a pickle from Azure Blob Storage - -```yaml -ml_model: - type: pickle.PickleDataSet - filepath: "abfs://models/ml_models.pickle" - versioned: True - credentials: dev_abs -``` -In the `credentials.yml` file, define the `account_name` and `account_key`: - -```yaml -dev_abs: - account_name: accountname - account_key: key -``` - - -### Example 17: Loads a CSV file stored in a remote location through SSH - -```{note} -This example requires [Paramiko](https://www.paramiko.org) to be installed (`pip install paramiko`). -``` -```yaml -cool_dataset: - type: pandas.CSVDataSet - filepath: "sftp:///path/to/remote_cluster/cool_data.csv" - credentials: cluster_credentials -``` -All parameters required to establish the SFTP connection can be defined through `fs_args` or in the `credentials.yml` file as follows: - -```yaml -cluster_credentials: - username: my_username - host: host_address - port: 22 - password: password -``` -The list of all available parameters is given in the [Paramiko documentation](https://docs.paramiko.org/en/2.4/api/client.html#paramiko.client.SSHClient.connect). - -## Create a Data Catalog YAML configuration file via CLI - -You can use the [`kedro catalog create` command to create a Data Catalog YAML configuration](../development/commands_reference.md#create-a-data-catalog-yaml-configuration-file). - -This creates a `//catalog_.yml` configuration file with `MemoryDataSet` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`. - -```yaml -# //catalog_.yml -rockets: - type: MemoryDataSet -scooters: - type: MemoryDataSet + open_args_load: + mode: "rb" + encoding: "utf-8" ``` -## Adding parameters - -You can [configure parameters](../configuration/parameters.md) for your project and [reference them](../configuration/parameters.md#how-to-use-parameters) in your nodes. To do this, use the `add_feed_dict()` method ([API documentation](/kedro.io.DataCatalog)). You can use this method to add any other entry or metadata you wish on the `DataCatalog`. - - -## Feeding in credentials +### Dataset access credentials +The Data Catalog also works with the `credentials.yml` file in `conf/local/`, allowing you to specify usernames and passwords required to load certain datasets. Before instantiating the `DataCatalog`, Kedro will first attempt to read [the credentials from the project configuration](../configuration/credentials.md). The resulting dictionary is then passed into `DataCatalog.from_config()` as the `credentials` argument. @@ -385,333 +109,24 @@ dev_s3: client_kwargs: aws_access_key_id: key aws_secret_access_key: secret - -scooters_credentials: - con: sqlite:///kedro.db - -my_gcp_credentials: - id_token: key ``` -In the example above, the `catalog.yml` file contains references to credentials keys `dev_s3` and `scooters_credentials`. This means that when it instantiates the `motorbikes` dataset, for example, the `DataCatalog` will attempt to read top-level key `dev_s3` from the received `credentials` dictionary, and then will pass its values into the dataset `__init__` as a `credentials` argument. This is essentially equivalent to calling this: - -```python -CSVDataSet( - filepath="s3://test_bucket/data/02_intermediate/company/motorbikes.csv", - load_args=dict(sep=",", skiprows=5, skipfooter=1, na_values=["#NA", "NA"]), - credentials=dict(key="token", secret="key"), -) -``` - - -## Load multiple datasets with similar configuration using YAML anchors - -Different datasets might use the same file format, load and save arguments, and be stored in the same folder. [YAML has a built-in syntax](https://yaml.org/spec/1.2.1/#Syntax) for factorising parts of a YAML file, which means that you can decide what is generalisable across your datasets, so that you need not spend time copying and pasting dataset configurations in the `catalog.yml` file. - -You can see this in the following example: +and the Data Catalog is specified in `catalog.yml` as follows: ```yaml -_csv: &csv - type: spark.SparkDataSet - file_format: csv +motorbikes: + type: pandas.CSVDataSet + filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv + credentials: dev_s3 load_args: sep: ',' - na_values: ['#NA', NA] - header: True - inferSchema: False - -cars: - <<: *csv - filepath: s3a://data/01_raw/cars.csv - -trucks: - <<: *csv - filepath: s3a://data/01_raw/trucks.csv - -bikes: - <<: *csv - filepath: s3a://data/01_raw/bikes.csv - load_args: - header: False -``` - -The syntax `&csv` names the following block `csv` and the syntax `<<: *csv` inserts the contents of the block named `csv`. Locally declared keys entirely override inserted ones as seen in `bikes`. - -```{note} -It's important that the name of the template entry starts with a `_` so Kedro knows not to try and instantiate it as a dataset. ``` +In the example above, the `catalog.yml` file contains references to credentials keys `dev_s3`. The Data Catalog first reads `dev_s3` from the received `credentials` dictionary, and then passes its values into the dataset as a `credentials` argument to `__init__`. -You can also nest reuseable YAML syntax: - -```yaml -_csv: &csv - type: spark.SparkDataSet - file_format: csv - load_args: &csv_load_args - header: True - inferSchema: False - -airplanes: - <<: *csv - filepath: s3a://data/01_raw/airplanes.csv - load_args: - <<: *csv_load_args - sep: ; -``` -In this example, the default `csv` configuration is inserted into `airplanes` and then the `load_args` block is overridden. Normally, that would replace the whole dictionary. In order to extend `load_args`, the defaults for that block are then re-inserted. +### Dataset versioning -## Load multiple datasets with similar configuration using dataset factories -For catalog entries that share configuration details, you can also use the dataset factories introduced in Kedro 0.18.12. This syntax allows you to generalise the configuration and -reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns. - -### Example 1: Generalise datasets with similar names and types into one dataset factory -Consider the following catalog entries: -```yaml -factory_data: - type: pandas.CSVDataSet - filepath: data/01_raw/factory_data.csv - - -process_data: - type: pandas.CSVDataSet - filepath: data/01_raw/process_data.csv -``` -The datasets in this catalog can be generalised to the following dataset factory: -```yaml -"{name}_data": - type: pandas.CSVDataSet - filepath: data/01_raw/{name}_data.csv -``` -When `factory_data` or `process_data` is used in your pipeline, it is matched to the factory pattern `{name}_data`. The factory pattern must always be enclosed in -quotes to avoid YAML parsing errors. - - -### Example 2: Generalise datasets of the same type into one dataset factory -You can also combine all the datasets with the same type and configuration details. For example, consider the following -catalog with three datasets named `boats`, `cars` and `planes` of the type `pandas.CSVDataSet`: -```yaml -boats: - type: pandas.CSVDataSet - filepath: data/01_raw/shuttles.csv - -cars: - type: pandas.CSVDataSet - filepath: data/01_raw/reviews.csv - -planes: - type: pandas.CSVDataSet - filepath: data/01_raw/companies.csv -``` -These datasets can be combined into the following dataset factory: -```yaml -"{dataset_name}#csv": - type: pandas.CSVDataSet - filepath: data/01_raw/{dataset_name}.csv -``` -You will then have to update the pipelines in your project located at `src///pipeline.py` to refer to these datasets as `boats#csv`, -`cars#csv` and `planes#csv`. Adding a suffix or a prefix to the dataset names and the dataset factory patterns, like `#csv` here, ensures that the dataset -names are matched with the intended pattern. -```python -from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles - - -def create_pipeline(**kwargs) -> Pipeline: - return pipeline( - [ - node( - func=preprocess_boats, - inputs="boats#csv", - outputs="preprocessed_boats", - name="preprocess_boats_node", - ), - node( - func=preprocess_cars, - inputs="cars#csv", - outputs="preprocessed_cars", - name="preprocess_cars_node", - ), - node( - func=preprocess_planes, - inputs="planes#csv", - outputs="preprocessed_planes", - name="preprocess_planes_node", - ), - node( - func=create_model_input_table, - inputs=[ - "preprocessed_boats", - "preprocessed_planes", - "preprocessed_cars", - ], - outputs="model_input_table", - name="create_model_input_table_node", - ), - ] - ) -``` -### Example 3: Generalise datasets using namespaces into one dataset factory -You can also generalise the catalog entries for datasets belonging to namespaced modular pipelines. Consider the -following pipeline which takes in a `model_input_table` and outputs two regressors belonging to the -`active_modelling_pipeline` and the `candidate_modelling_pipeline` namespaces: -```python -from kedro.pipeline import Pipeline, node -from kedro.pipeline.modular_pipeline import pipeline - -from .nodes import evaluate_model, split_data, train_model - - -def create_pipeline(**kwargs) -> Pipeline: - pipeline_instance = pipeline( - [ - node( - func=split_data, - inputs=["model_input_table", "params:model_options"], - outputs=["X_train", "y_train"], - name="split_data_node", - ), - node( - func=train_model, - inputs=["X_train", "y_train"], - outputs="regressor", - name="train_model_node", - ), - ] - ) - ds_pipeline_1 = pipeline( - pipe=pipeline_instance, - inputs="model_input_table", - namespace="active_modelling_pipeline", - ) - ds_pipeline_2 = pipeline( - pipe=pipeline_instance, - inputs="model_input_table", - namespace="candidate_modelling_pipeline", - ) - - return ds_pipeline_1 + ds_pipeline_2 -``` -You can now have one dataset factory pattern in your catalog instead of two separate entries for `active_modelling_pipeline.regressor` -and `candidate_modelling_pipeline.regressor` as below: -```yaml -{namespace}.regressor: - type: pickle.PickleDataSet - filepath: data/06_models/regressor_{namespace}.pkl - versioned: true -``` -### Example 4: Generalise datasets of the same type in different layers into one dataset factory with multiple placeholders - -You can use multiple placeholders in the same pattern. For example, consider the following catalog where the dataset -entries share `type`, `file_format` and `save_args`: -```yaml -processing.factory_data: - type: spark.SparkDataSet - filepath: data/processing/factory_data.pq - file_format: parquet - save_args: - mode: overwrite - -processing.process_data: - type: spark.SparkDataSet - filepath: data/processing/process_data.pq - file_format: parquet - save_args: - mode: overwrite - -modelling.metrics: - type: spark.SparkDataSet - filepath: data/modelling/factory_data.pq - file_format: parquet - save_args: - mode: overwrite -``` -This could be generalised to the following pattern: -```yaml -"{layer}.{dataset_name}": - type: spark.SparkDataSet - filepath: data/{layer}/{dataset_name}.pq - file_format: parquet - save_args: - mode: overwrite -``` -All the placeholders used in the catalog entry body must exist in the factory pattern name. - -### Example 5: Generalise datasets using multiple dataset factories -You can have multiple dataset factories in your catalog. For example: -```yaml -"{namespace}.{dataset_name}@spark": - type: spark.SparkDataSet - filepath: data/{namespace}/{dataset_name}.pq - file_format: parquet - -"{dataset_name}@csv": - type: pandas.CSVDataSet - filepath: data/01_raw/{dataset_name}.csv -``` - -Having multiple dataset factories in your catalog can lead to a situation where a dataset name from your pipeline might -match multiple patterns. To overcome this, Kedro sorts all the potential matches for the dataset name in the pipeline and picks the best match. -The matches are ranked according to the following criteria : -1. Number of exact character matches between the dataset name and the factory pattern. For example, a dataset named `factory_data$csv` would match `{dataset}_data$csv` over `{dataset_name}$csv`. -2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`. -3. Alphabetical order - -### Example 6: Generalise all datasets with a catch-all dataset factory to overwrite the default `MemoryDataSet` -You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataSet` creation. -```yaml -"{default_dataset}": - type: pandas.CSVDataSet - filepath: data/{default_dataset}.csv - -``` -Kedro will now treat all the datasets mentioned in your project's pipelines that do not appear as specific patterns or explicit entries in your catalog -as `pandas.CSVDataSet`. - -## Transcode datasets - -You might come across a situation where you would like to read the same file using two different dataset implementations. Use transcoding when you want to load and save the same file, via its specified `filepath`, using different `DataSet` implementations. - -### A typical example of transcoding - -For instance, parquet files can not only be loaded via the `ParquetDataSet` using `pandas`, but also directly by `SparkDataSet`. This conversion is typical when coordinating a `Spark` to `pandas` workflow. - -To enable transcoding, define two `DataCatalog` entries for the same dataset in a common format (Parquet, JSON, CSV, etc.) in your `conf/base/catalog.yml`: - -```yaml -my_dataframe@spark: - type: spark.SparkDataSet - filepath: data/02_intermediate/data.parquet - file_format: parquet - -my_dataframe@pandas: - type: pandas.ParquetDataSet - filepath: data/02_intermediate/data.parquet -``` - -These entries are used in the pipeline like this: - -```python -pipeline( - [ - node(func=my_func1, inputs="spark_input", outputs="my_dataframe@spark"), - node(func=my_func2, inputs="my_dataframe@pandas", outputs="pipeline_output"), - ] -) -``` - -### How does transcoding work? - -In this example, Kedro understands that `my_dataframe` is the same dataset in its `spark.SparkDataSet` and `pandas.ParquetDataSet` formats and helps resolve the node execution order. - -In the pipeline, Kedro uses the `spark.SparkDataSet` implementation for saving and `pandas.ParquetDataSet` -for loading, so the first node should output a `pyspark.sql.DataFrame`, while the second node would receive a `pandas.Dataframe`. - - -## Version datasets and ML models - -Making a simple addition to your Data Catalog allows you to perform versioning of datasets and machine learning models. - -Consider the following versioned dataset defined in the `catalog.yml`: +Kedro enables dataset and ML model versioning through the `versioned` definition. For example: ```yaml cars: @@ -720,125 +135,41 @@ cars: versioned: True ``` -The `DataCatalog` will create a versioned `CSVDataSet` called `cars`. The actual csv file location will look like `data/01_raw/company/cars.csv//cars.csv`, where `` corresponds to a global save version string formatted as `YYYY-MM-DDThh.mm.ss.sssZ`. +In this example, `filepath` is used as the basis of a folder that stores versions of the `cars` dataset. Each time a new version is created by a pipeline run it is stored within `data/01_raw/company/cars.csv//cars.csv`, where `` corresponds to a version string formatted as `YYYY-MM-DDThh.mm.ss.sssZ`. -You can run the pipeline with a particular versioned data set with `--load-version` flag as follows: +By default, `kedro run` loads the latest version of the dataset. However, you can also specify a particular versioned data set with `--load-version` flag as follows: ```bash kedro run --load-version=cars:YYYY-MM-DDThh.mm.ss.sssZ ``` where `--load-version` is dataset name and version timestamp separated by `:`. -This section shows just the very basics of versioning, which is described further in [the documentation about Kedro IO](../data/kedro_io.md#versioning). - -## Use the Data Catalog with the Code API - -The code API allows you to: - -* configure data sources in code -* operate the IO module within notebooks - -### Configure a Data Catalog - -In a file like `catalog.py`, you can construct a `DataCatalog` object programmatically. In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets). - -```python -from kedro.io import DataCatalog -from kedro_datasets.pandas import ( - CSVDataSet, - SQLTableDataSet, - SQLQueryDataSet, - ParquetDataSet, -) - -io = DataCatalog( - { - "bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"), - "cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")), - "cars_table": SQLTableDataSet( - table_name="cars", credentials=dict(con="sqlite:///kedro.db") - ), - "scooters_query": SQLQueryDataSet( - sql="select * from cars where gear=4", - credentials=dict(con="sqlite:///kedro.db"), - ), - "ranked": ParquetDataSet(filepath="ranked.parquet"), - } -) -``` - -When using `SQLTableDataSet` or `SQLQueryDataSet` you must provide a `con` key containing [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) database connection string. In the example above we pass it as part of `credentials` argument. Alternative to `credentials` is to put `con` into `load_args` and `save_args` (`SQLTableDataSet` only). - -### Load datasets - -You can access each dataset by its name. - -```python -cars = io.load("cars") # data is now loaded as a DataFrame in 'cars' -gear = cars["gear"].values -``` - -#### Behind the scenes - -The following steps happened behind the scenes when `load` was called: - -- The value `cars` was located in the Data Catalog -- The corresponding `AbstractDataset` object was retrieved -- The `load` method of this dataset was called -- This `load` method delegated the loading to the underlying pandas `read_csv` function - -### View the available data sources - -If you forget what data was assigned, you can always review the `DataCatalog`. - -```python -io.list() -``` - -### Save data +A dataset offers versioning support if it extends the [`AbstractVersionedDataSet`](/kedro.io.AbstractVersionedDataset) class to accept a version keyword argument as part of the constructor and adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively. -You can save data using an API similar to that used to load data. +To verify whether a dataset can undergo versioning, you should examine the dataset class code to inspect its inheritance [(you can find contributed datasets within the `kedro-datasets` repository)](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets). Check if the dataset class inherits from the `AbstractVersionedDataSet`. For instance, if you encounter a class like `CSVDataSet(AbstractVersionedDataSet[pd.DataFrame, pd.DataFrame])`, this indicates that the dataset is set up to support versioning. -```{warning} -This use is not recommended unless you are prototyping in notebooks. -``` - -#### Save data to memory - -```python -from kedro.io import MemoryDataSet - -memory = MemoryDataSet(data=None) -io.add("cars_cache", memory) -io.save("cars_cache", "Memory can store anything.") -io.load("cars_cache") +```{note} +Note that HTTP(S) is a supported file system in the dataset implementations, but if you it, you can't also use versioning. ``` -#### Save data to a SQL database for querying - -We might now want to put the data in a SQLite database to run queries on it. Let's use that to rank scooters by their mpg. - -```python -import os - -# This cleans up the database in case it exists at this point -try: - os.remove("kedro.db") -except FileNotFoundError: - pass - -io.save("cars_table", cars) -ranked = io.load("scooters_query")[["brand", "mpg"]] -``` +## Use the Data Catalog within Kedro configuration -#### Save data in Parquet +Kedro configuration enables you to organise your project for different stages of your data pipeline. For example, you might need different Data Catalog settings for development, testing, and production environments. -Finally, we can save the processed data in Parquet format. +By default, Kedro has a `base` and a `local` folder for configuration. The Data Catalog configuration is loaded using a configuration loader class which recursively scans for configuration files inside the `conf` folder, firstly in `conf/base` and then in `conf/local` (which is the designated overriding environment). Kedro merges the configuration information and returns a configuration dictionary according to rules set out in the [configuration documentation](../configuration/configuration_basics.md). -```python -io.save("ranked", ranked) -``` +In summary, if you need to configure your datasets for different environments, you can create both `conf/base/catalog.yml` and `conf/local/catalog.yml`. For instance, you can use the `catalog.yml` file in `conf/base/` to register the locations of datasets that would run in production, while adding a second version of `catalog.yml` in `conf/local/` to register the locations of sample datasets while you are using them for prototyping data pipeline(s). -```{warning} -Saving `None` to a dataset is not allowed! +To illustrate this, consider the following catalog entry for a dataset named `cars` in `conf/base/catalog.yml`, which points to a csv file stored in a bucket on AWS S3: +```yaml +cars: + filepath: s3://my_bucket/cars.csv + type: pandas.CSVDataSet + ``` +You can overwrite this catalog entry in `conf/local/catalog.yml` to point to a locally stored file instead: +```yaml +cars: + filepath: data/01_raw/cars.csv + type: pandas.CSVDataSet ``` +In your pipeline code, when the `cars` dataset is used, it will use the overwritten catalog entry from `conf/local/catalog.yml` and rely on Kedro to detect which definition of `cars` dataset to use in your pipeline. diff --git a/docs/source/data/data_catalog_yaml_examples.md b/docs/source/data/data_catalog_yaml_examples.md new file mode 100644 index 0000000000..0570aa0f2c --- /dev/null +++ b/docs/source/data/data_catalog_yaml_examples.md @@ -0,0 +1,408 @@ +# Data Catalog YAML examples + +This page contains a set of examples to help you structure your YAML configuration file in `conf/base/catalog.yml` or `conf/local/catalog.yml`. + +```{contents} Table of Contents +:depth: 3 +``` + +## Load data from a local binary file using `utf-8` encoding + +The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively. + +```yaml +test_dataset: + type: ... + fs_args: + open_args_load: + mode: "rb" + encoding: "utf-8" +``` + +`load_args` and `save_args` configure how a third-party library (e.g. `pandas` for `CSVDataSet`) loads/saves data from/to a file. + +## Save data to a CSV file without row names (index) using `utf-8` encoding + +```yaml +test_dataset: + type: pandas.CSVDataSet + ... + save_args: + index: False + encoding: "utf-8" +``` + +## Load/save a CSV file from/to a local file system + +```yaml +bikes: + type: pandas.CSVDataSet + filepath: data/01_raw/bikes.csv +``` + +## Load/save a CSV on a local file system, using specified load/save arguments + +```yaml +cars: + type: pandas.CSVDataSet + filepath: data/01_raw/company/cars.csv + load_args: + sep: ',' + save_args: + index: False + date_format: '%Y-%m-%d %H:%M' + decimal: . + +``` + +## Load/save a compressed CSV on a local file system + +```yaml +boats: + type: pandas.CSVDataSet + filepath: data/01_raw/company/boats.csv.gz + load_args: + sep: ',' + compression: 'gzip' + fs_args: + open_args_load: + mode: 'rb' +``` + +## Load a CSV file from a specific S3 bucket, using credentials and load arguments + +```yaml +motorbikes: + type: pandas.CSVDataSet + filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv + credentials: dev_s3 + load_args: + sep: ',' + skiprows: 5 + skipfooter: 1 + na_values: ['#NA', NA] +``` + +## Load/save a pickle file from/to a local file system + +```yaml +airplanes: + type: pickle.PickleDataSet + filepath: data/06_models/airplanes.pkl + backend: pickle +``` + +## Load an Excel file from Google Cloud Storage + +The example includes the `project` value for the underlying filesystem class (`GCSFileSystem`) within Google Cloud Storage (GCS) + +```yaml +rockets: + type: pandas.ExcelDataSet + filepath: gcs://your_bucket/data/02_intermediate/company/motorbikes.xlsx + fs_args: + project: my-project + credentials: my_gcp_credentials + save_args: + sheet_name: Sheet1 +``` + + +## Load a multi-sheet Excel file from a local file system + +```yaml +trains: + type: pandas.ExcelDataSet + filepath: data/02_intermediate/company/trains.xlsx + load_args: + sheet_name: [Sheet1, Sheet2, Sheet3] +``` + +## Save an image created with Matplotlib on Google Cloud Storage + +```yaml +results_plot: + type: matplotlib.MatplotlibWriter + filepath: gcs://your_bucket/data/08_results/plots/output_1.jpeg + fs_args: + project: my-project + credentials: my_gcp_credentials +``` + + +## Load/save an HDF file on local file system storage, using specified load/save arguments + +```yaml +skateboards: + type: pandas.HDFDataSet + filepath: data/02_intermediate/skateboards.hdf + key: name + load_args: + columns: [brand, length] + save_args: + mode: w # Overwrite even when the file already exists + dropna: True +``` + +## Load/save a parquet file on local file system storage, using specified load/save arguments + +```yaml +trucks: + type: pandas.ParquetDataSet + filepath: data/02_intermediate/trucks.parquet + load_args: + columns: [name, gear, disp, wt] + categories: list + index: name + save_args: + compression: GZIP + file_scheme: hive + has_nulls: False + partition_on: [name] +``` + + +## Load/save a Spark table on S3, using specified load/save arguments + +```yaml +weather: + type: spark.SparkDataSet + filepath: s3a://your_bucket/data/01_raw/weather* + credentials: dev_s3 + file_format: csv + load_args: + header: True + inferSchema: True + save_args: + sep: '|' + header: True +``` + + +## Load/save a SQL table using credentials, a database connection, and specified load/save arguments + +```yaml +scooters: + type: pandas.SQLTableDataSet + credentials: scooters_credentials + table_name: scooters + load_args: + index_col: [name] + columns: [name, gear] + save_args: + if_exists: replace +``` + +## Load a SQL table with credentials and a database connection, and apply a SQL query to the table + + +```yaml +scooters_query: + type: pandas.SQLQueryDataSet + credentials: scooters_credentials + sql: select * from cars where gear=4 + load_args: + index_col: [name] +``` + +When you use [`pandas.SQLTableDataSet`](/kedro_datasets.pandas.SQLTableDataSet) or [`pandas.SQLQueryDataSet`](/kedro_datasets.pandas.SQLQueryDataSet), you must provide a database connection string. In the above example, we pass it using the `scooters_credentials` key from the credentials. + +Note that `scooters_credentials` must have a top-level key `con` containing a [SQLAlchemy compatible](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) connection string. As an alternative to credentials, you could explicitly put `con` into `load_args` and `save_args` (`pandas.SQLTableDataSet` only). + + +## Load data from an API endpoint + +This example uses US corn yield data from USDA. + +```yaml +us_corn_yield_data: + type: api.APIDataSet + url: https://quickstats.nass.usda.gov + credentials: usda_credentials + params: + key: SOME_TOKEN + format: JSON + commodity_desc: CORN + statisticcat_des: YIELD + agg_level_desc: STATE + year: 2000 +``` + +Note that `usda_credientials` will be passed as the `auth` argument in the `requests` library. Specify the username and password as a list in your `credentials.yml` file as follows: + +```yaml +usda_credentials: + - username + - password +``` + + +## Load data from Minio (S3 API Compatible Storage) + + +```yaml +test: + type: pandas.CSVDataSet + filepath: s3://your_bucket/test.csv # assume `test.csv` is uploaded to the Minio server. + credentials: dev_minio +``` +In `credentials.yml`, define the `key`, `secret` and the `endpoint_url` as follows: + +```yaml +dev_minio: + key: token + secret: key + client_kwargs: + endpoint_url : 'http://localhost:9000' +``` + +```{note} +The easiest way to setup MinIO is to run a Docker image. After the following command, you can access the Minio server with `http://localhost:9000` and create a bucket and add files as if it is on S3. +``` + +`docker run -p 9000:9000 -e "MINIO_ACCESS_KEY=token" -e "MINIO_SECRET_KEY=key" minio/minio server /data` + + +## Load a model saved as a pickle from Azure Blob Storage + +```yaml +ml_model: + type: pickle.PickleDataSet + filepath: "abfs://models/ml_models.pickle" + versioned: True + credentials: dev_abs +``` +In the `credentials.yml` file, define the `account_name` and `account_key`: + +```yaml +dev_abs: + account_name: accountname + account_key: key +``` + + +## Load a CSV file stored in a remote location through SSH + +```{note} +This example requires [Paramiko](https://www.paramiko.org) to be installed (`pip install paramiko`). +``` +```yaml +cool_dataset: + type: pandas.CSVDataSet + filepath: "sftp:///path/to/remote_cluster/cool_data.csv" + credentials: cluster_credentials +``` +All parameters required to establish the SFTP connection can be defined through `fs_args` or in the `credentials.yml` file as follows: + +```yaml +cluster_credentials: + username: my_username + host: host_address + port: 22 + password: password +``` +The list of all available parameters is given in the [Paramiko documentation](https://docs.paramiko.org/en/2.4/api/client.html#paramiko.client.SSHClient.connect). + +## Load multiple datasets with similar configuration using YAML anchors + +Different datasets might use the same file format, load and save arguments, and be stored in the same folder. [YAML has a built-in syntax](https://yaml.org/spec/1.2.1/#Syntax) for factorising parts of a YAML file, which means that you can decide what is generalisable across your datasets, so that you need not spend time copying and pasting dataset configurations in the `catalog.yml` file. + +You can see this in the following example: + +```yaml +_csv: &csv + type: spark.SparkDataSet + file_format: csv + load_args: + sep: ',' + na_values: ['#NA', NA] + header: True + inferSchema: False + +cars: + <<: *csv + filepath: s3a://data/01_raw/cars.csv + +trucks: + <<: *csv + filepath: s3a://data/01_raw/trucks.csv + +bikes: + <<: *csv + filepath: s3a://data/01_raw/bikes.csv + load_args: + header: False +``` + +The syntax `&csv` names the following block `csv` and the syntax `<<: *csv` inserts the contents of the block named `csv`. Locally declared keys entirely override inserted ones as seen in `bikes`. + +```{note} +It's important that the name of the template entry starts with a `_` so Kedro knows not to try and instantiate it as a dataset. +``` + +You can also nest reuseable YAML syntax: + +```yaml +_csv: &csv + type: spark.SparkDataSet + file_format: csv + load_args: &csv_load_args + header: True + inferSchema: False + +airplanes: + <<: *csv + filepath: s3a://data/01_raw/airplanes.csv + load_args: + <<: *csv_load_args + sep: ; +``` + +In this example, the default `csv` configuration is inserted into `airplanes` and then the `load_args` block is overridden. Normally, that would replace the whole dictionary. In order to extend `load_args`, the defaults for that block are then re-inserted. + +## Read the same file using two different datasets + +You might come across a situation where you would like to read the same file using two different dataset implementations (known as transcoding). For example, Parquet files can not only be loaded via the `ParquetDataSet` using `pandas`, but also directly by `SparkDataSet`. This conversion is typical when coordinating a `Spark` to `pandas` workflow. + +Define two `DataCatalog` entries for the same dataset in a common format (Parquet, JSON, CSV, etc.) in your `conf/base/catalog.yml`: + +```yaml +my_dataframe@spark: + type: spark.SparkDataSet + filepath: data/02_intermediate/data.parquet + file_format: parquet + +my_dataframe@pandas: + type: pandas.ParquetDataSet + filepath: data/02_intermediate/data.parquet +``` + +These entries are used in the pipeline like this: + +```python +pipeline( + [ + node(func=my_func1, inputs="spark_input", outputs="my_dataframe@spark"), + node(func=my_func2, inputs="my_dataframe@pandas", outputs="pipeline_output"), + ] +) +``` + +In this example, Kedro understands that `my_dataframe` is the same dataset in its `spark.SparkDataSet` and `pandas.ParquetDataSet` formats and resolves the node execution order. + +In the pipeline, Kedro uses the `spark.SparkDataSet` implementation for saving and `pandas.ParquetDataSet` +for loading, so the first node outputs a `pyspark.sql.DataFrame`, while the second node receives a `pandas.Dataframe`. + +## Create a Data Catalog YAML configuration file via the CLI + +You can use the [`kedro catalog create` command to create a Data Catalog YAML configuration](../development/commands_reference.md#create-a-data-catalog-yaml-configuration-file). + +This creates a `//catalog/.yml` configuration file with `MemoryDataSet` datasets for each dataset in a registered pipeline if it is missing from the `DataCatalog`. + +```yaml +# //catalog/.yml +rockets: + type: MemoryDataSet +scooters: + type: MemoryDataSet +``` diff --git a/docs/source/extend_kedro/custom_datasets.md b/docs/source/data/how_to_create_a_custom_dataset.md similarity index 93% rename from docs/source/extend_kedro/custom_datasets.md rename to docs/source/data/how_to_create_a_custom_dataset.md index c0aad914da..86010b4f18 100644 --- a/docs/source/extend_kedro/custom_datasets.md +++ b/docs/source/data/how_to_create_a_custom_dataset.md @@ -1,7 +1,12 @@ -# Custom datasets +# Advanced: Tutorial to create a custom dataset [Kedro supports many datasets](/kedro_datasets) out of the box, but you may find that you need to create a custom dataset. For example, you may need to handle a proprietary data format or filesystem in your pipeline, or perhaps you have found a particular use case for a dataset that Kedro does not support. This tutorial explains how to create a custom dataset to read and save image data. +## AbstractDataSet + +For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataSet` interface](/kedro.io.AbstractDataset) or [`AbstractVersionedDataSet` interface](/kedro.io.AbstractVersionedDataset) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation. + + ## Scenario In this example, we use a [Kaggle dataset of Pokémon images and types](https://www.kaggle.com/vishalsubbiah/pokemon-images-and-types) to train a model to classify the type of a given [Pokémon](https://en.wikipedia.org/wiki/Pok%C3%A9mon), e.g. Water, Fire, Bug, etc., based on its appearance. To train the model, we read the Pokémon images from PNG files into `numpy` arrays before further manipulation in the Kedro pipeline. To work with PNG images out of the box, in this example we create an `ImageDataSet` to read and save image data. @@ -93,7 +98,7 @@ src/kedro_pokemon/extras ## Implement the `_load` method with `fsspec` -Many of the built-in Kedro datasets rely on [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) as a consistent interface to different data sources, as described earlier in the section about the [Data Catalog](../data/data_catalog.md#specify-the-location-of-the-dataset). In this example, it's particularly convenient to use `fsspec` in conjunction with `Pillow` to read image data, since it allows the dataset to work flexibly with different image locations and formats. +Many of the built-in Kedro datasets rely on [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) as a consistent interface to different data sources, as described earlier in the section about the [Data Catalog](../data/data_catalog.md#dataset-filepath). In this example, it's particularly convenient to use `fsspec` in conjunction with `Pillow` to read image data, since it allows the dataset to work flexibly with different image locations and formats. Here is the implementation of the `_load` method using `fsspec` and `Pillow` to read the data of a single image into a `numpy` array: @@ -266,7 +271,7 @@ class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]): Currently, the `ImageDataSet` only works with a single image, but this example needs to load all Pokemon images from the raw data directory for further processing. -Kedro's [`PartitionedDataSet`](../data/kedro_io.md#partitioned-dataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory. +Kedro's [`PartitionedDataSet`](./partitioned_and_incremental_datasets.md) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory. To use `PartitionedDataSet` with `ImageDataSet` to load all Pokemon PNG images, add this to the data catalog YAML so that `PartitionedDataSet` loads all PNG files from the data directory using `ImageDataSet`: @@ -297,11 +302,14 @@ $ ls -la data/01_raw/pokemon-images-and-types/images/images/*.png | wc -l ## Versioning +### How to implement versioning in your dataset + ```{note} Versioning doesn't work with `PartitionedDataSet`. You can't use both of them at the same time. ``` -To add [Versioning](../data/kedro_io.md#versioning) support to the new dataset we need to extend the - [AbstractVersionedDataset](/kedro.io.AbstractVersionedDataset) to: + +To add versioning support to the new dataset we need to extend the + [AbstractVersionedDataSet](/kedro.io.AbstractVersionedDataset) to: * Accept a `version` keyword argument as part of the constructor * Adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively @@ -498,7 +506,6 @@ In [2]: context.catalog.save('pikachu', data=img) Inspect the content of the data directory to find a new version of the data, written by `save`. -You may also want to consult the [in-depth documentation about the Versioning API](../data/kedro_io.md#versioning). ## Thread-safety @@ -562,7 +569,7 @@ class ImageDataSet(AbstractVersionedDataset): ... ``` -We provide additional examples of [how to use parameters through the data catalog's YAML API](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api). For an example of how to use these parameters in your dataset's constructor, please see the [SparkDataSet](/kedro_datasets.spark.SparkDataSet)'s implementation. +We provide additional examples of [how to use parameters through the data catalog's YAML API](./data_catalog_yaml_examples.md). For an example of how to use these parameters in your dataset's constructor, please see the [SparkDataSet](/kedro_datasets.spark.SparkDataSet)'s implementation. ## How to contribute a custom dataset implementation diff --git a/docs/source/data/index.md b/docs/source/data/index.md index 00c05353fc..b90a3d9961 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -1,8 +1,49 @@ -# Data Catalog + +# The Kedro Data Catalog + +In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. The catalog is stored in a YAML file (`catalog.yml`) that maps the names of node inputs and outputs as keys in the `DataCatalog` class. + +[Kedro provides different built-in datasets in the `kedro-datasets` package](/kedro_datasets) for numerous file types and file systems, so you don’t have to write any of the logic for reading/writing data. + + +We first introduce the basic sections of `catalog.yml`, which is the file used to register data sources for a Kedro project. ```{toctree} :maxdepth: 1 data_catalog -kedro_io +``` + +The following page offers a range of examples of YAML specification for various Data Catalog use cases: + +```{toctree} +:maxdepth: 1 + +data_catalog_yaml_examples +``` + +Once you are familiar with the format of `catalog.yml`, you may find your catalog gets repetitive if you need to load multiple datasets with similar configuration. From Kedro 0.18.12 you can use dataset factories to generalise the configuration and reduce the number of similar catalog entries. This works by by matching datasets used in your project’s pipelines to dataset factory patterns and is explained in a new page about Kedro dataset factories: + + +```{toctree} +:maxdepth: 1 + +kedro_dataset_factories +``` + +Further pages describe more advanced concepts: + +```{toctree} +:maxdepth: 1 + +advanced_data_catalog_usage +partitioned_and_incremental_datasets +``` + +This section on handing data with Kedro concludes with an advanced use case, illustrated with a tutorial that explains how to create your own custom dataset: + +```{toctree} +:maxdepth: 1 + +how_to_create_a_custom_dataset ``` diff --git a/docs/source/data/kedro_dataset_factories.md b/docs/source/data/kedro_dataset_factories.md new file mode 100644 index 0000000000..693272c013 --- /dev/null +++ b/docs/source/data/kedro_dataset_factories.md @@ -0,0 +1,385 @@ +# Kedro dataset factories +You can load multiple datasets with similar configuration using dataset factories, introduced in Kedro 0.18.12. + +The syntax allows you to generalise the configuration and reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns. + +## How to generalise datasets with similar names and types + +Consider the following catalog entries: + +```yaml +factory_data: + type: pandas.CSVDataSet + filepath: data/01_raw/factory_data.csv + + +process_data: + type: pandas.CSVDataSet + filepath: data/01_raw/process_data.csv +``` + +The datasets in this catalog can be generalised to the following dataset factory: + +```yaml +"{name}_data": + type: pandas.CSVDataSet + filepath: data/01_raw/{name}_data.csv +``` + +When `factory_data` or `process_data` is used in your pipeline, it is matched to the factory pattern `{name}_data`. The factory pattern must always be enclosed in +quotes to avoid YAML parsing errors. + + +## How to generalise datasets of the same type + +You can also combine all the datasets with the same type and configuration details. For example, consider the following +catalog with three datasets named `boats`, `cars` and `planes` of the type `pandas.CSVDataSet`: + +```yaml +boats: + type: pandas.CSVDataSet + filepath: data/01_raw/shuttles.csv + +cars: + type: pandas.CSVDataSet + filepath: data/01_raw/reviews.csv + +planes: + type: pandas.CSVDataSet + filepath: data/01_raw/companies.csv +``` + +These datasets can be combined into the following dataset factory: + +```yaml +"{dataset_name}#csv": + type: pandas.CSVDataSet + filepath: data/01_raw/{dataset_name}.csv +``` + +You will then have to update the pipelines in your project located at `src///pipeline.py` to refer to these datasets as `boats#csv`, +`cars#csv` and `planes#csv`. Adding a suffix or a prefix to the dataset names and the dataset factory patterns, like `#csv` here, ensures that the dataset +names are matched with the intended pattern. + +```python +from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles + + +def create_pipeline(**kwargs) -> Pipeline: + return pipeline( + [ + node( + func=preprocess_boats, + inputs="boats#csv", + outputs="preprocessed_boats", + name="preprocess_boats_node", + ), + node( + func=preprocess_cars, + inputs="cars#csv", + outputs="preprocessed_cars", + name="preprocess_cars_node", + ), + node( + func=preprocess_planes, + inputs="planes#csv", + outputs="preprocessed_planes", + name="preprocess_planes_node", + ), + node( + func=create_model_input_table, + inputs=[ + "preprocessed_boats", + "preprocessed_planes", + "preprocessed_cars", + ], + outputs="model_input_table", + name="create_model_input_table_node", + ), + ] + ) +``` +## How to generalise datasets using namespaces + +You can also generalise the catalog entries for datasets belonging to namespaced modular pipelines. Consider the +following pipeline which takes in a `model_input_table` and outputs two regressors belonging to the +`active_modelling_pipeline` and the `candidate_modelling_pipeline` namespaces: + +```python +from kedro.pipeline import Pipeline, node +from kedro.pipeline.modular_pipeline import pipeline + +from .nodes import evaluate_model, split_data, train_model + + +def create_pipeline(**kwargs) -> Pipeline: + pipeline_instance = pipeline( + [ + node( + func=split_data, + inputs=["model_input_table", "params:model_options"], + outputs=["X_train", "y_train"], + name="split_data_node", + ), + node( + func=train_model, + inputs=["X_train", "y_train"], + outputs="regressor", + name="train_model_node", + ), + ] + ) + ds_pipeline_1 = pipeline( + pipe=pipeline_instance, + inputs="model_input_table", + namespace="active_modelling_pipeline", + ) + ds_pipeline_2 = pipeline( + pipe=pipeline_instance, + inputs="model_input_table", + namespace="candidate_modelling_pipeline", + ) + + return ds_pipeline_1 + ds_pipeline_2 +``` +You can now have one dataset factory pattern in your catalog instead of two separate entries for `active_modelling_pipeline.regressor` +and `candidate_modelling_pipeline.regressor` as below: + +```yaml +{namespace}.regressor: + type: pickle.PickleDataSet + filepath: data/06_models/regressor_{namespace}.pkl + versioned: true +``` +## How to generalise datasets of the same type in different layers + +You can use multiple placeholders in the same pattern. For example, consider the following catalog where the dataset +entries share `type`, `file_format` and `save_args`: + +```yaml +processing.factory_data: + type: spark.SparkDataSet + filepath: data/processing/factory_data.pq + file_format: parquet + save_args: + mode: overwrite + +processing.process_data: + type: spark.SparkDataSet + filepath: data/processing/process_data.pq + file_format: parquet + save_args: + mode: overwrite + +modelling.metrics: + type: spark.SparkDataSet + filepath: data/modelling/factory_data.pq + file_format: parquet + save_args: + mode: overwrite +``` + +This could be generalised to the following pattern: + +```yaml +"{layer}.{dataset_name}": + type: spark.SparkDataSet + filepath: data/{layer}/{dataset_name}.pq + file_format: parquet + save_args: + mode: overwrite +``` +All the placeholders used in the catalog entry body must exist in the factory pattern name. + +## How to generalise datasets using multiple dataset factories +You can have multiple dataset factories in your catalog. For example: + +```yaml +"{namespace}.{dataset_name}@spark": + type: spark.SparkDataSet + filepath: data/{namespace}/{dataset_name}.pq + file_format: parquet + +"{dataset_name}@csv": + type: pandas.CSVDataSet + filepath: data/01_raw/{dataset_name}.csv +``` + +Having multiple dataset factories in your catalog can lead to a situation where a dataset name from your pipeline might +match multiple patterns. To overcome this, Kedro sorts all the potential matches for the dataset name in the pipeline and picks the best match. +The matches are ranked according to the following criteria: + +1. Number of exact character matches between the dataset name and the factory pattern. For example, a dataset named `factory_data$csv` would match `{dataset}_data$csv` over `{dataset_name}$csv`. +2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`. +3. Alphabetical order + +## How to override the default dataset creation with dataset factories + +You can use dataset factories to define a catch-all pattern which will overwrite the default [`MemoryDataSet`](/kedro.io.MemoryDataset) creation. + +```yaml +"{default_dataset}": + type: pandas.CSVDataSet + filepath: data/{default_dataset}.csv + +``` +Kedro will now treat all the datasets mentioned in your project's pipelines that do not appear as specific patterns or explicit entries in your catalog +as `pandas.CSVDataSet`. + +## CLI commands for dataset factories + +To manage your dataset factories, two new commands have been added to the Kedro CLI: `kedro catalog rank` (0.18.12) and `kedro catalog resolve` (0.18.13). + +### How to use `kedro catalog rank` + +This command outputs a list of all dataset factories in the catalog, ranked in the order by which pipeline datasets are matched against them. The ordering is determined by the following criteria: + +1. The number of non-placeholder characters in the pattern +2. The number of placeholders in the pattern +3. Alphabetic ordering + +Consider a catalog file with the following patterns: + +
+Click to expand + +```yaml +"{layer}.{dataset_name}": + type: pandas.CSVDataSet + filepath: data/{layer}/{dataset_name}.csv + +preprocessed_{dataset_name}: + type: pandas.ParquetDataSet + filepath: data/02_intermediate/preprocessed_{dataset_name}.pq + +processed_{dataset_name}: + type: pandas.ParquetDataSet + filepath: data/03_primary/processed_{dataset_name}.pq + +"{dataset_name}_csv": + type: pandas.CSVDataSet + filepath: data/03_primary/{dataset_name}.csv + +"{namespace}.{dataset_name}_pq": + type: pandas.ParquetDataSet + filepath: data/03_primary/{dataset_name}_{namespace}.pq + +"{default_dataset}": + type: pickle.PickleDataSet + filepath: data/01_raw/{default_dataset}.pickle +``` +
+ +Running `kedro catalog rank` will result in the following output: + +``` +- preprocessed_{dataset_name} +- processed_{dataset_name} +- '{namespace}.{dataset_name}_pq' +- '{dataset_name}_csv' +- '{layer}.{dataset_name}' +- '{default_dataset}' +``` + +As we can see, the entries are ranked firstly by how many non-placeholders are in the pattern, in descending order. Where two entries have the same number of non-placeholder characters, `{namespace}.{dataset_name}_pq` and `{dataset_name}_csv` with four each, they are then ranked by the number of placeholders, also in decreasing order. `{default_dataset}` is the least specific pattern possible, and will always be matched against last. + +### How to use `kedro catalog resolve` + +This command resolves dataset patterns in the catalog against any explicit dataset entries in the project pipeline. The resulting output contains all explicit dataset entries in the catalog and any dataset in the default pipeline that resolves some dataset pattern. + +To illustrate this, consider the following catalog file: + +
+Click to expand + +```yaml +companies: + type: pandas.CSVDataSet + filepath: data/01_raw/companies.csv + +reviews: + type: pandas.CSVDataSet + filepath: data/01_raw/reviews.csv + +shuttles: + type: pandas.ExcelDataSet + filepath: data/01_raw/shuttles.xlsx + load_args: + engine: openpyxl # Use modern Excel engine, it is the default since Kedro 0.18.0 + +preprocessed_{name}: + type: pandas.ParquetDataSet + filepath: data/02_intermediate/preprocessed_{name}.pq + +"{default}": + type: pandas.ParquetDataSet + filepath: data/03_primary/{default}.pq +``` +
+ +and the following pipeline in `pipeline.py`: + +
+Click to expand + +```python +def create_pipeline(**kwargs) -> Pipeline: + return pipeline( + [ + node( + func=preprocess_companies, + inputs="companies", + outputs="preprocessed_companies", + name="preprocess_companies_node", + ), + node( + func=preprocess_shuttles, + inputs="shuttles", + outputs="preprocessed_shuttles", + name="preprocess_shuttles_node", + ), + node( + func=create_model_input_table, + inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"], + outputs="model_input_table", + name="create_model_input_table_node", + ), + ] + ) +``` +
+ +The resolved catalog output by the command will be as follows: + +
+Click to expand + +```yaml +companies: + filepath: data/01_raw/companies.csv + type: pandas.CSVDataSet +model_input_table: + filepath: data/03_primary/model_input_table.pq + type: pandas.ParquetDataSet +preprocessed_companies: + filepath: data/02_intermediate/preprocessed_companies.pq + type: pandas.ParquetDataSet +preprocessed_shuttles: + filepath: data/02_intermediate/preprocessed_shuttles.pq + type: pandas.ParquetDataSet +reviews: + filepath: data/01_raw/reviews.csv + type: pandas.CSVDataSet +shuttles: + filepath: data/01_raw/shuttles.xlsx + load_args: + engine: openpyxl + type: pandas.ExcelDataSet +``` +
+ +By default this is output to the terminal. However, if you wish to output the resolved catalog to a specific file, you can use the redirection operator `>`: + +```bash +kedro catalog resolve > output_file.yaml +``` diff --git a/docs/source/data/kedro_io.md b/docs/source/data/partitioned_and_incremental_datasets.md similarity index 62% rename from docs/source/data/kedro_io.md rename to docs/source/data/partitioned_and_incremental_datasets.md index a38ea97fcb..7e48c23137 100644 --- a/docs/source/data/kedro_io.md +++ b/docs/source/data/partitioned_and_incremental_datasets.md @@ -1,245 +1,9 @@ -# Kedro IO +# Advanced: Partitioned and incremental datasets +## Partitioned datasets -In this tutorial, we cover advanced uses of [the Kedro IO module](/kedro.io) to understand the underlying implementation. The relevant API documentation is [kedro.io.AbstractDataset](/kedro.io.AbstractDataset) and [kedro.io.DataSetError](/kedro.io.DataSetError). +Distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible. -## Error handling - -We have custom exceptions for the main classes of errors that you can handle to deal with failures. - -```python -from kedro.io import * -``` - -```python -io = DataCatalog(data_sets=dict()) # empty catalog - -try: - cars_df = io.load("cars") -except DataSetError: - print("Error raised.") -``` - - -## AbstractDataset - -To understand what is going on behind the scenes, you should study the [AbstractDataset interface](/kedro.io.AbstractDataset). `AbstractDataset` is the underlying interface that all datasets extend. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation. - -If you have a dataset called `parts`, you can make direct calls to it like so: - -```python -parts_df = parts.load() -``` - -We recommend using a `DataCatalog` instead (for more details, see [the `DataCatalog` documentation](../data/data_catalog.md)) as it has been designed to make all datasets available to project members. - -For contributors, if you would like to submit a new dataset, you must extend the `AbstractDataset`. For a complete guide, please read [the section on custom datasets](../extend_kedro/custom_datasets.md). - - -## Versioning - -In order to enable versioning, you need to update the `catalog.yml` config file and set the `versioned` attribute to `true` for the given dataset. If this is a custom dataset, the implementation must also: - 1. extend `kedro.io.core.AbstractVersionedDataset` AND - 2. add `version` namedtuple as an argument to its `__init__` method AND - 3. call `super().__init__()` with positional arguments `filepath`, `version`, and, optionally, with `glob` and `exists` functions if it uses a non-local filesystem (see [kedro_datasets.pandas.CSVDataSet](/kedro_datasets.pandas.CSVDataSet) as an example) AND - 4. modify its `_describe`, `_load` and `_save` methods respectively to support versioning (see [`kedro_datasets.pandas.CSVDataSet`](/kedro_datasets.pandas.CSVDataSet) for an example implementation) - -```{note} -If a new version of a dataset is created mid-run, for instance by an external system adding new files, it will not interfere in the current run, i.e. the load version stays the same throughout subsequent loads. -``` - -An example dataset could look similar to the below: - -```python -from pathlib import Path, PurePosixPath - -import pandas as pd - -from kedro.io import AbstractVersionedDataset - - -class MyOwnDataSet(AbstractVersionedDataset): - def __init__(self, filepath, version, param1, param2=True): - super().__init__(PurePosixPath(filepath), version) - self._param1 = param1 - self._param2 = param2 - - def _load(self) -> pd.DataFrame: - load_path = self._get_load_path() - return pd.read_csv(load_path) - - def _save(self, df: pd.DataFrame) -> None: - save_path = self._get_save_path() - df.to_csv(save_path) - - def _exists(self) -> bool: - path = self._get_load_path() - return Path(path).exists() - - def _describe(self): - return dict(version=self._version, param1=self._param1, param2=self._param2) -``` - -With `catalog.yml` specifying: - -```yaml -my_dataset: - type: .MyOwnDataSet - filepath: data/01_raw/my_data.csv - versioned: true - param1: # param1 is a required argument - # param2 will be True by default -``` - -### `version` namedtuple - -Versioned dataset `__init__` method must have an optional argument called `version` with a default value of `None`. If provided, this argument must be an instance of [`kedro.io.core.Version`](/kedro.io.Version). Its `load` and `save` attributes must either be `None` or contain string values representing exact load and save versions: - -* If `version` is `None`, then the dataset is considered *not versioned*. -* If `version.load` is `None`, then the latest available version will be used to load the dataset, otherwise a string representing exact load version must be provided. -* If `version.save` is `None`, then a new save version string will be generated by calling `kedro.io.core.generate_timestamp()`, otherwise a string representing the exact save version must be provided. - -### Versioning using the YAML API - -The easiest way to version a specific dataset is to change the corresponding entry in the `catalog.yml` file. For example, if the following dataset was defined in the `catalog.yml` file: - -```yaml -cars: - type: pandas.CSVDataSet - filepath: data/01_raw/company/car_data.csv - versioned: true -``` - -The `DataCatalog` will create a versioned `CSVDataSet` called `cars`. The actual csv file location will look like `data/01_raw/company/car_data.csv//car_data.csv`, where `` corresponds to a global save version string formatted as `YYYY-MM-DDThh.mm.ss.sssZ`. Every time the `DataCatalog` is instantiated, it generates a new global save version, which is propagated to all versioned datasets it contains. - -The `catalog.yml` file only allows you to version your datasets, but does not allow you to choose which version to load or save. This is deliberate because we have chosen to separate the data catalog from any runtime configuration. If you need to pin a dataset version, you can either [specify the versions in a separate `yml` file and call it at runtime](../nodes_and_pipelines/run_a_pipeline.md#configure-kedro-run-arguments) or [instantiate your versioned datasets using Code API and define a version parameter explicitly](#versioning-using-the-code-api). - -By default, the `DataCatalog` will load the latest version of the dataset. However, you can also specify an exact load version. In order to do that, pass a dictionary with exact load versions to `DataCatalog.from_config`: - -```python -load_versions = {"cars": "2019-02-13T14.35.36.518Z"} -io = DataCatalog.from_config(catalog_config, credentials, load_versions=load_versions) -cars = io.load("cars") -``` - -The last row in the example above would attempt to load a CSV file from `data/01_raw/company/car_data.csv/2019-02-13T14.35.36.518Z/car_data.csv`: - -* `load_versions` configuration has an effect only if a dataset versioning has been enabled in the catalog config file - see the example above. - -* We recommend that you do not override `save_version` argument in `DataCatalog.from_config` unless strongly required to do so, since it may lead to inconsistencies between loaded and saved versions of the versioned datasets. - -```{warning} -The `DataCatalog` does not re-generate save versions between instantiations. Therefore, if you call `catalog.save('cars', some_data)` twice, then the second call will fail, since it tries to overwrite a versioned dataset using the same save version. To mitigate this, reload your data catalog by calling `%reload_kedro` line magic. This limitation does not apply to `load` operation. -``` - -### Versioning using the Code API - -Although we recommend enabling versioning using the `catalog.yml` config file as described in the section above, you might require more control over load and save versions of a specific dataset. To achieve this, you can instantiate `Version` and pass it as a parameter to the dataset initialisation: - -```python -from kedro.io import DataCatalog, Version -from kedro_datasets.pandas import CSVDataSet -import pandas as pd - -data1 = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]}) -data2 = pd.DataFrame({"col1": [7], "col2": [8], "col3": [9]}) -version = Version( - load=None, # load the latest available version - save=None, # generate save version automatically on each save operation -) - -test_data_set = CSVDataSet( - filepath="data/01_raw/test.csv", save_args={"index": False}, version=version -) -io = DataCatalog({"test_data_set": test_data_set}) - -# save the dataset to data/01_raw/test.csv//test.csv -io.save("test_data_set", data1) -# save the dataset into a new file data/01_raw/test.csv//test.csv -io.save("test_data_set", data2) - -# load the latest version from data/test.csv/*/test.csv -reloaded = io.load("test_data_set") -assert data2.equals(reloaded) -``` - -```{note} -In the example above, we did not fix any versions. If we do, then the behaviour of load and save operations becomes slightly different: -``` - -```python -version = Version( - load="my_exact_version", # load exact version - save="my_exact_version", # save to exact version -) - -test_data_set = CSVDataSet( - filepath="data/01_raw/test.csv", save_args={"index": False}, version=version -) -io = DataCatalog({"test_data_set": test_data_set}) - -# save the dataset to data/01_raw/test.csv/my_exact_version/test.csv -io.save("test_data_set", data1) -# load from data/01_raw/test.csv/my_exact_version/test.csv -reloaded = io.load("test_data_set") -assert data1.equals(reloaded) - -# raises DataSetError since the path -# data/01_raw/test.csv/my_exact_version/test.csv already exists -io.save("test_data_set", data2) -``` - -```{warning} -We do not recommend passing exact load and/or save versions, since it might lead to inconsistencies between operations. For example, if versions for load and save operations do not match, a save operation would result in a `UserWarning` indicating that save and load versions do not match. Load after save might also return an error if the corresponding load version is not found: -``` - -```python -version = Version( - load="exact_load_version", # load exact version - save="exact_save_version", # save to exact version -) - -test_data_set = CSVDataSet( - filepath="data/01_raw/test.csv", save_args={"index": False}, version=version -) -io = DataCatalog({"test_data_set": test_data_set}) - -io.save("test_data_set", data1) # emits a UserWarning due to version inconsistency - -# raises DataSetError since the data/01_raw/test.csv/exact_load_version/test.csv -# file does not exist -reloaded = io.load("test_data_set") -``` - -### Supported datasets - -Currently, the following datasets support versioning: - -- `kedro_datasets.matplotlib.MatplotlibWriter` -- `kedro_datasets.holoviews.HoloviewsWriter` -- `kedro_datasets.networkx.NetworkXDataSet` -- `kedro_datasets.pandas.CSVDataSet` -- `kedro_datasets.pandas.ExcelDataSet` -- `kedro_datasets.pandas.FeatherDataSet` -- `kedro_datasets.pandas.HDFDataSet` -- `kedro_datasets.pandas.JSONDataSet` -- `kedro_datasets.pandas.ParquetDataSet` -- `kedro_datasets.pickle.PickleDataSet` -- `kedro_datasets.pillow.ImageDataSet` -- `kedro_datasets.text.TextDataSet` -- `kedro_datasets.spark.SparkDataSet` -- `kedro_datasets.yaml.YAMLDataSet` -- `kedro_datasets.api.APIDataSet` -- `kedro_datasets.tensorflow.TensorFlowModelDataSet` -- `kedro_datasets.json.JSONDataSet` - -```{note} -Although HTTP(S) is a supported file system in the dataset implementations, it does not support versioning. -``` - -## Partitioned dataset - -These days, distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible. This is why Kedro provides a built-in [PartitionedDataSet](/kedro.io.PartitionedDataSet), with the following features: @@ -252,9 +16,9 @@ This is why Kedro provides a built-in [PartitionedDataSet](/kedro.io.Partitioned In this section, each individual file inside a given location is called a partition. ``` -### Partitioned dataset definition +### How to use `PartitionedDataSet` -`PartitionedDataSet` definition can be put in your `catalog.yml` file like any other regular dataset definition. The definition represents the following structure: +You can use a `PartitionedDataSet` in `catalog.yml` file like any other regular dataset definition: ```yaml # conf/base/catalog.yml @@ -320,22 +84,22 @@ Here is an exhaustive list of the arguments supported by `PartitionedDataSet`: | `filepath_arg` | No | `str` (defaults to `filepath`) | Argument name of the underlying dataset initializer that will contain a path to an individual partition | | `filename_suffix` | No | `str` (defaults to an empty string) | If specified, partitions that don't end with this string will be ignored | -#### Dataset definition +### Dataset definition -Dataset definition should be passed into the `dataset` argument of the `PartitionedDataSet`. The dataset definition is used to instantiate a new dataset object for each individual partition, and use that dataset object for load and save operations. Dataset definition supports shorthand and full notations. +The dataset definition should be passed into the `dataset` argument of the `PartitionedDataSet`. The dataset definition is used to instantiate a new dataset object for each individual partition, and use that dataset object for load and save operations. Dataset definition supports shorthand and full notations. -##### Shorthand notation +#### Shorthand notation Requires you only to specify a class of the underlying dataset either as a string (e.g. `pandas.CSVDataSet` or a fully qualified class path like `kedro_datasets.pandas.CSVDataSet`) or as a class object that is a subclass of the [AbstractDataset](/kedro.io.AbstractDataset). -##### Full notation +#### Full notation Full notation allows you to specify a dictionary with the full underlying dataset definition _except_ the following arguments: * The argument that receives the partition path (`filepath` by default) - if specified, a `UserWarning` will be emitted stating that this value will be overridden by individual partition paths * `credentials` key - specifying it will result in a `DataSetError` being raised; dataset credentials should be passed into the `credentials` argument of the `PartitionedDataSet` rather than the underlying dataset definition - see the section below on [partitioned dataset credentials](#partitioned-dataset-credentials) for details * `versioned` flag - specifying it will result in a `DataSetError` being raised; versioning cannot be enabled for the underlying datasets -#### Partitioned dataset credentials +### Partitioned dataset credentials ```{note} Support for `dataset_credentials` key in the credentials for `PartitionedDataSet` is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. @@ -414,7 +178,7 @@ new_partitioned_dataset: filename_suffix: ".csv" ``` -node definition: +Here is the node definition: ```python from kedro.pipeline import node @@ -422,7 +186,7 @@ from kedro.pipeline import node node(create_partitions, inputs=None, outputs="new_partitioned_dataset") ``` -and underlying node function `create_partitions`: +The underlying node function is as follows in `create_partitions`: ```python from typing import Any, Dict @@ -449,6 +213,7 @@ Writing to an existing partition may result in its data being overwritten, if th ### Partitioned dataset lazy saving `PartitionedDataSet` also supports lazy saving, where the partition's data is not materialised until it is time to write. + To use this, simply return `Callable` types in the dictionary: ```python @@ -473,8 +238,7 @@ def create_partitions() -> Dict[str, Callable[[], Any]]: ```{note} When using lazy saving, the dataset will be written _after_ the `after_node_run` [hook](../hooks/introduction). ``` - -### Incremental loads with `IncrementalDataSet` +## Incremental datasets [IncrementalDataSet](/kedro.io.IncrementalDataSet) is a subclass of `PartitionedDataSet`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataSet` addresses the use case when partitions have to be processed incrementally, i.e. each subsequent pipeline run should only process the partitions which were not processed by the previous runs. @@ -482,17 +246,17 @@ This checkpoint, by default, is persisted to the location of the data partitions The checkpoint file is only created _after_ [the partitioned dataset is explicitly confirmed](#incremental-dataset-confirm). -#### Incremental dataset load +### Incremental dataset loads Loading `IncrementalDataSet` works similarly to [`PartitionedDataSet`](#partitioned-dataset-load) with several exceptions: 1. `IncrementalDataSet` loads the data _eagerly_, so the values in the returned dictionary represent the actual data stored in the corresponding partition, rather than a pointer to the load function. `IncrementalDataSet` considers a partition relevant for processing if its ID satisfies the comparison function, given the checkpoint value. 2. `IncrementalDataSet` _does not_ raise a `DataSetError` if load finds no partitions to return - an empty dictionary is returned instead. An empty list of available partitions is part of a normal workflow for `IncrementalDataSet`. -#### Incremental dataset save +### Incremental dataset save The `IncrementalDataSet` save operation is identical to the [save operation of the `PartitionedDataSet`](#partitioned-dataset-save). -#### Incremental dataset confirm +### Incremental dataset confirm ```{note} The checkpoint value *is not* automatically updated when a new set of partitions is successfully loaded or saved. @@ -549,7 +313,7 @@ Important notes about the confirmation operation: * A pipeline cannot contain more than one node confirming the same dataset. -#### Checkpoint configuration +### Checkpoint configuration `IncrementalDataSet` does not require explicit configuration of the checkpoint unless there is a need to deviate from the defaults. To update the checkpoint configuration, add a `checkpoint` key containing the valid dataset configuration. This may be required if, say, the pipeline has read-only permissions to the location of partitions (or write operations are undesirable for any other reason). In such cases, `IncrementalDataSet` can be configured to save the checkpoint elsewhere. The `checkpoint` key also supports partial config updates where only some checkpoint attributes are overwritten, while the defaults are kept for the rest: @@ -565,7 +329,7 @@ my_partitioned_dataset: k1: v1 ``` -#### Special checkpoint config keys +### Special checkpoint config keys Along with the standard dataset attributes, `checkpoint` config also accepts two special optional keys: * `comparison_func` (defaults to `operator.gt`) - a fully qualified import path to the function that will be used to compare a partition ID with the checkpoint value, to determine whether a partition should be processed. Such functions must accept two positional string arguments - partition ID and checkpoint value - and return `True` if such partition is considered to be past the checkpoint. It might be useful to specify your own `comparison_func` if you need to customise the checkpoint filtration mechanism - for example, you might want to implement windowed loading, where you always want to load the partitions representing the last calendar month. See the example config specifying a custom comparison function: diff --git a/docs/source/deployment/argo.md b/docs/source/deployment/argo.md index f66b809b0e..9207debe3d 100644 --- a/docs/source/deployment/argo.md +++ b/docs/source/deployment/argo.md @@ -24,7 +24,7 @@ To use Argo Workflows, ensure you have the following prerequisites in place: - [Argo Workflows is installed](https://github.com/argoproj/argo/blob/master/README.md#quickstart) on your Kubernetes cluster - [Argo CLI is installed](https://github.com/argoproj/argo/releases) on your machine - A `name` attribute is set for each [Kedro node](/kedro.pipeline.node) since it is used to build a DAG -- [All node input/output DataSets must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3); you cannot use the `MemoryDataSet` in your workflow +- [All node input/output DataSets must be configured in `catalog.yml`](../data/data_catalog_yaml_examples.md) and refer to an external location (e.g. AWS S3); you cannot use the `MemoryDataSet` in your workflow ```{note} Each node will run in its own container. diff --git a/docs/source/deployment/aws_batch.md b/docs/source/deployment/aws_batch.md index 976d5e9e5a..c83b58f8ea 100644 --- a/docs/source/deployment/aws_batch.md +++ b/docs/source/deployment/aws_batch.md @@ -18,7 +18,7 @@ To use AWS Batch, ensure you have the following prerequisites in place: - An [AWS account set up](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/). - A `name` attribute is set for each [Kedro node](/kedro.pipeline.node). Each node will run in its own Batch job, so having sensible node names will make it easier to `kedro run --node=`. -- [All node input/output `DataSets` must be configured in `catalog.yml`](../data/data_catalog.md#use-the-data-catalog-with-the-yaml-api) and refer to an external location (e.g. AWS S3). A clean way to do this is to create a new configuration environment `conf/aws_batch` containing a `catalog.yml` file with the appropriate configuration, as illustrated below. +- [All node input/output `DataSets` must be configured in `catalog.yml`](../data/data_catalog_yaml_examples.md) and refer to an external location (e.g. AWS S3). A clean way to do this is to create a new configuration environment `conf/aws_batch` containing a `catalog.yml` file with the appropriate configuration, as illustrated below.
Click to expand diff --git a/docs/source/deployment/databricks/databricks_deployment_workflow.md b/docs/source/deployment/databricks/databricks_deployment_workflow.md index 799a5044c1..245708e6bf 100644 --- a/docs/source/deployment/databricks/databricks_deployment_workflow.md +++ b/docs/source/deployment/databricks/databricks_deployment_workflow.md @@ -170,7 +170,7 @@ A Kedro project's configuration and data do not get included when it is packaged Your packaged Kedro project needs access to data and configuration in order to run. Therefore, you will need to upload your project's data and configuration to a location accessible to Databricks. In this guide, we will store the data on the Databricks File System (DBFS). -The `databricks-iris` starter contains a [catalog](../../data/data_catalog.md#the-data-catalog) that is set up to access data stored in DBFS (`/conf/`). You will point your project to use configuration stored on DBFS using the `--conf-source` option when you create your job on Databricks. +The `databricks-iris` starter contains a [catalog](../../data/data_catalog.md) that is set up to access data stored in DBFS (`/conf/`). You will point your project to use configuration stored on DBFS using the `--conf-source` option when you create your job on Databricks. There are several ways to upload data to DBFS: you can use the [DBFS API](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/dbfs), the [`dbutils` module](https://docs.databricks.com/dev-tools/databricks-utils.html) in a Databricks notebook or the [Databricks CLI](https://docs.databricks.com/dev-tools/cli/dbfs-cli.html). In this guide, it is recommended to use the Databricks CLI because of the convenience it offers. diff --git a/docs/source/deployment/databricks/databricks_ide_development_workflow.md b/docs/source/deployment/databricks/databricks_ide_development_workflow.md index dc723189c9..2cf8f40ca2 100644 --- a/docs/source/deployment/databricks/databricks_ide_development_workflow.md +++ b/docs/source/deployment/databricks/databricks_ide_development_workflow.md @@ -142,7 +142,7 @@ Name the new folder `local`. In this guide, we have no local credentials to stor When run on Databricks, Kedro cannot access data stored in your project's directory. Therefore, you will need to upload your project's data to an accessible location. In this guide, we will store the data on the Databricks File System (DBFS). -The `databricks-iris` starter contains a [catalog](../../data/data_catalog.md#the-data-catalog) that is set up to access data stored in DBFS (`/conf/`). You will point your project to use configuration stored on DBFS using the `--conf-source` option when you create your job on Databricks. +The `databricks-iris` starter contains a [catalog](../../data/data_catalog.md) that is set up to access data stored in DBFS (`/conf/`). You will point your project to use configuration stored on DBFS using the `--conf-source` option when you create your job on Databricks. There are several ways to upload data to DBFS. In this guide, it is recommended to use [Databricks CLI](https://docs.databricks.com/dev-tools/cli/dbfs-cli.html) because of the convenience it offers. At the command line in your local environment, use the following Databricks CLI command to upload your locally stored data to DBFS: diff --git a/docs/source/development/commands_reference.md b/docs/source/development/commands_reference.md index 45801ea112..815bae91f8 100644 --- a/docs/source/development/commands_reference.md +++ b/docs/source/development/commands_reference.md @@ -498,7 +498,7 @@ kedro catalog list --pipeline=ds,de kedro catalog rank ``` -The output includes a list of any [dataset factories](../data/data_catalog.md#load-multiple-datasets-with-similar-configuration-using-dataset-factories) in the catalog, ranked by the priority on which they are matched against. +The output includes a list of any [dataset factories](../data/kedro_dataset_factories.md) in the catalog, ranked by the priority on which they are matched against. #### Data Catalog diff --git a/docs/source/experiment_tracking/index.md b/docs/source/experiment_tracking/index.md index a8e94dd05b..31bff89ee2 100644 --- a/docs/source/experiment_tracking/index.md +++ b/docs/source/experiment_tracking/index.md @@ -19,7 +19,7 @@ Kedro's [experiment tracking demo](https://demo.kedro.org/experiment-tracking) e ![](../meta/images/experiment-tracking_demo.gif) ## Kedro versions supporting experiment tracking -Kedro has always supported parameter versioning (as part of your codebase with a version control system like `git`) and Kedro’s dataset versioning capabilities enabled you to [snapshot models, datasets and plots](../data/data_catalog.md#version-datasets-and-ml-models). +Kedro has always supported parameter versioning (as part of your codebase with a version control system like `git`) and Kedro’s dataset versioning capabilities enabled you to [snapshot models, datasets and plots](../data/data_catalog.md#dataset-versioning). Kedro-Viz version 4.1.1 introduced metadata capture, visualisation, discovery and comparison, enabling you to access, edit and [compare your experiments](#access-run-data-and-compare-runs) and additionally [track how your metrics change over time](#view-and-compare-metrics-data). diff --git a/docs/source/extend_kedro/common_use_cases.md b/docs/source/extend_kedro/common_use_cases.md index 04b36d6ca5..9f8d32dc9f 100644 --- a/docs/source/extend_kedro/common_use_cases.md +++ b/docs/source/extend_kedro/common_use_cases.md @@ -12,7 +12,7 @@ This can now achieved by using [Hooks](../hooks/introduction.md), to define the ## Use Case 2: How to integrate Kedro with additional data sources -You can use [DataSets](/kedro_datasets) to interface with various different data sources. If the data source you plan to use is not supported out of the box by Kedro, you can [create a custom dataset](custom_datasets.md). +You can use [DataSets](/kedro_datasets) to interface with various different data sources. If the data source you plan to use is not supported out of the box by Kedro, you can [create a custom dataset](../data/how_to_create_a_custom_dataset.md). ## Use Case 3: How to add or modify CLI commands diff --git a/docs/source/extend_kedro/index.md b/docs/source/extend_kedro/index.md index f368ac9a73..fefa8e21f9 100644 --- a/docs/source/extend_kedro/index.md +++ b/docs/source/extend_kedro/index.md @@ -4,6 +4,5 @@ :maxdepth: 1 common_use_cases -custom_datasets plugins ``` diff --git a/docs/source/faq/faq.md b/docs/source/faq/faq.md index 75790690a9..23cfa6b094 100644 --- a/docs/source/faq/faq.md +++ b/docs/source/faq/faq.md @@ -39,9 +39,6 @@ This is a growing set of technical FAQs. The [product FAQs on the Kedro website] * [How do I use resolvers in the `OmegaConfigLoader`](../configuration/advanced_configuration.md#how-to-use-resolvers-in-the-omegaconfigloader)? * [How do I load credentials through environment variables](../configuration/advanced_configuration.md#how-to-load-credentials-through-environment-variables)? -## Datasets and the Data Catalog - -* [Can I read the same data file using two different dataset implementations](../data/data_catalog.md#transcode-datasets)? ## Nodes and pipelines diff --git a/docs/source/nodes_and_pipelines/nodes.md b/docs/source/nodes_and_pipelines/nodes.md index a41f147244..70835183dc 100644 --- a/docs/source/nodes_and_pipelines/nodes.md +++ b/docs/source/nodes_and_pipelines/nodes.md @@ -213,7 +213,7 @@ With `pandas` built-in support, you can use the `chunksize` argument to read dat ### Saving data with Generators To use generators to save data lazily, you need do three things: - Update the `make_prediction` function definition to use `return` instead of `yield`. -- Create a [custom dataset](../extend_kedro/custom_datasets.md) called `ChunkWiseCSVDataset` +- Create a [custom dataset](../data/how_to_create_a_custom_dataset.md) called `ChunkWiseCSVDataset` - Update `catalog.yml` to use a newly created `ChunkWiseCSVDataset`. Copy the following code to `nodes.py`. The main change is to use a new model `DecisionTreeClassifier` to make prediction by chunks in `make_predictions`. diff --git a/docs/source/notebooks_and_ipython/kedro_and_notebooks.md b/docs/source/notebooks_and_ipython/kedro_and_notebooks.md index d32139b2f8..8344b1346f 100644 --- a/docs/source/notebooks_and_ipython/kedro_and_notebooks.md +++ b/docs/source/notebooks_and_ipython/kedro_and_notebooks.md @@ -101,7 +101,7 @@ INFO Loading data from 'parameters' (MemoryDataSet)... ``` ```{note} -If you enable [versioning](../data/data_catalog.md#version-datasets-and-ml-models) you can load a particular version of a dataset, e.g. `catalog.load("example_train_x", version="2021-12-13T15.08.09.255Z")`. +If you enable [versioning](../data/data_catalog.md#dataset-versioning) you can load a particular version of a dataset, e.g. `catalog.load("example_train_x", version="2021-12-13T15.08.09.255Z")`. ``` ### `context` diff --git a/docs/source/tutorial/add_another_pipeline.md b/docs/source/tutorial/add_another_pipeline.md index 95093b5d0b..1ceba96edc 100644 --- a/docs/source/tutorial/add_another_pipeline.md +++ b/docs/source/tutorial/add_another_pipeline.md @@ -125,7 +125,7 @@ regressor: versioned: true ``` -By setting `versioned` to `true`, versioning is enabled for `regressor`. This means that the pickled output of the `regressor` is saved every time the pipeline runs, which stores the history of the models built using this pipeline. You can learn more in the [Versioning section](../data/kedro_io.md#versioning). +By setting `versioned` to `true`, versioning is enabled for `regressor`. This means that the pickled output of the `regressor` is saved every time the pipeline runs, which stores the history of the models built using this pipeline. You can learn more in the [later section about dataset and ML model versioning](../data/data_catalog.md#dataset-versioning). ## Data science pipeline diff --git a/docs/source/tutorial/set_up_data.md b/docs/source/tutorial/set_up_data.md index 364818b3a1..2315f04068 100644 --- a/docs/source/tutorial/set_up_data.md +++ b/docs/source/tutorial/set_up_data.md @@ -120,7 +120,7 @@ When you have finished, close `ipython` session with `exit()`. [Kedro supports numerous datasets](/kedro_datasets) out of the box, but you can also add support for any proprietary data format or filesystem. -You can find further information about [how to add support for custom datasets](../extend_kedro/custom_datasets.md) in specific documentation covering advanced usage. +You can find further information about [how to add support for custom datasets](../data/how_to_create_a_custom_dataset.md) in specific documentation covering advanced usage. ### Supported data locations diff --git a/setup.py b/setup.py index e78ea817a7..8d94b9c965 100644 --- a/setup.py +++ b/setup.py @@ -97,7 +97,7 @@ def _collect_requirements(requires): "sphinxcontrib-mermaid~=0.7.1", "myst-parser~=1.0.0", "Jinja2<3.1.0", - "kedro-datasets[all,pandas-deltatabledataset]~=1.5.1", + "kedro-datasets[all]~=1.5.3", ], "geopandas": _collect_requirements(geopandas_require), "matplotlib": _collect_requirements(matplotlib_require),