Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation on how to configure the dataset #3919

Closed
ElenaKhaustova opened this issue Jun 3, 2024 · 4 comments
Closed

Improve documentation on how to configure the dataset #3919

ElenaKhaustova opened this issue Jun 3, 2024 · 4 comments
Assignees
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation

Comments

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jun 3, 2024

Description

Users struggle to understand how to configure datasets properly, resulting in frustration. They miss the existence of the Kedro-Datasets component and from the Kedro documentation, they struggle to get on how to set up the parameters for datasets.

We propose adding a configuration example with the reference to the https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1. Specifically how to set up kedro- and dataset-related parameters.

Documentation page (if applicable)

https://docs.kedro.org/en/stable/data/data_catalog.html

Context

"They tend not not know the underlying library connected to the datasets. They need to be redirected to the right place in the documentation (e.g. pandas.CSVDataset API doc)" (C)

@astrojuanlu
Copy link
Member

astrojuanlu commented Jun 6, 2024

Is better documentation enough to address this though?

For example, this was the first comment a user made when joining our Slack:

Hey folks! Just started using kedro. Is there any kedro command to import datasets from a path into my data directory in the project?

(https://linen-slack.kedro.org/t/9703502/hey-folks-just-started-using-kedro-is-there-any-kedro-comman#296704bb-7be1-419c-94b2-2429086acbea, cc @juanmarin00)

In the same way we have kedro pipeline create, we could have kedro dataset import /tmp/my_data.csv or something like that, populating the catalog for you.

@astrojuanlu
Copy link
Member

Also unclear if this is related to the DataCatalog API itself, but more of a Kedro DX thing in general.

@merelcht
Copy link
Member

I'd be curious to know what's really meant with "configuring" a datasets. We have a huge amount of docs on yaml examples: https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html, but if that's not what users are looking for then what is it they'd like to see?

@ElenaKhaustova ElenaKhaustova self-assigned this Jun 25, 2024
@ElenaKhaustova ElenaKhaustova moved this from To Do to In Progress in Kedro Framework Jun 25, 2024
@ElenaKhaustova ElenaKhaustova changed the title [DataCatalog]: Improve documentation on how to configure the dataset Improve documentation on how to configure the dataset Jun 25, 2024
@ElenaKhaustova
Copy link
Contributor Author

ElenaKhaustova commented Jun 25, 2024

@astrojuanlu, @merelcht, what we got from the interviews is that less experienced users are missing the connection between DataCatalog, Dataset and the actual python package encapsulated with the specific dataset implementation, aka working with pandas. When users want to add dataset configuration into the catalog.yml it's not obvious for some of them that the set of the dataset configuration parameters is defined by its implementation (filepath, load_args, etc), but for example load_args are defined by the underlying library like pandas.

We can add a small example to the docs to clarify the dependency DataCatalog -> Dataset -> underlying library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation
Projects
Archived in project
Development

No branches or pull requests

3 participants