From 7c262e65f82b929c83cb656d0f3bb7fec94cffe5 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 15 Aug 2024 12:13:10 +0200 Subject: [PATCH 1/4] remove more script docs --- docs/source/create_dataset.mdx | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/docs/source/create_dataset.mdx b/docs/source/create_dataset.mdx index 3b855481448..c5e72daa742 100644 --- a/docs/source/create_dataset.mdx +++ b/docs/source/create_dataset.mdx @@ -7,6 +7,19 @@ In this tutorial, you'll learn how to use 🤗 Datasets low-code methods for cre * Folder-based builders for quickly creating an image or audio dataset * `from_` methods for creating datasets from local files +## File-based builders + +🤗 Datasets supports many common formats such as `csv`, `json/jsonl`, `parquet`, `txt`. + +For example it can read a dataset made up of one or several CSV files (in this case, pass your CSV files as a list): + +```py +>>> from datasets import load_dataset +>>> dataset = load_dataset("csv", data_files="my_file.csv") +``` + +To get the list of supported formats and code examples, follow this guide [here](https://huggingface.co/docs/datasets/loading#local-and-remote-files). + ## Folder-based builders There are two folder-based builders, [`ImageFolder`] and [`AudioFolder`]. These are low-code methods for quickly creating an image or speech and audio dataset with several thousand examples. They are great for rapidly prototyping computer vision and speech models before scaling to a larger dataset. Folder-based builders takes your data and automatically generates the dataset's features, splits, and labels. Under the hood: @@ -61,8 +74,6 @@ squirtle.png, When it retracts its long neck into its shell, it squirts out wate To learn more about each of these folder-based builders, check out the and ImageFolder or AudioFolder guides. -For similiar builders to load data from common formats such as `csv`, `json/jsonl`, `parquet`, and `txt` follow this guide [here](https://huggingface.co/docs/datasets/loading#local-and-remote-files) - ## From local files You can also create a dataset from local files by specifying the path to the data files. There are two ways you can create a dataset using the `from_` methods: @@ -104,11 +115,3 @@ You can also create a dataset from local files by specifying the path to the dat ```py >>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio()) ``` - -## Next steps - -We didn't mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, and are not well supported on Hugging Face. Though in some rare cases it can still be helpful. - -To learn more about how to write loading scripts, take a look at the image loading script, audio loading script, and text loading script guides. - -Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset. From da53207d83d911e2884dbf2504a3fd6ad6f5220d Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 15 Aug 2024 12:14:05 +0200 Subject: [PATCH 2/4] minor --- docs/source/create_dataset.mdx | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/create_dataset.mdx b/docs/source/create_dataset.mdx index c5e72daa742..0884ed5d577 100644 --- a/docs/source/create_dataset.mdx +++ b/docs/source/create_dataset.mdx @@ -115,3 +115,5 @@ You can also create a dataset from local files by specifying the path to the dat ```py >>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio()) ``` + +Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset. From e66ff347dbc29cf0523ea179c980e7d638922cef Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 15 Aug 2024 12:15:31 +0200 Subject: [PATCH 3/4] minor --- docs/source/create_dataset.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/create_dataset.mdx b/docs/source/create_dataset.mdx index 0884ed5d577..3cf04a52d5e 100644 --- a/docs/source/create_dataset.mdx +++ b/docs/source/create_dataset.mdx @@ -74,7 +74,7 @@ squirtle.png, When it retracts its long neck into its shell, it squirts out wate To learn more about each of these folder-based builders, check out the and ImageFolder or AudioFolder guides. -## From local files +## From local files paths You can also create a dataset from local files by specifying the path to the data files. There are two ways you can create a dataset using the `from_` methods: From 79ab3e965a436b9819cc8e8e8e67bdf54e235276 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 15 Aug 2024 12:17:58 +0200 Subject: [PATCH 4/4] minor --- docs/source/create_dataset.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/create_dataset.mdx b/docs/source/create_dataset.mdx index 3cf04a52d5e..7f12b2575c6 100644 --- a/docs/source/create_dataset.mdx +++ b/docs/source/create_dataset.mdx @@ -74,9 +74,9 @@ squirtle.png, When it retracts its long neck into its shell, it squirts out wate To learn more about each of these folder-based builders, check out the and ImageFolder or AudioFolder guides. -## From local files paths +## From Python dictionaries -You can also create a dataset from local files by specifying the path to the data files. There are two ways you can create a dataset using the `from_` methods: +You can also create a dataset from data in Python dictionaries. There are two ways you can create a dataset using the `from_` methods: * The [`~Dataset.from_generator`] method is the most memory-efficient way to create a dataset from a [generator](https://wiki.python.org/moin/Generators) due to a generators iterative behavior. This is especially useful when you're working with a really large dataset that may not fit in memory, since the dataset is generated on disk progressively and then memory-mapped.