Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text completion dataset docs #1696

Merged
merged 3 commits into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/basics/datasets_overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ The following tasks are supported:
- RLHF
- :ref:`preference_dataset_usage_label`
- Continued pre-training
- Text Completion Datasets
- :ref:`text_completion_dataset_usage_label`

Data pipeline
-------------
Expand Down
2 changes: 1 addition & 1 deletion docs/source/basics/preference_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Example local preference dataset
"chosen": "chosen_conversations",
"rejected": "rejected_conversations"
}
dataset = preference_dataset(
ds = preference_dataset(
tokenizer=tokenizer,
source="json",
column_map=column_map,
Expand Down
155 changes: 155 additions & 0 deletions docs/source/basics/text_completion_datasets.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
.. _text_completion_dataset_usage_label:

========================
Text-completion Datasets
========================


Text-completion datasets are typically used for continued pre-training paradigms which involve
fine-tuning a base model on an unstructured, unlabelled dataset in a self-supervised manner.

The primary entry point for fine-tuning with preference datasets in torchtune :func:`~torchtune.datasets.text_completion`.
Text completion datasets are simply expected to contain a column, "text", which contains the text for each sample.


Example local text completion datasets
--------------------------------------

``.json`` format
^^^^^^^^^^^^^^^^

.. code-block:: bash

# odyssey.json
[
{
"input": "After we were clear of the river Oceanus, and had got out into the open sea, we went on till we reached the Aeaean island where there is dawn and sunrise as in other places. We then drew our ship on to the sands and got out of her on to the shore, where we went to sleep and waited till day should break."
},
{
"input": "Then, when the child of morning, rosy-fingered Dawn, appeared, I sent some men to Circe's house to fetch the body of Elpenor. We cut firewood from a wood where the headland jutted out into the sea, and after we had wept over him and lamented him we performed his funeral rites. When his body and armour had been burned to ashes, we raised a cairn, set a stone over it, and at the top of the cairn we fixed the oar that he had been used to row with."
}
]

.. code-block:: python

from torchtune.models.llama3 import llama3_tokenizer
from torchtune.datasets import text_completion_dataset

m_tokenizer = llama3_tokenizer(
path="/tmp/Meta-Llama-3.1-8B/original/tokenizer.model",
max_seq_len=8192
)

ds = text_completion_dataset(
tokenizer=m_tokenizer,
source="json",
column="input",
data_files="odyssey.json",
split="train",
)
tokenized_dict = ds[0]
print(m_tokenizer.decode(tokenized_dict["tokens"]))
# After we were clear of the river Oceanus, and had got out into the open sea,\
# we went on till we reached the Aeaean island where there is dawn and sunrise \
# as in other places. We then drew our ship on to the sands and got out of her on \
# to the shore, where we went to sleep and waited till day should break.
print(tokenized_dict["labels"])
# [128000, 6153, 584, 1051, 2867, 315, 279, 15140, 22302, 355, 11, 323, 1047, \
# 2751, 704, 1139, 279, 1825, 9581, 11, 584, 4024, 389, 12222, 584, 8813, 279, \
# 362, 12791, 5420, 13218, 1405, 1070, 374, 39493, 323, 64919, 439, 304, 1023, \
# 7634, 13, 1226, 1243, 24465, 1057, 8448, 389, 311, 279, 70163, 323, 2751, 704, \
# 315, 1077, 389, 311, 279, 31284, 11, 1405, 584, 4024, 311, 6212, 323, 30315, \
# 12222, 1938, 1288, 1464, 13, 128001]


This can also be accomplished via the yaml config:

.. code-block:: yaml

# In config
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: /tmp/Meta-Llama-3.1-8B/original/tokenizer.model
max_seq_len: 8192

dataset:
_component_: torchtune.datasets.text_completion_dataset
source: json
data_files: odyssey.json
column: input
split: train

``.txt`` format
^^^^^^^^^^^^^^^

.. code-block:: text

# odyssey.txt

After we were clear of the river Oceanus, and had got out into the open sea, we went on till we reached the Aeaean island where there is dawn and sunrise as in other places. We then drew our ship on to the sands and got out of her on to the shore, where we went to sleep and waited till day should break.
Then, when the child of morning, rosy-fingered Dawn, appeared, I sent some men to Circe's house to fetch the body of Elpenor. We cut firewood from a wood where the headland jutted out into the sea, and after we had wept over him and lamented him we performed his funeral rites. When his body and armour had been burned to ashes, we raised a cairn, set a stone over it, and at the top of the cairn we fixed the oar that he had been used to row with.


.. code-block:: python

from torchtune.models.llama3 import llama3_tokenizer
from torchtune.datasets import text_completion_dataset

m_tokenizer = llama3_tokenizer(
path="/tmp/Meta-Llama-3.1-8B/original/tokenizer.model",
max_seq_len=8192
)

ds = text_completion_dataset(
tokenizer=m_tokenizer,
source="text",
data_files="odyssey.txt",
split="train",
)
# the outputs here are identical to above

Similarly, this can also be accomplished via the yaml config:

.. code-block:: yaml

# In config
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: /tmp/Meta-Llama-3.1-8B/original/tokenizer.model
max_seq_len: 8192

dataset:
_component_: torchtune.datasets.text_completion_dataset
source: text
data_files: odyssey.txt
split: train

Loading text completion datasets from Hugging Face
--------------------------------------------------

To load in a text completion dataset from Hugging Face you'll need to pass in the dataset repo name to ``source``. For most HF datasets, you will also need to specify the ``split``.

.. code-block:: python

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import text_completion_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = text_completion_dataset(
tokenizer=g_tokenizer,
source="wikimedia/wikipedia",
split="train",
)

.. code-block:: yaml

# Tokenizer is passed into the dataset in the recipe so we don't need it here
dataset:
_component_: torchtune.datasets.text_completion_dataset
source: wikimedia/wikipedia
split: train


Built-in preference datasets
----------------------------
- :func:`~torchtune.datasets.cnn_dailymail_articles_dataset`
11 changes: 6 additions & 5 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,15 +114,16 @@ torchtune tutorials.
:hidden:

basics/datasets_overview
basics/chat_datasets
basics/instruct_datasets
basics/multimodal_datasets
basics/preference_datasets
basics/text_completion_datasets
basics/model_transforms
basics/messages
basics/message_transforms
basics/instruct_datasets
basics/chat_datasets
basics/tokenizers
basics/prompt_templates
basics/preference_datasets
basics/multimodal_datasets
basics/model_transforms

.. toctree::
:glob:
Expand Down
Loading