This repository aims to provide a diverse and accessible collection of datasets
that can be used to train OpenAssistant models.
Our goal is to cover a wide
range of topics, languages and tasks.
To see the datasets people are currently working on, please refer to the spreadsheet.
- Each dataset is organized into its own folder, which may include notebooks, processing scripts, markdown files and other materials that explain the dataset creation process
- The dataset files themselves are stored on Hugging Face
- The root
__init__.py
lists the dataset names and corresponding Hugging Face datasets - The final version of each dataset is pushed to the OpenAssisstant Hugging Face
- All data must be
UTF-8
encoded to simplify training!
To simplify the training process, all datasets must be UTF-8
encoded and
stored in either one of these two formats:
- parquet with the option
row_group_size=100
andindex=False
- jsonl or jsonl.gz
There are 4 types of datasets that currently accepted:
- Instruction
- Multi-turn Dialog
- Safety
- Text-only
Instruction datasets are designed to align language models with human interactions. These can take the form of question-answer, request-response, task-solution pairs, and so on. The instruction dataset must include the following columns:
- INSTRUCTION (string): Instruction text
- RESPONSE (string): Expected response to the instruction
- SOURCE (string): Original data source short name, e.g. "wikipedia"
- METADATA (JSON string, optional): Any other useful information stored in
JSON
For example, NSFW content can be marked as{"nsfw": true}
This type of dataset is designed for conversations with multiple continuations. In this format, each conversation is represented as a tree structure, where each node represents a message from the user or the assistant. For instance, Open-Assistant is collecting the data in a similar format (example).
The dataset must be a jsonl file with the following schema:
{
"thread": {
"text": "", # Message text
"role": "", # Message role: "prompter" or "assistant"
"meta": {}, # Message optional metadata, for example, message rank, safety score and so on
"replies": [] # A list of message responses, each with the same structure as "thread"
},
"source": "", # Source of the conversation
"meta": {} # Optional metadata of the conversation
}
For example:
{
"thread": {
"text": "What is the best programing language in 2023?",
"role": "prompter",
"meta": { "lang": "en" },
"replies": [
{
"text": "It depends on the task that you aiming to solve.",
"role": "assistant",
"meta": { "rank": 0 },
"replies": [
{
"text": "I want to start learning to code",
"role": "prompter",
"meta": { "rank": 0 },
"replies": []
},
{
"text": "I want to make money",
"role": "prompter",
"meta": { "rank": 1 },
"replies": []
}
]
},
{
"text": "Python is the best.",
"role": "assistant",
"meta": { "rank": 1 },
"replies": []
}
]
},
"source": "twitter",
"meta": { "post_id": "..." }
}
For datasets that are intended to be used to train safety models, prosocial format is proposed. The format is given below
- USER (string): the potentially unsafe utterance
- RESPONSE (string, optional): the guiding utterance grounded on rules-of-thumb (rots)
- ROTs (List): the relevant rules-of-thumb for text not labeled as casual
- SAFETY_LABEL (string): the final verdict of the context according to safety_annotations: {casual, possibly_needs_caution, probably_needs_caution, needs_caution, needs_intervention}
- EPISODE_DONE (bool): an indicator of whether it is the end of the dialogue
- SOURCE (string,optional) : the source of the seed text that was used to craft the first utterance of the dialogue: {socialchemistry, sbic, ethics_amt, ethics_reddit}
For datasets that do not fit any previous types. The text-only dataset must include the following columns:
- TEXT (string)
- SOURCE (string)
- METADATA (JSON string, optional)
The dataset must adhere to the following requirements:
- Must have a permissive license
- Must not contain child sexual abuse materials
- Must not contain materials with private individual's personal information (e.g. name, address, phone number, government ID, or medical information)
To add a new dataset to OpenAssistant, follow these steps:
-
Create an issue: Create a new issue and describe your proposal for the new dataset.
-
Create a dataset on Hugging Face: Create a dataset on HuggingFace. See below for more details.
-
Make a pull request: Add a new dataset loading script to this folder and link the issue in the pull request description. For more information, see below.
To create a new dataset on Hugging Face, follow these steps:
import pandas as pd
# Create a pandas dataframe from your dataset file(s)
df = pd.read_json(...) # or any other way
# Save the file in the Parquet format
df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow", index=False)
Make sure the text data in the dataframe is properly encoded as UTF-8
!
pip install huggingface_hub
Use your access token to login:
- Via terminal
huggingface-cli login
- in Jupyter notebook (currently does not work in Visual Studio Code)
from huggingface_hub import notebook_login
notebook_login()
from datasets import Dataset
ds = Dataset.from_parquet("dataset.parquet")
ds.push_to_hub("your_huggingface_name/dataset_name")
Update the README.md
file of your dataset by visiting this link:
https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/README.md
(paste your HuggingFace name and dataset)
- Create a folder with the name of your dataset.
- Add files that describe your dataset and its creation, such as a README, notebooks, scrapers, etc.
- Add your dataset to the parent
__init__.py
INSTRUCTION_DATASETS = {
...,
"dataset_name": "your_huggingface_name/dataset_name"
}
pre-commit run
- Submit a pull request and include a link to the issue it resolves in the
description, for example:
Resolves #123