A flexible open-source framework to generate datasets with large language models.
- [10/23] We released the first version of this repository on PyPI. You can install it via
pip install fabricator-ai
. - [10/23] Our paper got accepted at EMNLP 2023. You can find the preprint here. You can find the experimental scripts under release v0.1.0.
- [09/23] Support for
gpt-3.5-turbo-instruct
added in the new Haystack release! - [08/23] Added several experimental scripts to investigate the generation and annotation ability of
gpt-3.5-turbo
on various downstream tasks + the influence of few-shot examples on the performance for different downstream tasks. - [07/23] Refactorings of majors classes - you can now simply use our BasePrompt class to create your own customized prompts for every downstream task!
- [07/23] Added dataset transformations for token classification to prompt LLMs with textual spans rather than with list of tags.
- [06/23] Initial version of fabricator supporting text classification and question answering tasks.
This repository:
- is an easy-to-use open-source library to generate datasets with large language models. If you want to train a model on a specific domain / label distribution / downstream task, you can use this framework to generate a dataset for it.
- builds on top of deepset's haystack and huggingface's datasets libraries. Thus, we support a wide range of language models and you can load and use the generated datasets as you know it from the Datasets library for your model training.
- is highly flexible and offers various adaptions possibilities such as prompt customization, integration and sampling of fewshot examples or annotation of the unlabeled datasets.
Using conda:
git clone git@github.com:flairNLP/fabricator.git
cd fabricator
conda create -y -n fabricator python=3.10
conda activate fabricator
pip install fabricator-ai
If you want to install in editable mode, you can use the following command:
pip install -e .
This framework is based on the idea of using large language models to generate datasets for specific tasks. To do so, we need four basic modules: a dataset, a prompt, a language model and a generator:
- Dataset: We use huggingface's datasets library to load fewshot or
unlabeled datasets and store the generated or annotated datasets with their
Dataset
class. Once created, you can share the dataset with others via the hub or use it for your model training. - Prompt: A prompt is the instruction made to the language model. It can be a simple sentence or a more complex template with placeholders. We provide an easy interface for custom dataset generation prompts in which you can specify label options for the LLM to choose from, provide fewshot examples to support the prompt with or annotate an unlabeled dataset in a specific way.
- LLM: We use deepset's haystack library as our LLM interface. deepset supports a wide range of LLMs including OpenAI, all models from the HuggingFace model hub and many more.
- Generator: The generator is the core of this framework. It takes a dataset, a prompt and a LLM and generates a dataset based on your specifications.
With our library, you can generate datasets for any task you want. You can start as simple as that:
import os
from haystack.nodes import PromptNode
from fabricator import DatasetGenerator
from fabricator.prompts import BasePrompt
prompt = BasePrompt(
task_description="Generate a short movie review.",
)
prompt_node = PromptNode(
model_name_or_path="gpt-3.5-turbo",
api_key=os.environ.get("OPENAI_API_KEY"),
max_length=100,
)
generator = DatasetGenerator(prompt_node)
generated_dataset = generator.generate(
prompt_template=prompt,
max_prompt_calls=10,
)
generated_dataset.push_to_hub("your-first-generated-dataset")
In our tutorial, we introduce how to create classification datasets with label options to choose from, how to include fewshot examples or how to annotate unlabeled data into predefined categories.
If you find this repository useful, please cite our work.
@inproceedings{golde2023fabricator,
title = "Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher {LLM}s",
author = "Golde, Jonas and Haller, Patrick and Hamborg, Felix and Risch, Julian and Akbik, Alan",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-demo.1",
pages = "1--11",
}