Cosmopedia

Image generated by DALL-E, the prompt was generated by Mixtral-8x7B-Instruct-v0.1.

[🤗 Cosmopedia dataset] | [🤖 1B-LLM trained on Cosmopedia] | [📰 Blog post]

blog post:

Description

Here you can find the code used for creating Cosmopedia, a dataset of synthetic textbooks, blogposts, stories, posts and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1. It contains over 30 million files and 25 billion tokens, making it the largest open synthetic dataset to date.

Cosmopedia covers a variety of topics; we tried to map world knowledge present in Web datasets like RefinedWeb and RedPajama, and generate synthetic content that covers them. This is the v0.1 of Cosmopedia, with ample room for improvement and topics to be more comprehensively covered. We hope this dataset will help the community's research efforts in the increasingly intriguing domain of synthetic data.

The clusters of Cosmopedia.

You can also find a files frequency plot of single topic clusters in plots/topic_distpng.png.

Code structure

prompts: the code for building the prompts in each seed_data in Cosmopedia. In web_samples, you can also find pointers for the topic clustering we did.
generation: the code to run large scale synthetic generations with llm-swarm using the prompts you built. Cosmopedia consists of 25B tokens and was generated in > 10k H100 GPU hours.
deduplication: the script we used to run MinHash deduplication with datatrove.
decontamination: the code we used to run n-gram decontamination against evaluation benchmarks, when training models on the dataset like cosmopedian-1b.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
classification		classification
decontamination		decontamination
deduplication		deduplication
evaluation		evaluation
fulltext_search		fulltext_search
generation		generation
plots		plots
prompts		prompts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cosmopedia

Description

Code structure

About

Releases

Packages

Contributors 5

Languages

License

huggingface/cosmopedia

Folders and files

Latest commit

History

Repository files navigation

Cosmopedia

Description

Code structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages