Large Language Menagerie

A Taxonomy of open-source and/or open-weights LLMs.

At the current pace of innovation, this table can become quickly outdated. Feel free to open PRs to improve it or keep it up-to-date.

Model Name	Institution	First release date	Source Code License	Weights License	Dataset License	Dataset Size	Dataset Language(s)	Model Size	Base Model	Training modality	Comments
Flan-T5-{}	Google	October 2022	Apache 2.0	Apache 2.0	N/A	N/A	Many languages	80M-11B	T5	Instruction fine-tuned
GPT4All	Nomic-ai	March 2023	Apache License 2.0			800k examples	English	11B Params	GPT-J	Instruction & dialog fine-tuned
LLaMA	Meta	February 2023	GPL-3	Non-commercial bespoke license	N/A	>1T tokens	20 languages	7B, 13B, 33B, and 65B	N/A	Causal LM	First highly performant "small" LLM
Alpaca	Stanford	March 2023	Apache License 2.00	CC BY NC 4.0 (LLaMA weigth diff)	Claims CC BY NC 4.0 but not clear it is!	54k examples	English	7B, 13B	LLaMA	Instruction fine-tuned	Alpaca
Vicuña	UC Berkeley, UCSD, CMU, MBZUAI	March 2023	Apache License 2.0	Apache License 2.0 (LLaMA weigth diff)	N/A	70k examples	N/A	13B	LLaMA	Instruction & dialog fine-tuned	Vicuna
Koala	BAIR (Berkeley)	April 2023	Apache License 2.0	Unclear 1	N/A	>350k examples	N/A	13B	LLaMA	Instruction & dialog fine-tuned	Koala
FastChat-T5	UC Berkeley, UCSD, CMU, MBZUAI	April 2023	Apache License 2.0	Unclear 2	N/A	70k examples	N/A	3B	T5-flan-XL	Dialog fine-tuned
Pythia	EleutherAI	April 2023	Apache License 2.0	Apache License 2.0	"open source" 3	300B tokens	Mostly English (though multiple languages)	70M-12B	GPT-neoX	Causal LM
StableLM-Alpha	Stability-AI	April 2023	N/A	CC BY-SA-4.0	N/A	1.5T tokens	Mostly English (though multiple languages)	3B-7B	GPT-neoX	Causal LM	4k context window, training code not available
Dolly-v2	Databricks	March 2023	Apache License 2.0	Apache License 2.0	CC BY-SA 3.0	15k examples	English	3B-12B	Pythia	Instruction fine-tuned	Not state-of-the-art, first one of the first commercially licensed.
CerebrasGPT	Cerebras	March 2023	N/A	Apache License 2.0	"open source" 3	300B tokens	Mostly English (though multiple languages)	111M-13B	GPT-2	Causal LM	Developed mostly as demo of cerebra's hardware capabilities, performance sub-par
WizardLM-{7/13/15/30}	WizardLM	April 2023	Apache License 2.0	Attribution-NonCommercial 4.0 International	Attribution-NonCommercial 4.0 International (LLaMA weight diff)	70k instructions	Mostly English	7, 13, 15, 30B	LLaMA	Causal LM
MPT-7B	Mosaic ML	May 2023	Apache License 2.0	Apache License 2.0	Various datasets 4	300B tokens	Mostly English (though multiple languages)	7B	MPT	Causal LM	Commercially usable, comparable performance to LLaMA, up to 64k context length
Falcon-7B/40B	Technology Innovation Institute UAE	May 2023	N/A	TII Falcon LLM License	Apache License 2.0 + other	1T tokens	English, German, Spanish, French	7B, 40B		Causal LM	Commercially usable up to capped revenues, top-performer on OpenLLM leaderboard as of date of launch
Orca-13B	Microsoft Research	June 2023	Unclear, but likely non-commercial	N/A	N/A	6M examples	Mostly English	13B	Presumably LLaMA	Causal LM	Pending publication of artifacts & their licenses, but likely restrictive since seems to be based on LLaMA

[1] Claimed to be "subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Any other usage of the model weights, including but not limited to commercial usage, is strictly prohibited. "

[2] Uses ShareGPT data, which comes form users posting data from ChatGPT, whose terms and conditions are restrictive...

[3] The code to generate the pile is MIT-licensed, and the data itself can be downloaded, no-strings-attached from here. But nowhere it says what the actual license for the dataset is, other than claiming it is "open-source".

[4] Included a variety of data sources: mC4, C4, RedPajama, The Stack, Semantic Scholar, mostly public datasets from public data but each with possibly different licenses.

Comments

Alpaca: First evidence that small-high quality data can make a relatively small LLM competitive with much bigger models, LLaMA fine-tuned at a cost of ~600USD (dataset gen + training)
Vicuna: Another LLaMA with model, which GPT-4 grades better than ChatGPT and Alpaca. Fine-tuned at a cost of ~300USD
Koala: Another LLaMA with model fine-tuned on large, partially propietary dialog dataset, with comparable performance to Alpaca according to human evaluators.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Language Menagerie

Comments

About

Releases

Packages

atibaup/large-language-menagerie

Folders and files

Latest commit

History

Repository files navigation

Large Language Menagerie

Comments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages