A Taxonomy of open-source and/or open-weights LLMs.
At the current pace of innovation, this table can become quickly outdated. Feel free to open PRs to improve it or keep it up-to-date.
Model Name | Institution | First release date | Source Code License | Weights License | Dataset License | Dataset Size | Dataset Language(s) | Model Size | Base Model | Training modality | Comments |
---|---|---|---|---|---|---|---|---|---|---|---|
Flan-T5-{} | October 2022 | Apache 2.0 | Apache 2.0 | N/A | N/A | Many languages | 80M-11B | T5 | Instruction fine-tuned | ||
GPT4All | Nomic-ai | March 2023 | Apache License 2.0 | 800k examples | English | 11B Params | GPT-J | Instruction & dialog fine-tuned | |||
LLaMA | Meta | February 2023 | GPL-3 | Non-commercial bespoke license | N/A | >1T tokens | 20 languages | 7B, 13B, 33B, and 65B | N/A | Causal LM | First highly performant "small" LLM |
Alpaca | Stanford | March 2023 | Apache License 2.00 | CC BY NC 4.0 (LLaMA weigth diff) | Claims CC BY NC 4.0 but not clear it is! | 54k examples | English | 7B, 13B | LLaMA | Instruction fine-tuned | Alpaca |
Vicuña | UC Berkeley, UCSD, CMU, MBZUAI | March 2023 | Apache License 2.0 | Apache License 2.0 (LLaMA weigth diff) | N/A | 70k examples | N/A | 13B | LLaMA | Instruction & dialog fine-tuned | Vicuna |
Koala | BAIR (Berkeley) | April 2023 | Apache License 2.0 | Unclear 1 | N/A | >350k examples | N/A | 13B | LLaMA | Instruction & dialog fine-tuned | Koala |
FastChat-T5 | UC Berkeley, UCSD, CMU, MBZUAI | April 2023 | Apache License 2.0 | Unclear 2 | N/A | 70k examples | N/A | 3B | T5-flan-XL | Dialog fine-tuned | |
Pythia | EleutherAI | April 2023 | Apache License 2.0 | Apache License 2.0 | "open source" 3 | 300B tokens | Mostly English (though multiple languages) | 70M-12B | GPT-neoX | Causal LM | |
StableLM-Alpha | Stability-AI | April 2023 | N/A | CC BY-SA-4.0 | N/A | 1.5T tokens | Mostly English (though multiple languages) | 3B-7B | GPT-neoX | Causal LM | 4k context window, training code not available |
Dolly-v2 | Databricks | March 2023 | Apache License 2.0 | Apache License 2.0 | CC BY-SA 3.0 | 15k examples | English | 3B-12B | Pythia | Instruction fine-tuned | Not state-of-the-art, first one of the first commercially licensed. |
CerebrasGPT | Cerebras | March 2023 | N/A | Apache License 2.0 | "open source" 3 | 300B tokens | Mostly English (though multiple languages) | 111M-13B | GPT-2 | Causal LM | Developed mostly as demo of cerebra's hardware capabilities, performance sub-par |
WizardLM-{7/13/15/30} | WizardLM | April 2023 | Apache License 2.0 | Attribution-NonCommercial 4.0 International | Attribution-NonCommercial 4.0 International (LLaMA weight diff) | 70k instructions | Mostly English | 7, 13, 15, 30B | LLaMA | Causal LM | |
MPT-7B | Mosaic ML | May 2023 | Apache License 2.0 | Apache License 2.0 | Various datasets 4 | 300B tokens | Mostly English (though multiple languages) | 7B | MPT | Causal LM | Commercially usable, comparable performance to LLaMA, up to 64k context length |
Falcon-7B/40B | Technology Innovation Institute UAE | May 2023 | N/A | TII Falcon LLM License | Apache License 2.0 + other | 1T tokens | English, German, Spanish, French | 7B, 40B | Causal LM | Commercially usable up to capped revenues, top-performer on OpenLLM leaderboard as of date of launch | |
Orca-13B | Microsoft Research | June 2023 | Unclear, but likely non-commercial | N/A | N/A | 6M examples | Mostly English | 13B | Presumably LLaMA | Causal LM | Pending publication of artifacts & their licenses, but likely restrictive since seems to be based on LLaMA |
[1] Claimed to be "subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Any other usage of the model weights, including but not limited to commercial usage, is strictly prohibited. "
[2] Uses ShareGPT data, which comes form users posting data from ChatGPT, whose terms and conditions are restrictive...
[3] The code to generate the pile is MIT-licensed, and the data itself can be downloaded, no-strings-attached from here. But nowhere it says what the actual license for the dataset is, other than claiming it is "open-source".
[4] Included a variety of data sources: mC4, C4, RedPajama, The Stack, Semantic Scholar, mostly public datasets from public data but each with possibly different licenses.
- Alpaca: First evidence that small-high quality data can make a relatively small LLM competitive with much bigger models, LLaMA fine-tuned at a cost of ~600USD (dataset gen + training)
- Vicuna: Another LLaMA with model, which GPT-4 grades better than ChatGPT and Alpaca. Fine-tuned at a cost of ~300USD
- Koala: Another LLaMA with model fine-tuned on large, partially propietary dialog dataset, with comparable performance to Alpaca according to human evaluators.