Awesome llm training

Awesome llm training
- Pretrain
  - Other
- Scaling
  - Other
- Fine-tuning
  - Other
- Architectures
- Optimization
- Ensemble
  - Other
- MoE
  - Other
- Toolkits
- Misc
- Courses
- Other
- Extra reference

Pretrain

Efficient Continual Pre-training by Mitigating the Stability Gap, arXiv, 2406.14833, arxiv, pdf, cication: -1

Yiduo Guo, Jie Fu, Huishuai Zhang, Dongyan Zhao, Yikang Shen
Instruction Pre-Training: Language Models are Supervised Multitask Learners, arXiv, 2406.14491, arxiv, pdf, cication: -1

Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, Furu Wei · (LMOps - microsoft)
Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training, arXiv, 2405.15319, arxiv, pdf, cication: -1

Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu · (llm-stacking.github)
Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining, arXiv, 2405.14908, arxiv, pdf, cication: -1

Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding
Pre-training Small Base LMs with Fewer Tokens, arXiv, 2404.08634, arxiv, pdf, cication: -1

Sunny Sanyal, Sujay Sanghavi, Alexandros G. Dimakis · (LLM-Inheritune - sanyalsunny111)
Training LLMs over Neurally Compressed Text, arXiv, 2404.03626, arxiv, pdf, cication: -1

Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant
The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis, arXiv, 2404.01204, arxiv, pdf, cication: -1

Chen Yang, Junzhuo Li, Xinyao Niu, Xinrun Du, Songyang Gao, Haoran Zhang, Zhaoliang Chen, Xingwei Qu, Ruibin Yuan, Yizhi Li
Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training, arXiv, 2403.00758, arxiv, pdf, cication: -1

Qingyan Guo, Rui Wang, Junliang Guo, Xu Tan, Jiang Bian, Yujiu Yang
Reverse Training to Nurse the Reversal Curse, arXiv, 2403.13799, arxiv, pdf, cication: -1

Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, Sainbayar Sukhbaatar
Language models scale reliably with over-training and on downstream tasks, arXiv, 2403.08540, arxiv, pdf, cication: -1

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh
fm-cheatsheet - allenai

Website for hosting the Open Foundation Models Cheat Sheet. · (fmcheatsheet)
SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection, arXiv, 2401.13160, arxiv, pdf, cication: -1

Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar
MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications, arXiv, 2310.15777, arxiv, pdf, cication: -1

Yizhe Yang, Huashan Sun, Jiawei Li, Runheng Liu, Yinghao Li, Yuhang Liu, Heyan Huang, Yang Gao · (jiqizhixin)
In-Context Pretraining: Language Modeling Beyond Document Boundaries, arXiv, 2310.10638, arxiv, pdf, cication: -1

Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, Mike Lewis
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale, arXiv, 2309.04564, arxiv, pdf, cication: 3

Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, Sara Hooker
metaseq - facebookresearch

· (mp.weixin.qq) · (bilibili)

Other

useful AI paper on optimizing training, comment with your finds
如何从零开始训练大模型（minicpm分享&讨论） - 知乎
贫穷让我预训练

Scaling

Scaling Laws for Linear Complexity Language Models, arXiv, 2406.16690, arxiv, pdf, cication: -1

Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, Yiran Zhong
D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models, arXiv, 2406.01375, arxiv, pdf, cication: -1

Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?, arXiv, 2406.04391, arxiv, pdf, cication: -1

Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo
Observational Scaling Laws and the Predictability of Language Model Performance, arXiv, 2405.10938, arxiv, pdf, cication: -1

Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory, arXiv, 2405.08707, arxiv, pdf, cication: -1

Xueyan Niu, Bo Bai, Lei Deng, Wei Han
Chinchilla Scaling: A replication attempt, arXiv, 2404.10102, arxiv, pdf, cication: -1

Tamay Besiroglu, Ege Erdil, Matthew Barnett, Josh You
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws, arXiv, 2404.05405, arxiv, pdf, cication: -1

Zeyuan Allen-Zhu, Yuanzhi Li
DiPaCo: Distributed Path Composition, arXiv, 2403.10616, arxiv, pdf, cication: -1

Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Adhiguna Kuncoro, Yani Donchev, Rachita Chhaparia, Ionel Gog, Marc'Aurelio Ranzato, Jiajun Shen, Arthur Szlam
Algorithmic progress in language models, arXiv, 2403.05812, arxiv, pdf, cication: -1

Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, Jaime Sevilla

· (mp.weixin.qq)
- since 2012, the computational efficiency for pretraining language models (including large language models) has doubled approximately every 8 months, a pace much faster than the hardware advancements predicted by Moore's Law.
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method, arXiv, 2402.17193, arxiv, pdf, cication: -1

Biao Zhang, Zhongtao Liu, Colin Cherry, Orhan Firat
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, arXiv, 2402.15627, arxiv, pdf, cication: -1

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong
OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning, arXiv, 2402.06954, arxiv, pdf, cication: -1

Rui Ye, Wenhao Wang, Jingyi Chai, Dihan Li, Zexi Li, Yinda Xu, Yaxin Du, Yanfeng Wang, Siheng Chen · (openfedllm - rui-ye)
A Tale of Tails: Model Collapse as a Change of Scaling Laws, arXiv, 2402.07043, arxiv, pdf, cication: -1

Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, Julia Kempe
Scaling Laws for Downstream Task Performance of Large Language Models, arXiv, 2402.04177, arxiv, pdf, cication: -1

Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, Sanmi Koyejo
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives, arXiv, 2401.16677, arxiv, pdf, cication: -1

Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair
Zero Bubble Pipeline Parallelism, arXiv, 2401.10241, arxiv, pdf, cication: -1

Penghui Qi, Xinyi Wan, Guangxing Huang, Min Lin · (zero-bubble-pipeline-parallelism - sail-sg)
Asynchronous Local-SGD Training for Language Modeling, arXiv, 2401.09135, arxiv, pdf, cication: -1

Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism, arXiv, 2401.02954, arxiv, pdf, cication: -1

DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws, arXiv, 2401.00448, arxiv, pdf, cication: -1

Nikhil Sardana, Jonathan Frankle
Unicron: Economizing Self-Healing LLM Training at Scale, arXiv, 2401.00134, arxiv, pdf, cication: -1

Tao He, Xue Li, Zhibin Wang, Kun Qian, Jingbo Xu, Wenyuan Yu, Jingren Zhou
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling, arXiv, 2312.15166, arxiv, pdf, cication: -1

Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim
Distributed Inference and Fine-tuning of Large Language Models Over The Internet, arXiv, 2312.08361, arxiv, pdf, cication: -1

Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel

· (petals - bigscience-workshop)
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models, arXiv, 2312.06109, arxiv, pdf, cication: -1

Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism, arXiv, 2312.04916, arxiv, pdf, cication: -1

Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou · (EE-LLM - pan-x-c)
DiLoCo: Distributed Low-Communication Training of Language Models, arXiv, 2311.08105, arxiv, pdf, cication: -1

Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc'Aurelio Ranzato, Arthur Szlam, Jiajun Shen
Microscaling Data Formats for Deep Learning, arXiv, 2310.10537, arxiv, pdf, cication: -1

Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf
A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale, arXiv, 2309.06497, arxiv, pdf, cication: -1

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, Michael Rabbat
Scaling Laws for Sparsely-Connected Foundation Models, arXiv, 2309.08520, arxiv, pdf, cication: 1

Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci
Scaling TransNormer to 175 Billion Parameters, arXiv, 2307.14995, arxiv, pdf, cication: 1

Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Fei Yuan, Xiao Luo · (jiqizhixin)
Go smol or go home | Harm de Vries
Inverse Scaling: When Bigger Isn't Better, arXiv, 2306.09479, arxiv, pdf, cication: 15

Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu
Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training, arXiv, 2306.08055, arxiv, pdf, cication: 1

Abraham J. Fetterman, Ellie Kitanidis, Joshua Albrecht, Zachary Polizzi, Bryden Fogelman, Maksis Knutins, Bartosz Wróblewski, James B. Simon, Kanjun Qiu
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis, arXiv, 2305.13230, arxiv, pdf, cication: -1

Fuzhao Xue, Yao Fu, Wangchunshu Zhou, Zangwei Zheng, Yang You
Training Compute-Optimal Large Language Models, arXiv, 2203.15556, arxiv, pdf, cication: 202

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling, ICML, 2023, arxiv, pdf, cication: -1

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff
- twitter

Other

scaling laws allow us to accurately predict the performance of larger training runs from much cheaper ones.
Techniques for training large neural networks
How to Train Really Large Models on Many GPUs? | Lil'Log
揭开OpenAI Scaling Laws面纱
Chinchilla之死：只要训练足够长时间，小模型也能超过大模型 | 机器之心

Fine-tuning

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models, arXiv, 2403.13372, arxiv, pdf, cication: -1

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo · (LLaMA-Factory - hiyouga)
Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training, arXiv, 2403.09613, arxiv, pdf, cication: -1

Yanlai Yang, Matt Jones, Michael C. Mozer, Mengye Ren
Larimar: Large Language Models with Episodic Memory Control, arXiv, 2403.11901, arxiv, pdf, cication: -1

Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aurélie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Jiří
Simple and Scalable Strategies to Continually Pre-train Large Language Models, arXiv, 2403.08763, arxiv, pdf, cication: -1

Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish

· (twitter)
- LLMs can be efficiently updated with new data through a combination of simple learning rate rewarming and adding a small fraction of previous training data to counteract catastrophic forgetting.
Personalized Large Language Models, arXiv, 2402.09269, arxiv, pdf, cication: -1

Stanisław Woźniak, Bartłomiej Koptyra, Arkadiusz Janz, Przemysław Kazienko, Jan Kocoń
BitDelta: Your Fine-Tune May Only Be Worth One Bit, arXiv, 2402.10193, arxiv, pdf, cication: -1

James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai · (BitDelta - FasterDecoding)
EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models, arXiv, 2402.00518, arxiv, pdf, cication: -1

Xuchen Pan, Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou · (EE-LLM - pan-x-c)
Scaling Sparse Fine-Tuning to Large Language Models, arXiv, 2401.16405, arxiv, pdf, cication: -1

Alan Ansell, Ivan Vulić, Hannah Sterz, Anna Korhonen, Edoardo M. Ponti · (peft - AlanAnsell) · (sft-llm - ducdauge)
Tuning Language Models by Proxy, arXiv, 2401.08565, arxiv, pdf, cication: -1

Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, Noah A. Smith

· (lightning)
LLaMA Pro: Progressive LLaMA with Block Expansion, arXiv, 2401.02415, arxiv, pdf, cication: -1

Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, Ying Shan · (LLaMA-Pro - TencentARC) · (huggingface)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models, arXiv, 2401.01335, arxiv, pdf, cication: -1

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu

· (SPIN - uclaml)
Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes, arXiv, 2312.06353, arxiv, pdf, cication: -1

Zhen Qin, Daoyuan Chen, Bingchen Qian, Bolin Ding, Yaliang Li, Shuiguang Deng
Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2, arXiv, 2311.10702, arxiv, pdf, cication: -1

Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy
Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse, arXiv, 2311.07468, arxiv, pdf, cication: -1

Ang Lv, Kaiyi Zhang, Shufang Xie, Quan Tu, Yuhan Chen, Ji-Rong Wen, Rui Yan · (jiqizhixin)
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization, arXiv, 2311.06243, arxiv, pdf, cication: -1

Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng

Other

Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora
Answer.AI - Efficient finetuning of Llama 3 with FSDP QDoRA
How to Fine-Tune LLMs in 2024 with Hugging Face
Combating Evaluation Data Contamination in LLMs: Strategies for High-Quality Finetuning and Model Merging
Fine-tuning GPT3.5-turbo based on 140k slack messages · Ross Lazerowitz

Architectures

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach, arXiv, 2406.04594, arxiv, pdf, cication: -1

Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao
Linear Transformers with Learnable Kernel Functions are Better In-Context Models, arXiv, 2402.10644, arxiv, pdf, cication: -1

Yaroslav Aksenov, Nikita Balagansky, Sofia Maria Lo Cicero Vaina, Boris Shaposhnikov, Alexey Gorbatovski, Daniil Gavrilov
Rethinking Optimization and Architecture for Tiny Language Models, arXiv, 2402.02791, arxiv, pdf, cication: -1

Yehui Tang, Fangcheng Liu, Yunsheng Ni, Yuchuan Tian, Zheyuan Bai, Yi-Qi Hu, Sichao Liu, Shangling Jui, Kai Han, Yunhe Wang · (RethinkTinyLM - YuchuanTian)
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers, arXiv, 2311.10642, arxiv, pdf, cication: -1

Vukasin Bozic, Danilo Dordervic, Daniele Coppola, Joseph Thommes
UT5: Pretraining Non autoregressive T5 with unrolled denoising, arXiv, 2311.08552, arxiv, pdf, cication: -1

Mahmoud G. Salem, Jiayu Ye, Chu-Cheng Lin, Frederick Liu
How to Build Low-cost Networks for Large Language Models (without Sacrificing Performance)?, arXiv, 2307.12169, arxiv, pdf, cication: -1

Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, Naader Hasani
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates, arXiv, 2307.05695, arxiv, pdf, cication: 2

Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

· (mp.weixin.qq)

Optimization

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion, arXiv, 2407.01392, arxiv, pdf, cication: -1

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann

· (diffusion-forcing - buoyancy99)
MiniCPM: Unveiling the Potential of End-side Large Language Models
Adam-mini: Use Fewer Learning Rates To Gain More, arXiv, 2406.16793, arxiv, pdf, cication: -1

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs, arXiv, 2406.10209, arxiv, pdf, cication: -1

Abhimanyu Hans, Yuxin Wen, Neel Jain, John Kirchenbauer, Hamid Kazemi, Prajwal Singhania, Siddharth Singh, Gowthami Somepalli, Jonas Geiping, Abhinav Bhatele · (goldfish-loss - ahans30)
2BP: 2-Stage Backpropagation, arXiv, 2405.18047, arxiv, pdf, cication: -1

Christopher Rae, Joseph K. L. Lee, James Richings
The Road Less Scheduled, arXiv, 2405.15682, arxiv, pdf, cication: -1

Aaron Defazio, Xingyu, Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky · (schedule_free - facebookresearch)
Thermodynamic Natural Gradient Descent, arXiv, 2405.13817, arxiv, pdf, cication: -1

Kaelan Donatella, Samuel Duffield, Maxwell Aifer, Denis Melanson, Gavin Crooks, Patrick J. Coles
The Entropy Enigma: Success and Failure of Entropy Minimization, arXiv, 2405.05012, arxiv, pdf, cication: -1

Ori Press, Ravid Shwartz-Ziv, Yann LeCun, Matthias Bethge · (EntropyEnigma - oripress)
psgd_torch - lixilinx

Pytorch implementation of preconditioned stochastic gradient descent (affine group preconditioner, low-rank approximation preconditioner and more) · (Preconditioned-Stochastic-Gradient-Descent - opooladz)
A Large-Scale Exploration of $μ$-Transfer, arXiv, 2404.05728, arxiv, pdf, cication: -1

Lucas Lingle
schedule_free - facebookresearch

Schedule-Free Optimization in PyTorch
Reverse Training to Nurse the Reversal Curse, arXiv, 2403.13799, arxiv, pdf, cication: -1

Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, Sainbayar Sukhbaatar
Towards Optimal Learning of Language Models, arXiv, 2402.17759, arxiv, pdf, cication: -1

Yuxian Gu, Li Dong, Yaru Hao, Qingxiu Dong, Minlie Huang, Furu Wei · (aka)
Stabilizing Transformer Training by Preventing Attention Entropy Collapse, ICML, 2023, arxiv, pdf, cication: -1

Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Josh Susskind · (ml-sigma-reparam - apple)
CAME: Confidence-guided Adaptive Memory Efficient Optimization, arXiv, 2307.02047, arxiv, pdf, cication: -1

Yang Luo, Xiaozhe Ren, Zangwei Zheng, Zhuo Jiang, Xin Jiang, Yang You · (qbitai)

Adam学习率Scaling law的「浪涌现象」
用最酷的LR，训最猛的模型

Ensemble

Octopus v4: Graph of language models, arXiv, 2404.19296, arxiv, pdf, cication: -1

Wei Chen, Zhiyuan Li
Training-Free Pretrained Model Merging, arXiv, 2403.01753, arxiv, pdf, cication: -1

Zhengqi Xu, Ke Yuan, Huiqiong Wang, Yong Wang, Mingli Song, Jie Song
- The proposed model merging framework addresses the challenge of balancing unit similarity inconsistencies between weight and activation spaces during model merging by linearly combining similarity matrices of both, resulting in better multi-task model performance.
Evolutionary Optimization of Model Merging Recipes, arXiv, 2403.13187, arxiv, pdf, cication: -1

Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, David Ha

· (evolutionary-model-merge - sakanaai)
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM, arXiv, 2403.07816, arxiv, pdf, cication: -1

Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston
AutoMerger - mlabonne 🤗

· (huggingface)
Learning to Decode Collaboratively with Multiple Language Models, arXiv, 2403.03870, arxiv, pdf, cication: -1

Shannon Zejiang Shen, Hunter Lang, Bailin Wang, Yoon Kim, David Sontag · (co-llm - clinicalml)
FuseChat: Knowledge Fusion of Chat Models, arXiv, 2402.16107, arxiv, pdf, cication: -1

Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, Wei Bi · (FuseLLM - fanqiwan)
Knowledge Fusion of Large Language Models, arXiv, 2401.10491, arxiv, pdf, cication: -1

Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, Shuming Shi · (FuseLLM - fanqiwan)
Beagle14-7B - mlabonne 🤗
Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM, arXiv, 2401.02994, arxiv, pdf, cication: -1

Xiaoding Lu, Adian Liusie, Vyas Raina, Yuwen Zhang, William Beauchamp
LLM Augmented LLMs: Expanding Capabilities through Composition, arXiv, 2401.02412, arxiv, pdf, cication: 8

Rachit Bansal, Bidisha Samanta, Siddharth Dalmia, Nitish Gupta, Shikhar Vashishth, Sriram Ganapathy, Abhishek Bapna, Prateek Jain, Partha Talukdar
Model Merging - a osanseviero Collection
Papers about model merging - a julien-c Collection
mergekit - cg123

Tools for merging pretrained large language models.
LM-Cocktail: Resilient Tuning of Language Models via Model Merging, arXiv, 2311.13534, arxiv, pdf, cication: -1

Shitao Xiao, Zheng Liu, Peitian Zhang, Xingrun Xing · (FlagEmbedding - FlagOpen)
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, arXiv, 2311.08692, arxiv, pdf, cication: -1

Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch, arXiv, 2311.03099, arxiv, pdf, cication: -1

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li · (mergelm - yule-buaa)
LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion, arXiv, 2306.02561, arxiv, pdf, cication: 16

Dongfu Jiang, Xiang Ren, Bill Yuchen Lin · (LLM-Blender - yuchenlin) · (mp.weixin.qq)

Other

Evolutionary Model Merging For All
mergekit-gui - julien-c 🤗
Model merging lessons in The Waifu Research Department
Model Merging - a osanseviero Collection
🤗 PEFT welcomes new merging methods
Weight averaging and model merging for LLMs seem to be the most interesting themes in 2024 so far.

MoE

A Closer Look into Mixture-of-Experts in Large Language Models, arXiv, 2406.18219, arxiv, pdf, cication: -1

Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, Jie Fu · (Look-into-MoEs - kamanphoebe)
Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts, arXiv, 2406.12034, arxiv, pdf, cication: -1

Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James Glass, David Cox, Rameswar Panda, Rogerio Feris, Alan Ritter
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models, arXiv, 2406.06563, arxiv, pdf, cication: -1

Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng
MoEUT: Mixture-of-Experts Universal Transformers, arXiv, 2405.16039, arxiv, pdf, cication: -1

Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, Christopher D. Manning · (moeut - robertcsordas)
Multi-Head Mixture-of-Experts, arXiv, 2404.15045, arxiv, pdf, cication: -1

Xun Wu, Shaohan Huang, Wenhui Wang, Furu Wei
mergoo - Leeroo-AI

A library for easily merging multiple LLM experts, and efficiently train the merged LLM.
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, arXiv, 2404.02258, arxiv, pdf, cication: -1

David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro · (qbitai) · (OLMo - thepowerfuldeez) · (twitter)
JetMoE - myshell-ai

Reaching LLaMA2 Performance with 0.1M Dollars

· (research.myshell)
- JetMoE-8B has 24 blocks where each block has two MoE layers: Mixture of Attention heads (MoA) and Mixture of MLP Experts (MoE); each MoA and MoE layer has 8 experts, and 2 experts are activated for each input token with 2.2B active parameters.
megablocks - databricks
Scattered Mixture-of-Experts Implementation, arXiv, 2403.08245, arxiv, pdf, cication: -1

Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville
Scaling Laws for Fine-Grained Mixture of Experts, arXiv, 2402.07871, arxiv, pdf, cication: -1

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski
BlackMamba: Mixture of Experts for State-Space Models, arXiv, 2402.01771, arxiv, pdf, cication: -1

Quentin Anthony, Yury Tokpanov, Paolo Glorioso, Beren Millidge · (BlackMamba - Zyphra)
LocMoE: A Low-overhead MoE for Large Language Model Training, arXiv, 2401.13920, arxiv, pdf, cication: -1

Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao, Xin Chen · (jiqizhixin)
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts, arXiv, 2401.04081, arxiv, pdf, cication: -1

Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Sebastian Jaszczur
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, arXiv, 2401.06066, arxiv, pdf, cication: -1

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu · (DeepSeek-MoE - deepseek-ai) · (huggingface)
Mixtral of Experts, arXiv, 2401.04088, arxiv, pdf, cication: -1

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts, arXiv, 2401.04081, arxiv, pdf, cication: -1

Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Sebastian Jaszczur
Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning, arXiv, 2312.12379, arxiv, pdf, cication: -1

Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang · (qbitai)
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention, arXiv, 2312.07987, arxiv, pdf, cication: -1

Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber

· (moe_attention - robertcsordas)
Mixture of Experts Explained
megablocks-public - mistralai

· (qbitai)
llama-mistral - dzhulgakov

Inference code for Mistral and Mixtral hacked up into original Llama implementation
SmartMoE - zms1999

A MoE impl for PyTorch, [ATC'23] SmartMoE · (jiqizhixin)
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models, arXiv, 2305.14705, arxiv, pdf, cication: 5

Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen · (jiqizhixin)
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models, arXiv, 2402.01739, arxiv, pdf, cication: -1

Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, Yang You
OpenMoE - XueFuzhao

A family of open-sourced Mixture-of-Experts (MoE) Large Language Models

· (OpenMoE - XueFuzhao)
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models, arXiv, 2305.14705, arxiv, pdf, cication: -1

Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts, proceedings of machine learning and systems, 2023, arxiv, pdf, cication: -1

Trevor Gale, Deepak Narayanan, Cliff Young, Matei Zaharia
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, the journal of machine learning research, 2022, arxiv, pdf, cication: -1

William Fedus, Barret Zoph, Noam Shazeer
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, arXiv, 2006.16668, arxiv, pdf, cication: -1

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, arXiv, 1701.06538, arxiv, pdf, cication: -1

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean

Other

How do mixture-of-experts layers affect transformer models? - Stack Overflow
Accelerating MoE model inference with Locality-Aware Kernel Design | PyTorch
a mini-tutorial on three types of Mixture of Experts (MoE): Pre-trained MoE, upcycled MoEs, and FrankenMoEs
makeMoE - AviSoori1x

From scratch implementation of a sparse mixture of experts language model inspired by Andrej Karpathy's makemore :) · (huggingface) · (qbitai)
Mixture of Experts - a mlabonne Collection
8x7B-MoE-test-NOT-MIXTRAL - CausalLM 🤗
从Mixtral-8x7B到LLaMA MOE再到DeepSeek-MoE
模块化大模型来了！IBM公开WastonX核心架构技术细节

Online Learning

Unlocking Continual Learning Abilities in Language Models, arXiv, 2406.17245, arxiv, pdf, cication: -1

Wenyu Du, Shuang Cheng, Tongxu Luo, Zihan Qiu, Zeyu Huang, Ka Chun Cheung, Reynold Cheng, Jie Fu · (MIGU - wenyudu)
Online Training of Large Language Models: Learn while chatting, arXiv, 2403.04790, arxiv, pdf, cication: -1

Juhao Liang, Ziwei Wang, Zhuoheng Ma, Jianquan Li, Zhiyi Zhang, Xiangbo Wu, Benyou Wang

Toolkits

mistral-finetune - mistralai
ZeRO++ - DeepSpeed
torchtitan - pytorch

A native PyTorch Library for large model training
ColossalAI - hpcaitech

Making large AI models cheaper, faster and more accessible
xtuner - InternLM

An efficient, flexible and full-featured toolkit for fine-tuning large models (InternLM, Llama, Baichuan, Qwen, ChatGLM)
corenet - apple

CoreNet: A library for training deep neural networks
maxtext - google

A simple, performant and scalable Jax LLM!
lightning-thunder - Lightning-AI

Source to source compiler for PyTorch. It makes PyTorch programs faster on single accelerators and distributed.
zero-bubble-pipeline-parallelism - sail-sg

Zero Bubble Pipeline Parallelism · (mp.weixin.qq)
levanter - stanford-crfm

Legibile, Scalable, Reproducible Foundation Models with Named Tensors and Jax
axolotl - OpenAccess-AI-Collective

Go ahead and axolotl questions
LLMtuner - promptslab

Tune LLM in few lines of code
LLM-FineTuning-Large-Language-Models - rohan-paul

LLM (Large Language Model) FineTuning
Megatron-LM - NVIDIA

Ongoing research training transformer models at scale
saturn - knagrecha

Saturn accelerates the training of large-scale deep learning models with a novel joint optimization approach.
SynapseML - microsoft

Simple and Distributed Machine Learning
gpt-llm-trainer - mshumer

· (qbitai)
LLaMA-Factory - hiyouga

Easy-to-use LLM fine-tuning framework (LLaMA, BLOOM, Mistral, Baichuan, Qwen, ChatGLM)
Megatron-LLaMA - alibaba

Best practice for training LLaMA models in Megatron-LM · (jiqizhixin)
Efficient-PyTorch - Lyken17

My best practice of training large dataset using PyTorch.

Misc

Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs, arXiv, 2311.02262, arxiv, pdf, cication: -1

Qingru Zhang, Chandan Singh, Liyuan Liu, Xiaodong Liu, Bin Yu, Jianfeng Gao, Tuo Zhao
Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks, arXiv, 2310.02244, arxiv, pdf, cication: -1

Greg Yang, Dingli Yu, Chen Zhu, Soufiane Hayou · (qbitai)
Think before you speak: Training Language Models With Pause Tokens, arXiv, 2310.02226, arxiv, pdf, cication: -1

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan · (qbitai)
Textbooks Are All You Need II: phi-1.5 technical report, arXiv, 2309.05463, arxiv, pdf, cication: 9

Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee · (jiqizhixin)
Towards Robust and Efficient Continual Language Learning, arXiv, 2307.05741, arxiv, pdf, cication: 1

Adam Fisch, Amal Rannen-Triki, Razvan Pascanu, Jörg Bornschein, Angeliki Lazaridou, Elena Gribovskaya, Marc'Aurelio Ranzato

Courses

Fetching Title#hxa6
Pathways 论文精读【论文精读】_哔哩哔哩_bilibili
GPipe论文精读【论文精读】_哔哩哔哩_bilibili
Megatron LM 论文精读【论文精读】_哔哩哔哩_bilibili
Zero 论文精读【论文精读】_哔哩哔哩_bilibili

Other

deepspeed-to-fsdp-and-back - 🤗
Training great LLMs entirely from ground zero in the wilderness as a startup — Yi Tay

· (twitter) · (mp.weixin.qq)
Fine-tuning Llama 2 70B using PyTorch FSDP
Everything about Distributed Training and Efficient Finetuning | Sumanth's Personal Website
如何从零开始训练大模型（minicpm分享&讨论） - 知乎
0代码微调大模型火了，只需5步，成本低至150块 | 量子位
GPU 利用率低常见原因分析及优化

Extra reference

llm-alignment-survey - Magnetic2014

A curated reading list for large language model (LLM) alignment. Take a look at our new survey "Large Language Model Alignment: A Survey" for more details!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awesome_llm_training.md

awesome_llm_training.md

Awesome llm training

Pretrain

Other

Scaling

Other

Fine-tuning

Other

Architectures

Optimization

Ensemble

Other

MoE

Other

Online Learning

Toolkits

Misc

Courses

Other

Extra reference

Files

awesome_llm_training.md

Latest commit

History

awesome_llm_training.md

File metadata and controls

Awesome llm training

Pretrain

Other

Scaling

Other

Fine-tuning

Other

Architectures

Optimization

Ensemble

Other

MoE

Other

Online Learning

Toolkits

Misc

Courses

Other

Extra reference