Skip to content

Latest commit

 

History

History
669 lines (472 loc) · 63.5 KB

awesome_llm_training.md

File metadata and controls

669 lines (472 loc) · 63.5 KB

Awesome llm training

Pretrain

  • Efficient Continual Pre-training by Mitigating the Stability Gap, arXiv, 2406.14833, arxiv, pdf, cication: -1

    Yiduo Guo, Jie Fu, Huishuai Zhang, Dongyan Zhao, Yikang Shen

  • Instruction Pre-Training: Language Models are Supervised Multitask Learners, arXiv, 2406.14491, arxiv, pdf, cication: -1

    Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, Furu Wei · (LMOps - microsoft) Star

  • Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training, arXiv, 2405.15319, arxiv, pdf, cication: -1

    Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu · (llm-stacking.github)

  • Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining, arXiv, 2405.14908, arxiv, pdf, cication: -1

    Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding

  • Pre-training Small Base LMs with Fewer Tokens, arXiv, 2404.08634, arxiv, pdf, cication: -1

    Sunny Sanyal, Sujay Sanghavi, Alexandros G. Dimakis · (LLM-Inheritune - sanyalsunny111) Star

  • Training LLMs over Neurally Compressed Text, arXiv, 2404.03626, arxiv, pdf, cication: -1

    Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant

  • The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis, arXiv, 2404.01204, arxiv, pdf, cication: -1

    Chen Yang, Junzhuo Li, Xinyao Niu, Xinrun Du, Songyang Gao, Haoran Zhang, Zhaoliang Chen, Xingwei Qu, Ruibin Yuan, Yizhi Li

  • Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training, arXiv, 2403.00758, arxiv, pdf, cication: -1

    Qingyan Guo, Rui Wang, Junliang Guo, Xu Tan, Jiang Bian, Yujiu Yang

  • Reverse Training to Nurse the Reversal Curse, arXiv, 2403.13799, arxiv, pdf, cication: -1

    Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, Sainbayar Sukhbaatar

  • Language models scale reliably with over-training and on downstream tasks, arXiv, 2403.08540, arxiv, pdf, cication: -1

    Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh

  • fm-cheatsheet - allenai Star

    Website for hosting the Open Foundation Models Cheat Sheet. · (fmcheatsheet)

  • SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection, arXiv, 2401.13160, arxiv, pdf, cication: -1

    Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar

  • MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications, arXiv, 2310.15777, arxiv, pdf, cication: -1

    Yizhe Yang, Huashan Sun, Jiawei Li, Runheng Liu, Yinghao Li, Yuhang Liu, Heyan Huang, Yang Gao · (jiqizhixin)

  • In-Context Pretraining: Language Modeling Beyond Document Boundaries, arXiv, 2310.10638, arxiv, pdf, cication: -1

    Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, Mike Lewis

  • When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale, arXiv, 2309.04564, arxiv, pdf, cication: 3

    Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, Sara Hooker

  • metaseq - facebookresearch Star

    · (mp.weixin.qq) · (bilibili)

Other

Scaling

  • Scaling Laws for Linear Complexity Language Models, arXiv, 2406.16690, arxiv, pdf, cication: -1

    Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, Yiran Zhong

  • D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models, arXiv, 2406.01375, arxiv, pdf, cication: -1

    Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang

  • Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?, arXiv, 2406.04391, arxiv, pdf, cication: -1

    Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo

  • Observational Scaling Laws and the Predictability of Language Model Performance, arXiv, 2405.10938, arxiv, pdf, cication: -1

    Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto

  • Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory, arXiv, 2405.08707, arxiv, pdf, cication: -1

    Xueyan Niu, Bo Bai, Lei Deng, Wei Han

  • Chinchilla Scaling: A replication attempt, arXiv, 2404.10102, arxiv, pdf, cication: -1

    Tamay Besiroglu, Ege Erdil, Matthew Barnett, Josh You

  • Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws, arXiv, 2404.05405, arxiv, pdf, cication: -1

    Zeyuan Allen-Zhu, Yuanzhi Li

  • DiPaCo: Distributed Path Composition, arXiv, 2403.10616, arxiv, pdf, cication: -1

    Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Adhiguna Kuncoro, Yani Donchev, Rachita Chhaparia, Ionel Gog, Marc'Aurelio Ranzato, Jiajun Shen, Arthur Szlam

  • Algorithmic progress in language models, arXiv, 2403.05812, arxiv, pdf, cication: -1

    Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, Jaime Sevilla

    · (mp.weixin.qq)

    • since 2012, the computational efficiency for pretraining language models (including large language models) has doubled approximately every 8 months, a pace much faster than the hardware advancements predicted by Moore's Law.
  • When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method, arXiv, 2402.17193, arxiv, pdf, cication: -1

    Biao Zhang, Zhongtao Liu, Colin Cherry, Orhan Firat

  • MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, arXiv, 2402.15627, arxiv, pdf, cication: -1

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong

  • OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning, arXiv, 2402.06954, arxiv, pdf, cication: -1

    Rui Ye, Wenhao Wang, Jingyi Chai, Dihan Li, Zexi Li, Yinda Xu, Yaxin Du, Yanfeng Wang, Siheng Chen · (openfedllm - rui-ye) Star

  • A Tale of Tails: Model Collapse as a Change of Scaling Laws, arXiv, 2402.07043, arxiv, pdf, cication: -1

    Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, Julia Kempe

  • Scaling Laws for Downstream Task Performance of Large Language Models, arXiv, 2402.04177, arxiv, pdf, cication: -1

    Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, Sanmi Koyejo

  • T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives, arXiv, 2401.16677, arxiv, pdf, cication: -1

    Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair

  • Zero Bubble Pipeline Parallelism, arXiv, 2401.10241, arxiv, pdf, cication: -1

    Penghui Qi, Xinyi Wan, Guangxing Huang, Min Lin · (zero-bubble-pipeline-parallelism - sail-sg) Star

  • Asynchronous Local-SGD Training for Language Modeling, arXiv, 2401.09135, arxiv, pdf, cication: -1

    Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato

  • DeepSeek LLM: Scaling Open-Source Language Models with Longtermism, arXiv, 2401.02954, arxiv, pdf, cication: -1

    DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong

  • Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws, arXiv, 2401.00448, arxiv, pdf, cication: -1

    Nikhil Sardana, Jonathan Frankle

  • Unicron: Economizing Self-Healing LLM Training at Scale, arXiv, 2401.00134, arxiv, pdf, cication: -1

    Tao He, Xue Li, Zhibin Wang, Kun Qian, Jingbo Xu, Wenyuan Yu, Jingren Zhou

  • SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling, arXiv, 2312.15166, arxiv, pdf, cication: -1

    Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim

  • Distributed Inference and Fine-tuning of Large Language Models Over The Internet, arXiv, 2312.08361, arxiv, pdf, cication: -1

    Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel

    · (petals - bigscience-workshop) Star

  • Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models, arXiv, 2312.06109, arxiv, pdf, cication: -1

    Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang

  • EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism, arXiv, 2312.04916, arxiv, pdf, cication: -1

    Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou · (EE-LLM - pan-x-c) Star

  • DiLoCo: Distributed Low-Communication Training of Language Models, arXiv, 2311.08105, arxiv, pdf, cication: -1

    Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc'Aurelio Ranzato, Arthur Szlam, Jiajun Shen

  • Microscaling Data Formats for Deep Learning, arXiv, 2310.10537, arxiv, pdf, cication: -1

    Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf

  • A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale, arXiv, 2309.06497, arxiv, pdf, cication: -1

    Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, Michael Rabbat

  • Scaling Laws for Sparsely-Connected Foundation Models, arXiv, 2309.08520, arxiv, pdf, cication: 1

    Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci

  • Scaling TransNormer to 175 Billion Parameters, arXiv, 2307.14995, arxiv, pdf, cication: 1

    Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Fei Yuan, Xiao Luo · (jiqizhixin)

  • Go smol or go home | Harm de Vries

  • Inverse Scaling: When Bigger Isn't Better, arXiv, 2306.09479, arxiv, pdf, cication: 15

    Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu

  • Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training, arXiv, 2306.08055, arxiv, pdf, cication: 1

    Abraham J. Fetterman, Ellie Kitanidis, Joshua Albrecht, Zachary Polizzi, Bryden Fogelman, Maksis Knutins, Bartosz Wróblewski, James B. Simon, Kanjun Qiu

  • To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis, arXiv, 2305.13230, arxiv, pdf, cication: -1

    Fuzhao Xue, Yao Fu, Wangchunshu Zhou, Zangwei Zheng, Yang You

  • Training Compute-Optimal Large Language Models, arXiv, 2203.15556, arxiv, pdf, cication: 202

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark

  • Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling, ICML, 2023, arxiv, pdf, cication: -1

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff

Other

Fine-tuning

  • LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models, arXiv, 2403.13372, arxiv, pdf, cication: -1

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo · (LLaMA-Factory - hiyouga) Star

  • Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training, arXiv, 2403.09613, arxiv, pdf, cication: -1

    Yanlai Yang, Matt Jones, Michael C. Mozer, Mengye Ren

  • Larimar: Large Language Models with Episodic Memory Control, arXiv, 2403.11901, arxiv, pdf, cication: -1

    Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aurélie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Jiří

  • Simple and Scalable Strategies to Continually Pre-train Large Language Models, arXiv, 2403.08763, arxiv, pdf, cication: -1

    Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, Irina Rish

    · (twitter)

    • LLMs can be efficiently updated with new data through a combination of simple learning rate rewarming and adding a small fraction of previous training data to counteract catastrophic forgetting.
  • Personalized Large Language Models, arXiv, 2402.09269, arxiv, pdf, cication: -1

    Stanisław Woźniak, Bartłomiej Koptyra, Arkadiusz Janz, Przemysław Kazienko, Jan Kocoń

  • BitDelta: Your Fine-Tune May Only Be Worth One Bit, arXiv, 2402.10193, arxiv, pdf, cication: -1

    James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai · (BitDelta - FasterDecoding) Star

  • EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models, arXiv, 2402.00518, arxiv, pdf, cication: -1

    Xuchen Pan, Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou · (EE-LLM - pan-x-c) Star

  • Scaling Sparse Fine-Tuning to Large Language Models, arXiv, 2401.16405, arxiv, pdf, cication: -1

    Alan Ansell, Ivan Vulić, Hannah Sterz, Anna Korhonen, Edoardo M. Ponti · (peft - AlanAnsell) Star · (sft-llm - ducdauge) Star

  • Tuning Language Models by Proxy, arXiv, 2401.08565, arxiv, pdf, cication: -1

    Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, Noah A. Smith

    · (lightning)

  • LLaMA Pro: Progressive LLaMA with Block Expansion, arXiv, 2401.02415, arxiv, pdf, cication: -1

    Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, Ying Shan · (LLaMA-Pro - TencentARC) Star · (huggingface)

  • Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models, arXiv, 2401.01335, arxiv, pdf, cication: -1

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu

    · (SPIN - uclaml) Star

  • Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes, arXiv, 2312.06353, arxiv, pdf, cication: -1

    Zhen Qin, Daoyuan Chen, Bingchen Qian, Bolin Ding, Yaliang Li, Shuiguang Deng

  • Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2, arXiv, 2311.10702, arxiv, pdf, cication: -1

    Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy

  • Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse, arXiv, 2311.07468, arxiv, pdf, cication: -1

    Ang Lv, Kaiyi Zhang, Shufang Xie, Quan Tu, Yuhan Chen, Ji-Rong Wen, Rui Yan · (jiqizhixin)

  • Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization, arXiv, 2311.06243, arxiv, pdf, cication: -1

    Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng

Other

Architectures

  • Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach, arXiv, 2406.04594, arxiv, pdf, cication: -1

    Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao

  • Linear Transformers with Learnable Kernel Functions are Better In-Context Models, arXiv, 2402.10644, arxiv, pdf, cication: -1

    Yaroslav Aksenov, Nikita Balagansky, Sofia Maria Lo Cicero Vaina, Boris Shaposhnikov, Alexey Gorbatovski, Daniil Gavrilov

  • Rethinking Optimization and Architecture for Tiny Language Models, arXiv, 2402.02791, arxiv, pdf, cication: -1

    Yehui Tang, Fangcheng Liu, Yunsheng Ni, Yuchuan Tian, Zheyuan Bai, Yi-Qi Hu, Sichao Liu, Shangling Jui, Kai Han, Yunhe Wang · (RethinkTinyLM - YuchuanTian) Star

  • Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers, arXiv, 2311.10642, arxiv, pdf, cication: -1

    Vukasin Bozic, Danilo Dordervic, Daniele Coppola, Joseph Thommes

  • UT5: Pretraining Non autoregressive T5 with unrolled denoising, arXiv, 2311.08552, arxiv, pdf, cication: -1

    Mahmoud G. Salem, Jiayu Ye, Chu-Cheng Lin, Frederick Liu

  • How to Build Low-cost Networks for Large Language Models (without Sacrificing Performance)?, arXiv, 2307.12169, arxiv, pdf, cication: -1

    Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, Naader Hasani

  • Stack More Layers Differently: High-Rank Training Through Low-Rank Updates, arXiv, 2307.05695, arxiv, pdf, cication: 2

    Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky

  • Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

    · (mp.weixin.qq)

Optimization

  • Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion, arXiv, 2407.01392, arxiv, pdf, cication: -1

    Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann

    · (diffusion-forcing - buoyancy99) Star

  • MiniCPM: Unveiling the Potential of End-side Large Language Models

  • Adam-mini: Use Fewer Learning Rates To Gain More, arXiv, 2406.16793, arxiv, pdf, cication: -1

    Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun

  • Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs, arXiv, 2406.10209, arxiv, pdf, cication: -1

    Abhimanyu Hans, Yuxin Wen, Neel Jain, John Kirchenbauer, Hamid Kazemi, Prajwal Singhania, Siddharth Singh, Gowthami Somepalli, Jonas Geiping, Abhinav Bhatele · (goldfish-loss - ahans30) Star

  • 2BP: 2-Stage Backpropagation, arXiv, 2405.18047, arxiv, pdf, cication: -1

    Christopher Rae, Joseph K. L. Lee, James Richings

  • The Road Less Scheduled, arXiv, 2405.15682, arxiv, pdf, cication: -1

    Aaron Defazio, Xingyu, Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky · (schedule_free - facebookresearch) Star

  • Thermodynamic Natural Gradient Descent, arXiv, 2405.13817, arxiv, pdf, cication: -1

    Kaelan Donatella, Samuel Duffield, Maxwell Aifer, Denis Melanson, Gavin Crooks, Patrick J. Coles

  • The Entropy Enigma: Success and Failure of Entropy Minimization, arXiv, 2405.05012, arxiv, pdf, cication: -1

    Ori Press, Ravid Shwartz-Ziv, Yann LeCun, Matthias Bethge · (EntropyEnigma - oripress) Star

  • psgd_torch - lixilinx Star

    Pytorch implementation of preconditioned stochastic gradient descent (affine group preconditioner, low-rank approximation preconditioner and more) · (Preconditioned-Stochastic-Gradient-Descent - opooladz) Star

  • A Large-Scale Exploration of $μ$-Transfer, arXiv, 2404.05728, arxiv, pdf, cication: -1

    Lucas Lingle

  • schedule_free - facebookresearch Star

    Schedule-Free Optimization in PyTorch

  • Reverse Training to Nurse the Reversal Curse, arXiv, 2403.13799, arxiv, pdf, cication: -1

    Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, Sainbayar Sukhbaatar

  • Towards Optimal Learning of Language Models, arXiv, 2402.17759, arxiv, pdf, cication: -1

    Yuxian Gu, Li Dong, Yaru Hao, Qingxiu Dong, Minlie Huang, Furu Wei · (aka)

  • Stabilizing Transformer Training by Preventing Attention Entropy Collapse, ICML, 2023, arxiv, pdf, cication: -1

    Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Josh Susskind · (ml-sigma-reparam - apple) Star

  • CAME: Confidence-guided Adaptive Memory Efficient Optimization, arXiv, 2307.02047, arxiv, pdf, cication: -1

    Yang Luo, Xiaozhe Ren, Zangwei Zheng, Zhuo Jiang, Xin Jiang, Yang You · (qbitai)


Ensemble

  • Octopus v4: Graph of language models, arXiv, 2404.19296, arxiv, pdf, cication: -1

    Wei Chen, Zhiyuan Li

  • Training-Free Pretrained Model Merging, arXiv, 2403.01753, arxiv, pdf, cication: -1

    Zhengqi Xu, Ke Yuan, Huiqiong Wang, Yong Wang, Mingli Song, Jie Song

    • The proposed model merging framework addresses the challenge of balancing unit similarity inconsistencies between weight and activation spaces during model merging by linearly combining similarity matrices of both, resulting in better multi-task model performance.
  • Evolutionary Optimization of Model Merging Recipes, arXiv, 2403.13187, arxiv, pdf, cication: -1

    Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, David Ha

    · (evolutionary-model-merge - sakanaai) Star

  • Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM, arXiv, 2403.07816, arxiv, pdf, cication: -1

    Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston

  • AutoMerger - mlabonne 🤗

    · (huggingface)

  • Learning to Decode Collaboratively with Multiple Language Models, arXiv, 2403.03870, arxiv, pdf, cication: -1

    Shannon Zejiang Shen, Hunter Lang, Bailin Wang, Yoon Kim, David Sontag · (co-llm - clinicalml) Star

  • FuseChat: Knowledge Fusion of Chat Models, arXiv, 2402.16107, arxiv, pdf, cication: -1

    Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, Wei Bi · (FuseLLM - fanqiwan) Star

  • Knowledge Fusion of Large Language Models, arXiv, 2401.10491, arxiv, pdf, cication: -1

    Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, Shuming Shi · (FuseLLM - fanqiwan) Star

  • Beagle14-7B - mlabonne 🤗

  • Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM, arXiv, 2401.02994, arxiv, pdf, cication: -1

    Xiaoding Lu, Adian Liusie, Vyas Raina, Yuwen Zhang, William Beauchamp

  • LLM Augmented LLMs: Expanding Capabilities through Composition, arXiv, 2401.02412, arxiv, pdf, cication: 8

    Rachit Bansal, Bidisha Samanta, Siddharth Dalmia, Nitish Gupta, Shikhar Vashishth, Sriram Ganapathy, Abhishek Bapna, Prateek Jain, Partha Talukdar

  • Model Merging - a osanseviero Collection

  • Papers about model merging - a julien-c Collection

  • mergekit - cg123 Star

    Tools for merging pretrained large language models.

  • LM-Cocktail: Resilient Tuning of Language Models via Model Merging, arXiv, 2311.13534, arxiv, pdf, cication: -1

    Shitao Xiao, Zheng Liu, Peitian Zhang, Xingrun Xing · (FlagEmbedding - FlagOpen) Star

  • Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, arXiv, 2311.08692, arxiv, pdf, cication: -1

    Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou

  • Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch, arXiv, 2311.03099, arxiv, pdf, cication: -1

    Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li · (mergelm - yule-buaa) Star

  • LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion, arXiv, 2306.02561, arxiv, pdf, cication: 16

    Dongfu Jiang, Xiang Ren, Bill Yuchen Lin · (LLM-Blender - yuchenlin) Star · (mp.weixin.qq)

Other

MoE

  • A Closer Look into Mixture-of-Experts in Large Language Models, arXiv, 2406.18219, arxiv, pdf, cication: -1

    Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, Jie Fu · (Look-into-MoEs - kamanphoebe) Star

  • Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts, arXiv, 2406.12034, arxiv, pdf, cication: -1

    Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James Glass, David Cox, Rameswar Panda, Rogerio Feris, Alan Ritter

  • Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models, arXiv, 2406.06563, arxiv, pdf, cication: -1

    Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng

  • MoEUT: Mixture-of-Experts Universal Transformers, arXiv, 2405.16039, arxiv, pdf, cication: -1

    Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, Christopher D. Manning · (moeut - robertcsordas) Star

  • Multi-Head Mixture-of-Experts, arXiv, 2404.15045, arxiv, pdf, cication: -1

    Xun Wu, Shaohan Huang, Wenhui Wang, Furu Wei

  • mergoo - Leeroo-AI Star

    A library for easily merging multiple LLM experts, and efficiently train the merged LLM.

  • Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, arXiv, 2404.02258, arxiv, pdf, cication: -1

    David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro · (qbitai) · (OLMo - thepowerfuldeez) Star · (twitter)

  • JetMoE - myshell-ai Star

    Reaching LLaMA2 Performance with 0.1M Dollars

    · (research.myshell)

    • JetMoE-8B has 24 blocks where each block has two MoE layers: Mixture of Attention heads (MoA) and Mixture of MLP Experts (MoE); each MoA and MoE layer has 8 experts, and 2 experts are activated for each input token with 2.2B active parameters.
  • megablocks - databricks Star

  • Scattered Mixture-of-Experts Implementation, arXiv, 2403.08245, arxiv, pdf, cication: -1

    Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville

  • Scaling Laws for Fine-Grained Mixture of Experts, arXiv, 2402.07871, arxiv, pdf, cication: -1

    Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski

  • BlackMamba: Mixture of Experts for State-Space Models, arXiv, 2402.01771, arxiv, pdf, cication: -1

    Quentin Anthony, Yury Tokpanov, Paolo Glorioso, Beren Millidge · (BlackMamba - Zyphra) Star

  • LocMoE: A Low-overhead MoE for Large Language Model Training, arXiv, 2401.13920, arxiv, pdf, cication: -1

    Jing Li, Zhijie Sun, Xuan He, Li Zeng, Yi Lin, Entong Li, Binfan Zheng, Rongqian Zhao, Xin Chen · (jiqizhixin)

  • MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts, arXiv, 2401.04081, arxiv, pdf, cication: -1

    Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Sebastian Jaszczur

  • DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, arXiv, 2401.06066, arxiv, pdf, cication: -1

    Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu · (DeepSeek-MoE - deepseek-ai) Star · (huggingface)

  • Mixtral of Experts, arXiv, 2401.04088, arxiv, pdf, cication: -1

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand

  • MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts, arXiv, 2401.04081, arxiv, pdf, cication: -1

    Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Sebastian Jaszczur

  • Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning, arXiv, 2312.12379, arxiv, pdf, cication: -1

    Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang · (qbitai)

  • SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention, arXiv, 2312.07987, arxiv, pdf, cication: -1

    Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber

    · (moe_attention - robertcsordas) Star

  • Mixture of Experts Explained

  • megablocks-public - mistralai Star

    · (qbitai)

  • llama-mistral - dzhulgakov Star

    Inference code for Mistral and Mixtral hacked up into original Llama implementation

  • SmartMoE - zms1999 Star

    A MoE impl for PyTorch, [ATC'23] SmartMoE · (jiqizhixin)

  • Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models, arXiv, 2305.14705, arxiv, pdf, cication: 5

    Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen · (jiqizhixin)

  • OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models, arXiv, 2402.01739, arxiv, pdf, cication: -1

    Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, Yang You

  • OpenMoE - XueFuzhao Star

    A family of open-sourced Mixture-of-Experts (MoE) Large Language Models

    · (OpenMoE - XueFuzhao) Star

  • Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models, arXiv, 2305.14705, arxiv, pdf, cication: -1

    Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen

  • MegaBlocks: Efficient Sparse Training with Mixture-of-Experts, proceedings of machine learning and systems, 2023, arxiv, pdf, cication: -1

    Trevor Gale, Deepak Narayanan, Cliff Young, Matei Zaharia

  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, the journal of machine learning research, 2022, arxiv, pdf, cication: -1

    William Fedus, Barret Zoph, Noam Shazeer

  • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, arXiv, 2006.16668, arxiv, pdf, cication: -1

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

  • Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, arXiv, 1701.06538, arxiv, pdf, cication: -1

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean

Other

Online Learning

  • Unlocking Continual Learning Abilities in Language Models, arXiv, 2406.17245, arxiv, pdf, cication: -1

    Wenyu Du, Shuang Cheng, Tongxu Luo, Zihan Qiu, Zeyu Huang, Ka Chun Cheung, Reynold Cheng, Jie Fu · (MIGU - wenyudu) Star

  • Online Training of Large Language Models: Learn while chatting, arXiv, 2403.04790, arxiv, pdf, cication: -1

    Juhao Liang, Ziwei Wang, Zhuoheng Ma, Jianquan Li, Zhiyi Zhang, Xiangbo Wu, Benyou Wang

Toolkits

  • mistral-finetune - mistralai Star

  • ZeRO++ - DeepSpeed

  • torchtitan - pytorch Star

    A native PyTorch Library for large model training

  • ColossalAI - hpcaitech Star

    Making large AI models cheaper, faster and more accessible

  • xtuner - InternLM Star

    An efficient, flexible and full-featured toolkit for fine-tuning large models (InternLM, Llama, Baichuan, Qwen, ChatGLM)

  • corenet - apple Star

    CoreNet: A library for training deep neural networks

  • maxtext - google Star

    A simple, performant and scalable Jax LLM!

  • lightning-thunder - Lightning-AI Star

    Source to source compiler for PyTorch. It makes PyTorch programs faster on single accelerators and distributed.

  • zero-bubble-pipeline-parallelism - sail-sg Star

    Zero Bubble Pipeline Parallelism · (mp.weixin.qq)

  • levanter - stanford-crfm Star

    Legibile, Scalable, Reproducible Foundation Models with Named Tensors and Jax

  • axolotl - OpenAccess-AI-Collective Star

    Go ahead and axolotl questions

  • LLMtuner - promptslab Star

    Tune LLM in few lines of code

  • LLM-FineTuning-Large-Language-Models - rohan-paul Star

    LLM (Large Language Model) FineTuning

  • Megatron-LM - NVIDIA Star

    Ongoing research training transformer models at scale

  • saturn - knagrecha Star

    Saturn accelerates the training of large-scale deep learning models with a novel joint optimization approach.

  • SynapseML - microsoft Star

    Simple and Distributed Machine Learning

  • gpt-llm-trainer - mshumer Star

    · (qbitai)

  • LLaMA-Factory - hiyouga Star

    Easy-to-use LLM fine-tuning framework (LLaMA, BLOOM, Mistral, Baichuan, Qwen, ChatGLM)

  • Megatron-LLaMA - alibaba Star

    Best practice for training LLaMA models in Megatron-LM · (jiqizhixin)

  • Efficient-PyTorch - Lyken17 Star

    My best practice of training large dataset using PyTorch.

Misc

  • Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs, arXiv, 2311.02262, arxiv, pdf, cication: -1

    Qingru Zhang, Chandan Singh, Liyuan Liu, Xiaodong Liu, Bin Yu, Jianfeng Gao, Tuo Zhao

  • Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks, arXiv, 2310.02244, arxiv, pdf, cication: -1

    Greg Yang, Dingli Yu, Chen Zhu, Soufiane Hayou · (qbitai)

  • Think before you speak: Training Language Models With Pause Tokens, arXiv, 2310.02226, arxiv, pdf, cication: -1

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan · (qbitai)

  • Textbooks Are All You Need II: phi-1.5 technical report, arXiv, 2309.05463, arxiv, pdf, cication: 9

    Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee · (jiqizhixin)

  • Towards Robust and Efficient Continual Language Learning, arXiv, 2307.05741, arxiv, pdf, cication: 1

    Adam Fisch, Amal Rannen-Triki, Razvan Pascanu, Jörg Bornschein, Angeliki Lazaridou, Elena Gribovskaya, Marc'Aurelio Ranzato

Courses

Other

Extra reference

  • llm-alignment-survey - Magnetic2014 Star

    A curated reading list for large language model (LLM) alignment. Take a look at our new survey "Large Language Model Alignment: A Survey" for more details!