From fbc0385fef90f613650d1fbbc53e9963ee49f1bb Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Thu, 1 Feb 2024 07:27:50 -0800
Subject: [PATCH 01/30] add optional dependency for preview to environment.yml

---
 environment.yml | 1 +
 1 file changed, 1 insertion(+)
diff --git a/environment.yml b/environment.yml
index 9ab48dedc..af421b3c6 100644
--- a/environment.yml
+++ b/environment.yml
@@ -27,6 +27,7 @@ dependencies:
   - conda-forge::monkeytype   # infer type annotations
   - conda-forge::rich         # better, colored tracebacks, etc
   - conda-forge::pytest-sugar # better pytest output
+  # - conda-forge::nodejs       # for `doc-builder preview` (optional)
 
 ## ENV CREATION - steps to reproduce:
 # mamba env remove -n bnb

From 84b5fc001c72d92fbd987b44db1d976c0d0a19fd Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Thu, 1 Feb 2024 08:32:18 -0800
Subject: [PATCH 02/30] Add additional sections, first optimizers, MacOS WIP

---
 docs/source/_toctree.yml                    |  16 ++-
 docs/source/installation.mdx                |   8 ++
 docs/source/integrations.mdx                |   5 +
 docs/source/{index.mdx => introduction.mdx} |   8 +-
 docs/source/optimizers.mdx                  | 103 ++++++++++++++++++++
 docs/source/quantization.mdx                |   1 +
 6 files changed, 133 insertions(+), 8 deletions(-)
 create mode 100644 docs/source/integrations.mdx
 rename docs/source/{index.mdx => introduction.mdx} (96%)
 create mode 100644 docs/source/optimizers.mdx
 create mode 100644 docs/source/quantization.mdx

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 043597177..8f63a6339 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -1,8 +1,16 @@
-- sections:
-  - local: index
-    title: Bits & Bytes
+- title: Get started
+  sections:
+  - local: introduction
+    title: Introduction
   - local: quickstart
     title: Quickstart
   - local: installation
     title: Installation
-  title: Get started
+- title: Features & Integrations
+  sections:
+  - local: quantization
+    title: Quantization
+  - local: optimizers
+    title: Optimizers
+  - local: integrations
+    title: Integrations
diff --git a/docs/source/installation.mdx b/docs/source/installation.mdx
index 50031acf7..860acb35b 100644
--- a/docs/source/installation.mdx
+++ b/docs/source/installation.mdx
@@ -4,6 +4,7 @@ Note currently `bitsandbytes` is only supported on CUDA GPU hardwares, support f
 
 <hfoptions id="OS system">
 <hfoption id="Linux">
+<hfoption id="MacOS">
 
 ## Linux
 
@@ -39,5 +40,12 @@ python -m build --wheel
 
 Big thanks to [wkpark](https://github.com/wkpark), [Jamezo97](https://github.com/Jamezo97), [rickardp](https://github.com/rickardp), [akx](https://github.com/akx) for their amazing contributions to make bitsandbytes compatible with Windows.
 
+</hfoption>
+<hfoption id="Windows">
+
+## MacOS
+
+Mac support is still a work in progress.
+
 </hfoption>
 </hfoptions>
diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
new file mode 100644
index 000000000..a12dd31ef
--- /dev/null
+++ b/docs/source/integrations.mdx
@@ -0,0 +1,5 @@
+# Transformers
+
+# PEFT
+
+# Trainer for the optimizers
diff --git a/docs/source/index.mdx b/docs/source/introduction.mdx
similarity index 96%
rename from docs/source/index.mdx
rename to docs/source/introduction.mdx
index 67c928309..7506992bc 100644
--- a/docs/source/index.mdx
+++ b/docs/source/introduction.mdx
@@ -1,10 +1,10 @@
-# bitsandbytes
+# `bitsandbytes`
 
-The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.
+The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.
 
+There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
 
-
-Resources:
+# Resources:
 - [8-bit Optimizer Paper](https://arxiv.org/abs/2110.02861) --  [Video](https://www.youtube.com/watch?v=IxrlHAJtqKE) -- [Docs](https://bitsandbytes.readthedocs.io/en/latest/)
 
 - [LLM.int8() Paper](https://arxiv.org/abs/2208.07339) -- [LLM.int8() Software Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration) -- [LLM.int8() Emergent Features Blog Post](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)
diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
new file mode 100644
index 000000000..1ac80b593
--- /dev/null
+++ b/docs/source/optimizers.mdx
@@ -0,0 +1,103 @@
+Here we provide a short description and usage examples for each optimizer in `bitsandbytes.optim. We'll start by explaining the core optimizer class `Optimizer8bit`, followed by the specific implementations `Adagrad`, `Adagrad8bit` and `Adagrad32bit`.
+
+Each of these optimizers can be utilized depending on the specific requirements of the task at hand, such as memory constraints, computational efficiency and the need for precision.
+
+# Optimizer base class
+
+## `Optimizer8bit`
+
+The `Optimizer8bit` class serves as a base class for all 8-bit optimizers, providing common functionalities required for quantized optimization. The class is designed to support both 32-bit and 8-bit computations, where 8-bit optimizations can significantly reduce memory footprint and increase computation speed.
+
+### Usage:
+
+```python
+import torch
+from bitsandbytes.optim import Optimizer8bit
+
+model = YourModel()
+params = model.parameters()
+
+# Initialize the optimizer with your model's parameters
+optimizer = Optimizer8bit(params, defaults={
+    'lr': 0.001,
+    'betas': (0.9, 0.999),
+    'eps': 1e-08,
+    'weight_decay': 0
+}, optim_bits=8)  # Use optim_bits=32 for 32-bit optimization
+
+# In your training loop
+optimizer.zero_grad()
+loss = compute_loss()  # Implement your loss computation
+loss.backward()
+optimizer.step()
+```
+
+# Adagrad implementations
+
+## `Adagrad`
+
+The `Adagrad` class is an implementation of the Adagrad optimizer, which adapts the learning rate for each parameter based on the historical gradient information. This version allows for both 32-bit and 8-bit representations, with specific classes for each.
+
+### `Adagrad` Usage:
+
+```python
+import torch
+from bitsandbytes.optim import Adagrad
+
+model = YourModel()
+params = model.parameters()
+
+# Initialize the optimizer with your model's parameters
+optimizer = Adagrad(params, lr=0.01)
+
+# In your training loop
+optimizer.zero_grad()
+loss = compute_loss()  # Implement your loss computation
+loss.backward()
+optimizer.step()
+```
+
+## `Adagrad8bit`
+
+The `Adagrad8bit` class is specifically tailored for 8-bit optimization, inheriting from `Optimizer1State`. It is designed for models where memory efficiency is crucial and it operates with reduced precision to save memory and increase computation speed.
+
+### `Adagrad8bit` Usage:
+
+```python
+import torch
+from bitsandbytes.optim import Adagrad8bit
+
+model = YourModel()
+
+# Initialize the optimizer with your model's parameters
+optimizer = Adagrad8bit(params, lr=0.01)
+
+# In your training loop
+optimizer.zero_grad()
+loss = compute_loss()  # Implement your loss computation
+loss.backward()
+optimizer.step()
+```
+
+## Adagrad32bit
+
+The `Adagrad32bit` class is similar to `Adagrad` but ensures that all computations are carried out with 32-bit precision. This class is preferable when numerical precision is more critical than memory efficiency.
+
+### Adagrad32bit Usage:
+
+```python
+import torch
+from bitsandbytes.optim import Adagrad32bit
+
+model = YourModel()
+params = model.parameters()
+
+# Initialize the optimizer with your model's parameters
+optimizer = Adagrad32bit(params, lr=0.01)
+
+# In your training loop
+optimizer.zero_grad()
+loss = compute_loss()  # Implement your loss computation
+loss.backward()
+optimizer.step()
+```
diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
new file mode 100644
index 000000000..b4bb9d17d
--- /dev/null
+++ b/docs/source/quantization.mdx
@@ -0,0 +1 @@
+Linear8bitLt & Linear4bit

From 725d29af6c4118ba0ea7557bc960a2a1ea0c0f5f Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Thu, 1 Feb 2024 15:36:50 -0800
Subject: [PATCH 03/30] drafting + refactoring new docs

---
 README.md                    |   3 -
 docs/source/_toctree.yml     |  16 ++++
 docs/source/contributing.mdx |   6 ++
 docs/source/faqs.mdx         |   7 ++
 docs/source/integrations.mdx |   3 +
 docs/source/introduction.mdx |  58 +-----------
 docs/source/moduletree.mdx   |   5 ++
 docs/source/optimizers.mdx   | 166 ++++++++++++++++++++---------------
 docs/source/qlora.mdx        |   1 +
 docs/source/quantization.mdx |   6 +-
 docs/source/resources.mdx    |  90 +++++++++++++++++++
 howto_config_override.md     |  40 ---------
 12 files changed, 231 insertions(+), 170 deletions(-)
 create mode 100644 docs/source/contributing.mdx
 create mode 100644 docs/source/faqs.mdx
 create mode 100644 docs/source/moduletree.mdx
 create mode 100644 docs/source/qlora.mdx
 create mode 100644 docs/source/resources.mdx
 delete mode 100644 howto_config_override.md

diff --git a/README.md b/README.md
index 61dede8c1..35a03dbcb 100644
--- a/README.md
+++ b/README.md
@@ -4,10 +4,7 @@ The bitsandbytes is a lightweight wrapper around CUDA custom functions, in parti
 
 
 
-Resources:
-- [8-bit Optimizer Paper](https://arxiv.org/abs/2110.02861) --  [Video](https://www.youtube.com/watch?v=IxrlHAJtqKE) -- [Docs](https://bitsandbytes.readthedocs.io/en/latest/)
 
-- [LLM.int8() Paper](https://arxiv.org/abs/2208.07339) -- [LLM.int8() Software Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration) -- [LLM.int8() Emergent Features Blog Post](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)
 
 ## TL;DR
 **Requirements**
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 8f63a6339..b1a957c6c 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -6,6 +6,8 @@
     title: Quickstart
   - local: installation
     title: Installation
+  - local: moduletree
+    title: Module Tree
 - title: Features & Integrations
   sections:
   - local: quantization
@@ -14,3 +16,17 @@
     title: Optimizers
   - local: integrations
     title: Integrations
+  - local: qlora
+    title: QLoRA
+- title: Support & Learning
+  sections:
+  - local: resources
+    title: Papers, related resources & how to cite
+  - local: faqs
+    title: FAQs (Frequently Asked Questions)
+- title: Contributors Guidelines
+  sections:
+  - local: contributing
+    title: Contributing
+  # - local: code_of_conduct
+  #   title: Code of Conduct
diff --git a/docs/source/contributing.mdx b/docs/source/contributing.mdx
new file mode 100644
index 000000000..45bb72ce9
--- /dev/null
+++ b/docs/source/contributing.mdx
@@ -0,0 +1,6 @@
+# Contributors guidelines
+... stil under construction ... (feel free to propose materials, `bitsandbytes` is a community project)
+
+## Documentation
+- [guideline for documentation syntax](https://github.com/huggingface/doc-builder#readme)
+- images shall be uploaded via PR in the `bitsandbytes/` directory [here](https://huggingface.co/datasets/huggingface/documentation-images)
diff --git a/docs/source/faqs.mdx b/docs/source/faqs.mdx
new file mode 100644
index 000000000..b9549e9d8
--- /dev/null
+++ b/docs/source/faqs.mdx
@@ -0,0 +1,7 @@
+# FAQs
+
+Please submit your questions in [this Github Discussion thread](https://github.com/TimDettmers/bitsandbytes/discussions/1013) if you feel that they will likely affect a lot of other users and that they haven't been sufficiently covered in the documentation.
+
+We'll pick the most generally applicable ones and post the QAs here or integrate them into the general documentation (also feel free to submit doc PRs, please).
+
+# ... under construction ...
diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
index a12dd31ef..25deb839b 100644
--- a/docs/source/integrations.mdx
+++ b/docs/source/integrations.mdx
@@ -1,5 +1,8 @@
 # Transformers
+... TODO: to be filled out ...
 
 # PEFT
+... TODO: to be filled out ...
 
 # Trainer for the optimizers
+... TODO: to be filled out ...
diff --git a/docs/source/introduction.mdx b/docs/source/introduction.mdx
index 7506992bc..b7bf499b9 100644
--- a/docs/source/introduction.mdx
+++ b/docs/source/introduction.mdx
@@ -1,39 +1,11 @@
+TODO: Many parts of this doc will still be redistributed among the new doc structure.
+
 # `bitsandbytes`
 
 The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.
 
 There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
 
-# Resources:
-- [8-bit Optimizer Paper](https://arxiv.org/abs/2110.02861) --  [Video](https://www.youtube.com/watch?v=IxrlHAJtqKE) -- [Docs](https://bitsandbytes.readthedocs.io/en/latest/)
-
-- [LLM.int8() Paper](https://arxiv.org/abs/2208.07339) -- [LLM.int8() Software Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration) -- [LLM.int8() Emergent Features Blog Post](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)
-
-## TL;DR
-**Requirements**
-Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.
-
-(Deprecated: CUDA 10.0 is deprecated and only CUDA >= 11.0) will be supported with release 0.39.0)
-
-**Installation**:
-
-``pip install bitsandbytes``
-
-In some cases it can happen that you need to compile from source. If this happens please consider submitting a bug report with `python -m bitsandbytes` information. What now follows is some short instructions which might work out of the box if `nvcc` is installed. If these do not work see further below.
-
-Compilation quickstart:
-```bash
-git clone https://github.com/timdettmers/bitsandbytes.git
-cd bitsandbytes
-
-# CUDA_VERSIONS in {110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 120}
-# make argument in {cuda110, cuda11x, cuda12x}
-# if you do not know what CUDA you have, try looking at the output of: python -m bitsandbytes
-CUDA_VERSION=117 make cuda11x
-python setup.py install
-```
-
-**Using Int8 inference with HuggingFace Transformers**
 
 ```python
 from transformers import AutoModelForCausalLM
@@ -89,9 +61,6 @@ The bitsandbytes library is currently only supported on Linux distributions. Win
 
 The requirements can best be fulfilled by installing pytorch via anaconda. You can install PyTorch by following the ["Get Started"](https://pytorch.org/get-started/locally/) instructions on the official website.
 
-To install run:
-
-``pip install bitsandbytes``
 
 ## Using bitsandbytes
 
@@ -166,26 +135,3 @@ For more detailed instruction, please follow the [compile_from_source.md](compil
 The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: Pytorch is licensed under the BSD license.
 
 We thank Fabio Cannizzo for his work on [FastBinarySearch](https://github.com/fabiocannizzo/FastBinarySearch) which we use for CPU quantization.
-
-## How to cite us
-If you found this library and found LLM.int8() useful, please consider citing our work:
-
-```bibtex
-@article{dettmers2022llmint8,
-  title={LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
-  author={Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke},
-  journal={arXiv preprint arXiv:2208.07339},
-  year={2022}
-}
-```
-
-For 8-bit optimizers or quantization routines, please consider citing the following work:
-
-```bibtex
-@article{dettmers2022optimizers,
-  title={8-bit Optimizers via Block-wise Quantization},
-  author={Dettmers, Tim and Lewis, Mike and Shleifer, Sam and Zettlemoyer, Luke},
-  journal={9th International Conference on Learning Representations, ICLR},
-  year={2022}
-}
-```
diff --git a/docs/source/moduletree.mdx b/docs/source/moduletree.mdx
new file mode 100644
index 000000000..2bd10a4a6
--- /dev/null
+++ b/docs/source/moduletree.mdx
@@ -0,0 +1,5 @@
+# Module tree overview
+
+- **bitsandbytes.functional**: Contains quantization functions and stateless 8-bit optimizer update functions.
+- **bitsandbytes.nn.modules**: Contains stable embedding layer with automatic 32-bit optimizer overrides (important for NLP stability)
+- **bitsandbytes.optim**: Contains 8-bit optimizers.
diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
index 1ac80b593..a71478adc 100644
--- a/docs/source/optimizers.mdx
+++ b/docs/source/optimizers.mdx
@@ -1,103 +1,129 @@
-Here we provide a short description and usage examples for each optimizer in `bitsandbytes.optim. We'll start by explaining the core optimizer class `Optimizer8bit`, followed by the specific implementations `Adagrad`, `Adagrad8bit` and `Adagrad32bit`.
+# Introduction: 8-bit optimizers
+With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers:
 
-Each of these optimizers can be utilized depending on the specific requirements of the task at hand, such as memory constraints, computational efficiency and the need for precision.
+- Faster (e.g. 4x faster than regular Adam)
+- 75% less memory, same performance
+- No hyperparameter tuning needed
 
-# Optimizer base class
+8-bit optimizers are mostly useful to finetune large models that did not fit into memory before. They also make it easier to pretrain larger models and have great synergy with sharded data parallelism. 8-bit Adam, for example, is already used across multiple teams in Facebook. This optimizer saves a ton of memory at no accuracy hit.
 
-## `Optimizer8bit`
+See here the biggest models
 
-The `Optimizer8bit` class serves as a base class for all 8-bit optimizers, providing common functionalities required for quantized optimization. The class is designed to support both 32-bit and 8-bit computations, where 8-bit optimizations can significantly reduce memory footprint and increase computation speed.
+We feature 8-bit Adam/AdamW, SGD momentum, LARS, LAMB, and RMSProp.
 
-### Usage:
+It only requires a two-line code change to get started.
+```
+import bitsandbytes as bnb
 
-```python
-import torch
-from bitsandbytes.optim import Optimizer8bit
-
-model = YourModel()
-params = model.parameters()
-
-# Initialize the optimizer with your model's parameters
-optimizer = Optimizer8bit(params, defaults={
-    'lr': 0.001,
-    'betas': (0.9, 0.999),
-    'eps': 1e-08,
-    'weight_decay': 0
-}, optim_bits=8)  # Use optim_bits=32 for 32-bit optimization
-
-# In your training loop
-optimizer.zero_grad()
-loss = compute_loss()  # Implement your loss computation
-loss.backward()
-optimizer.step()
+# before: adam = torch.optim.Adam(...)
+adam = bnb.optim.Adam8bit(...)
+
+# recommended for NLP models
+# before: torch.nn.Embedding(...)
+bnb.nn.StableEmbedding(...)
 ```
 
-# Adagrad implementations
+The arguments passed are the same as standard Adam. For NLP models we recommend also to use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.
 
-## `Adagrad`
+## Overview of expected gradients
 
-The `Adagrad` class is an implementation of the Adagrad optimizer, which adapts the learning rate for each parameter based on the historical gradient information. This version allows for both 32-bit and 8-bit representations, with specific classes for each.
+TODO: add pics here, no idea how to do that
 
-### `Adagrad` Usage:
+Want to add both pics in https://huggingface.co/datasets/huggingface/documentation-images/tree/main/bitsandbytes
 
-```python
-import torch
-from bitsandbytes.optim import Adagrad
+# Research Background
 
-model = YourModel()
-params = model.parameters()
+Stateful optimizers maintain gradient statistics over time, e.g. the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. `bitsandbytes` optimizers use 8-bit statistics, while maintaining the performance levels of using 32-bit optimizer states.
 
-# Initialize the optimizer with your model's parameters
-optimizer = Adagrad(params, lr=0.01)
+To overcome the resulting computational, quantization and stability challenges, 8-bit optimizers have three components:
+1) **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
+2) **dynamic quantization**, which quantizes both small and large values with high precision,
+3) a **stable embedding layer** improves stability during optimization for models with word embeddings.
 
-# In your training loop
-optimizer.zero_grad()
-loss = compute_loss()  # Implement your loss computation
-loss.backward()
-optimizer.step()
-```
+With these components, performing an optimizer update with 8-bit states is straightforward. We dequantize the 8-bit optimizer states to 32-bit, perform the update and then quantize the states back to 8-bit for storage.
 
-## `Adagrad8bit`
+We do this 8-bit to 32-bit conversion element-by-element in registers, which means no slow copies to GPU memory or additional temporary memory are needed to perform quantization and dequantization. For GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers.
 
-The `Adagrad8bit` class is specifically tailored for 8-bit optimization, inheriting from `Optimizer1State`. It is designed for models where memory efficiency is crucial and it operates with reduced precision to save memory and increase computation speed.
+For more details, please refer to the paper [8-bit Optimizers via Block-wise Quantization](https://arxiv.org/abs/2110.02861)
 
-### `Adagrad8bit` Usage:
+## Stable Embedding Layer
 
-```python
-import torch
-from bitsandbytes.optim import Adagrad8bit
+The Stable Embedding Layer enhances the standard word embedding layer for improved training stability in NLP tasks. It addresses the challenge of non-uniform input distributions and mitigates extreme gradient variations, ensuring smoother training processes.
+
+### Features:
+
+- **Initialization**: Utilizes Xavier uniform initialization to maintain consistent variance, reducing the likelihood of large gradients.
+- **Normalization**: Incorporates layer normalization before adding positional embeddings, aiding in output stability.
+- **Optimizer States**: Employs 32-bit optimizer states exclusively for this layer to enhance stability, while the rest of the model may use standard 16-bit precision.
+
+### Benefits:
+
+- Designed to support more aggressive quantization strategies without compromising training stability.
+- Helps in achieving stable training outcomes, particularly important for models dealing with diverse and complex language data.
 
-model = YourModel()
+# Usage
 
-# Initialize the optimizer with your model's parameters
-optimizer = Adagrad8bit(params, lr=0.01)
+Some more examples of how you can replace your old optimizer with the 8-bit optimizer:
 
-# In your training loop
-optimizer.zero_grad()
-loss = compute_loss()  # Implement your loss computation
-loss.backward()
-optimizer.step()
 ```
+import bitsandbytes as bnb
 
-## Adagrad32bit
+# adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
+adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
+adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=8) # equivalent
 
-The `Adagrad32bit` class is similar to `Adagrad` but ensures that all computations are carried out with 32-bit precision. This class is preferable when numerical precision is more critical than memory efficiency.
+# use 32-bit Adam with 5th percentile clipping
+adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995),
+                      optim_bits=32, percentile_clipping=5)
+```
+
+# How to override config hyperparameters for particular weights/parameters
+
+If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things:
+
+1) Register the parameter while they are still on the CPU,
+2) override the config with the new desired hyperparameters (anytime, anywhere)
 
-### Adagrad32bit Usage:
+For global overrides in many different places in your code you can do:
 
 ```python
 import torch
-from bitsandbytes.optim import Adagrad32bit
+import bitsandbytes as bnb
 
-model = YourModel()
-params = model.parameters()
+mng = bnb.optim.GlobalOptimManager.get_instance()
 
-# Initialize the optimizer with your model's parameters
-optimizer = Adagrad32bit(params, lr=0.01)
+model = MyModel()
+mng.register_parameters(model.parameters()) # 1. register parameters while still on CPU
 
-# In your training loop
-optimizer.zero_grad()
-loss = compute_loss()  # Implement your loss computation
-loss.backward()
-optimizer.step()
+model = model.cuda()
+# use 8-bit optimizer states for all parameters
+adam = bnb.optim.Adam(model.parameters(), lr=0.001, optim_bits=8)
+
+# 2a. override: the parameter model.fc1.weight now uses 32-bit Adam
+mng.override_config(model.fc1.weight, 'optim_bits', 32)
+
+# 2b. override: the two special layers use
+# sparse optimization + different learning rate + different Adam betas
+mng.override_config([model.special.weight, model.also_special.weight],
+                    key_value_dict ={'is_sparse': True, 'lr': 1e-5, 'betas'=(0.9, 0.98)})
 ```
+Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`
+
+For overrides for particular layers we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
+```python
+class MyModule(torch.nn.Module):
+  def __init__(din, dout):
+    super(MyModule, self).__init__()
+    self.linear = torch.nn.Linear(din, dout)
+    # optimization will happen in 32-bit and
+    # learning rate will be set to 0.0001 independent of the main learning rate
+    config = {'optim_bits': 32, 'lr' : 0.0001}
+    GlobalOptimManager.get_instance().register_module_override(self, 'weight', config)
+
+```
+
+# API Docs
+
+... under construction ...
+
+Here we'll provide auto-generated API docs soon. Please feel free to contribute doc-strings for the respective optimizers, as `bitsandbytes` is a community effort.
diff --git a/docs/source/qlora.mdx b/docs/source/qlora.mdx
new file mode 100644
index 000000000..3eb24a5e9
--- /dev/null
+++ b/docs/source/qlora.mdx
@@ -0,0 +1 @@
+# ... under construction ...(contributions welcome)
diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
index b4bb9d17d..c020df642 100644
--- a/docs/source/quantization.mdx
+++ b/docs/source/quantization.mdx
@@ -1 +1,5 @@
-Linear8bitLt & Linear4bit
+# Linear8bitLt
+... TODO: to be filled out ...
+
+# Linear4bit
+... TODO: to be filled out ...
diff --git a/docs/source/resources.mdx b/docs/source/resources.mdx
new file mode 100644
index 000000000..cafaf189b
--- /dev/null
+++ b/docs/source/resources.mdx
@@ -0,0 +1,90 @@
+# Papers, related resources & how to cite
+
+The below academic work is ordered in reverse chronological order.
+
+## [SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression (Jun 2023)](https://arxiv.org/abs/2306.03078)
+Authors: Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh
+
+- [Twitter summary thread](https://twitter.com/Tim_Dettmers/status/1666076553665744896)
+
+```
+@article{dettmers2023spqr,
+  title={SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression},
+  author={Dettmers, Tim and Svirschevski, Ruslan and Egiazarian, Vage and Kuznedelev, Denis and Frantar, Elias and Ashkboos, Saleh and Borzunov, Alexander and Hoefler, Torsten and Alistarh, Dan},
+  journal={arXiv preprint arXiv:2306.03078},
+  year={2023}
+}
+```
+
+## [QLoRA: Efficient Finetuning of Quantized LLMs (May 2023)](https://arxiv.org/abs/2305.14314)
+Authors: Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
+
+- [Video](https://www.youtube.com/watch?v=y9PHWGOa8HA&ab_channel=LondonMachineLearningMeetup)
+- [Twitter summary thread](https://twitter.com/Tim_Dettmers/status/1661379354507476994)
+
+```
+@article{dettmers2023qlora,
+  title={Qlora: Efficient finetuning of quantized llms},
+  author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
+  journal={arXiv preprint arXiv:2305.14314},
+  year={2023}
+}
+
+## [The case for 4-bit precision: k-bit Inference Scaling Laws (Dec 2022)](https://arxiv.org/abs/2212.09720)
+Authors: Tim Dettmers, Luke Zettlemoyer
+
+- [Video](https://www.youtube.com/watch?v=odlQa6AE1gY&ab_channel=TheInsideView)
+- [Twitter summary thread](https://twitter.com/Tim_Dettmers/status/1605209171758284805)
+
+```
+@inproceedings{dettmers2023case,
+  title={The case for 4-bit precision: k-bit inference scaling laws},
+  author={Dettmers, Tim and Zettlemoyer, Luke},
+  booktitle={International Conference on Machine Learning},
+  pages={7750--7774},
+  year={2023},
+  organization={PMLR}
+}
+```
+
+## [LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Nov 2022)](https://arxiv.org/abs/2208.07339)
+Authors: Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer
+
+- [LLM.int8() Blog Post](https://huggingface.co/blog/hf-bitsandbytes-integration)
+- [LLM.int8() Emergent Features Blog Post](https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/)
+- [Introduction to Weight Quantization](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c)
+- [Poster](https://twitter.com/Tim_Dettmers/status/1598351301942951937)
+
+```
+@article{dettmers2022llm,
+  title={Llm. int8 (): 8-bit matrix multiplication for transformers at scale},
+  author={Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke},
+  journal={arXiv preprint arXiv:2208.07339},
+  year={2022}
+}
+```
+
+## [8-bit Optimizers via Block-wise Quantization (Oct 2021)](https://arxiv.org/abs/2110.02861)
+Authors: Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer
+
+- [Video](https://www.youtube.com/watch?v=IxrlHAJtqKE)
+- [Twitter summary thread](https://twitter.com/Tim_Dettmers/status/1446472128979562499)
+
+```
+@article{DBLP:journals/corr/abs-2110-02861,
+  author       = {Tim Dettmers and
+                  Mike Lewis and
+                  Sam Shleifer and
+                  Luke Zettlemoyer},
+  title        = {8-bit Optimizers via Block-wise Quantization},
+  journal      = {CoRR},
+  volume       = {abs/2110.02861},
+  year         = {2021},
+  url          = {https://arxiv.org/abs/2110.02861},
+  eprinttype    = {arXiv},
+  eprint       = {2110.02861},
+  timestamp    = {Thu, 21 Oct 2021 16:20:08 +0200},
+  biburl       = {https://dblp.org/rec/journals/corr/abs-2110-02861.bib},
+  bibsource    = {dblp computer science bibliography, https://dblp.org}
+}
+```
diff --git a/howto_config_override.md b/howto_config_override.md
deleted file mode 100644
index 55b24e3ab..000000000
--- a/howto_config_override.md
+++ /dev/null
@@ -1,40 +0,0 @@
-# How to override config hyperparameters for particular weights/parameters
-
-If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things: (1) register the parameter while they are still on the CPU, (2) override the config with the new desired hyperparameters (anytime, anywhere). See our [guide](howto_config_override.md) for more details
-
-For global overrides in many different places in your code you can do:
-```python
-import torch
-import bitsandbytes as bnb
-
-mng = bnb.optim.GlobalOptimManager.get_instance()
-
-model = MyModel()
-mng.register_parameters(model.parameters()) # 1. register parameters while still on CPU
-
-model = model.cuda()
-# use 8-bit optimizer states for all parameters
-adam = bnb.optim.Adam(model.parameters(), lr=0.001, optim_bits=8)
-
-# 2a. override: the parameter model.fc1.weight now uses 32-bit Adam
-mng.override_config(model.fc1.weight, 'optim_bits', 32)
-
-# 2b. override: the two special layers use
-# sparse optimization + different learning rate + different Adam betas
-mng.override_config([model.special.weight, model.also_special.weight],
-                    key_value_dict ={'is_sparse': True, 'lr': 1e-5, 'betas'=(0.9, 0.98)})
-```
-Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`
-
-For overrides for particular layers we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
-```python
-class MyModule(torch.nn.Module):
-  def __init__(din, dout):
-    super(MyModule, self).__init__()
-    self.linear = torch.nn.Linear(din, dout)
-    # optimization will happen in 32-bit and
-    # learning rate will be set to 0.0001 independent of the main learning rate
-    config = {'optim_bits': 32, 'lr' : 0.0001}
-    GlobalOptimManager.get_instance().register_module_override(self, 'weight', config)
-
-```

From 58566e2638bf4aedfb2f83ad6ac707728c77b7f3 Mon Sep 17 00:00:00 2001
From: younesbelkada <younesbelkada@gmail.com>
Date: Fri, 2 Feb 2024 03:26:21 +0000
Subject: [PATCH 04/30] some changes

---
 docs/source/_toctree.yml     |  2 --
 docs/source/faqs.mdx         |  2 +-
 docs/source/installation.mdx |  8 ++++----
 docs/source/integrations.mdx |  3 +++
 docs/source/introduction.mdx | 15 +++------------
 docs/source/moduletree.mdx   |  4 ++--
 docs/source/optimizers.mdx   | 32 +++++++++++++++++++++-----------
 docs/source/qlora.mdx        |  1 -
 docs/source/quantization.mdx |  4 ++--
 docs/source/quickstart.mdx   |  2 +-
 docs/source/resources.mdx    |  1 +
 11 files changed, 38 insertions(+), 36 deletions(-)
 delete mode 100644 docs/source/qlora.mdx

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index b1a957c6c..182d10900 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -16,8 +16,6 @@
     title: Optimizers
   - local: integrations
     title: Integrations
-  - local: qlora
-    title: QLoRA
 - title: Support & Learning
   sections:
   - local: resources
diff --git a/docs/source/faqs.mdx b/docs/source/faqs.mdx
index b9549e9d8..801a27b15 100644
--- a/docs/source/faqs.mdx
+++ b/docs/source/faqs.mdx
@@ -4,4 +4,4 @@ Please submit your questions in [this Github Discussion thread](https://github.c
 
 We'll pick the most generally applicable ones and post the QAs here or integrate them into the general documentation (also feel free to submit doc PRs, please).
 
-# ... under construction ...
+# ... under construction ...
\ No newline at end of file
diff --git a/docs/source/installation.mdx b/docs/source/installation.mdx
index 860acb35b..d28431f77 100644
--- a/docs/source/installation.mdx
+++ b/docs/source/installation.mdx
@@ -4,7 +4,6 @@ Note currently `bitsandbytes` is only supported on CUDA GPU hardwares, support f
 
 <hfoptions id="OS system">
 <hfoption id="Linux">
-<hfoption id="MacOS">
 
 ## Linux
 
@@ -22,7 +21,7 @@ CUDA_VERSION=XXX make cuda12x
 python setup.py install
 ```
 
-with `XXX` being your CUDA version, for <12.0 call `make cuda 11x`
+with `XXX` being your CUDA version, for <12.0 call `make cuda 11x`. Note support for non-CUDA GPUs (e.g. AMD, Intel), is also coming soon.
 
 </hfoption>
 <hfoption id="Windows">
@@ -41,11 +40,12 @@ python -m build --wheel
 Big thanks to [wkpark](https://github.com/wkpark), [Jamezo97](https://github.com/Jamezo97), [rickardp](https://github.com/rickardp), [akx](https://github.com/akx) for their amazing contributions to make bitsandbytes compatible with Windows.
 
 </hfoption>
-<hfoption id="Windows">
+<hfoption id="MacOS">
 
 ## MacOS
 
-Mac support is still a work in progress.
+Mac support is still a work in progress. Please make sure to check out the latest bitsandbytes issues to get notified about the progress with respect to MacOS integration.
 
 </hfoption>
+
 </hfoptions>
diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
index 25deb839b..ba3ee218f 100644
--- a/docs/source/integrations.mdx
+++ b/docs/source/integrations.mdx
@@ -1,8 +1,11 @@
 # Transformers
+
 ... TODO: to be filled out ...
 
 # PEFT
+
 ... TODO: to be filled out ...
 
 # Trainer for the optimizers
+
 ... TODO: to be filled out ...
diff --git a/docs/source/introduction.mdx b/docs/source/introduction.mdx
index b7bf499b9..bc5c7a6d0 100644
--- a/docs/source/introduction.mdx
+++ b/docs/source/introduction.mdx
@@ -5,20 +5,10 @@ TODO: Many parts of this doc will still be redistributed among the new doc struc
 The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.
 
 There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
+The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8bit optimizers through `bitsandbytes.optim` module.
 
+**Using 8-bit optimizers**:
 
-```python
-from transformers import AutoModelForCausalLM
-model = AutoModelForCausalLM.from_pretrained(
-  'decapoda-research/llama-7b-hf',
-  device_map='auto',
-  load_in_8bit=True,
-  max_memory=f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB')
-```
-
-A more detailed example, can be found in [examples/int8_inference_huggingface.py](examples/int8_inference_huggingface.py).
-
-**Using 8-bit optimizer**:
 1. Comment out optimizer: ``#torch.optim.Adam(....)``
 2. Add 8-bit optimizer of your choice ``bnb.optim.Adam8bit(....)`` (arguments stay the same)
 3. Replace embedding layer if necessary: ``torch.nn.Embedding(..) -> bnb.nn.Embedding(..)``
@@ -40,6 +30,7 @@ out = linear(x.to(torch.float16))
 
 
 ## Features
+
 - 8-bit Matrix multiplication with mixed precision decomposition
 - LLM.int8() inference
 - 8-bit Optimizers: Adam, AdamW, RMSProp, LARS, LAMB, Lion (saves 75% memory)
diff --git a/docs/source/moduletree.mdx b/docs/source/moduletree.mdx
index 2bd10a4a6..ec372f9a0 100644
--- a/docs/source/moduletree.mdx
+++ b/docs/source/moduletree.mdx
@@ -1,5 +1,5 @@
 # Module tree overview
 
-- **bitsandbytes.functional**: Contains quantization functions and stateless 8-bit optimizer update functions.
+- **bitsandbytes.functional**: Contains quantization functions (4-bit & 8-bit) and stateless 8-bit optimizer update functions.
 - **bitsandbytes.nn.modules**: Contains stable embedding layer with automatic 32-bit optimizer overrides (important for NLP stability)
-- **bitsandbytes.optim**: Contains 8-bit optimizers.
+- **bitsandbytes.optim**: Contains 8-bit optimizers.
\ No newline at end of file
diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
index a71478adc..04738a439 100644
--- a/docs/source/optimizers.mdx
+++ b/docs/source/optimizers.mdx
@@ -1,4 +1,5 @@
 # Introduction: 8-bit optimizers
+
 With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers:
 
 - Faster (e.g. 4x faster than regular Adam)
@@ -12,7 +13,7 @@ See here the biggest models
 We feature 8-bit Adam/AdamW, SGD momentum, LARS, LAMB, and RMSProp.
 
 It only requires a two-line code change to get started.
-```
+```py
 import bitsandbytes as bnb
 
 # before: adam = torch.optim.Adam(...)
@@ -25,20 +26,30 @@ bnb.nn.StableEmbedding(...)
 
 The arguments passed are the same as standard Adam. For NLP models we recommend also to use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.
 
+## Overview of supported 8-bit optimizers 
+
+TOOD: List here all optimizers in `bitsandbytes/optim/__init__.py`
+TODO (future) have an automated API docs through doc-builder
+
 ## Overview of expected gradients
 
-TODO: add pics here, no idea how to do that
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bitsandbytes/optimizer_comparison.png", width="50%">
+</div>
 
-Want to add both pics in https://huggingface.co/datasets/huggingface/documentation-images/tree/main/bitsandbytes
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bitsandbytes/optimizer_largest_model.png", width="50%">
+</div>
 
 # Research Background
 
 Stateful optimizers maintain gradient statistics over time, e.g. the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. `bitsandbytes` optimizers use 8-bit statistics, while maintaining the performance levels of using 32-bit optimizer states.
 
 To overcome the resulting computational, quantization and stability challenges, 8-bit optimizers have three components:
-1) **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
-2) **dynamic quantization**, which quantizes both small and large values with high precision,
-3) a **stable embedding layer** improves stability during optimization for models with word embeddings.
+
+1- **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
+2- **dynamic quantization**, which quantizes both small and large values with high precision,
+3- a **stable embedding layer** improves stability during optimization for models with word embeddings.
 
 With these components, performing an optimizer update with 8-bit states is straightforward. We dequantize the 8-bit optimizer states to 32-bit, perform the update and then quantize the states back to 8-bit for storage.
 
@@ -65,15 +76,14 @@ The Stable Embedding Layer enhances the standard word embedding layer for improv
 
 Some more examples of how you can replace your old optimizer with the 8-bit optimizer:
 
-```
+```diff
 import bitsandbytes as bnb
 
-# adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
-adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
-adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=8) # equivalent
+- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
++ adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
 
 # use 32-bit Adam with 5th percentile clipping
-adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995),
++ adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995),
                       optim_bits=32, percentile_clipping=5)
 ```
 
diff --git a/docs/source/qlora.mdx b/docs/source/qlora.mdx
deleted file mode 100644
index 3eb24a5e9..000000000
--- a/docs/source/qlora.mdx
+++ /dev/null
@@ -1 +0,0 @@
-# ... under construction ...(contributions welcome)
diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
index c020df642..a09d90c2d 100644
--- a/docs/source/quantization.mdx
+++ b/docs/source/quantization.mdx
@@ -1,5 +1,5 @@
-# Linear8bitLt
+# Linear8bitLt (LLM.int8)
 ... TODO: to be filled out ...
 
-# Linear4bit
+# Linear4bit (QLoRA)
 ... TODO: to be filled out ...
diff --git a/docs/source/quickstart.mdx b/docs/source/quickstart.mdx
index d1028c655..3a560ff6b 100644
--- a/docs/source/quickstart.mdx
+++ b/docs/source/quickstart.mdx
@@ -8,5 +8,5 @@
 
 The following code illustrates the steps above.
 
-```python
+```py
 ```
diff --git a/docs/source/resources.mdx b/docs/source/resources.mdx
index cafaf189b..d3ac952d5 100644
--- a/docs/source/resources.mdx
+++ b/docs/source/resources.mdx
@@ -3,6 +3,7 @@
 The below academic work is ordered in reverse chronological order.
 
 ## [SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression (Jun 2023)](https://arxiv.org/abs/2306.03078)
+
 Authors: Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh
 
 - [Twitter summary thread](https://twitter.com/Tim_Dettmers/status/1666076553665744896)

From 47cc3e9f699007971a21c9f11cc5c548aac199ab Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Thu, 1 Feb 2024 20:33:59 -0800
Subject: [PATCH 05/30] run pre-commit hooks

---
 .pre-commit-config.yaml    | 2 +-
 docs/source/faqs.mdx       | 2 +-
 docs/source/moduletree.mdx | 2 +-
 docs/source/optimizers.mdx | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 039139b95..feb6c766e 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -1,6 +1,6 @@
 repos:
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.1.15
+    rev: v0.2.0
     hooks:
       - id: ruff
         args:
diff --git a/docs/source/faqs.mdx b/docs/source/faqs.mdx
index 801a27b15..b9549e9d8 100644
--- a/docs/source/faqs.mdx
+++ b/docs/source/faqs.mdx
@@ -4,4 +4,4 @@ Please submit your questions in [this Github Discussion thread](https://github.c
 
 We'll pick the most generally applicable ones and post the QAs here or integrate them into the general documentation (also feel free to submit doc PRs, please).
 
-# ... under construction ...
\ No newline at end of file
+# ... under construction ...
diff --git a/docs/source/moduletree.mdx b/docs/source/moduletree.mdx
index ec372f9a0..d117f90c0 100644
--- a/docs/source/moduletree.mdx
+++ b/docs/source/moduletree.mdx
@@ -2,4 +2,4 @@
 
 - **bitsandbytes.functional**: Contains quantization functions (4-bit & 8-bit) and stateless 8-bit optimizer update functions.
 - **bitsandbytes.nn.modules**: Contains stable embedding layer with automatic 32-bit optimizer overrides (important for NLP stability)
-- **bitsandbytes.optim**: Contains 8-bit optimizers.
\ No newline at end of file
+- **bitsandbytes.optim**: Contains 8-bit optimizers.
diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
index 04738a439..3a6c8ca1f 100644
--- a/docs/source/optimizers.mdx
+++ b/docs/source/optimizers.mdx
@@ -26,7 +26,7 @@ bnb.nn.StableEmbedding(...)
 
 The arguments passed are the same as standard Adam. For NLP models we recommend also to use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.
 
-## Overview of supported 8-bit optimizers 
+## Overview of supported 8-bit optimizers
 
 TOOD: List here all optimizers in `bitsandbytes/optim/__init__.py`
 TODO (future) have an automated API docs through doc-builder

From c26645b2ab811fa222a8cb8d19792e10295739ff Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Thu, 1 Feb 2024 20:50:56 -0800
Subject: [PATCH 06/30] add mention of pre-commit to contributing

---
 docs/source/contributing.mdx | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/docs/source/contributing.mdx b/docs/source/contributing.mdx
index 45bb72ce9..18ad66b4c 100644
--- a/docs/source/contributing.mdx
+++ b/docs/source/contributing.mdx
@@ -1,6 +1,13 @@
 # Contributors guidelines
 ... stil under construction ... (feel free to propose materials, `bitsandbytes` is a community project)
 
+# Setup pre-commit hooks
+- Install pre-commit hooks with `pip install pre-commit`.
+- Run `pre-commit autoupdate` once to configure the hooks.
+- Re-run `pre-commit autoupdate` every time a new hook got added.
+
+Now all the pre-commit hooks will be automatically run when you try to commit and if they introduce some changes, you need to re-add the changed files before being able to commit and push.
+
 ## Documentation
 - [guideline for documentation syntax](https://github.com/huggingface/doc-builder#readme)
 - images shall be uploaded via PR in the `bitsandbytes/` directory [here](https://huggingface.co/datasets/huggingface/documentation-images)

From ab42c5f1248fec73310e9b427ccef8971a6739eb Mon Sep 17 00:00:00 2001
From: younesbelkada <younesbelkada@gmail.com>
Date: Fri, 2 Feb 2024 05:55:54 +0000
Subject: [PATCH 07/30] fix

---
 docs/source/optimizers.mdx | 4 ++--
 docs/source/resources.mdx  | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
index 3a6c8ca1f..6f111610a 100644
--- a/docs/source/optimizers.mdx
+++ b/docs/source/optimizers.mdx
@@ -96,7 +96,7 @@ If you want to optimize some unstable parameters with 32-bit Adam and others wit
 
 For global overrides in many different places in your code you can do:
 
-```python
+```py
 import torch
 import bitsandbytes as bnb
 
@@ -120,7 +120,7 @@ mng.override_config([model.special.weight, model.also_special.weight],
 Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`
 
 For overrides for particular layers we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
-```python
+```py
 class MyModule(torch.nn.Module):
   def __init__(din, dout):
     super(MyModule, self).__init__()
diff --git a/docs/source/resources.mdx b/docs/source/resources.mdx
index d3ac952d5..56330175a 100644
--- a/docs/source/resources.mdx
+++ b/docs/source/resources.mdx
@@ -30,6 +30,7 @@ Authors: Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
   journal={arXiv preprint arXiv:2305.14314},
   year={2023}
 }
+```
 
 ## [The case for 4-bit precision: k-bit Inference Scaling Laws (Dec 2022)](https://arxiv.org/abs/2212.09720)
 Authors: Tim Dettmers, Luke Zettlemoyer

From a71efa8b7660eebff5317d34783ee156074eb519 Mon Sep 17 00:00:00 2001
From: younesbelkada <younesbelkada@gmail.com>
Date: Fri, 2 Feb 2024 06:17:10 +0000
Subject: [PATCH 08/30] test autodoc

---
 bitsandbytes/nn/modules.py   | 42 ++++++++++++++++++++++++++++++++++++
 docs/source/quantization.mdx | 10 +++++++--
 2 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/bitsandbytes/nn/modules.py b/bitsandbytes/nn/modules.py
index 922feae15..304c1d405 100644
--- a/bitsandbytes/nn/modules.py
+++ b/bitsandbytes/nn/modules.py
@@ -397,9 +397,51 @@ def maybe_rearrange_weight(state_dict, prefix, local_metadata, strict, missing_k
 
 
 class Linear8bitLt(nn.Linear):
+    """
+    This class is the base module for the [LLM.int8()](https://arxiv.org/abs/2208.07339) algorithm. 
+    To read more about it, have a look at the paper.
+
+    In order to quantize a linear layer one should first load the original fp16 / bf16 weights into 
+    the Linear8bitLt module, then call `int8_module.to("cuda")` to quantize the fp16 weights.
+
+    Example:
+
+    ```python
+    import torch
+    import torch.nn as nn
+
+    import bitsandbytes as bnb
+    from bnb.nn import Linear8bitLt
+
+    fp16_model = nn.Sequential(
+        nn.Linear(64, 64),
+        nn.Linear(64, 64)
+    )
+
+    int8_model = nn.Sequential(
+        Linear8bitLt(64, 64, has_fp16_weights=False),
+        Linear8bitLt(64, 64, has_fp16_weights=False)
+    )
+
+    int8_model.load_state_dict(fp16_model.state_dict())
+    int8_model = int8_model.to(0) # Quantization happens here
+    ```
+    """
     def __init__(self, input_features, output_features, bias=True, has_fp16_weights=True,
                        memory_efficient_backward=False, threshold=0.0, index=None, device=None):
         super().__init__(input_features, output_features, bias, device)
+        """
+        Initialize Linear8bitLt class.
+
+        Args:
+            input_features (`str`):
+                Number of input features of the linear layer.
+            output_features (`str`):
+                Number of output features of the linear layer.
+            bias (`bool`, defaults to `True`):
+                Whether the linear class uses the bias term as well.
+        """
+        
         assert not memory_efficient_backward, "memory_efficient_backward is no longer required and the argument is deprecated in 0.37.0 and will be removed in 0.39.0"
         self.state = bnb.MatmulLtState()
         self.index = index
diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
index a09d90c2d..71f15abac 100644
--- a/docs/source/quantization.mdx
+++ b/docs/source/quantization.mdx
@@ -1,5 +1,11 @@
-# Linear8bitLt (LLM.int8)
-... TODO: to be filled out ...
+# Quantization primitives
+
+Below you will find the docstring of the quantization primitives exposed in bitsandbytes.
+
+## Linear8bitLt
+
+[[autodoc]] bitsandbytes.nn.Linear8bitLt
+
 
 # Linear4bit (QLoRA)
 ... TODO: to be filled out ...

From c1ec5f8a6bf9ddfa3a2523f7d177f54ae7095b6d Mon Sep 17 00:00:00 2001
From: younesbelkada <younesbelkada@gmail.com>
Date: Fri, 2 Feb 2024 06:26:16 +0000
Subject: [PATCH 09/30] new additions

---
 .pre-commit-config.yaml      |  2 +-
 bitsandbytes/nn/modules.py   | 44 ++++++++++++++++++++++++++++++++++--
 docs/source/quantization.mdx |  3 ++-
 3 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index feb6c766e..039139b95 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -1,6 +1,6 @@
 repos:
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.2.0
+    rev: v0.1.15
     hooks:
       - id: ruff
         args:
diff --git a/bitsandbytes/nn/modules.py b/bitsandbytes/nn/modules.py
index 304c1d405..8597e9503 100644
--- a/bitsandbytes/nn/modules.py
+++ b/bitsandbytes/nn/modules.py
@@ -222,9 +222,50 @@ def to(self, *args, **kwargs):
 
 
 class Linear4bit(nn.Linear):
+    """
+    This class is the base module for the 4-bit quantization algorithm presented in [QLoRA](https://arxiv.org/abs/2305.14314). 
+    QLoRA 4-bit linear layers uses blockwise k-bit quantization under the hood, with the possibility of selecting various
+    compute datatypes such as FP4 and NF4.
+
+    In order to quantize a linear layer one should first load the original fp16 / bf16 weights into 
+    the Linear8bitLt module, then call `quantized_module.to("cuda")` to quantize the fp16 / bf16 weights.
+
+    Example:
+
+    ```python
+    import torch
+    import torch.nn as nn
+
+    import bitsandbytes as bnb
+    from bnb.nn import Linear4bit
+
+    fp16_model = nn.Sequential(
+        nn.Linear(64, 64),
+        nn.Linear(64, 64)
+    )
 
+    quantized_model = nn.Sequential(
+        Linear4bit(64, 64),
+        Linear4bit(64, 64)
+    )
+
+    quantized_model.load_state_dict(fp16_model.state_dict())
+    quantized_model = quantized_model.to(0) # Quantization happens here
+    ```
+    """
     def __init__(self, input_features, output_features, bias=True, compute_dtype=None, compress_statistics=True, quant_type='fp4', quant_storage=torch.uint8, device=None):
         super().__init__(input_features, output_features, bias, device)
+        """
+        Initialize Linear4bit class.
+
+        Args:
+            input_features (`str`):
+                Number of input features of the linear layer.
+            output_features (`str`):
+                Number of output features of the linear layer.
+            bias (`bool`, defaults to `True`):
+                Whether the linear class uses the bias term as well.
+        """
         self.weight = Params4bit(self.weight.data, requires_grad=False, compress_statistics=compress_statistics, quant_type=quant_type, quant_storage=quant_storage, module=self)
         # self.persistent_buffers = []  # TODO consider as way to save quant state
         self.compute_dtype = compute_dtype
@@ -440,8 +481,7 @@ def __init__(self, input_features, output_features, bias=True, has_fp16_weights=
                 Number of output features of the linear layer.
             bias (`bool`, defaults to `True`):
                 Whether the linear class uses the bias term as well.
-        """
-        
+        """ 
         assert not memory_efficient_backward, "memory_efficient_backward is no longer required and the argument is deprecated in 0.37.0 and will be removed in 0.39.0"
         self.state = bnb.MatmulLtState()
         self.index = index
diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
index 71f15abac..287b2b87a 100644
--- a/docs/source/quantization.mdx
+++ b/docs/source/quantization.mdx
@@ -8,4 +8,5 @@ Below you will find the docstring of the quantization primitives exposed in bits
 
 
 # Linear4bit (QLoRA)
-... TODO: to be filled out ...
+
+[[autodoc]] bitsandbytes.nn.Linear4bit
\ No newline at end of file

From 544114df3a1f00de59327783289e48592452622d Mon Sep 17 00:00:00 2001
From: younesbelkada <younesbelkada@gmail.com>
Date: Fri, 2 Feb 2024 06:42:49 +0000
Subject: [PATCH 10/30] add subtilte

---
 docs/source/quantization.mdx | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
index 287b2b87a..038295707 100644
--- a/docs/source/quantization.mdx
+++ b/docs/source/quantization.mdx
@@ -6,7 +6,7 @@ Below you will find the docstring of the quantization primitives exposed in bits
 
 [[autodoc]] bitsandbytes.nn.Linear8bitLt
 
-
-# Linear4bit (QLoRA)
+ 
+## Linear4bit (QLoRA)
 
 [[autodoc]] bitsandbytes.nn.Linear4bit
\ No newline at end of file

From f735b35683f402d3b9998357aec88d0145fa468a Mon Sep 17 00:00:00 2001
From: younesbelkada <younesbelkada@gmail.com>
Date: Fri, 2 Feb 2024 06:49:22 +0000
Subject: [PATCH 11/30] add some content

---
 bitsandbytes/nn/modules.py   |  3 +++
 docs/source/optimizers.mdx   | 20 ++++++++++----------
 docs/source/quantization.mdx |  9 ++++++---
 3 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/bitsandbytes/nn/modules.py b/bitsandbytes/nn/modules.py
index 8597e9503..c2c19344d 100644
--- a/bitsandbytes/nn/modules.py
+++ b/bitsandbytes/nn/modules.py
@@ -19,6 +19,9 @@
 
 
 class StableEmbedding(torch.nn.Embedding):
+    """
+    TODO: @titus fill this with some info
+    """
     def __init__(
         self,
         num_embeddings: int,
diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
index 6f111610a..9906423d2 100644
--- a/docs/source/optimizers.mdx
+++ b/docs/source/optimizers.mdx
@@ -1,4 +1,4 @@
-# Introduction: 8-bit optimizers
+## Introduction: 8-bit optimizers
 
 With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers:
 
@@ -29,7 +29,7 @@ The arguments passed are the same as standard Adam. For NLP models we recommend
 ## Overview of supported 8-bit optimizers
 
 TOOD: List here all optimizers in `bitsandbytes/optim/__init__.py`
-TODO (future) have an automated API docs through doc-builder
+TODO (future) document all optimizers as we did for Linear4bit / Linear8bitLt classes
 
 ## Overview of expected gradients
 
@@ -41,7 +41,7 @@ TODO (future) have an automated API docs through doc-builder
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bitsandbytes/optimizer_largest_model.png", width="50%">
 </div>
 
-# Research Background
+### Research Background
 
 Stateful optimizers maintain gradient statistics over time, e.g. the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. `bitsandbytes` optimizers use 8-bit statistics, while maintaining the performance levels of using 32-bit optimizer states.
 
@@ -61,18 +61,18 @@ For more details, please refer to the paper [8-bit Optimizers via Block-wise Qua
 
 The Stable Embedding Layer enhances the standard word embedding layer for improved training stability in NLP tasks. It addresses the challenge of non-uniform input distributions and mitigates extreme gradient variations, ensuring smoother training processes.
 
-### Features:
+#### Features:
 
 - **Initialization**: Utilizes Xavier uniform initialization to maintain consistent variance, reducing the likelihood of large gradients.
 - **Normalization**: Incorporates layer normalization before adding positional embeddings, aiding in output stability.
 - **Optimizer States**: Employs 32-bit optimizer states exclusively for this layer to enhance stability, while the rest of the model may use standard 16-bit precision.
 
-### Benefits:
+#### Benefits:
 
 - Designed to support more aggressive quantization strategies without compromising training stability.
 - Helps in achieving stable training outcomes, particularly important for models dealing with diverse and complex language data.
 
-# Usage
+## Usage
 
 Some more examples of how you can replace your old optimizer with the 8-bit optimizer:
 
@@ -83,11 +83,11 @@ import bitsandbytes as bnb
 + adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
 
 # use 32-bit Adam with 5th percentile clipping
-+ adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995),
-                      optim_bits=32, percentile_clipping=5)
++ adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=32, percentile_clipping=5)
+- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
 ```
 
-# How to override config hyperparameters for particular weights/parameters
+### How to override config hyperparameters for particular weights/parameters
 
 If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things:
 
@@ -132,7 +132,7 @@ class MyModule(torch.nn.Module):
 
 ```
 
-# API Docs
+## API Docs
 
 ... under construction ...
 
diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
index 038295707..456f2740b 100644
--- a/docs/source/quantization.mdx
+++ b/docs/source/quantization.mdx
@@ -2,11 +2,14 @@
 
 Below you will find the docstring of the quantization primitives exposed in bitsandbytes.
 
+## Linear4bit (QLoRA)
+
+[[autodoc]] bitsandbytes.nn.Linear4bit
+
 ## Linear8bitLt
 
 [[autodoc]] bitsandbytes.nn.Linear8bitLt
 
- 
-## Linear4bit (QLoRA)
+## StableEmbedding
 
-[[autodoc]] bitsandbytes.nn.Linear4bit
\ No newline at end of file
+[[autodoc]] bitsandbytes.nn.StableEmbedding
\ No newline at end of file

From daff94c0428892d341d0203eb7c32233f5358985 Mon Sep 17 00:00:00 2001
From: younesbelkada <younesbelkada@gmail.com>
Date: Fri, 2 Feb 2024 06:56:04 +0000
Subject: [PATCH 12/30] add more methods

---
 docs/source/quantization.mdx | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
index 456f2740b..8fbff809f 100644
--- a/docs/source/quantization.mdx
+++ b/docs/source/quantization.mdx
@@ -5,10 +5,12 @@ Below you will find the docstring of the quantization primitives exposed in bits
 ## Linear4bit (QLoRA)
 
 [[autodoc]] bitsandbytes.nn.Linear4bit
+    - __init__
 
 ## Linear8bitLt
 
 [[autodoc]] bitsandbytes.nn.Linear8bitLt
+    - __init__
 
 ## StableEmbedding
 

From 301ee803d291aa265ae0a765a84adfbd778ed030 Mon Sep 17 00:00:00 2001
From: younesbelkada <younesbelkada@gmail.com>
Date: Fri, 2 Feb 2024 07:01:19 +0000
Subject: [PATCH 13/30] fix

---
 bitsandbytes/nn/modules.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/bitsandbytes/nn/modules.py b/bitsandbytes/nn/modules.py
index c2c19344d..4cd2da153 100644
--- a/bitsandbytes/nn/modules.py
+++ b/bitsandbytes/nn/modules.py
@@ -257,7 +257,6 @@ class Linear4bit(nn.Linear):
     ```
     """
     def __init__(self, input_features, output_features, bias=True, compute_dtype=None, compress_statistics=True, quant_type='fp4', quant_storage=torch.uint8, device=None):
-        super().__init__(input_features, output_features, bias, device)
         """
         Initialize Linear4bit class.
 
@@ -269,6 +268,7 @@ def __init__(self, input_features, output_features, bias=True, compute_dtype=Non
             bias (`bool`, defaults to `True`):
                 Whether the linear class uses the bias term as well.
         """
+        super().__init__(input_features, output_features, bias, device)
         self.weight = Params4bit(self.weight.data, requires_grad=False, compress_statistics=compress_statistics, quant_type=quant_type, quant_storage=quant_storage, module=self)
         # self.persistent_buffers = []  # TODO consider as way to save quant state
         self.compute_dtype = compute_dtype
@@ -473,7 +473,6 @@ class Linear8bitLt(nn.Linear):
     """
     def __init__(self, input_features, output_features, bias=True, has_fp16_weights=True,
                        memory_efficient_backward=False, threshold=0.0, index=None, device=None):
-        super().__init__(input_features, output_features, bias, device)
         """
         Initialize Linear8bitLt class.
 
@@ -484,7 +483,8 @@ def __init__(self, input_features, output_features, bias=True, has_fp16_weights=
                 Number of output features of the linear layer.
             bias (`bool`, defaults to `True`):
                 Whether the linear class uses the bias term as well.
-        """ 
+        """
+        super().__init__(input_features, output_features, bias, device)
         assert not memory_efficient_backward, "memory_efficient_backward is no longer required and the argument is deprecated in 0.37.0 and will be removed in 0.39.0"
         self.state = bnb.MatmulLtState()
         self.index = index

From 683a72ba042c3ed441f2bbab33ed0501c85e6605 Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Fri, 2 Feb 2024 13:55:04 -0800
Subject: [PATCH 14/30] further docs updates

---
 compile_from_source.md                        | 39 ---------
 docs/source/_toctree.yml                      | 10 ++-
 docs/source/algorithms.mdx                    | 12 +++
 docs/source/compiling.mdx                     | 41 +++++++++
 docs/source/contributing.mdx                  |  6 +-
 .../source/errors.mdx                         | 16 ++--
 docs/source/installation.mdx                  |  8 ++
 docs/source/integrations.mdx                  |  8 ++
 docs/source/introduction.mdx                  | 84 +------------------
 docs/source/moduletree.mdx                    |  5 --
 .../source/nonpytorchcuda.mdx                 |  8 +-
 docs/source/optimizers.mdx                    | 50 ++++++++---
 12 files changed, 135 insertions(+), 152 deletions(-)
 delete mode 100644 compile_from_source.md
 create mode 100644 docs/source/algorithms.mdx
 create mode 100644 docs/source/compiling.mdx
 rename errors_and_solutions.md => docs/source/errors.mdx (57%)
 delete mode 100644 docs/source/moduletree.mdx
 rename how_to_use_nonpytorch_cuda.md => docs/source/nonpytorchcuda.mdx (76%)

diff --git a/compile_from_source.md b/compile_from_source.md
deleted file mode 100644
index 6310fd6c6..000000000
--- a/compile_from_source.md
+++ /dev/null
@@ -1,39 +0,0 @@
-# Compiling from source
-
-Basic steps.
-1. `CUDA_VERSION=XXX make [target]` where `[target]` is among `cuda92, cuda10x, cuda110, cuda11x, cuda12x, cpuonly`
-2. `python setup.py install`
-
-To run these steps you will need to have the nvcc compiler installed that comes with a CUDA installation. If you use anaconda (recommended) then you can figure out which version of CUDA you are using with PyTorch via the command `conda list | grep cudatoolkit`. Then you can install the nvcc compiler by downloading and installing the same CUDA version from the [CUDA toolkit archive](https://developer.nvidia.com/cuda-toolkit-archive).
-
-You can install CUDA locally without sudo by following the following steps:
-
-```bash
-wget https://raw.githubusercontent.com/TimDettmers/bitsandbytes/main/install_cuda.sh
-# Syntax cuda_install CUDA_VERSION INSTALL_PREFIX EXPORT_TO_BASH
-#   CUDA_VERSION in {110, 111, 112, 113, 114, 115, 116, 117, 118, 120, 121, 122}
-#   EXPORT_TO_BASH in {0, 1} with 0=False and 1=True
-
-# For example, the following installs CUDA 11.7 to ~/local/cuda-11.7 and exports the path to your .bashrc
-bash install_cuda.sh 117 ~/local 1
-```
-
-By default, the Makefile will look at your `CUDA_HOME` environmental variable to find your CUDA version for compiling the library. If this path is not set it is inferred from the path of your `nvcc` compiler.
-
-Either `nvcc` needs to be in path for the `CUDA_HOME` variable needs to be set to the CUDA directory root (e.g. `/usr/local/cuda`) in order for compilation to succeed
-
-If you type `nvcc` and it cannot be found, you might need to add to your path or set the CUDA_HOME variable. You can run `python -m bitsandbytes` to find the path to CUDA. For example if `python -m bitsandbytes` shows you the following:
-```
-++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++
-/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudart.so
-```
-You can set `CUDA_HOME` to `/usr/local/cuda-11.7`. For example, you might be able to compile like this.
-
-``CUDA_HOME=~/local/cuda-11.7 CUDA_VERSION=117 make cuda11x``
-
-
-If you have problems compiling the library with these instructions from source, please open an issue.
-
-## Compilation with Kepler
-
-Since 0.39.1 bitsandbytes installed via pip no longer provides Kepler binaries and these need to be compiled from source. Follow the steps above and instead of `cuda11x_nomatmul` etc use `cuda11x_nomatmul_kepler`
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 182d10900..9a8628786 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -16,10 +16,18 @@
     title: Optimizers
   - local: integrations
     title: Integrations
+  - local: algorithms
+    title: Algorithms
 - title: Support & Learning
   sections:
   - local: resources
-    title: Papers, related resources & how to cite
+    title: Papers, resources & how to cite
+  - local: errors
+    title: Errors & Solutions
+  - local: nonpytorchcuda
+    title: Non-PyTorch CUDA
+  - local: compiling
+    title: Compilation from Source (extended)
   - local: faqs
     title: FAQs (Frequently Asked Questions)
 - title: Contributors Guidelines
diff --git a/docs/source/algorithms.mdx b/docs/source/algorithms.mdx
new file mode 100644
index 000000000..558b5673e
--- /dev/null
+++ b/docs/source/algorithms.mdx
@@ -0,0 +1,12 @@
+# Other algorithms
+_WIP: Still incomplete... Community contributions would be greatly welcome!_
+
+This is an overview of the algorithms in `bitsandbytes` that we think would also be useful as standalone entities.
+
+## Using Int8 Matrix Multiplication
+
+For straight Int8 matrix multiplication with mixed precision decomposition you can use ``bnb.matmul(...)``. To enable mixed precision decomposition, use the threshold parameter:
+
+```py
+bnb.matmul(..., threshold=6.0)
+```
diff --git a/docs/source/compiling.mdx b/docs/source/compiling.mdx
new file mode 100644
index 000000000..fc8c58769
--- /dev/null
+++ b/docs/source/compiling.mdx
@@ -0,0 +1,41 @@
+# Compiling from Source[[compiling]]
+
+To compile from source, the CUDA Toolkit is required. Ensure `nvcc` is installed; if not, follow these steps to install it along with the CUDA Toolkit:
+
+```bash
+wget https://raw.githubusercontent.com/TimDettmers/bitsandbytes/main/install_cuda.sh
+# Use the following syntax: cuda_install CUDA_VERSION INSTALL_PREFIX EXPORT_TO_BASH
+#   CUDA_VERSION options include 110 to 122
+#   EXPORT_TO_BASH: 0 for False, 1 for True
+
+# Example for installing CUDA 11.7 at ~/local/cuda-11.7 and exporting the path to .bashrc:
+bash install_cuda.sh 117 ~/local 1
+```
+
+For a single compile run with a specific CUDA version, set `CUDA_HOME` to point to your CUDA installation directory. For instance, to compile using CUDA 11.7 located at `~/local/cuda-11.7`, use:
+
+```
+CUDA_HOME=~/local/cuda-11.7 CUDA_VERSION=117 make cuda11x
+```
+
+## General Compilation Steps
+
+1. Use `CUDA_VERSION=XXX make [target]` to compile, where `[target]` includes options like `cuda92`, `cuda10x`, `cuda11x`, and others.
+2. Install with `python setup.py install`.
+
+Ensure `nvcc` is available in your system. If using Anaconda, determine your CUDA version with PyTorch using `conda list | grep cudatoolkit` and match it by downloading the corresponding version from the [CUDA Toolkit Archive](https://developer.nvidia.com/cuda-toolkit-archive).
+
+To install CUDA locally without administrative rights:
+
+```bash
+wget https://raw.githubusercontent.com/TimDettmers/bitsandbytes/main/install_cuda.sh
+# Follow the same syntax and example as mentioned earlier
+```
+
+The compilation process relies on the `CUDA_HOME` environment variable to locate CUDA. If `CUDA_HOME` is unset, it will attempt to infer the location from `nvcc`. If `nvcc` is not in your path, you may need to add it or set `CUDA_HOME` manually. For example, if `python -m bitsandbytes` indicates your CUDA path as `/usr/local/cuda-11.7`, you can set `CUDA_HOME` to this path.
+
+If compilation issues arise, please report them.
+
+## Compilation for Kepler Architecture
+
+From version 0.39.1, bitsandbytes no longer includes Kepler binaries in pip installations, requiring manual compilation. Follow the general steps and use `cuda11x_nomatmul_kepler` for Kepler-targeted compilation.
diff --git a/docs/source/contributing.mdx b/docs/source/contributing.mdx
index 18ad66b4c..a9d915ef7 100644
--- a/docs/source/contributing.mdx
+++ b/docs/source/contributing.mdx
@@ -1,13 +1,17 @@
 # Contributors guidelines
 ... stil under construction ... (feel free to propose materials, `bitsandbytes` is a community project)
 
-# Setup pre-commit hooks
+## Setup pre-commit hooks
 - Install pre-commit hooks with `pip install pre-commit`.
 - Run `pre-commit autoupdate` once to configure the hooks.
 - Re-run `pre-commit autoupdate` every time a new hook got added.
 
 Now all the pre-commit hooks will be automatically run when you try to commit and if they introduce some changes, you need to re-add the changed files before being able to commit and push.
 
+## Doc-string syntax
+
+TODO: Add description + reference of HF docstring best practices.
+
 ## Documentation
 - [guideline for documentation syntax](https://github.com/huggingface/doc-builder#readme)
 - images shall be uploaded via PR in the `bitsandbytes/` directory [here](https://huggingface.co/datasets/huggingface/documentation-images)
diff --git a/errors_and_solutions.md b/docs/source/errors.mdx
similarity index 57%
rename from errors_and_solutions.md
rename to docs/source/errors.mdx
index 5b8cbcdd5..68fb7f938 100644
--- a/errors_and_solutions.md
+++ b/docs/source/errors.mdx
@@ -1,21 +1,25 @@
-# No kernel image available
+# Errors & Solutions
 
-This problem arises with the cuda version loaded by bitsandbytes is not supported by your GPU, or if you pytorch CUDA version mismatches. To solve this problem you need to debug ``$LD_LIBRARY_PATH``, ``$CUDA_HOME``, ``$PATH``. You can print these via ``echo $PATH``. You should look for multiple paths to different CUDA versions. This can include versions in your anaconda path, for example ``$HOME/anaconda3/lib``. You can check those versions via ``ls -l $HOME/anaconda3/lib/*cuda*`` or equivalent paths. Look at the CUDA versions of files in these paths. Does it match with ``nvidia-smi``?
+## No kernel image available
 
-If you are feeling lucky, you can also try to compile the library from source. This can be still problematic if your PATH variables have multiple cuda versions. As such, it is recommended to figure out path conflicts before you proceed with compilation.
+This problem arises with the cuda version loaded by bitsandbytes is not supported by your GPU, or if you pytorch CUDA version mismatches.
+
+To solve this problem you need to debug ``$LD_LIBRARY_PATH``, ``$CUDA_HOME``, ``$PATH``. You can print these via ``echo $PATH``. You should look for multiple paths to different CUDA versions. This can include versions in your anaconda path, for example ``$HOME/anaconda3/lib``. You can check those versions via ``ls -l $HOME/anaconda3/lib/*cuda*`` or equivalent paths. Look at the CUDA versions of files in these paths. Does it match with ``nvidia-smi``?
 
+If you are feeling lucky, you can also try to compile the library from source. This can be still problematic if your PATH variables have multiple cuda versions. As such, it is recommended to figure out path conflicts before you proceed with compilation.
 
 __If you encounter any other error not listed here please create an issue. This will help resolve your problem and will help out others in the future.
 
 
-# fatbinwrap
+## fatbinwrap
+
+This error occurs if there is a mismatch between CUDA versions in the C++ library and the CUDA part. Make sure you have right CUDA in your `$PATH` and `$LD_LIBRARY_PATH` variable. In the conda base environment you can find the library under:
 
-This error occurs if there is a mismatch between CUDA versions in the C++ library and the CUDA part. Make sure you have right CUDA in your $PATH and $LD_LIBRARY_PATH variable. In the conda base environment you can find the library under:
 ```bash
 ls $CONDA_PREFIX/lib/*cudart*
 ```
 Make sure this path is appended to the `LD_LIBRARY_PATH` so bnb can find the CUDA runtime environment library (cudart).
 
-If this does not fix the issue, please try [compilation from source](compile_from_source.md) next.
+If this does not fix the issue, please try compilation from source next.
 
 If this does not work, please open an issue and paste the printed environment if you call `make` and the associated error when running bnb.
diff --git a/docs/source/installation.mdx b/docs/source/installation.mdx
index d28431f77..e6a7b851e 100644
--- a/docs/source/installation.mdx
+++ b/docs/source/installation.mdx
@@ -5,6 +5,12 @@ Note currently `bitsandbytes` is only supported on CUDA GPU hardwares, support f
 <hfoptions id="OS system">
 <hfoption id="Linux">
 
+## Hardware requirements:
+ - LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or newer).
+ - 8-bit optimizers and quantization: NVIDIA Kepler GPU or newer (>=GTX 78X).
+
+Supported CUDA versions: 10.2 - 12.0  #TODO: check currently supported versions
+
 ## Linux
 
 ### From Pypi
@@ -23,6 +29,8 @@ python setup.py install
 
 with `XXX` being your CUDA version, for <12.0 call `make cuda 11x`. Note support for non-CUDA GPUs (e.g. AMD, Intel), is also coming soon.
 
+For a more detailed guide, head to the [dedicated page on the topic](#compiling)
+
 </hfoption>
 <hfoption id="Windows">
 
diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
index ba3ee218f..5cb8bc91e 100644
--- a/docs/source/integrations.mdx
+++ b/docs/source/integrations.mdx
@@ -9,3 +9,11 @@
 # Trainer for the optimizers
 
 ... TODO: to be filled out ...
+
+Here we point out to relevant doc sections in transformers / peft / Trainer + very briefly explain how these are integrated:
+e.g. for transformers state that you can load any model in 8-bit / 4-bit precision, for PEFT, you can use QLoRA out of the box with `LoraConfig` + 4-bit base model, for Trainer: all bnb optimizers are supported by passing the correct string in `TrainingArguments` : https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/training_args.py#L134
+
+Few references:
+
+- transformers: https://huggingface.co/docs/transformers/quantization#bitsandbytes
+- PEFT: https://huggingface.co/docs/peft/developer_guides/quantization
diff --git a/docs/source/introduction.mdx b/docs/source/introduction.mdx
index bc5c7a6d0..e1f55c5eb 100644
--- a/docs/source/introduction.mdx
+++ b/docs/source/introduction.mdx
@@ -7,46 +7,11 @@ The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom fu
 There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
 The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8bit optimizers through `bitsandbytes.optim` module.
 
-**Using 8-bit optimizers**:
-
-1. Comment out optimizer: ``#torch.optim.Adam(....)``
-2. Add 8-bit optimizer of your choice ``bnb.optim.Adam8bit(....)`` (arguments stay the same)
-3. Replace embedding layer if necessary: ``torch.nn.Embedding(..) -> bnb.nn.Embedding(..)``
-
-
-**Using 8-bit Inference**:
-1. Comment out torch.nn.Linear: ``#linear = torch.nn.Linear(...)``
-2. Add bnb 8-bit linear light module: ``linear = bnb.nn.Linear8bitLt(...)`` (base arguments stay the same)
-3. There are two modes:
-   - Mixed 8-bit training with 16-bit main weights. Pass the argument ``has_fp16_weights=True`` (default)
-   - Int8 inference. Pass the argument ``has_fp16_weights=False``
-4. To use the full LLM.int8() method, use the ``threshold=k`` argument. We recommend ``k=6.0``.
-```python
-# LLM.int8()
-linear = bnb.nn.Linear8bitLt(dim1, dim2, bias=True, has_fp16_weights=False, threshold=6.0)
-# inputs need to be fp16
-out = linear(x.to(torch.float16))
-```
-
-
-## Features
-
-- 8-bit Matrix multiplication with mixed precision decomposition
-- LLM.int8() inference
-- 8-bit Optimizers: Adam, AdamW, RMSProp, LARS, LAMB, Lion (saves 75% memory)
-- Stable Embedding Layer: Improved stability through better initialization, and normalization
-- 8-bit quantization: Quantile, Linear, and Dynamic quantization
-- Fast quantile estimation: Up to 100x faster than other algorithms
-
 ## Requirements & Installation
 
 Requirements: anaconda, cudatoolkit, pytorch
 
-Hardware requirements:
- - LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or newer).
- - 8-bit optimizers and quantization: NVIDIA Kepler GPU or newer (>=GTX 78X).
 
-Supported CUDA versions: 10.2 - 12.0
 
 The bitsandbytes library is currently only supported on Linux distributions. Windows is not supported at the moment.
 
@@ -55,36 +20,10 @@ The requirements can best be fulfilled by installing pytorch via anaconda. You c
 
 ## Using bitsandbytes
 
-### Using Int8 Matrix Multiplication
-
-For straight Int8 matrix multiplication with mixed precision decomposition you can use ``bnb.matmul(...)``. To enable mixed precision decomposition, use the threshold parameter:
-```python
-bnb.matmul(..., threshold=6.0)
-```
+###
 
 For instructions how to use LLM.int8() inference layers in your own code, see the TL;DR above or for extended instruction see [this blog post](https://huggingface.co/blog/hf-bitsandbytes-integration).
 
-### Using the 8-bit Optimizers
-
-With bitsandbytes 8-bit optimizers can be used by changing a single line of code in your codebase. For NLP models we recommend also to use the StableEmbedding layers (see below) which improves results and helps with stable 8-bit optimization.  To get started with 8-bit optimizers, it is sufficient to replace your old optimizer with the 8-bit optimizer in the following way:
-```python
-import bitsandbytes as bnb
-
-# adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
-adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
-adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=8) # equivalent
-
-
-torch.nn.Embedding(...) ->  bnb.nn.StableEmbedding(...) # recommended for NLP models
-```
-
-Note that by default all parameter tensors with less than 4096 elements are kept at 32-bit even if you initialize those parameters with 8-bit optimizers. This is done since such small tensors do not save much memory and often contain highly variable parameters (biases) or parameters that require high precision (batch norm, layer norm). You can change this behavior like so:
-```python
-# parameter tensors with less than 16384 values are optimized in 32-bit
-# it is recommended to use multiplies of 4096
-adam = bnb.optim.Adam8bit(model.parameters(), min_8bit_size=16384)
-```
-
 ### Change Bits and other Hyperparameters for Individual Parameters
 
 If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things: (1) register the parameter while they are still on the CPU, (2) override the config with the new desired hyperparameters (anytime, anywhere). See our [guide](howto_config_override.md) for more details
@@ -97,29 +36,8 @@ To use the Stable Embedding Layer, override the respective `build_embedding(...)
 
 For upcoming features and changes and full history see [Patch Notes](CHANGELOG.md).
 
-## Errors
-
-1. RuntimeError: CUDA error: no kernel image is available for execution on the device. [Solution](errors_and_solutions.md#No-kernel-image-available)
-2. __fatbinwrap_.. [Solution](errors_and_solutions.md#fatbinwrap_)
-
-## Compile from source
-To compile from source, you need an installation of CUDA. If `nvcc` is not installed, you can install the CUDA Toolkit with nvcc through the following commands.
-
-```bash
-wget https://raw.githubusercontent.com/TimDettmers/bitsandbytes/main/install_cuda.sh
-# Syntax cuda_install CUDA_VERSION INSTALL_PREFIX EXPORT_TO_BASH
-#   CUDA_VERSION in {110, 111, 112, 113, 114, 115, 116, 117, 118, 120, 121, 122}
-#   EXPORT_TO_BASH in {0, 1} with 0=False and 1=True
-
-# For example, the following installs CUDA 11.7 to ~/local/cuda-11.7 and exports the path to your .bashrc
-bash install_cuda.sh 117 ~/local 1
-```
-
-To use a specific CUDA version just for a single compile run, you can set the variable `CUDA_HOME`, for example the following command compiles `libbitsandbytes_cuda117.so` using compiler flags for cuda11x with the cuda version at `~/local/cuda-11.7`:
 
-``CUDA_HOME=~/local/cuda-11.7 CUDA_VERSION=117 make cuda11x``
 
-For more detailed instruction, please follow the [compile_from_source.md](compile_from_source.md) instructions.
 
 ## License
 
diff --git a/docs/source/moduletree.mdx b/docs/source/moduletree.mdx
deleted file mode 100644
index d117f90c0..000000000
--- a/docs/source/moduletree.mdx
+++ /dev/null
@@ -1,5 +0,0 @@
-# Module tree overview
-
-- **bitsandbytes.functional**: Contains quantization functions (4-bit & 8-bit) and stateless 8-bit optimizer update functions.
-- **bitsandbytes.nn.modules**: Contains stable embedding layer with automatic 32-bit optimizer overrides (important for NLP stability)
-- **bitsandbytes.optim**: Contains 8-bit optimizers.
diff --git a/how_to_use_nonpytorch_cuda.md b/docs/source/nonpytorchcuda.mdx
similarity index 76%
rename from how_to_use_nonpytorch_cuda.md
rename to docs/source/nonpytorchcuda.mdx
index 566b0170e..099a6961b 100644
--- a/how_to_use_nonpytorch_cuda.md
+++ b/docs/source/nonpytorchcuda.mdx
@@ -1,6 +1,6 @@
-## How to use a CUDA version that is different from PyTorch
+# How to use a CUDA version that is different from PyTorch
 
-Some features of bitsandbytes may need a newer CUDA version than regularly supported by PyTorch binaries from conda / pip. In that case you can use the following instructions to load a precompiled bitsandbytes binary that works for you.
+Some features of `bitsandbytes` may need a newer CUDA version than regularly supported by PyTorch binaries from conda / pip. In that case you can use the following instructions to load a precompiled `bitsandbytes` binary that works for you.
 
 ## Installing or determining the CUDA installation
 
@@ -12,7 +12,7 @@ Determine the path of the CUDA version that you want to use. Common paths paths
 
 where XX.X is the CUDA version number.
 
-You can also install CUDA version that you need locally with a script provided by bitsandbytes as follows:
+You can also install CUDA version that you need locally with a script provided by `bitsandbytes` as follows:
 
 ```bash
 wget https://raw.githubusercontent.com/TimDettmers/bitsandbytes/main/install_cuda.sh
@@ -25,7 +25,7 @@ wget https://raw.githubusercontent.com/TimDettmers/bitsandbytes/main/install_cud
 bash cuda_install.sh 117 ~/local 1
 ```
 
-## Setting the environmental variables BNB_CUDA_VERSION, and LD_LIBRARY_PATH
+## Setting the environmental variables `BNB_CUDA_VERSION`, and `LD_LIBRARY_PATH`
 
 To manually override the PyTorch installed CUDA version you need to set to variable, like so:
 
diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
index 9906423d2..8380ed861 100644
--- a/docs/source/optimizers.mdx
+++ b/docs/source/optimizers.mdx
@@ -1,4 +1,4 @@
-## Introduction: 8-bit optimizers
+# Introduction: 8-bit optimizers
 
 With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers:
 
@@ -8,10 +8,10 @@ With 8-bit optimizers, larger models can be finetuned with the same GPU memory c
 
 8-bit optimizers are mostly useful to finetune large models that did not fit into memory before. They also make it easier to pretrain larger models and have great synergy with sharded data parallelism. 8-bit Adam, for example, is already used across multiple teams in Facebook. This optimizer saves a ton of memory at no accuracy hit.
 
-See here the biggest models
-
 We feature 8-bit Adam/AdamW, SGD momentum, LARS, LAMB, and RMSProp.
 
+## Using 8-bit optimizers
+
 It only requires a two-line code change to get started.
 ```py
 import bitsandbytes as bnb
@@ -26,17 +26,41 @@ bnb.nn.StableEmbedding(...)
 
 The arguments passed are the same as standard Adam. For NLP models we recommend also to use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.
 
+Note that by default all parameter tensors with less than 4096 elements are kept at 32-bit even if you initialize those parameters with 8-bit optimizers. This is done since such small tensors do not save much memory and often contain highly variable parameters (biases) or parameters that require high precision (batch norm, layer norm). You can change this behavior like so:
+
+```py
+# parameter tensors with less than 16384 values are optimized in 32-bit
+# it is recommended to use multiplies of 4096
+adam = bnb.optim.Adam8bit(model.parameters(), min_8bit_size=16384)
+```
+
 ## Overview of supported 8-bit optimizers
 
-TOOD: List here all optimizers in `bitsandbytes/optim/__init__.py`
-TODO (future) document all optimizers as we did for Linear4bit / Linear8bitLt classes
+Currently, `bitsandbytes` supports the following optimizers:
 
-## Overview of expected gradients
+- `Adagrad`, `Adagrad8bit`, `Adagrad32bit`
+- `Adam`, `Adam8bit`, `Adam32bit`, `PagedAdam`, `PagedAdam8bit`, `PagedAdam32bit`
+- `AdamW`, `AdamW8bit`, `AdamW32bit`, `PagedAdamW`, `PagedAdamW8bit`, `PagedAdamW32bit`
+- `LAMB`, `LAMB8bit`, `LAMB32bit`
+- `LARS`, `LARS8bit`, `LARS32bit`, `PytorchLARS`
+- `Lion`, `Lion8bit`, `Lion32bit`, `PagedLion`, `PagedLion8bit`, `PagedLion32bit`
+- `RMSprop`, `RMSprop8bit`, `RMSprop32bit`
+- `SGD`, `SGD8bit`, `SGD32bit`
+
+Additionally, there's `GlobalOptimManager`, which is explained [below](#optim_manager).
+
+Find the API docs [here](#optim_api_docs). (still under construction)
+
+
+## Overview of expected gains
 
 <div style="text-align: center">
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bitsandbytes/optimizer_comparison.png", width="50%">
 </div>
 
+
+See here an overview of the biggest models that can be trained based on optimizer usage:
+
 <div style="text-align: center">
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bitsandbytes/optimizer_largest_model.png", width="50%">
 </div>
@@ -47,9 +71,9 @@ Stateful optimizers maintain gradient statistics over time, e.g. the exponential
 
 To overcome the resulting computational, quantization and stability challenges, 8-bit optimizers have three components:
 
-1- **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
-2- **dynamic quantization**, which quantizes both small and large values with high precision,
-3- a **stable embedding layer** improves stability during optimization for models with word embeddings.
+1. **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
+2. **dynamic quantization**, which quantizes both small and large values with high precision,
+3. a **stable embedding layer** improves stability during optimization for models with word embeddings.
 
 With these components, performing an optimizer update with 8-bit states is straightforward. We dequantize the 8-bit optimizer states to 32-bit, perform the update and then quantize the states back to 8-bit for storage.
 
@@ -87,12 +111,12 @@ import bitsandbytes as bnb
 - adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
 ```
 
-### How to override config hyperparameters for particular weights/parameters
+### How to override config hyperparameters for particular weights/parameters[[optim_manager]]
 
 If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things:
 
-1) Register the parameter while they are still on the CPU,
-2) override the config with the new desired hyperparameters (anytime, anywhere)
+1. Register the parameter while they are still on the CPU,
+2. override the config with the new desired hyperparameters (anytime, anywhere)
 
 For global overrides in many different places in your code you can do:
 
@@ -132,7 +156,7 @@ class MyModule(torch.nn.Module):
 
 ```
 
-## API Docs
+## API Docs[[optim_api_docs]]
 
 ... under construction ...
 

From 60a7699708b95160329a1150999a77f2052cb075 Mon Sep 17 00:00:00 2001
From: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date: Sat, 3 Feb 2024 00:30:52 +0100
Subject: [PATCH 15/30] Update _toctree.yml

---
 docs/source/_toctree.yml | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 9a8628786..21f79e288 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -6,8 +6,6 @@
     title: Quickstart
   - local: installation
     title: Installation
-  - local: moduletree
-    title: Module Tree
 - title: Features & Integrations
   sections:
   - local: quantization
@@ -34,5 +32,3 @@
   sections:
   - local: contributing
     title: Contributing
-  # - local: code_of_conduct
-  #   title: Code of Conduct

From 543a7b1f5a436f40cbce2deb9b55070a10c3ed7d Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sat, 3 Feb 2024 14:39:05 -0800
Subject: [PATCH 16/30] fix link

---
 docs/source/installation.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/installation.mdx b/docs/source/installation.mdx
index e6a7b851e..fc559471d 100644
--- a/docs/source/installation.mdx
+++ b/docs/source/installation.mdx
@@ -29,7 +29,7 @@ python setup.py install
 
 with `XXX` being your CUDA version, for <12.0 call `make cuda 11x`. Note support for non-CUDA GPUs (e.g. AMD, Intel), is also coming soon.
 
-For a more detailed guide, head to the [dedicated page on the topic](#compiling)
+For a more detailed guide, head to the [dedicated page on the topic](./compiling)
 
 </hfoption>
 <hfoption id="Windows">

From 2d73f4d350ad7490c7bb5d0b5f64e25a69591819 Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sat, 3 Feb 2024 14:45:03 -0800
Subject: [PATCH 17/30] run pre-commit hooks

---
 .pre-commit-config.yaml      | 2 +-
 bitsandbytes/nn/modules.py   | 8 ++++----
 docs/source/quantization.mdx | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 039139b95..feb6c766e 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -1,6 +1,6 @@
 repos:
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.1.15
+    rev: v0.2.0
     hooks:
       - id: ruff
         args:
diff --git a/bitsandbytes/nn/modules.py b/bitsandbytes/nn/modules.py
index 4cd2da153..3b729b8da 100644
--- a/bitsandbytes/nn/modules.py
+++ b/bitsandbytes/nn/modules.py
@@ -226,11 +226,11 @@ def to(self, *args, **kwargs):
 
 class Linear4bit(nn.Linear):
     """
-    This class is the base module for the 4-bit quantization algorithm presented in [QLoRA](https://arxiv.org/abs/2305.14314). 
+    This class is the base module for the 4-bit quantization algorithm presented in [QLoRA](https://arxiv.org/abs/2305.14314).
     QLoRA 4-bit linear layers uses blockwise k-bit quantization under the hood, with the possibility of selecting various
     compute datatypes such as FP4 and NF4.
 
-    In order to quantize a linear layer one should first load the original fp16 / bf16 weights into 
+    In order to quantize a linear layer one should first load the original fp16 / bf16 weights into
     the Linear8bitLt module, then call `quantized_module.to("cuda")` to quantize the fp16 / bf16 weights.
 
     Example:
@@ -442,10 +442,10 @@ def maybe_rearrange_weight(state_dict, prefix, local_metadata, strict, missing_k
 
 class Linear8bitLt(nn.Linear):
     """
-    This class is the base module for the [LLM.int8()](https://arxiv.org/abs/2208.07339) algorithm. 
+    This class is the base module for the [LLM.int8()](https://arxiv.org/abs/2208.07339) algorithm.
     To read more about it, have a look at the paper.
 
-    In order to quantize a linear layer one should first load the original fp16 / bf16 weights into 
+    In order to quantize a linear layer one should first load the original fp16 / bf16 weights into
     the Linear8bitLt module, then call `int8_module.to("cuda")` to quantize the fp16 weights.
 
     Example:
diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
index 8fbff809f..e106c4401 100644
--- a/docs/source/quantization.mdx
+++ b/docs/source/quantization.mdx
@@ -14,4 +14,4 @@ Below you will find the docstring of the quantization primitives exposed in bits
 
 ## StableEmbedding
 
-[[autodoc]] bitsandbytes.nn.StableEmbedding
\ No newline at end of file
+[[autodoc]] bitsandbytes.nn.StableEmbedding

From 8f0fd8a620aab19d28b4b46cecc4f4f65082dae3 Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sat, 3 Feb 2024 16:47:52 -0800
Subject: [PATCH 18/30] refactor + further docs

---
 bitsandbytes/nn/modules.py   | 46 +++++++++++++++++++++++++++++++++++-
 docs/source/integrations.mdx |  8 +++++++
 docs/source/introduction.mdx | 37 ++---------------------------
 docs/source/optimizers.mdx   | 41 +++++++++++++++++++++++++++-----
 docs/source/quantization.mdx |  4 ----
 5 files changed, 90 insertions(+), 46 deletions(-)

diff --git a/bitsandbytes/nn/modules.py b/bitsandbytes/nn/modules.py
index 3b729b8da..6eeecc273 100644
--- a/bitsandbytes/nn/modules.py
+++ b/bitsandbytes/nn/modules.py
@@ -20,7 +20,40 @@
 
 class StableEmbedding(torch.nn.Embedding):
     """
-    TODO: @titus fill this with some info
+    Custom embedding layer designed for stable training in NLP tasks. The stable
+    embedding layer improves stability during optimization for models with word
+    embeddings, addressing issues related to the non-uniform distribution of input
+    tokens.
+
+    This stable embedding layer is initialized with Xavier uniform initialization,
+    followed by layer normalization. It is designed to support aggressive quantization,
+    addressing extreme gradient variations in non-uniform input distributions. The
+    stability of training is enhanced by using 32-bit optimizer states specifically
+    for this layer.
+
+    Example:
+
+    ```
+    # Initialize StableEmbedding layer with vocabulary size 1000, embedding dimension 300
+    embedding_layer = StableEmbedding(num_embeddings=1000, embedding_dim=300)
+
+    # Reset embedding parameters
+    embedding_layer.reset_parameters()
+
+    # Perform a forward pass with input tensor
+    input_tensor = torch.tensor([1, 2, 3])
+    output_embedding = embedding_layer(input_tensor)
+    ```
+
+    Attributes:
+        norm (torch.nn.LayerNorm): Layer normalization applied after the embedding.
+
+    Methods:
+        reset_parameters(): Reset embedding parameters using Xavier uniform initialization.
+        forward(input: Tensor) -> Tensor: Forward pass through the stable embedding layer.
+
+    Reference:
+        - [8-bit optimizer paper](https://arxiv.org/pdf/2110.02861.pdf)
     """
     def __init__(
         self,
@@ -35,6 +68,17 @@ def __init__(
         device=None,
         dtype=None,
     ) -> None:
+        """
+        Args:
+            num_embeddings (`int`): The number of unique embeddings (vocabulary size).
+            embedding_dim (`int`): The dimensionality of the embedding.
+            padding_idx (`Optional[int]`): If specified, pads the output with zeros at the given index.
+            max_norm (`Optional[float]`): If given, renormalizes embeddings to have a maximum L2 norm.
+            norm_type (`float`, defaults to `2.0`): The p-norm to compute for the max_norm option.
+            scale_grad_by_freq (`bool`): Scale gradient by frequency during backpropagation.
+            sparse (`bool`): If True, computes sparse gradients; False, computes dense gradients.
+            _weight (`Optional[Tensor]`): Pre-trained embeddings.
+        """
         super().__init__(
             num_embeddings,
             embedding_dim,
diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
index 5cb8bc91e..a131ad105 100644
--- a/docs/source/integrations.mdx
+++ b/docs/source/integrations.mdx
@@ -17,3 +17,11 @@ Few references:
 
 - transformers: https://huggingface.co/docs/transformers/quantization#bitsandbytes
 - PEFT: https://huggingface.co/docs/peft/developer_guides/quantization
+
+# Blog posts
+
+- [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
+
+###
+
+For instructions how to use LLM.int8() inference layers in your own code, see the TL;DR above or for extended instruction see [this blog post](https://huggingface.co/blog/hf-bitsandbytes-integration).
diff --git a/docs/source/introduction.mdx b/docs/source/introduction.mdx
index e1f55c5eb..c86623c98 100644
--- a/docs/source/introduction.mdx
+++ b/docs/source/introduction.mdx
@@ -1,43 +1,10 @@
-TODO: Many parts of this doc will still be redistributed among the new doc structure.
-
 # `bitsandbytes`
 
-The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.
+The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 + 4-bit quantization functions.
 
-There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
 The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8bit optimizers through `bitsandbytes.optim` module.
 
-## Requirements & Installation
-
-Requirements: anaconda, cudatoolkit, pytorch
-
-
-
-The bitsandbytes library is currently only supported on Linux distributions. Windows is not supported at the moment.
-
-The requirements can best be fulfilled by installing pytorch via anaconda. You can install PyTorch by following the ["Get Started"](https://pytorch.org/get-started/locally/) instructions on the official website.
-
-
-## Using bitsandbytes
-
-###
-
-For instructions how to use LLM.int8() inference layers in your own code, see the TL;DR above or for extended instruction see [this blog post](https://huggingface.co/blog/hf-bitsandbytes-integration).
-
-### Change Bits and other Hyperparameters for Individual Parameters
-
-If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things: (1) register the parameter while they are still on the CPU, (2) override the config with the new desired hyperparameters (anytime, anywhere). See our [guide](howto_config_override.md) for more details
-
-### Fairseq Users
-
-To use the Stable Embedding Layer, override the respective `build_embedding(...)` function of your model. Make sure to also use the `--no-scale-embedding` flag to disable scaling of the word embedding layer (nor replaced with layer norm). You can use the optimizers by replacing the optimizer in the respective file (`adam.py` etc.).
-
-## Release and Feature History
-
-For upcoming features and changes and full history see [Patch Notes](CHANGELOG.md).
-
-
-
+There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
 
 ## License
 
diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
index 8380ed861..8199b73e1 100644
--- a/docs/source/optimizers.mdx
+++ b/docs/source/optimizers.mdx
@@ -8,9 +8,20 @@ With 8-bit optimizers, larger models can be finetuned with the same GPU memory c
 
 8-bit optimizers are mostly useful to finetune large models that did not fit into memory before. They also make it easier to pretrain larger models and have great synergy with sharded data parallelism. 8-bit Adam, for example, is already used across multiple teams in Facebook. This optimizer saves a ton of memory at no accuracy hit.
 
-We feature 8-bit Adam/AdamW, SGD momentum, LARS, LAMB, and RMSProp.
+Our 8-bit optimizers have three components:
+1. **block-wise quantization** isolates outliers and distributes the error more equally over all bits,
+2. **dynamic quantization** quantizes both small and large values with high precision,
+3. a **stable embedding layer** improves stability during optimization for models with word embeddings.
+
+With these components, performing an optimizer update with 8-bit states is straightforward and for GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers. [Further details below](#research-background)
+
+We feature 8-bit `Adagrad`, `Adam`, `AdamW`, `LAMB`, `LARS`, `Lion`, `RMSprop` and `SGD` (momentum).
+
+## Caveats
 
-## Using 8-bit optimizers
+8-bit optimizers reduce the memory footprint and accelerate optimization on a wide range of tasks. However, since 8-bit optimizers reduce only the memory footprint proportional to the number of parameters, **models that use large amounts of activation memory, such as convolutional networks, have few benefits from using 8-bit optimizers**. Thus, 8-bit optimizers are most beneficial for training or finetuning models with many parameters on highly memory-constrained GPUs.
+
+## Usage
 
 It only requires a two-line code change to get started.
 ```py
@@ -47,11 +58,10 @@ Currently, `bitsandbytes` supports the following optimizers:
 - `RMSprop`, `RMSprop8bit`, `RMSprop32bit`
 - `SGD`, `SGD8bit`, `SGD32bit`
 
-Additionally, there's `GlobalOptimManager`, which is explained [below](#optim_manager).
+Additionally, for cases in which you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`, which is explained [below](#optim_manager).
 
 Find the API docs [here](#optim_api_docs). (still under construction)
 
-
 ## Overview of expected gains
 
 <div style="text-align: center">
@@ -79,7 +89,7 @@ With these components, performing an optimizer update with 8-bit states is strai
 
 We do this 8-bit to 32-bit conversion element-by-element in registers, which means no slow copies to GPU memory or additional temporary memory are needed to perform quantization and dequantization. For GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers.
 
-For more details, please refer to the paper [8-bit Optimizers via Block-wise Quantization](https://arxiv.org/abs/2110.02861)
+For more details, please refer to the paper [8-bit Optimizers via Block-wise Quantization](https://arxiv.org/abs/2110.02861).
 
 ## Stable Embedding Layer
 
@@ -96,6 +106,20 @@ The Stable Embedding Layer enhances the standard word embedding layer for improv
 - Designed to support more aggressive quantization strategies without compromising training stability.
 - Helps in achieving stable training outcomes, particularly important for models dealing with diverse and complex language data.
 
+## Paged Optimizers
+
+Paged optimizers are build on top of the [unified memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) feature of CUDA. This feature is not supported by PyTorch and we added it to `bitsandbytes`.
+
+It works like regular CPU paging, which means that it only becomes active _if one runs out of GPU memory_. Only then will the memory be transferred, page-by-page, from GPU to CPU. The memory is mapped, meaning that pages are preallocated on the CPU, but they are not updated automatically. They are only updated if the memory is accessed, or a swapping operation is launched.
+
+The unified memory feature is less efficient than regular asynchronous memory transfers. This means, one usually will not be able to get full PCIe memory bandwidth utilization. If one does a manual prefetch, transfer speeds can be high but still about half or worse than the full PCIe memory bandwidth (tested on 16x lanes PCIe 3.0).
+
+This all means performance depends highly on the particular use-case. If one evicts, say, 1 GB of memory per forward-backward-optimizer loop: One can expect about 50% of the PCIe bandwidth as time in the best case. So 1 GB for PCIe 3.0 with 16x lanes, which runs at 16 GB/s, is `1/(16*0.5) = 1/8 = 125ms` overhead per optimizer step. Other overhead can be estimated for the particular use-case given a PCIe interface, lanes, and the memory that is evicted in each iteration.
+
+Compared to CPU offloading, this has the advantage that there is zero overhead if all the memory fits into the device and only some overhead if some of memory needs to be evicted. For offloading, one would usually offload fixed parts of the model and need to off and onload all this memory with each iteration through the model (sometimes twice for both forward and backward pass).
+
+[Find more details in this discussion](https://github.com/TimDettmers/bitsandbytes/issues/962).
+
 ## Usage
 
 Some more examples of how you can replace your old optimizer with the 8-bit optimizer:
@@ -160,4 +184,9 @@ class MyModule(torch.nn.Module):
 
 ... under construction ...
 
-Here we'll provide auto-generated API docs soon. Please feel free to contribute doc-strings for the respective optimizers, as `bitsandbytes` is a community effort.
+Here we'll provide further auto-generated API docs soon. Please feel free to contribute doc-strings for the respective optimizers, as `bitsandbytes` is a community effort.
+
+## StableEmbedding
+
+[[autodoc]] bitsandbytes.nn.StableEmbedding
+    - __init__
diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
index e106c4401..82045f039 100644
--- a/docs/source/quantization.mdx
+++ b/docs/source/quantization.mdx
@@ -11,7 +11,3 @@ Below you will find the docstring of the quantization primitives exposed in bits
 
 [[autodoc]] bitsandbytes.nn.Linear8bitLt
     - __init__
-
-## StableEmbedding
-
-[[autodoc]] bitsandbytes.nn.StableEmbedding

From a3c45d31b1091e4c12b6821d42dd0a90e4bbb053 Mon Sep 17 00:00:00 2001
From: Titus <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 10:29:51 -0300
Subject: [PATCH 19/30] Update README.md with new docs link

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 35a03dbcb..9abb282a8 100644
--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@
 The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.
 
 
-
+Read more about bitsandbytes on its dedicated documentation page: https://huggingface.co/docs/bitsandbytes/main
 
 
 ## TL;DR

From b370cee564ff9e240411c9c9a93493c441335b40 Mon Sep 17 00:00:00 2001
From: Titus <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 10:31:45 -0300
Subject: [PATCH 20/30] list of blog posts

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
---
 docs/source/integrations.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
index a131ad105..55a685779 100644
--- a/docs/source/integrations.mdx
+++ b/docs/source/integrations.mdx
@@ -21,7 +21,7 @@ Few references:
 # Blog posts
 
 - [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
-
+- [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration)
 ###
 
 For instructions how to use LLM.int8() inference layers in your own code, see the TL;DR above or for extended instruction see [this blog post](https://huggingface.co/blog/hf-bitsandbytes-integration).

From fd64f21052c66526525d764ed0176ed21a78eb07 Mon Sep 17 00:00:00 2001
From: Titus <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 10:31:56 -0300
Subject: [PATCH 21/30] list of blog posts

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
---
 docs/source/integrations.mdx | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
index 55a685779..56e585db5 100644
--- a/docs/source/integrations.mdx
+++ b/docs/source/integrations.mdx
@@ -22,6 +22,3 @@ Few references:
 
 - [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
 - [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration)
-###
-
-For instructions how to use LLM.int8() inference layers in your own code, see the TL;DR above or for extended instruction see [this blog post](https://huggingface.co/blog/hf-bitsandbytes-integration).

From 38d323abfeb351d3014c876c6342675fc466ab45 Mon Sep 17 00:00:00 2001
From: Titus <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 10:36:56 -0300
Subject: [PATCH 22/30] accept change suggestion

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
---
 docs/source/algorithms.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/algorithms.mdx b/docs/source/algorithms.mdx
index 558b5673e..53619bed2 100644
--- a/docs/source/algorithms.mdx
+++ b/docs/source/algorithms.mdx
@@ -1,7 +1,7 @@
 # Other algorithms
 _WIP: Still incomplete... Community contributions would be greatly welcome!_
 
-This is an overview of the algorithms in `bitsandbytes` that we think would also be useful as standalone entities.
+This is an overview of the functional API in `bitsandbytes` that we think would also be useful as standalone entities.
 
 ## Using Int8 Matrix Multiplication
 

From 82485d01c3a96791ca2ff9c4fb5e1d6066787c07 Mon Sep 17 00:00:00 2001
From: Titus <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 10:37:56 -0300
Subject: [PATCH 23/30] accept suggestion

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
---
 docs/source/integrations.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
index 56e585db5..c3493fd38 100644
--- a/docs/source/integrations.mdx
+++ b/docs/source/integrations.mdx
@@ -15,7 +15,7 @@ e.g. for transformers state that you can load any model in 8-bit / 4-bit precisi
 
 Few references:
 
-- transformers: https://huggingface.co/docs/transformers/quantization#bitsandbytes
+- [transformers documentation]( https://huggingface.co/docs/transformers/quantization#bitsandbytes)
 - PEFT: https://huggingface.co/docs/peft/developer_guides/quantization
 
 # Blog posts

From 75cfb1c6e61710b4f64c78a6e5b2d2231907ecaf Mon Sep 17 00:00:00 2001
From: Titus <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 10:38:26 -0300
Subject: [PATCH 24/30] accept suggestion

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
---
 docs/source/integrations.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
index c3493fd38..2a4e870e4 100644
--- a/docs/source/integrations.mdx
+++ b/docs/source/integrations.mdx
@@ -11,7 +11,7 @@
 ... TODO: to be filled out ...
 
 Here we point out to relevant doc sections in transformers / peft / Trainer + very briefly explain how these are integrated:
-e.g. for transformers state that you can load any model in 8-bit / 4-bit precision, for PEFT, you can use QLoRA out of the box with `LoraConfig` + 4-bit base model, for Trainer: all bnb optimizers are supported by passing the correct string in `TrainingArguments` : https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/training_args.py#L134
+e.g. for transformers state that you can load any model in 8-bit / 4-bit precision, for PEFT, you can use QLoRA out of the box with `LoraConfig` + 4-bit base model, for Trainer: all bnb optimizers are supported by passing the correct string in `TrainingArguments`'s `optim` attribute - e.g. (`paged_adamw_32bit`):
 
 Few references:
 

From 7a713902ddde957cfce58a6e141ee5480d628160 Mon Sep 17 00:00:00 2001
From: Titus <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 10:38:40 -0300
Subject: [PATCH 25/30] Update docs/source/integrations.mdx

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
---
 docs/source/integrations.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
index 2a4e870e4..a2acc2680 100644
--- a/docs/source/integrations.mdx
+++ b/docs/source/integrations.mdx
@@ -16,7 +16,7 @@ e.g. for transformers state that you can load any model in 8-bit / 4-bit precisi
 Few references:
 
 - [transformers documentation]( https://huggingface.co/docs/transformers/quantization#bitsandbytes)
-- PEFT: https://huggingface.co/docs/peft/developer_guides/quantization
+- [PEFT documentation](https://huggingface.co/docs/peft/developer_guides/quantization)
 
 # Blog posts
 

From a84afcf171ac4c9e35fcab1a3af85ba2188838d8 Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 05:51:27 -0800
Subject: [PATCH 26/30] index instead of intro

---
 docs/source/_toctree.yml                    | 4 ++--
 docs/source/{introduction.mdx => index.mdx} | 8 +++++++-
 docs/source/optimizers.mdx                  | 1 -
 docs/source/quantization.mdx                | 4 ++--
 4 files changed, 11 insertions(+), 6 deletions(-)
 rename docs/source/{introduction.mdx => index.mdx} (84%)

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 21f79e288..ede41bb6c 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -1,7 +1,7 @@
 - title: Get started
   sections:
-  - local: introduction
-    title: Introduction
+  - local: index
+    title: Index
   - local: quickstart
     title: Quickstart
   - local: installation
diff --git a/docs/source/introduction.mdx b/docs/source/index.mdx
similarity index 84%
rename from docs/source/introduction.mdx
rename to docs/source/index.mdx
index c86623c98..e7e15ab4c 100644
--- a/docs/source/introduction.mdx
+++ b/docs/source/index.mdx
@@ -6,7 +6,13 @@ The library includes quantization primitives for 8-bit & 4-bit operations, throu
 
 There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
 
-## License
+## API documentation
+
+- [Linear4bit](quantizaton#linear4bit)
+- [Linear8bit](quantizaton#linear8bit)
+- [StableEmbedding](optimizers#stableembedding)
+
+# License
 
 The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: Pytorch is licensed under the BSD license.
 
diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
index 8199b73e1..d4597dd89 100644
--- a/docs/source/optimizers.mdx
+++ b/docs/source/optimizers.mdx
@@ -68,7 +68,6 @@ Find the API docs [here](#optim_api_docs). (still under construction)
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bitsandbytes/optimizer_comparison.png", width="50%">
 </div>
 
-
 See here an overview of the biggest models that can be trained based on optimizer usage:
 
 <div style="text-align: center">
diff --git a/docs/source/quantization.mdx b/docs/source/quantization.mdx
index 82045f039..3880cc089 100644
--- a/docs/source/quantization.mdx
+++ b/docs/source/quantization.mdx
@@ -2,12 +2,12 @@
 
 Below you will find the docstring of the quantization primitives exposed in bitsandbytes.
 
-## Linear4bit (QLoRA)
+## Linear4bit (QLoRA)[[linear4bit]]
 
 [[autodoc]] bitsandbytes.nn.Linear4bit
     - __init__
 
-## Linear8bitLt
+## Linear8bitLt[[linear8bit]]
 
 [[autodoc]] bitsandbytes.nn.Linear8bitLt
     - __init__

From d3709f4381f6583b7bb1f660c9bab64c69745292 Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 06:14:32 -0800
Subject: [PATCH 27/30] fixup README, add docs link

---
 README.md                    | 189 ++---------------------------------
 docs/source/contributing.mdx |   1 +
 docs/source/index.mdx        |   2 +-
 3 files changed, 10 insertions(+), 182 deletions(-)

diff --git a/README.md b/README.md
index 9abb282a8..a9fb7f4e5 100644
--- a/README.md
+++ b/README.md
@@ -1,192 +1,19 @@
-# bitsandbytes
+# `bitsandbytes`
 
-The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.
+The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 + 4-bit quantization functions.
 
+The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8bit optimizers through `bitsandbytes.optim` module.
 
-Read more about bitsandbytes on its dedicated documentation page: https://huggingface.co/docs/bitsandbytes/main
+There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
 
+**Please head to the official documentation page:**
 
-## TL;DR
-**Requirements**
-Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.
+**[https://huggingface.co/docs/bitsandbytes/main](https://huggingface.co/docs/bitsandbytes/main)**
 
-(Deprecated: CUDA 10.0 is deprecated and only CUDA >= 11.0) will be supported with release 0.39.0)
 
-**Installation**:
 
-``pip install bitsandbytes``
+# License
 
-In some cases it can happen that you need to compile from source. If this happens please consider submitting a bug report with `python -m bitsandbytes` information. What now follows is some short instructions which might work out of the box if `nvcc` is installed. If these do not work see further below.
-
-Compilation quickstart:
-```bash
-git clone https://github.com/timdettmers/bitsandbytes.git
-cd bitsandbytes
-
-# CUDA_VERSIONS in {110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122}
-# make argument in {cuda110, cuda11x, cuda12x}
-# if you do not know what CUDA you have, try looking at the output of: python -m bitsandbytes
-CUDA_VERSION=117 make cuda11x
-python setup.py install
-```
-
-**Using Int8 inference with HuggingFace Transformers**
-
-```python
-from transformers import AutoModelForCausalLM
-model = AutoModelForCausalLM.from_pretrained(
-  'decapoda-research/llama-7b-hf',
-  device_map='auto',
-  load_in_8bit=True,
-  max_memory={
-    i: f'{int(torch.cuda.mem_get_info(i)[0]/1024**3)-2}GB'
-    for i in range(torch.cuda.device_count())
-  }
-)
-```
-
-A more detailed example, can be found in [examples/int8_inference_huggingface.py](examples/int8_inference_huggingface.py).
-
-**Using 8-bit optimizer**:
-1. Comment out optimizer: ``#torch.optim.Adam(....)``
-2. Add 8-bit optimizer of your choice ``bnb.optim.Adam8bit(....)`` (arguments stay the same)
-3. Replace embedding layer if necessary: ``torch.nn.Embedding(..) -> bnb.nn.Embedding(..)``
-
-
-**Using 8-bit Inference**:
-1. Comment out torch.nn.Linear: ``#linear = torch.nn.Linear(...)``
-2. Add bnb 8-bit linear light module: ``linear = bnb.nn.Linear8bitLt(...)`` (base arguments stay the same)
-3. There are two modes:
-   - Mixed 8-bit training with 16-bit main weights. Pass the argument ``has_fp16_weights=True`` (default)
-   - Int8 inference. Pass the argument ``has_fp16_weights=False``
-4. To use the full LLM.int8() method, use the ``threshold=k`` argument. We recommend ``k=6.0``.
-```python
-# LLM.int8()
-linear = bnb.nn.Linear8bitLt(dim1, dim2, bias=True, has_fp16_weights=False, threshold=6.0)
-# inputs need to be fp16
-out = linear(x.to(torch.float16))
-```
-
-
-## Features
-- 8-bit Matrix multiplication with mixed precision decomposition
-- LLM.int8() inference
-- 8-bit Optimizers: Adam, AdamW, RMSProp, LARS, LAMB, Lion (saves 75% memory)
-- Stable Embedding Layer: Improved stability through better initialization, and normalization
-- 8-bit quantization: Quantile, Linear, and Dynamic quantization
-- Fast quantile estimation: Up to 100x faster than other algorithms
-
-## Requirements & Installation
-
-Requirements: anaconda, cudatoolkit, pytorch
-
-Hardware requirements:
- - LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or newer).
- - 8-bit optimizers and quantization: NVIDIA Kepler GPU or newer (>=GTX 78X).
-
-Supported CUDA versions: 10.2 - 12.2
-
-The bitsandbytes library is currently only supported on Linux distributions. Windows is not supported at the moment.
-
-The requirements can best be fulfilled by installing pytorch via anaconda. You can install PyTorch by following the ["Get Started"](https://pytorch.org/get-started/locally/) instructions on the official website.
-
-To install run:
-
-``pip install bitsandbytes``
-
-## Using bitsandbytes
-
-### Using Int8 Matrix Multiplication
-
-For straight Int8 matrix multiplication with mixed precision decomposition you can use ``bnb.matmul(...)``. To enable mixed precision decomposition, use the threshold parameter:
-```python
-bnb.matmul(..., threshold=6.0)
-```
-
-For instructions how to use LLM.int8() inference layers in your own code, see the TL;DR above or for extended instruction see [this blog post](https://huggingface.co/blog/hf-bitsandbytes-integration).
-
-### Using the 8-bit Optimizers
-
-With bitsandbytes 8-bit optimizers can be used by changing a single line of code in your codebase. For NLP models we recommend also to use the StableEmbedding layers (see below) which improves results and helps with stable 8-bit optimization.  To get started with 8-bit optimizers, it is sufficient to replace your old optimizer with the 8-bit optimizer in the following way:
-```python
-import bitsandbytes as bnb
-
-# adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
-adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
-adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=8) # equivalent
-
-
-torch.nn.Embedding(...) ->  bnb.nn.StableEmbedding(...) # recommended for NLP models
-```
-
-Note that by default all parameter tensors with less than 4096 elements are kept at 32-bit even if you initialize those parameters with 8-bit optimizers. This is done since such small tensors do not save much memory and often contain highly variable parameters (biases) or parameters that require high precision (batch norm, layer norm). You can change this behavior like so:
-```python
-# parameter tensors with less than 16384 values are optimized in 32-bit
-# it is recommended to use multiplies of 4096
-adam = bnb.optim.Adam8bit(model.parameters(), min_8bit_size=16384)
-```
-
-### Change Bits and other Hyperparameters for Individual Parameters
-
-If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things: (1) register the parameter while they are still on the CPU, (2) override the config with the new desired hyperparameters (anytime, anywhere). See our [guide](howto_config_override.md) for more details
-
-### Fairseq Users
-
-To use the Stable Embedding Layer, override the respective `build_embedding(...)` function of your model. Make sure to also use the `--no-scale-embedding` flag to disable scaling of the word embedding layer (nor replaced with layer norm). You can use the optimizers by replacing the optimizer in the respective file (`adam.py` etc.).
-
-## Release and Feature History
-
-For upcoming features and changes and full history see [Patch Notes](CHANGELOG.md).
-
-## Errors
-
-1. RuntimeError: CUDA error: no kernel image is available for execution on the device. [Solution](errors_and_solutions.md#No-kernel-image-available)
-2. __fatbinwrap_.. [Solution](errors_and_solutions.md#fatbinwrap_)
-
-## Compile from source
-To compile from source, you need an installation of CUDA. If `nvcc` is not installed, you can install the CUDA Toolkit with nvcc through the following commands.
-
-```bash
-wget https://raw.githubusercontent.com/TimDettmers/bitsandbytes/main/install_cuda.sh
-# Syntax cuda_install CUDA_VERSION INSTALL_PREFIX EXPORT_TO_BASH
-#   CUDA_VERSION in {110, 111, 112, 113, 114, 115, 116, 117, 118, 120, 121, 122}
-#   EXPORT_TO_BASH in {0, 1} with 0=False and 1=True
-
-# For example, the following installs CUDA 11.7 to ~/local/cuda-11.7 and exports the path to your .bashrc
-bash install_cuda.sh 117 ~/local 1
-```
-
-To use a specific CUDA version just for a single compile run, you can set the variable `CUDA_HOME`, for example the following command compiles `libbitsandbytes_cuda117.so` using compiler flags for cuda11x with the cuda version at `~/local/cuda-11.7`:
-
-``CUDA_HOME=~/local/cuda-11.7 CUDA_VERSION=117 make cuda11x``
-
-For more detailed instruction, please follow the [compile_from_source.md](compile_from_source.md) instructions.
-
-## License
-
-The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: Pytorch is licensed under the BSD license.
+The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms, as the parts adapted from Pytorch are licensed under the BSD license.
 
 We thank Fabio Cannizzo for his work on [FastBinarySearch](https://github.com/fabiocannizzo/FastBinarySearch) which we use for CPU quantization.
-
-## How to cite us
-If you found this library and found LLM.int8() useful, please consider citing our work:
-
-```bibtex
-@article{dettmers2022llmint8,
-  title={LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
-  author={Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke},
-  journal={arXiv preprint arXiv:2208.07339},
-  year={2022}
-}
-```
-
-For 8-bit optimizers or quantization routines, please consider citing the following work:
-
-```bibtex
-@article{dettmers2022optimizers,
-  title={8-bit Optimizers via Block-wise Quantization},
-  author={Dettmers, Tim and Lewis, Mike and Shleifer, Sam and Zettlemoyer, Luke},
-  journal={9th International Conference on Learning Representations, ICLR},
-  year={2022}
-}
-```
diff --git a/docs/source/contributing.mdx b/docs/source/contributing.mdx
index a9d915ef7..8c0ad2dea 100644
--- a/docs/source/contributing.mdx
+++ b/docs/source/contributing.mdx
@@ -15,3 +15,4 @@ TODO: Add description + reference of HF docstring best practices.
 ## Documentation
 - [guideline for documentation syntax](https://github.com/huggingface/doc-builder#readme)
 - images shall be uploaded via PR in the `bitsandbytes/` directory [here](https://huggingface.co/datasets/huggingface/documentation-images)
+- find the documentation builds for each PR in a link posted to the PR, such as https://moon-ci-docs.huggingface.co/docs/bitsandbytes/pr_1012/en/introduction
diff --git a/docs/source/index.mdx b/docs/source/index.mdx
index e7e15ab4c..0b033c3a9 100644
--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@@ -14,6 +14,6 @@ There are ongoing efforts to support further hardware backends, i.e. Intel CPU +
 
 # License
 
-The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms: Pytorch is licensed under the BSD license.
+The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms, as the parts adapted from Pytorch are licensed under the BSD license.
 
 We thank Fabio Cannizzo for his work on [FastBinarySearch](https://github.com/fabiocannizzo/FastBinarySearch) which we use for CPU quantization.

From e00cbc941ba3ff45d4097df181a0bb9335e6c91d Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 06:17:20 -0800
Subject: [PATCH 28/30] add instructions for creating docstrings

---
 docs/source/contributing.mdx | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/source/contributing.mdx b/docs/source/contributing.mdx
index 8c0ad2dea..b28e91936 100644
--- a/docs/source/contributing.mdx
+++ b/docs/source/contributing.mdx
@@ -10,7 +10,9 @@ Now all the pre-commit hooks will be automatically run when you try to commit an
 
 ## Doc-string syntax
 
-TODO: Add description + reference of HF docstring best practices.
+We're following NumPy doc-string conventions with the only notable difference being that we use Markdown instead of Rich text format (RTF) for markup within the doc-strings.
+
+Please see the existing documentation to see how to generate autodocs.
 
 ## Documentation
 - [guideline for documentation syntax](https://github.com/huggingface/doc-builder#readme)

From 8a67759cd91f707db3aa36b6dc1e5ab2b10dca35 Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 10:46:48 -0800
Subject: [PATCH 29/30] final polish (except integrations)

---
 README.md                    | 12 +++----
 docs/source/algorithms.mdx   |  2 +-
 docs/source/errors.mdx       |  7 ++--
 docs/source/installation.mdx |  8 +++--
 docs/source/optimizers.mdx   | 69 ++++++++++++++++++------------------
 docs/source/quickstart.mdx   |  5 ++-
 6 files changed, 51 insertions(+), 52 deletions(-)

diff --git a/README.md b/README.md
index a9fb7f4e5..43eadf5a3 100644
--- a/README.md
+++ b/README.md
@@ -1,19 +1,17 @@
 # `bitsandbytes`
 
-The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 + 4-bit quantization functions.
+The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.
 
-The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8bit optimizers through `bitsandbytes.optim` module.
+The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8-bit optimizers through `bitsandbytes.optim` module.
 
-There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is on its way as well.
+There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is quite far along and is on its way as well.
 
 **Please head to the official documentation page:**
 
 **[https://huggingface.co/docs/bitsandbytes/main](https://huggingface.co/docs/bitsandbytes/main)**
 
+## License
 
-
-# License
-
-The majority of bitsandbytes is licensed under MIT, however portions of the project are available under separate license terms, as the parts adapted from Pytorch are licensed under the BSD license.
+The majority of bitsandbytes is licensed under MIT, however small portions of the project are available under separate license terms, as the parts adapted from Pytorch are licensed under the BSD license.
 
 We thank Fabio Cannizzo for his work on [FastBinarySearch](https://github.com/fabiocannizzo/FastBinarySearch) which we use for CPU quantization.
diff --git a/docs/source/algorithms.mdx b/docs/source/algorithms.mdx
index 53619bed2..d9db5cb04 100644
--- a/docs/source/algorithms.mdx
+++ b/docs/source/algorithms.mdx
@@ -1,7 +1,7 @@
 # Other algorithms
 _WIP: Still incomplete... Community contributions would be greatly welcome!_
 
-This is an overview of the functional API in `bitsandbytes` that we think would also be useful as standalone entities.
+This is an overview of the `bnb.functional` API in `bitsandbytes` that we think would also be useful as standalone entities.
 
 ## Using Int8 Matrix Multiplication
 
diff --git a/docs/source/errors.mdx b/docs/source/errors.mdx
index 68fb7f938..293017173 100644
--- a/docs/source/errors.mdx
+++ b/docs/source/errors.mdx
@@ -4,14 +4,11 @@
 
 This problem arises with the cuda version loaded by bitsandbytes is not supported by your GPU, or if you pytorch CUDA version mismatches.
 
-To solve this problem you need to debug ``$LD_LIBRARY_PATH``, ``$CUDA_HOME``, ``$PATH``. You can print these via ``echo $PATH``. You should look for multiple paths to different CUDA versions. This can include versions in your anaconda path, for example ``$HOME/anaconda3/lib``. You can check those versions via ``ls -l $HOME/anaconda3/lib/*cuda*`` or equivalent paths. Look at the CUDA versions of files in these paths. Does it match with ``nvidia-smi``?
+To solve this problem you need to debug ``$LD_LIBRARY_PATH``, ``$CUDA_HOME`` as well as ``$PATH``. You can print these via ``echo $PATH``. You should look for multiple paths to different CUDA versions. This can include versions in your anaconda path, for example ``$HOME/anaconda3/lib``. You can check those versions via ``ls -l $HOME/anaconda3/lib/*cuda*`` or equivalent paths. Look at the CUDA versions of files in these paths. Does it match with ``nvidia-smi``?
 
 If you are feeling lucky, you can also try to compile the library from source. This can be still problematic if your PATH variables have multiple cuda versions. As such, it is recommended to figure out path conflicts before you proceed with compilation.
 
-__If you encounter any other error not listed here please create an issue. This will help resolve your problem and will help out others in the future.
-
-
-## fatbinwrap
+## `fatbinwrap`
 
 This error occurs if there is a mismatch between CUDA versions in the C++ library and the CUDA part. Make sure you have right CUDA in your `$PATH` and `$LD_LIBRARY_PATH` variable. In the conda base environment you can find the library under:
 
diff --git a/docs/source/installation.mdx b/docs/source/installation.mdx
index fc559471d..ecdcdeb28 100644
--- a/docs/source/installation.mdx
+++ b/docs/source/installation.mdx
@@ -29,14 +29,14 @@ python setup.py install
 
 with `XXX` being your CUDA version, for <12.0 call `make cuda 11x`. Note support for non-CUDA GPUs (e.g. AMD, Intel), is also coming soon.
 
-For a more detailed guide, head to the [dedicated page on the topic](./compiling)
+For a more detailed compilation guide, head to the [dedicated page on the topic](./compiling)
 
 </hfoption>
 <hfoption id="Windows">
 
 ## Windows
 
-Currently for Windows users, you need to build bitsandbytes from source
+Currently for Windows users, you need to build bitsandbytes from source:
 
 ```bash
 git clone https://github.com/TimDettmers/bitsandbytes.git && cd bitsandbytes/
@@ -47,12 +47,14 @@ python -m build --wheel
 
 Big thanks to [wkpark](https://github.com/wkpark), [Jamezo97](https://github.com/Jamezo97), [rickardp](https://github.com/rickardp), [akx](https://github.com/akx) for their amazing contributions to make bitsandbytes compatible with Windows.
 
+For a more detailed compilation guide, head to the [dedicated page on the topic](./compiling)
+
 </hfoption>
 <hfoption id="MacOS">
 
 ## MacOS
 
-Mac support is still a work in progress. Please make sure to check out the latest bitsandbytes issues to get notified about the progress with respect to MacOS integration.
+Mac support is still a work in progress. Please make sure to check out the [Apple Silicon implementation coordination issue](https://github.com/TimDettmers/bitsandbytes/issues/1020) to get notified about the discussions and progress with respect to MacOS integration.
 
 </hfoption>
 
diff --git a/docs/source/optimizers.mdx b/docs/source/optimizers.mdx
index d4597dd89..18d20de1d 100644
--- a/docs/source/optimizers.mdx
+++ b/docs/source/optimizers.mdx
@@ -1,6 +1,6 @@
 # Introduction: 8-bit optimizers
 
-With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers:
+With 8-bit optimizers, larger models can be finetuned with the same GPU memory compared to standard 32-bit optimizer training. 8-bit optimizers are a drop-in replacement for regular optimizers, with the following properties:
 
 - Faster (e.g. 4x faster than regular Adam)
 - 75% less memory, same performance
@@ -8,12 +8,12 @@ With 8-bit optimizers, larger models can be finetuned with the same GPU memory c
 
 8-bit optimizers are mostly useful to finetune large models that did not fit into memory before. They also make it easier to pretrain larger models and have great synergy with sharded data parallelism. 8-bit Adam, for example, is already used across multiple teams in Facebook. This optimizer saves a ton of memory at no accuracy hit.
 
-Our 8-bit optimizers have three components:
+Generally, our 8-bit optimizers have three components:
 1. **block-wise quantization** isolates outliers and distributes the error more equally over all bits,
 2. **dynamic quantization** quantizes both small and large values with high precision,
 3. a **stable embedding layer** improves stability during optimization for models with word embeddings.
 
-With these components, performing an optimizer update with 8-bit states is straightforward and for GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers. [Further details below](#research-background)
+With these components, performing an optimizer update with 8-bit states is straightforward and for GPUs, this makes 8-bit optimizers way faster than regular 32-bit optimizers. [Further details below](#research-background)
 
 We feature 8-bit `Adagrad`, `Adam`, `AdamW`, `LAMB`, `LARS`, `Lion`, `RMSprop` and `SGD` (momentum).
 
@@ -24,27 +24,40 @@ We feature 8-bit `Adagrad`, `Adam`, `AdamW`, `LAMB`, `LARS`, `Lion`, `RMSprop` a
 ## Usage
 
 It only requires a two-line code change to get started.
-```py
+```diff
 import bitsandbytes as bnb
 
-# before: adam = torch.optim.Adam(...)
-adam = bnb.optim.Adam8bit(...)
+- adam = torch.optim.Adam(...)
++ adam = bnb.optim.Adam8bit(...)
 
 # recommended for NLP models
-# before: torch.nn.Embedding(...)
-bnb.nn.StableEmbedding(...)
+- before: torch.nn.Embedding(...)
++ bnb.nn.StableEmbedding(...)
 ```
 
-The arguments passed are the same as standard Adam. For NLP models we recommend also to use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.
+The arguments passed are the same as standard Adam. For NLP models we recommend to also use the StableEmbedding layers which improves results and helps with stable 8-bit optimization.
 
 Note that by default all parameter tensors with less than 4096 elements are kept at 32-bit even if you initialize those parameters with 8-bit optimizers. This is done since such small tensors do not save much memory and often contain highly variable parameters (biases) or parameters that require high precision (batch norm, layer norm). You can change this behavior like so:
 
 ```py
-# parameter tensors with less than 16384 values are optimized in 32-bit
-# it is recommended to use multiplies of 4096
+# For parameter tensors with less than 16384 values are optimized in 32-bit
+# it is recommended to use multiplies of 4096:
 adam = bnb.optim.Adam8bit(model.parameters(), min_8bit_size=16384)
 ```
 
+Some more examples of how you can replace your old optimizer with the 8-bit optimizer:
+
+```diff
+import bitsandbytes as bnb
+
+- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
++ adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
+
+# use 32-bit Adam with 5th percentile clipping
++ adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=32, percentile_clipping=5)
+- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
+```
+
 ## Overview of supported 8-bit optimizers
 
 Currently, `bitsandbytes` supports the following optimizers:
@@ -58,9 +71,9 @@ Currently, `bitsandbytes` supports the following optimizers:
 - `RMSprop`, `RMSprop8bit`, `RMSprop32bit`
 - `SGD`, `SGD8bit`, `SGD32bit`
 
-Additionally, for cases in which you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`, which is explained [below](#optim_manager).
+Additionally, for cases in which you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`, [as explained in greater detail below](#optim_manager).
 
-Find the API docs [here](#optim_api_docs). (still under construction)
+Find the API docs [here](#optim_api_docs) (still under construction).
 
 ## Overview of expected gains
 
@@ -81,12 +94,12 @@ Stateful optimizers maintain gradient statistics over time, e.g. the exponential
 To overcome the resulting computational, quantization and stability challenges, 8-bit optimizers have three components:
 
 1. **Block-wise quantization** divides input tensors into smaller blocks that are independently quantized, therein isolating outliers and distributing the error more equally over all bits. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization.
-2. **dynamic quantization**, which quantizes both small and large values with high precision,
+2. **Dynamic quantization**, which quantizes both small and large values with high precision and
 3. a **stable embedding layer** improves stability during optimization for models with word embeddings.
 
 With these components, performing an optimizer update with 8-bit states is straightforward. We dequantize the 8-bit optimizer states to 32-bit, perform the update and then quantize the states back to 8-bit for storage.
 
-We do this 8-bit to 32-bit conversion element-by-element in registers, which means no slow copies to GPU memory or additional temporary memory are needed to perform quantization and dequantization. For GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers.
+We do this 8-bit to 32-bit conversion element-by-element in registers, which means no slow copies to GPU memory or additional temporary memory are needed to perform quantization and dequantization. For GPUs, this makes 8-bit optimizers much faster than regular 32-bit optimizers.
 
 For more details, please refer to the paper [8-bit Optimizers via Block-wise Quantization](https://arxiv.org/abs/2110.02861).
 
@@ -105,7 +118,7 @@ The Stable Embedding Layer enhances the standard word embedding layer for improv
 - Designed to support more aggressive quantization strategies without compromising training stability.
 - Helps in achieving stable training outcomes, particularly important for models dealing with diverse and complex language data.
 
-## Paged Optimizers
+## Paged optimizers
 
 Paged optimizers are build on top of the [unified memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) feature of CUDA. This feature is not supported by PyTorch and we added it to `bitsandbytes`.
 
@@ -119,27 +132,13 @@ Compared to CPU offloading, this has the advantage that there is zero overhead i
 
 [Find more details in this discussion](https://github.com/TimDettmers/bitsandbytes/issues/962).
 
-## Usage
-
-Some more examples of how you can replace your old optimizer with the 8-bit optimizer:
-
-```diff
-import bitsandbytes as bnb
-
-- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
-+ adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # add bnb optimizer
-
-# use 32-bit Adam with 5th percentile clipping
-+ adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=32, percentile_clipping=5)
-- adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995)) # comment out old optimizer
-```
 
-### How to override config hyperparameters for particular weights/parameters[[optim_manager]]
+## `GlobalOptimManager`: How to override config hyperparameters for particular weights/parameters[[optim_manager]]
 
 If you want to optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, you can use the `GlobalOptimManager`. With this, we can also configure specific hyperparameters for particular layers, such as embedding layers. To do that, we need two things:
 
-1. Register the parameter while they are still on the CPU,
-2. override the config with the new desired hyperparameters (anytime, anywhere)
+1. Register the parameter while they are still on the CPU.
+2. Override the config with the new desired hyperparameters (anytime, anywhere).
 
 For global overrides in many different places in your code you can do:
 
@@ -164,9 +163,9 @@ mng.override_config(model.fc1.weight, 'optim_bits', 32)
 mng.override_config([model.special.weight, model.also_special.weight],
                     key_value_dict ={'is_sparse': True, 'lr': 1e-5, 'betas'=(0.9, 0.98)})
 ```
-Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`
+Possible options for the config override are: `betas, eps, weight_decay, lr, optim_bits, min_8bit_size, percentile_clipping, block_wise, max_unorm`.
 
-For overrides for particular layers we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
+For overrides for particular layers, we recommend overriding locally in each module. You can do this by passing the module, the parameter, and its attribute name to the GlobalOptimManager:
 ```py
 class MyModule(torch.nn.Module):
   def __init__(din, dout):
diff --git a/docs/source/quickstart.mdx b/docs/source/quickstart.mdx
index 3a560ff6b..ed92c896b 100644
--- a/docs/source/quickstart.mdx
+++ b/docs/source/quickstart.mdx
@@ -4,9 +4,12 @@
 
 ... work in progress ...
 
-## Minimal example
+(Community contributions would we very welcome!)
+
+## Minimal examples
 
 The following code illustrates the steps above.
 
 ```py
+code examples will soon follow
 ```

From d6325311c35b040f67300c34840d359905143d6a Mon Sep 17 00:00:00 2001
From: Titus von Koeller <9048635+Titus-von-Koeller@users.noreply.github.com>
Date: Sun, 4 Feb 2024 11:08:16 -0800
Subject: [PATCH 30/30] fill out integrations section

---
 docs/source/integrations.mdx | 34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/docs/source/integrations.mdx b/docs/source/integrations.mdx
index a2acc2680..7857abf4c 100644
--- a/docs/source/integrations.mdx
+++ b/docs/source/integrations.mdx
@@ -1,23 +1,41 @@
 # Transformers
 
-... TODO: to be filled out ...
+With Transformers it's very easy to load any model in 4 or 8-bit, quantizing them on the fly with bitsandbytes primitives.
+
+Please review the [bitsandbytes section in the Accelerate docs](https://huggingface.co/docs/transformers/v4.37.2/en/quantization#bitsandbytes).
+
+Details about the BitsAndBytesConfig can be found here](https://huggingface.co/docs/transformers/v4.37.2/en/main_classes/quantization#transformers.BitsAndBytesConfig).
+
+## Beware: bf16 is optional compute data type
+If your hardware supports it, `bf16` is the optimal compute dtype. The default is `float32` for backward compatibility and numerical stability. `float16` often leads to numerical instabilities, but `bfloat16` provides the benefits of both worlds: numerical stability and significant computation speedup. Therefore, be sure to check if your hardware supports `bf16` and configure it using the `bnb_4bit_compute_dtype` parameter in BitsAndBytesConfig:
+
+```py
+import torch
+from transformers import BitsAndBytesConfig
+
+quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
+```
 
 # PEFT
+With `PEFT`, you can use QLoRA out of the box with `LoraConfig` and a 4-bit base model.
+
+Please review the [bitsandbytes section in the Accelerate docs](https://huggingface.co/docs/peft/developer_guides/quantization#quantize-a-model).
 
-... TODO: to be filled out ...
+# Accelerate
+
+Bitsandbytes is also easily usable from within Accelerate.
+
+Please review the [bitsandbytes section in the Accelerate docs](https://huggingface.co/docs/accelerate/en/usage_guides/quantization).
 
 # Trainer for the optimizers
 
-... TODO: to be filled out ...
+You can use any of the 8-bit and/or paged optimizers by simple passing them to the `transformers.Trainer` class on intialization.All bnb optimizers are supported by passing the correct string in `TrainingArguments`'s `optim` attribute - e.g. (`paged_adamw_32bit`).
+
+See the [official API docs for reference](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer).
 
 Here we point out to relevant doc sections in transformers / peft / Trainer + very briefly explain how these are integrated:
 e.g. for transformers state that you can load any model in 8-bit / 4-bit precision, for PEFT, you can use QLoRA out of the box with `LoraConfig` + 4-bit base model, for Trainer: all bnb optimizers are supported by passing the correct string in `TrainingArguments`'s `optim` attribute - e.g. (`paged_adamw_32bit`):
 
-Few references:
-
-- [transformers documentation]( https://huggingface.co/docs/transformers/quantization#bitsandbytes)
-- [PEFT documentation](https://huggingface.co/docs/peft/developer_guides/quantization)
-
 # Blog posts
 
 - [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes)