Skip to content

Commit

Permalink
Squashed commit of the following:
Browse files Browse the repository at this point in the history
commit 9250546
Author: S1ro <54212263+S1ro1@users.noreply.github.com>
Date:   Sun Sep 8 03:44:04 2024 +0200

    Feat: add kl div to readme (linkedin#229)

    ## Summary
    Adds newly implemented kl divergence loss to readme. Closes linkedin#188
    finally.

    ## Testing Done
    No code changes

    ---------

    Co-authored-by: Shao Tang <tangshao28@gmail.com>
    Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>

commit 1cdb7f0
Author: S1ro <54212263+S1ro1@users.noreply.github.com>
Date:   Sun Sep 8 03:19:19 2024 +0200

    Refactor/benchmarking visualizer (linkedin#212)

    ## Summary
    Implements a new script, `benchmark/benchmarks_visualizer.py`, that
    substitues the functionality provided by current
    `benchmark/benchmarks_visualizer.ipynb`. Resolves linkedin#211 .

    ## Details
    ```console
    $ python3 benchmarks_visualizer.py --help
    usage: benchmarks_visualizer.py [-h] --kernel-name KERNEL_NAME --metric-name METRIC_NAME --kernel-operation-mode KERNEL_OPERATION_MODE [--display] [--overwrite]

    options:
      -h, --help            show this help message and exit
      --kernel-name KERNEL_NAME
                            Kernel name to benchmark
      --metric-name METRIC_NAME
                            Metric name to visualize (speed/memory)
      --kernel-operation-mode KERNEL_OPERATION_MODE
                            Kernel operation mode to visualize (forward/backward/full)
      --display             Display the visualization
      --overwrite           Overwrite existing visualization, if none exist this flag has no effect as one are always created
      ```

    ## Testing Done
    <!--- This is a required section; please describe how this change was tested. --->

    - Hardware Type: <BLANK>
    - [ ] run `make test` to ensure correctness
    - [ ] run `make checkstyle` to ensure code style
    - [ ] run `make test-convergence` to ensure convergence

    ---------

    Co-authored-by: Shao Tang <tangshao28@gmail.com>

commit 18fd280
Author: Wizyoung <happyyanghehe@gmail.com>
Date:   Sun Sep 8 07:56:04 2024 +0800

    (fix) fix pyproject.toml (linkedin#226)

    ## Summary
    In linkedin#218, I fixed the
    `tool.setuptools.packages.find` field and tested it only in editable
    mode with `pip install -e .`. However, in production mode with `pip
    install .`, only the env_report.py file is copied to the Python
    site-packages directory. To fix this, adding "liger_kernel.*" to the
    include list will ensure that setuptools correctly includes all
    subpackages within liger_kernel.

    <!---
    ## Details
    This is an optional section; is there anything specific that reviewers
    should be aware of?
    --->

    ## Testing Done
    <!--- This is a required section; please describe how this change was
    tested. --->

    <!--
    Replace BLANK with your device type. For example, A100-80G-PCIe

    Complete the following tasks before sending your PR, and replace `[ ]`
    with
    `[x]` to indicate you have done them.
    -->

    - Hardware Type: <BLANK>
    - [ ] run `make test` to ensure correctness
    - [ ] run `make checkstyle` to ensure code style
    - [ ] run `make test-convergence` to ensure convergence

    ---------

    Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>

commit 638b310
Author: Wizyoung <happyyanghehe@gmail.com>
Date:   Sat Sep 7 11:53:15 2024 +0800

    add repr infomation for layer_norm and rms_norm (linkedin#220)

    ## Summary
    Add repr information for layernorm and rmsnorm class so that the useful
    layer information can be displayed after the model is printed. Other
    classes are not modified because they inherit from related torch.nn
    classes, or there are torch.nn sub-modules.

    ## Testing Done
    <!--- This is a required section; please describe how this change was
    tested. --->

    <!--
    Replace BLANK with your device type. For example, A100-80G-PCIe

    Complete the following tasks before sending your PR, and replace `[ ]`
    with
    `[x]` to indicate you have done them.
    -->

    - Hardware Type: <BLANK>
    - [x] run `make test` to ensure correctness
    - [x] run `make checkstyle` to ensure code style
    - [x] run `make test-convergence` to ensure convergence

    ---------

    Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
    Co-authored-by: Shao Tang <tangshao28@gmail.com>

commit 07804e4
Author: Ivan Yashchuk <IvanYashchuk@users.noreply.github.com>
Date:   Sat Sep 7 06:30:32 2024 +0300

    Update swiglu and geglu forward: zeros_like -> empty_like (linkedin#217)

    ## Summary
    <!--- This is a required section; please describe the main purpose of
    this proposed code change. --->
    This PR improves the performance of swiglu and geglu forward by
    replacing `zeros_like` with `empty_like`. The difference is that
    `empty_like` doesn't require a separate kernel launch.

    <!---
    ## Details
    This is an optional section; is there anything specific that reviewers
    should be aware of?
    --->

    ## Testing Done
    <!--- This is a required section; please describe how this change was
    tested. --->
    Testing is covered by existing `test_geglu.py` and `test_swiglu.py`.

    <!--
    Replace BLANK with your device type. For example, A100-80G-PCIe

    Complete the following tasks before sending your PR, and replace `[ ]`
    with
    `[x]` to indicate you have done them.
    -->

    - Hardware Type: A100-80G-PCIe
    - [x] run `make test` to ensure correctness
    - [x] run `make checkstyle` to ensure code style
    - [x] run `make test-convergence` to ensure convergence

    ---------

    Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
    Co-authored-by: Shao Tang <tangshao28@gmail.com>

commit 6a75ddc
Author: Byron Hsu <byronhsu1230@gmail.com>
Date:   Fri Sep 6 20:16:05 2024 -0700

    Update README.md

commit 8cf49e2
Author: Byron Hsu <byronhsu1230@gmail.com>
Date:   Fri Sep 6 20:13:51 2024 -0700

    Update README.md

commit 53dcf02
Author: Wizyoung <happyyanghehe@gmail.com>
Date:   Sat Sep 7 07:13:28 2024 +0800

    (fix) fix pyproject.toml (linkedin#218)

    ## Summary
    Fix `tool.setuptools.packages.find` field in pyproject.toml. Otherwise
    in local build mode with `pip install .`, python system fails to locate
    liger_kernel.

    Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>

commit b42a27b
Author: Steven Shimizu <shimizust@gmail.com>
Date:   Fri Sep 6 14:16:41 2024 -0700

    Added HF use-case benchmark script (linkedin#223)

    ## Summary
    - Added Hugging Face training benchmarking script used for tech report
    - Writes files to
    `/results/${MODEL_TYPE}_use_liger_${USE_LIGER}_batch_size_${BATCH_SIZE}_rep_${i}.log`

    ## Testing Done
    - Ran benchmarking script

    - Hardware Type: A100
    - [x] run `make test` to ensure correctness
    - [x] run `make checkstyle` to ensure code style
    - [x] run `make test-convergence` to ensure convergence

commit 43cbd4e
Author: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
Date:   Sat Sep 7 05:07:01 2024 +0800

    Add label smoothing for cross entropy (linkedin#198)

    ## Summary
    Aim to solve linkedin#81.

    ## Details

    ### For loss:
    Label smoothing regularization ( LSR ) by replacing the label
    distribution $q(k) = \delta_{k,y}$ with
    ```math
    q'(k) = (1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K}
    ```
    Considering cross entropy with LSR is

    ```math
    \begin{align}
    L' = H(q', p) &= -\sum^K_{k=1}log\ {p(k)}q'(k) = -\sum^K_{k=1}log\ {p(k)}((1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K})\\
                        &=  -\sum^K_{k=1}log\ {p(k)}(1 - \epsilon)q(k)   -\sum^K_{k=1}log\ {p(k)}\frac{\epsilon}{K} \\
                        &= (1 - \epsilon)H(q,p) + \frac{\epsilon}{K} \sum^K_{k=1} log\ softmax(x_k)\\
                        &= (1- \epsilon)L + \frac{\epsilon}{K}\ SmoothLoss,

    \end{align}
    ```
    where $L = H(q,p)$ is the original loss and $\sum^K_{k=1} log\
    softmax(x_k)$ is smooth loss.

    ### For gradients:
    The original:
    ```math
    \begin{align}
    \frac{\partial L}{\partial x_i} &= p(k) - q(k)\\
                                                &= \begin{cases}
                                                       softmax(x_i) ,                        &  i \neq y \\
                                                       softmax(x_i) - 1,                    &  i = y

    \end{cases}
    \end{align}
    ```
    With LSR:
    ```math
    \begin{align}
    \frac{\partial L'}{\partial x_i} &= p(k) - q'(k)\\
                                                &= softmax(x_i) - (1 - \epsilon)\delta_{k,y} + \frac{\epsilon}{K}\\
                                                &= \begin{cases} softmax(x_i) - \frac{\epsilon}{K},                        &  i \neq y \\
                                                       softmax(x_i) - \frac{\epsilon}{K} - (1 - \epsilon) &  i = y

    \end{cases}
    \end{align}
    ```

    We can handle the $i = y$ case by simply adding $-(1-\epsilon)$ after
    computing all $i$.

    Reference:
    [Rethinking the Inception Architecture for Computer
    Vision](https://arxiv.org/abs/1512.00567)

    ## Testing Done
    Add a unit test for label smoothing.

    - Hardware Type: RTX-3080
    - [x] run `make test` to ensure correctness
    - [x] run `make checkstyle` to ensure code style
    - [x] run `make test-convergence` to ensure convergence
    ```bash
    ❯ python3 -m pytest test/transformers/test_cross_entropy.py
    ============================================ test session starts =============================================
    platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
    rootdir: /home/tcc/Liger-Kernel
    collected 94 items

    test/transformers/test_cross_entropy.py .............................................................. [ 65%]
    ...............................F                                                                       [100%]

    ================================================== FAILURES ==================================================
    __________________________________ test_large_no_exception[8-16384-128256] ___________________________________

    B = 8, T = 16384, V = 128256

        @pytest.mark.parametrize(
            "B, T, V",
            [
                (
                    8,
                    8192,
                    128256,
                ),  # _input = 16GB, total = ~32GB, 8405385216 > 2,147,483,647, so we need int64
                (8, 16384, 128256),  # _input = 32GB, total = ~64GB
            ],
        )
        # @pytest.mark.skipif(
        #     torch.cuda.get_device_properties(0).total_memory < 64 * 1000 * 1000 * 1000,
        #     reason="Needs 64GB+ GPU memory.",
        # )
        def test_large_no_exception(B, T, V):
            # The large inputs were hitting cuda illegal memory access because of
            # triton-lang/triton#1058
    >       _full_pass_once(B, T, V)

    test/transformers/test_cross_entropy.py:401:
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    B = 8, T = 16384, V = 128256

        def _full_pass_once(B, T, V):
            torch.manual_seed(0)
            liger_ce = LigerCrossEntropyLoss()

    >       _input = torch.randn(
                B * T, V, requires_grad=True, device="cuda", dtype=torch.bfloat16
            )
    E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.31 GiB. GPU 0 has a total capacity of 10.00 GiB of which 8.84 GiB is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

    test/transformers/test_cross_entropy.py:374: OutOfMemoryError
    ========================================== short test summary info ===========================================
    FAILED test/transformers/test_cross_entropy.py::test_large_no_exception[8-16384-128256] - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.31 GiB. GPU 0 has a total capacity of 10...
    ================================== 1 failed, 93 passed in 130.88s (0:02:10) ==================================
    ```
    ```bash
    ❯ make test
    python -m pytest --disable-warnings test/ --ignore=test/convergence
    ============================================ test session starts =============================================
    platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
    rootdir: /home/tcc/Liger-Kernel
    collected 256 items

    test/transformers/test_auto_model.py .                                                                 [  0%]
    test/transformers/test_cross_entropy.py ssssssssssssssssssssssss............ssssssssssssssssssssssssss [ 24%]
    ssssssssssssssssssssssssssssssss                                                                       [ 37%]
    test/transformers/test_embedding.py ...........                                                        [ 41%]
    test/transformers/test_fused_linear_cross_entropy.py ................                                  [ 47%]
    test/transformers/test_geglu.py ............                                                           [ 52%]
    test/transformers/test_layer_norm.py ................                                                  [ 58%]
    test/transformers/test_monkey_patch.py .....                                                           [ 60%]
    test/transformers/test_rms_norm.py ............................................................        [ 83%]
    test/transformers/test_rope.py ..................                                                      [ 91%]
    test/transformers/test_swiglu.py ....................                                                  [ 98%]
    test/transformers/test_trainer_integration.py .                                                        [ 99%]
    test/triton/test_triton_monkey_patch.py ..                                                             [100%]

    ================================ 174 passed, 82 skipped in 123.06s (0:02:03) =================================
    ```
    ```bash
    ❯ make checkstyle
    flake8 .; flake8_status=$?; \
    isort .; isort_status=$?; \
    black .; black_status=$?; \
    if [ $flake8_status -ne 0 ] || [ $isort_status -ne 0 ] || [ $black_status -ne 0 ]; then \
            exit 1; \
    fi
    Skipped 2 files
    All done! ✨ 🍰 ✨
    68 files left unchanged.
    ```
    ```bash
    ❯ make test-convergence
    HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence
    ============================================ test session starts =============================================
    platform linux -- Python 3.10.12, pytest-8.3.2, pluggy-1.5.0
    rootdir: /home/tcc/Liger-Kernel
    collected 30 items

    test/convergence/test_mini_models.py ..............                                                    [ 46%]
    test/convergence/test_mini_models_no_logits.py ................                                        [100%]

    ======================================= 30 passed in 223.18s (0:03:43) =======================================
    ```

commit 376fe0c
Author: Yanning Chen <momochenonline@gmail.com>
Date:   Fri Sep 6 13:10:02 2024 -0700

    Reference Unsloth in header (linkedin#216)

    ## Summary
    Reference Unsloth in header section

    <!---
    ## Details
    This is an optional section; is there anything specific that reviewers
    should be aware of?
    --->

    ## Testing Done
    <!--- This is a required section; please describe how this change was
    tested. --->

    <!--
    Replace BLANK with your device type. For example, A100-80G-PCIe

    Complete the following tasks before sending your PR, and replace `[ ]`
    with
    `[x]` to indicate you have done them.
    -->

    - Hardware Type: <BLANK>
    - [x] run `make test` to ensure correctness
    - [x] run `make checkstyle` to ensure code style
    - [x] run `make test-convergence` to ensure convergence

commit c844f78
Author: Byron Hsu <byronhsu1230@gmail.com>
Date:   Fri Sep 6 13:08:22 2024 -0700

    Update README.md

commit ec68ac0
Author: Byron Hsu <byronhsu1230@gmail.com>
Date:   Fri Sep 6 13:07:18 2024 -0700

    Add license in ack section (linkedin#224)

    ## Summary
    <!--- This is a required section; please describe the main purpose of
    this proposed code change. --->

    <!---
    ## Details
    This is an optional section; is there anything specific that reviewers
    should be aware of?
    --->

    ## Testing Done
    <!--- This is a required section; please describe how this change was
    tested. --->

    <!--
    Replace BLANK with your device type. For example, A100-80G-PCIe

    Complete the following tasks before sending your PR, and replace `[ ]`
    with
    `[x]` to indicate you have done them.
    -->

    - Hardware Type: <BLANK>
    - [ ] run `make test` to ensure correctness
    - [ ] run `make checkstyle` to ensure code style
    - [ ] run `make test-convergence` to ensure convergence

commit ec63200
Author: Byron Hsu <byronhsu1230@gmail.com>
Date:   Fri Sep 6 12:58:33 2024 -0700

    Elaborate ack section (linkedin#222)

    ## Summary
    <!--- This is a required section; please describe the main purpose of
    this proposed code change. --->

    <!---
    ## Details
    This is an optional section; is there anything specific that reviewers
    should be aware of?
    --->

    ## Testing Done
    <!--- This is a required section; please describe how this change was
    tested. --->

    <!--
    Replace BLANK with your device type. For example, A100-80G-PCIe

    Complete the following tasks before sending your PR, and replace `[ ]`
    with
    `[x]` to indicate you have done them.
    -->

    - Hardware Type: <BLANK>
    - [ ] run `make test` to ensure correctness
    - [ ] run `make checkstyle` to ensure code style
    - [ ] run `make test-convergence` to ensure convergence
  • Loading branch information
wizyoung committed Sep 8, 2024
1 parent d1e8240 commit 08981d3
Show file tree
Hide file tree
Showing 12 changed files with 497 additions and 187 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,7 @@ site/

# Build
build/
dist/
dist/

# Benchmark images
benchmark/visualizations
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ The `/benchmark` directory contains benchmarking scripts for the individual kern
- Existing entries that are the same (based on `kernel_name`, `kernel_provider`, `kernel_operation_mode`, `metric_name`, `x_name`, `x_value`, `extra_benchmark_config_str`, and `gpu_name`) will not be overwritten.
2. Run `make run-benchmarks OVERWRITE=1` to overwrite any existing entries that have the same configuration.
3. Run `python benchmark/scripts/benchmark_{kernel_name}.py` to run an individual benchmark.
4. You can use the `benchmark/benchmarks_visualizer.ipynb` notebook as an example to load the CSV and perform data visualization/analysis.
4. You can use the `benchmark/benchmarks_visualizer.py` script to generate visualizations from the CSV, these are then saved to the `benchmark/visualizations` directory (note: this directory is not tracked by git).

## Submit PR
Fork the repo, copy and paste the successful test logs in the PR and submit the PR followed by the PR template (**[example PR](https://github.com/linkedin/Liger-Kernel/pull/21)**).
Expand Down
34 changes: 27 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,18 +40,19 @@

<img src="https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/logo-banner.png">

[Installation](#installation) | [Getting Started](#getting-started) | [Examples](#examples) | [APIs](#apis) | [Structure](#structure) | [Contributing](#contributing) | [Contact](#contact)
[Installation](#installation) | [Getting Started](#getting-started) | [Examples](#examples) | [APIs](#apis) | [Structure](#structure) | [Contributing](#contributing) | [Acknowledgement](#acknowledgement)

<details>
<summary>Latest News 🔥</summary>

- [2024/9/6] We release v0.2.1 ([X post](https://x.com/liger_kernel/status/1832168197002510649)). 2500+ Stars, 10+ New Contributors, 50+ PRs, 50k Downloads in two weeks!
- [2024/8/31] CUDA MODE talk, [Liger-Kernel: Real-world Triton kernel for LLM Training](https://youtu.be/gWble4FreV4?si=dxPeIchhkJ36Mbns), [Slides](https://github.com/cuda-mode/lectures?tab=readme-ov-file#lecture-28-liger-kernel)
- [2024/8/23] Official release: check out our [X post](https://x.com/hsu_byron/status/1827072737673982056)

</details>


**Liger Kernel** is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU **training throughput by 20%** and reduce **memory usage by 60%**. We have implemented **Hugging Face Compatible** `RMSNorm`, `RoPE`, `SwiGLU`, `CrossEntropy`, `FusedLinearCrossEntropy`, and more to come. The kernel works out of the box with [Flash Attention](https://github.com/Dao-AILab/flash-attention), [PyTorch FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html), and [Microsoft DeepSpeed](https://github.com/microsoft/DeepSpeed). We welcome contributions from the community to gather the best kernels for LLM training.
**Liger Kernel** is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU **training throughput by 20%** and reduces **memory usage by 60%**. We have implemented **Hugging Face Compatible** `RMSNorm`, `RoPE`, `SwiGLU`, `CrossEntropy`, `FusedLinearCrossEntropy`, and more to come. The kernel works out of the box with [Flash Attention](https://github.com/Dao-AILab/flash-attention), [PyTorch FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html), and [Microsoft DeepSpeed](https://github.com/microsoft/DeepSpeed). We welcome contributions from the community to gather the best kernels for LLM training.

## Supercharge Your Model with Liger Kernel

Expand Down Expand Up @@ -131,7 +132,7 @@ pip install -e .
```
## Getting Started

There are a couple ways to apply Liger kernels, depending on the level of customization required.
There are a couple of ways to apply Liger kernels, depending on the level of customization required.

### 1. Use AutoLigerKernelForCausalLM

Expand Down Expand Up @@ -241,6 +242,7 @@ loss.backward()
| GeGLU | `liger_kernel.transformers.LigerGEGLUMLP` |
| CrossEntropy | `liger_kernel.transformers.LigerCrossEntropyLoss` |
| FusedLinearCrossEntropy | `liger_kernel.transformers.LigerFusedLinearCrossEntropyLoss`|
| KLDivergence | `liger_kernel.transformers.LigerKLDIVLoss` |

- **RMSNorm**: [RMSNorm](https://arxiv.org/pdf/1910.07467), which normalizes activations using their root mean square, is implemented by fusing the normalization and scaling steps into a single Triton kernel, and achieves ~3X speedup with ~3X peak memory reduction.
- **LayerNorm**: [LayerNorm](https://arxiv.org/pdf/1607.06450), which centers and normalizes activations across the feature dimension, is implemented by fusing the centering, normalization and scaling steps into a single Triton kernel, and achieves ~2X speedup.
Expand All @@ -254,7 +256,7 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
- **CrossEntropy**: [Cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) is implemented by computing both the loss and gradient in the forward pass with inplace replacement of input to reduce the peak memory by avoiding simultaneous materialization of both input logits and gradient. It achieves >2X speedup and >4X memory reduction for common vocab sizes (e.g., 32K, 128K, etc.).
<!-- TODO: verify vocab sizes are accurate -->
- **FusedLinearCrossEntropy**: Peak memory usage of cross entropy loss is further improved by fusing the model head with the CE loss and chunking the input for block-wise loss and gradient calculation, a technique inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy). It achieves >4X memory reduction for 128k vocab size. **This is highly effective for large batch size, large sequence length, and large vocabulary sizes.** Please refer to the [Medusa example](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) for individual kernel usage.

- **KLDivergence**: [KL Divergence](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) is implemented by fusing the forward into a single triton kernel, with reduction done outside the kernel. It achieves ~1.5X speed and ~15% memory reduction for 128K vocab size.

### Experimental Kernels

Expand Down Expand Up @@ -290,12 +292,30 @@ Since Liger Kernel is 100% Triton-based, it works seamlessly with [`torch.compil

## Acknowledgement


### Design

- [@claire_yishan](https://twitter.com/claire_yishan) for the LOGO design
- [flash-attn](https://github.com/Dao-AILab/flash-attention) and [Unsloth](https://github.com/unslothai/unsloth) for inspiration in Triton kernels for training
- [tiny shakespeare dataset](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) by Andrej Karpathy for convergence testing
- [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy) for lm_head + cross entropy inspiration
- [Wave Snippets](https://www.wavesnippets.com/) for generating the animated code snippets

### Code

We referenced or used the following projects:



| # | Project | Description | Location | License |
|---|----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| 1 | [Unsloth](https://github.com/unslothai/unsloth/blob/fd753fed99ed5f10ef8a9b7139588d9de9ddecfb/unsloth/kernels/utils.py#L43) | `calculate_settings` to determine block size and warp; We reuse it for Norm and MLP | [Liger Kernel Utils](https://github.com/linkedin/Liger-Kernel/blob/e249eee723978bf8610ff1ea2297d048a2417e20/src/liger_kernel/ops/utils.py#L23) | [Apache](https://github.com/unslothai/unsloth/blob/fd753fed99ed5f10ef8a9b7139588d9de9ddecfb/LICENSE) |
| 2 | [Unsloth](https://github.com/unslothai/unsloth/blob/976d11a10d54383aeb7a692c69e01151a20bfd72/unsloth/kernels/rms_layernorm.py#L48) | We modified and added dW calculation on top of Unsloth implementation | [Liger Kernel RMS Norm](https://github.com/linkedin/Liger-Kernel/blob/e249eee723978bf8610ff1ea2297d048a2417e20/src/liger_kernel/ops/rms_norm.py#L50) | [Apache](https://github.com/unslothai/unsloth/blob/fd753fed99ed5f10ef8a9b7139588d9de9ddecfb/LICENSE) |
| 3 | [Triton tutorial](https://triton-lang.org/main/index.html) | We modified on top of triton tutorials | [Liger Kernel RMS Norm](https://github.com/linkedin/Liger-Kernel/blob/e249eee723978bf8610ff1ea2297d048a2417e20/src/liger_kernel/ops/rms_norm.py#L50) | [MIT](https://github.com/triton-lang/triton/blob/main/LICENSE) |
| 4 | [tiny shakespeare dataset](https://huggingface.co/datasets/karpathy/tiny_shakespeare) | We use tiny shakespeare dataset to conduct convergence test on mini model | [Liger Kernel Convergence](https://github.com/linkedin/Liger-Kernel/tree/main/test/convergence) | N/A |
| 5 | [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy) | We use the idea of gradient-in-forward and chunking | [Liger Kernel Linear Cross Entropy](https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/ops/fused_linear_cross_entropy.py) | [MIT](https://github.com/mgmalek/efficient_cross_entropy/blob/main/LICENSE) |
| 6 | [Flash attn](https://github.com/Dao-AILab/flash-attention) | We take many optimization ideas from the work, such as tiling and recomputation | | [BSD](https://github.com/Dao-AILab/flash-attention/blob/main/LICENSE) |
| 7 | [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) | We reference the design of automodel | [Liger Kernel Auto Model](https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/transformers/auto_model.py) | [MIT](https://github.com/casper-hansen/AutoAWQ/blob/main/LICENSE) |
| 8 | [llm.c](https://github.com/karpathy/llm.c) | We reference the design of end-to-end testing | [Liger Kernel Convergence Tests](https://github.com/linkedin/Liger-Kernel/tree/main/test/convergence) | [MIT](https://github.com/karpathy/llm.c/blob/master/LICENSE) |

Many thanks to the contributors to these projects for their invaluable work that helped make Liger possible.

## License

Expand Down
132 changes: 0 additions & 132 deletions benchmark/benchmarks_visualizer.ipynb

This file was deleted.

Loading

0 comments on commit 08981d3

Please sign in to comment.