Skip to content

Commit

Permalink
bugfix: Fix Documentation Build Proces (#655)
Browse files Browse the repository at this point in the history
Documentation build failed since #654 probably because of directory
structure change, this PR fixes the issue.
This PR also improves the AOT compilation documentations.
  • Loading branch information
yzh119 authored Dec 12, 2024
1 parent e07c4a3 commit a693507
Show file tree
Hide file tree
Showing 15 changed files with 38 additions and 33 deletions.
25 changes: 15 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,31 +36,36 @@ Using our PyTorch API is the easiest way to get started:

### Installation

We provide prebuilt wheels for Linux and you can try out FlashInfer with the following command:
We provide prebuilt wheels for Linux. You can install FlashInfer with the following command:

```bash
# For CUDA 12.4 & torch 2.4
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4
# For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html
```

or you can build from source:
We also offer nightly-built wheels to try the latest features from the main branch:

```bash
git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
pip install -e .
pip install flashinfer -i https://flashinfer.ai/whl/nightly/cu124/torch2.4
```

to reduce binary size during build and testing:
Alternatively, you can build FlashInfer from source:

```bash
git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
# ref https://pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html#torch.cuda.get_device_capability
export TORCH_CUDA_ARCH_LIST=8.0
pip install -e .
pip install -e . -v
```

By default, FlashInfer uses Just-In-Time (JIT) compilation for its kernels. To pre-compile essential kernels, set the environment variable `FLASHINFER_ENABLE_AOT=1` before running the installation command:

```bash
FLASHINFER_ENABLE_AOT=1 pip install -e . -v
```

For more details, refer to the [Install from Source documentation](https://docs.flashinfer.ai/installation.html#install-from-source).

### Trying it out

Below is a minimal example of using FlashInfer's single-request decode/append/prefill attention kernels:
Expand Down Expand Up @@ -118,7 +123,7 @@ FlashInfer also provides C++ API and TVM bindings, please refer to [documentatio

## Adoption

Currently FlashInfer is adopted by the following projects:
We are thrilled to share that FlashInfer is being adopted by many cutting-edge projects, including but not limited to:
- [MLC-LLM](https://github.com/mlc-ai/mlc-llm)
- [Punica](https://github.com/punica-ai/punica)
- [SGLang](https://github.com/sgl-project/sglang)
Expand Down
2 changes: 1 addition & 1 deletion docs/api/python/activation.rst → docs/api/activation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Up/Gate output activation
-------------------------

.. autosummary::
:toctree: ../../generated
:toctree: ../generated

silu_and_mul
gelu_tanh_and_mul
Expand Down
2 changes: 1 addition & 1 deletion docs/api/python/cascade.rst → docs/api/cascade.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Merge Attention States
----------------------

.. autosummary::
:toctree: ../../generated
:toctree: ../generated

merge_state
merge_state_in_place
Expand Down
2 changes: 1 addition & 1 deletion docs/api/python/decode.rst → docs/api/decode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Single Request Decoding
-----------------------

.. autosummary::
:toctree: ../../generated
:toctree: ../generated

single_decode_with_kv_cache

Expand Down
2 changes: 1 addition & 1 deletion docs/api/python/gemm.rst → docs/api/gemm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ FP8 Batch GEMM
--------------

.. autosummary::
:toctree: ../../generated
:toctree: ../generated

bmm_fp8

Expand Down
2 changes: 1 addition & 1 deletion docs/api/python/norm.rst → docs/api/norm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Kernels for normalization layers.
.. currentmodule:: flashinfer.norm

.. autosummary::
:toctree: _generate
:toctree: ../generated

rmsnorm
fused_add_rmsnorm
Expand Down
2 changes: 1 addition & 1 deletion docs/api/python/page.rst → docs/api/page.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Append new K/V tensors to Paged KV-Cache
----------------------------------------

.. autosummary::
:toctree: ../../generated
:toctree: ../generated

append_paged_kv_cache
get_batch_indices_positions
2 changes: 1 addition & 1 deletion docs/api/python/prefill.rst → docs/api/prefill.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Single Request Prefill/Append Attention
---------------------------------------

.. autosummary::
:toctree: ../../generated
:toctree: ../generated

single_prefill_with_kv_cache
single_prefill_with_kv_cache_return_lse
Expand Down
1 change: 0 additions & 1 deletion docs/api/python/.gitignore

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Quantization related kernels.
.. currentmodule:: flashinfer.quantization

.. autosummary::
:toctree: _generate
:toctree: ../generated

packbits
segment_packbits
2 changes: 1 addition & 1 deletion docs/api/python/rope.rst → docs/api/rope.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Kernels for applying rotary embeddings.
.. currentmodule:: flashinfer.rope

.. autosummary::
:toctree: _generate
:toctree: ../generated

apply_rope_inplace
apply_llama31_rope_inplace
Expand Down
2 changes: 1 addition & 1 deletion docs/api/python/sampling.rst → docs/api/sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Kernels for LLM sampling.
.. currentmodule:: flashinfer.sampling

.. autosummary::
:toctree: ../../generated
:toctree: ../generated

sampling_from_probs
top_p_sampling_from_probs
Expand Down
File renamed without changes.
22 changes: 11 additions & 11 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,14 @@ FlashInfer is a library and kernel generator for Large Language Models that prov
:maxdepth: 2
:caption: PyTorch API Reference

api/python/decode
api/python/prefill
api/python/cascade
api/python/sparse
api/python/page
api/python/sampling
api/python/gemm
api/python/norm
api/python/rope
api/python/activation
api/python/quantization
api/decode
api/prefill
api/cascade
api/sparse
api/page
api/sampling
api/gemm
api/norm
api/rope
api/activation
api/quantization
3 changes: 2 additions & 1 deletion docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ AOT mode
- Core CUDA kernels are pre-compiled and included in the library, reducing runtime compilation overhead.
- If a required kernel is not pre-compiled, it will be compiled at runtime using JIT. AOT mode is recommended for production environments.

JIT mode is the default installation mode. To enable AOT mode, set the environment variable ``FLASHINFER_ENABLE_AOT=1`` before installing FlashInfer.
You can follow the steps below to install FlashInfer from source code:

1. Clone the FlashInfer repository:
Expand Down Expand Up @@ -156,7 +157,7 @@ You can follow the steps below to install FlashInfer from source code:
cd flashinfer
TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a" FLASHINFER_ENABLE_AOT=1 pip install --no-build-isolation --verbose --editable .
5. Create FlashInfer distributions
5. Create FlashInfer distributions (optional):

.. tabs::

Expand Down

0 comments on commit a693507

Please sign in to comment.