bugfix: Fix Documentation Build Proces (#655)

Documentation build failed since #654 probably because of directory structure change, this PR fixes the issue. This PR also improves the AOT compilation documentations.
flashinfer-ai · Dec 12, 2024 · a693507 · a693507
1 parent e07c4a3
commit a693507
Show file tree

Hide file tree

Showing 15 changed files with 38 additions and 33 deletions.
diff --git a/README.md b/README.md
@@ -36,31 +36,36 @@ Using our PyTorch API is the easiest way to get started:
 
 ### Installation
 
-We provide prebuilt wheels for Linux and you can try out FlashInfer with the following command:
+We provide prebuilt wheels for Linux. You can install FlashInfer with the following command:
 
 ```bash
 # For CUDA 12.4 & torch 2.4
 pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4
 # For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html
 ```
 
-or you can build from source:
+We also offer nightly-built wheels to try the latest features from the main branch:
 
 ```bash
-git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
-cd flashinfer
-pip install -e .
+pip install flashinfer -i https://flashinfer.ai/whl/nightly/cu124/torch2.4
 ```
 
-to reduce binary size during build and testing:
+Alternatively, you can build FlashInfer from source:
+
 ```bash
 git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
 cd flashinfer
-# ref https://pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html#torch.cuda.get_device_capability
-export TORCH_CUDA_ARCH_LIST=8.0
-pip install -e .
+pip install -e . -v
+```
+
+By default, FlashInfer uses Just-In-Time (JIT) compilation for its kernels. To pre-compile essential kernels, set the environment variable `FLASHINFER_ENABLE_AOT=1` before running the installation command:
+
+```bash
+FLASHINFER_ENABLE_AOT=1 pip install -e . -v
 ```
 
+For more details, refer to the [Install from Source documentation](https://docs.flashinfer.ai/installation.html#install-from-source).
+
 ### Trying it out
 
 Below is a minimal example of using FlashInfer's single-request decode/append/prefill attention kernels:
@@ -118,7 +123,7 @@ FlashInfer also provides C++ API and TVM bindings, please refer to [documentatio
 
 ## Adoption
 
-Currently FlashInfer is adopted by the following projects:
+We are thrilled to share that FlashInfer is being adopted by many cutting-edge projects, including but not limited to:
 - [MLC-LLM](https://github.com/mlc-ai/mlc-llm)
 - [Punica](https://github.com/punica-ai/punica)
 - [SGLang](https://github.com/sgl-project/sglang)

diff --git a/docs/api/python/activation.rst → docs/api/activation.rst b/docs/api/python/activation.rst → docs/api/activation.rst
@@ -11,7 +11,7 @@ Up/Gate output activation
 -------------------------
 
 .. autosummary::
-    :toctree: ../../generated
+    :toctree: ../generated
 
     silu_and_mul
     gelu_tanh_and_mul

diff --git a/docs/api/python/cascade.rst → docs/api/cascade.rst b/docs/api/python/cascade.rst → docs/api/cascade.rst
@@ -11,7 +11,7 @@ Merge Attention States
 ----------------------
 
 .. autosummary::
-   :toctree: ../../generated
+   :toctree: ../generated
 
    merge_state
    merge_state_in_place

diff --git a/docs/api/python/decode.rst → docs/api/decode.rst b/docs/api/python/decode.rst → docs/api/decode.rst
@@ -9,7 +9,7 @@ Single Request Decoding
 -----------------------
 
 .. autosummary::
-    :toctree: ../../generated
+    :toctree: ../generated
 
     single_decode_with_kv_cache
 

diff --git a/docs/api/python/gemm.rst → docs/api/gemm.rst b/docs/api/python/gemm.rst → docs/api/gemm.rst
@@ -11,7 +11,7 @@ FP8 Batch GEMM
 --------------
 
 .. autosummary::
-    :toctree: ../../generated
+    :toctree: ../generated
 
     bmm_fp8
 

diff --git a/docs/api/python/norm.rst → docs/api/norm.rst b/docs/api/python/norm.rst → docs/api/norm.rst
@@ -8,7 +8,7 @@ Kernels for normalization layers.
 .. currentmodule:: flashinfer.norm
 
 .. autosummary::
-    :toctree: _generate
+    :toctree: ../generated
 
     rmsnorm
     fused_add_rmsnorm

diff --git a/docs/api/python/page.rst → docs/api/page.rst b/docs/api/python/page.rst → docs/api/page.rst
@@ -11,7 +11,7 @@ Append new K/V tensors to Paged KV-Cache
 ----------------------------------------
 
 .. autosummary::
-  :toctree: ../../generated
+  :toctree: ../generated
 
   append_paged_kv_cache
   get_batch_indices_positions
diff --git a/docs/api/python/prefill.rst → docs/api/prefill.rst b/docs/api/python/prefill.rst → docs/api/prefill.rst
@@ -11,7 +11,7 @@ Single Request Prefill/Append Attention
 ---------------------------------------
 
 .. autosummary::
-    :toctree: ../../generated
+    :toctree: ../generated
 
     single_prefill_with_kv_cache
     single_prefill_with_kv_cache_return_lse

diff --git a/docs/api/python/.gitignore b/docs/api/python/.gitignore
diff --git a/docs/api/python/quantization.rst → docs/api/quantization.rst b/docs/api/python/quantization.rst → docs/api/quantization.rst
@@ -8,7 +8,7 @@ Quantization related kernels.
 .. currentmodule:: flashinfer.quantization
 
 .. autosummary::
-    :toctree: _generate
+    :toctree: ../generated
 
     packbits
     segment_packbits
diff --git a/docs/api/python/rope.rst → docs/api/rope.rst b/docs/api/python/rope.rst → docs/api/rope.rst
@@ -8,7 +8,7 @@ Kernels for applying rotary embeddings.
 .. currentmodule:: flashinfer.rope
 
 .. autosummary::
-    :toctree: _generate
+    :toctree: ../generated
 
     apply_rope_inplace
     apply_llama31_rope_inplace

diff --git a/docs/api/python/sampling.rst → docs/api/sampling.rst b/docs/api/python/sampling.rst → docs/api/sampling.rst
@@ -8,7 +8,7 @@ Kernels for LLM sampling.
 .. currentmodule:: flashinfer.sampling
 
 .. autosummary::
-    :toctree: ../../generated
+    :toctree: ../generated
 
     sampling_from_probs
     top_p_sampling_from_probs

diff --git a/docs/api/python/sparse.rst → docs/api/sparse.rst b/docs/api/python/sparse.rst → docs/api/sparse.rst
diff --git a/docs/index.rst b/docs/index.rst
@@ -27,14 +27,14 @@ FlashInfer is a library and kernel generator for Large Language Models that prov
    :maxdepth: 2
    :caption: PyTorch API Reference
 
-   api/python/decode
-   api/python/prefill
-   api/python/cascade
-   api/python/sparse
-   api/python/page
-   api/python/sampling
-   api/python/gemm
-   api/python/norm
-   api/python/rope
-   api/python/activation
-   api/python/quantization
+   api/decode
+   api/prefill
+   api/cascade
+   api/sparse
+   api/page
+   api/sampling
+   api/gemm
+   api/norm
+   api/rope
+   api/activation
+   api/quantization
diff --git a/docs/installation.rst b/docs/installation.rst
@@ -118,6 +118,7 @@ AOT mode
    - Core CUDA kernels are pre-compiled and included in the library, reducing runtime compilation overhead.
    - If a required kernel is not pre-compiled, it will be compiled at runtime using JIT. AOT mode is recommended for production environments.
 
+JIT mode is the default installation mode. To enable AOT mode, set the environment variable ``FLASHINFER_ENABLE_AOT=1`` before installing FlashInfer.
 You can follow the steps below to install FlashInfer from source code:
 
 1. Clone the FlashInfer repository:
@@ -156,7 +157,7 @@ You can follow the steps below to install FlashInfer from source code:
                cd flashinfer
                TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a" FLASHINFER_ENABLE_AOT=1 pip install --no-build-isolation --verbose --editable .
 
-5. Create FlashInfer distributions
+5. Create FlashInfer distributions (optional):
 
    .. tabs::
-Original file line number
+Diff line change
@@ Expand Up / @@ -11,7 +11,7 @@ FP8 Batch GEMM @@
     --------------
     .. autosummary::
-        :toctree: ../../generated
+        :toctree: ../generated
         bmm_fp8
@@ Expand Down @@