Skip to content

Commit

Permalink
feat(docs): overhaul of the documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
baptistecolle committed Jan 15, 2025
1 parent c0dee09 commit fc2dae5
Show file tree
Hide file tree
Showing 27 changed files with 1,047 additions and 115 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/doc-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,9 @@ jobs:
- name: Make documentation
shell: bash
run: |
doc-builder notebook-to-mdx examples/ --output_dir docs/source/howto/ --open_notebook_prefix https://colab.research.google.com/github/huggingface/optimum-tpu/blob/main
python docs/scripts/add_examples_to_docs.py
doc-builder build optimum.tpu docs/source/ --repo_name optimum-tpu --build_dir tpu-doc-build/ --version ${{ env.VERSION }} --version_tag_suffix "" --html --clean
cd tpu-doc-build/
mv optimum.tpu optimum-tpu
doc-builder push optimum-tpu --doc_build_repo_id "hf-doc-build/doc-build" --token "${{ secrets.HF_DOC_BUILD_PUSH }}" --commit_msg "Updated with commit $COMMIT_SHA See: https://github.com/huggingface/optimum-tpu/commit/$COMMIT_SHA" --n_retries 5
doc-builder push optimum-tpu --doc_build_repo_id "hf-doc-build/doc-build" --token "${{ secrets.HF_DOC_BUILD_PUSH }}" --commit_msg "Updated with commit $COMMIT_SHA See: https://github.com/huggingface/optimum-tpu/commit/$COMMIT_SHA" --n_retries 5
2 changes: 2 additions & 0 deletions .github/workflows/doc-pr-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ jobs:
- name: Make documentation
shell: bash
run: |
doc-builder notebook-to-mdx examples/ --output_dir docs/source/howto/ --open_notebook_prefix https://colab.research.google.com/github/huggingface/optimum-tpu/blob/main
python docs/scripts/add_examples_to_docs.py
doc-builder build optimum.tpu docs/source/ --repo_name optimum-tpu --build_dir tpu-doc-build/ --version pr_${{ env.PR_NUMBER }} --version_tag_suffix "" --html --clean
- name: Save commit_sha & pr_number
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -135,4 +135,7 @@ dmypy.json
.vscode
.idea/

jetstream-pt-deps
jetstream-pt-deps

# Optimum TPU artifacts
tpu-doc-build/
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,6 @@ tgi_test: test_installs tgi_server
tgi_docker_test:
python -m pip install -r text-generation-inference/integration-tests/requirements.txt
python -m pytest -sv text-generation-inference/integration-tests

preview_doc:
doc-builder preview optimum-tpu docs/source --not_python_module
29 changes: 29 additions & 0 deletions docs/scripts/auto-generate-examples.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import yaml

# Read the examples list
with open('docs/scripts/examples_list.yml', 'r') as f:
examples = yaml.safe_load(f)

# Read the main toctree
with open('docs/source/_toctree.yml', 'r') as f:
toc = yaml.safe_load(f)

# Find the howto section and insert before more_examples
# Iterate through the list to find the sections with howto
for item in toc:
if isinstance(item, dict) and 'sections' in item:
for section in item['sections']:
if isinstance(section, dict) and 'sections' in section:
howto_items = section['sections']
for i, subitem in enumerate(howto_items):
if subitem.get('local') == 'howto/more_examples':
# Insert the new examples before this position
for example in reversed(examples):
howto_items.insert(i, example)
break

# Write back the modified toctree
with open('docs/source/_toctree.yml', 'w') as f:
yaml.dump(toc, f, sort_keys=False, allow_unicode=True, default_flow_style=False)

print("Added examples to the howto section of the toctree")
4 changes: 4 additions & 0 deletions docs/scripts/examples_list.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
- local: howto/gemma_tuning.md
title: Gemma Fine-Tuning Example
- local: howto/llama_tuning.md
title: Llama Fine-Tuning Example
40 changes: 34 additions & 6 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,47 @@
title: 🤗 Optimum-TPU
- local: supported-architectures
title: Supported Models
- local: installation
title: Installation
- local: optimum_container
title: Optimum Container
- sections:
- local: tutorials/overview
title: Overview
- local: tutorials/tpu_setup
title: TPU Setup
- local: tutorials/inference_on_tpu
title: Inference on TPU
- local: tutorials/training_on_tpu
title: Training on TPU
title: Tutorials
- sections:
- local: howto/overview
title: Overview
- local: howto/deploy
title: Deploying a Google Cloud TPU instance
- local: howto/gcloud_cli
title: Using the GCloud CLI for TPU deployment and SSH connection
- local: howto/serving
title: Deploying a TGI server on a Google Cloud TPU instance
- local: howto/training
title: Training on a Google Cloud TPU instance
- local: howto/deploy_instance_on_ie
title: How to Deploy an TGI server on IE
- local: howto/advanced-tgi-serving
title: Advanced TGI Server Configuration
- local: howto/more_examples
title: Find More Examples
title: How-To Guides
- sections:
- local: conceptual_guides/tpu_hardware_support
title: TPU Hardware Support
- local: conceptual_guides/difference_between_jetstream_and_xla
title: Difference between Jetstream and XLA
title: Conceptual Guides
- sections:
- local: reference/fsdp_v2
title: FSDPv2
- local: reference/tgi_advanced_options
title: TGI Advanced Options
title: Reference
- sections:
- local: contributing
title: Contributing
title: Contributing
title: Optimum-TPU
isExpanded: true
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Differences between JetStream and PyTorch XLA

| Feature | JetStream | PyTorch XLA |
|---------|-----------|-------------|
| Training |||
| Serving |||
| Performance | Higher serving performance | Standard performance |
| Flexibility | Limited to serving | Full PyTorch ecosystem |
| Use Case | Production inference | Development and training |
| Integration | Optimized for deployment | Standard PyTorch workflow |

**Notes:**
By default, optimum-tpu is using PyTorch XLA for training and JetStream for serving.

You can find more information about:
- PyTorch XLA: https://pytorch.org/xla/ and https://github.com/pytorch/xla
- JetStream: https://github.com/google/jaxon/tree/main/jetstream
54 changes: 54 additions & 0 deletions docs/source/conceptual_guides/tpu_hardware_support.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# TPU hardware support
Optimum-TPU support and is optimized for V5e, V5p, and V6e TPUs.

## When to use TPU
TPUs excel at large-scale machine learning workloads with matrix computations, extended training periods, and large batch sizes. In contrast, GPUs offer more flexibility for models with custom operations or mixed CPU/GPU workloads. TPUs aren't ideal for workloads needing frequent branching, high-precision arithmetic, or custom training loop operations. More information can be found at https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus

## TPU naming convention
The TPU naming follows this format: `<tpu_version>-<number_of_tpus>`

TPU versions available:
- v5litepod (v5e)
- v5p
- v6e.

For example, a v5litepod-8 is a v5e TPU with 8 tpus.

## Memory on TPU
The HBM (High Bandwidth Memory) capacity per chip is 16gb for V5e, V5p and 32gb for V6e. So a v5e-8 (v5litepod-8), has 16gb*8=128gb of HBM memory

## Performance on TPU
There are several key metrics to consider when evaluating TPU performance:
- Peak compute per chip (bf16/int8): Measures the maximum theoretical computing power in floating point or integer operations per second. Higher values indicate faster processing capability for machine learning workloads.
HBM (High Bandwidth Memory) metrics:
- Capacity: Amount of available high-speed memory per chip
- Bandwidth: Speed at which data can be read from or written to memory
These affect how much data can be processed and how quickly it can be accessed.
- Inter-chip interconnect (ICI) bandwidth: Determines how fast TPU chips can communicate with each other, which is crucial for distributed training across multiple chips.
Pod-level metrics:
- Peak compute per Pod: Total computing power when multiple chips work together
These indicate performance at scale for large training or serving jobs.

The actual performance you achieve will depend on your specific workload characteristics and how well it matches these hardware capabilities.

## Recommended Runtime for TPU

During the TPU VM creation use the following TPU VM base images for optimum-tpu:
- v2-alpha-tpuv6e (TPU v6e) (recommended)
- v2-alpha-tpuv5 (TPU v5p) (recommended)
- v2-alpha-tpuv5-lite (TPU v5e) (recommended)
- tpu-ubuntu2204-base (default)

For installation instructions, refer to our [TPU setup tutorial](./tutorials/tpu_setup). We recommend you use the *alpha* version with optimum-tpu, as optimum-tpu is tested and optimized for those.

More information at https://cloud.google.com/tpu/docs/runtimes#pytorch_and_jax

# Next steps
For more information on the different TPU hardware, you can look at:
https://cloud.google.com/tpu/docs/v6e
https://cloud.google.com/tpu/docs/v5p
https://cloud.google.com/tpu/docs/v5e

Pricing informatin can be found here https://cloud.google.com/tpu/pricing

Tpu availability can be found https://cloud.google.com/tpu/docs/regions-zones
95 changes: 95 additions & 0 deletions docs/source/contributing.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Contributing to Optimum TPU

We're excited that you're interested in contributing to Optimum TPU! Whether you're fixing bugs, adding new features, improving documentation, or sharing your experiences, your contributions are highly valued 😄

## Getting Started

1. Fork and clone the repository:
```bash
git clone https://github.com/YOUR_USERNAME/optimum-tpu.git
cd optimum-tpu
```

2. Install the package locally:
```bash
python -m venv .venv
source .venv/bin/activate
python -m pip install . -f https://storage.googleapis.com/libtpu-releases/index.html

```
3. Install testing dependencies:
```bash
make test_installs
```

## Development Tools

The project includes a comprehensive Makefile with commands for various development tasks:

### Testing
```bash
make tests # Run all tests
make tgi_test # Run TGI tests with PyTorch/XLA
make tgi_test_jetstream # Run TGI tests with Jetstream backend
make tgi_docker_test # Run TGI integration tests in Docker
```

### Code Quality
```bash
make style # Auto-fix code style issues
make style_check # Check code style without fixing
```

### Documentation
```bash
make preview_doc # Preview documentation locally
```

### Docker Images
```bash
make tpu-tgi # Build TGI Docker image
make tpu-tgi-ie # Build TGI inference endpoint image
make tpu-tgi-gcp # Build TGI Google Cloud image
```

### TGI Development
When working on Text Generation Inference (`/text-generation-inference` folder). You will also want to build TGI image from scratch, as discussed in the manual image building section of the [serving how to guide](./howto/serving)

1. Build the standalone server:
```bash
make tgi_server
```

## Pull Request Process

1. Create a new branch:
```bash
git checkout -b your-feature-name
```

2. Make your changes

3. Run tests:
```bash
make tests
# Run more specialized test if needed such as make tgi_test, make tgi_test_jetstream, make tgi_docker_test
make style_check
```

4. Submit your PR with:
- Clear description of changes
- Test results
- Documentation updates if needed

5. Check that the CI tests are correct:
- Verify all CI workflows have passed
- Address any CI failures

## Need Help?

- Open an issue for bugs or feature requests
- Check the [documentation](https://huggingface.co/docs/optimum/tpu/overview)

## License

By contributing to Optimum TPU, you agree that your contributions will be licensed under the Apache License, Version 2.0.
71 changes: 71 additions & 0 deletions docs/source/howto/advanced-tgi-serving.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Advanced Option for TGI server

## Jetstream Pytorch and Pytorch XLA backends

[Jetstream Pytorch](https://github.com/AI-Hypercomputer/jetstream-pytorch) is a highly optimized Pytorch engine for serving LLMs on Cloud TPU. This engine is selected by default if the dependency is available.

We recommend using Jetstream with TGI for the best performance. If for some reason you want to use the Pytorch/XLA backend instead, you can set the `JETSTREAM_PT_DISABLE=1` environment variable.

For more information, see our discussion on the [difference between jetstream and pytorch XLA](./conceptual_guide/difference)

## Quantization
When using Jetstream Pytorch engine, it is possible to enable quantization to reduce the memory footprint and increase the throughput. To enable quantization, set the `QUANTIZATION=1` environment variable. For instance, on a 2x4 TPU v5e (16GB per chip * 8 = 128 GB per pod), you can serve models up to 70B parameters, such as Llama 3.3-70B. The quantization is done in `int8` on the fly as the weight loads. As with any quantization option, you can expect a small drop in the model accuracy. Without the quantization option enabled, the model is served in bf16.

## How to solve memory requirements

If you encounter `Backend(NotEnoughMemory(2048))`.
Here are some solutions that could help with reducing memory usage in TGI:

bash
```
docker run -p 8080:80 \
--shm-size 16G \
--privileged \
--net host \
-e QUANTIZATION=1 \
-e MAX_BATCH_SIZE=2 \
-e LOG_LEVEL=text_generation_router=debug \
-v ~/hf_data:/data \
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
-e SKIP_WARMUP=1 \
ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
--model-id google/gemma-2b-it \
--max-input-length 512 \
--max-total-tokens 1024 \
--max-batch-prefill-tokens 512
--max-batch-total-tokens 1024
```

- `-e QUANTIZATION=1`: To enable quantization. This should reduce memory requirements by almost half
- `-e MAX_BATCH_SIZE=n`: You can manually reduce the size of the batch size
- `--max-input-length`: Maximum input sequence length
- `--max-total-tokens`: Maximum combined input and output tokens
- `--max-batch-prefill-tokens`: Maximum tokens for batch processing
- `--max-batch-total-tokens`: Maximum total tokens in a batch

To reduce memory usage, you want to try a smaller number for `--max-input-length`, `--max-total-tokens`, `--max-batch-prefill-tokens`, and `--max-batch-total-tokens`.

<Tip warning={true}>
`max-batch-prefill-tokens ≤ max-input-length * max_batch_size`. Otherwise, you will have an error as the configuration does not make sense. If the max-batch-prefill-tokens were bigger, then you would not be able to process any request
</Tip>

## Sharding
Sharding is done automatically by the TGI server, so your model uses all the TPUs that are available. We do tensor parallelism, so the layers are automatically split in all available TPUs. However, the TGI router will only see one shard.

More information on tensor parralelsim can be found here https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism

## Understanding the configuration

Key parameters explained:
- `--shm-size 16G`: Shared memory allocation
- `--privileged`: Required for TPU access
- `--net host`: Uses host network mode
- `-v ~/hf_data:/data`: Volume mount for model storage
- `-e SKIP_WARMUP=1`: Disables warmup for quick testing (not recommended for production)

<Tip warning={true}>
`--privileged --shm-size 16gb --net host` is required as specify in https://github.com/pytorch/xla
</Tip>

## Next steps
Please check the [TGI docs](https://huggingface.co/docs/text-generation-inference) for more TGI server configuration options.
Loading

0 comments on commit fc2dae5

Please sign in to comment.