-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(docs): overhaul of the documentation
- Loading branch information
1 parent
c0dee09
commit fc2dae5
Showing
27 changed files
with
1,047 additions
and
115 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -135,4 +135,7 @@ dmypy.json | |
.vscode | ||
.idea/ | ||
|
||
jetstream-pt-deps | ||
jetstream-pt-deps | ||
|
||
# Optimum TPU artifacts | ||
tpu-doc-build/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
import yaml | ||
|
||
# Read the examples list | ||
with open('docs/scripts/examples_list.yml', 'r') as f: | ||
examples = yaml.safe_load(f) | ||
|
||
# Read the main toctree | ||
with open('docs/source/_toctree.yml', 'r') as f: | ||
toc = yaml.safe_load(f) | ||
|
||
# Find the howto section and insert before more_examples | ||
# Iterate through the list to find the sections with howto | ||
for item in toc: | ||
if isinstance(item, dict) and 'sections' in item: | ||
for section in item['sections']: | ||
if isinstance(section, dict) and 'sections' in section: | ||
howto_items = section['sections'] | ||
for i, subitem in enumerate(howto_items): | ||
if subitem.get('local') == 'howto/more_examples': | ||
# Insert the new examples before this position | ||
for example in reversed(examples): | ||
howto_items.insert(i, example) | ||
break | ||
|
||
# Write back the modified toctree | ||
with open('docs/source/_toctree.yml', 'w') as f: | ||
yaml.dump(toc, f, sort_keys=False, allow_unicode=True, default_flow_style=False) | ||
|
||
print("Added examples to the howto section of the toctree") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
- local: howto/gemma_tuning.md | ||
title: Gemma Fine-Tuning Example | ||
- local: howto/llama_tuning.md | ||
title: Llama Fine-Tuning Example |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
17 changes: 17 additions & 0 deletions
17
docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# Differences between JetStream and PyTorch XLA | ||
|
||
| Feature | JetStream | PyTorch XLA | | ||
|---------|-----------|-------------| | ||
| Training | ❌ | ✅ | | ||
| Serving | ✅ | ✅ | | ||
| Performance | Higher serving performance | Standard performance | | ||
| Flexibility | Limited to serving | Full PyTorch ecosystem | | ||
| Use Case | Production inference | Development and training | | ||
| Integration | Optimized for deployment | Standard PyTorch workflow | | ||
|
||
**Notes:** | ||
By default, optimum-tpu is using PyTorch XLA for training and JetStream for serving. | ||
|
||
You can find more information about: | ||
- PyTorch XLA: https://pytorch.org/xla/ and https://github.com/pytorch/xla | ||
- JetStream: https://github.com/google/jaxon/tree/main/jetstream |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# TPU hardware support | ||
Optimum-TPU support and is optimized for V5e, V5p, and V6e TPUs. | ||
|
||
## When to use TPU | ||
TPUs excel at large-scale machine learning workloads with matrix computations, extended training periods, and large batch sizes. In contrast, GPUs offer more flexibility for models with custom operations or mixed CPU/GPU workloads. TPUs aren't ideal for workloads needing frequent branching, high-precision arithmetic, or custom training loop operations. More information can be found at https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus | ||
|
||
## TPU naming convention | ||
The TPU naming follows this format: `<tpu_version>-<number_of_tpus>` | ||
|
||
TPU versions available: | ||
- v5litepod (v5e) | ||
- v5p | ||
- v6e. | ||
|
||
For example, a v5litepod-8 is a v5e TPU with 8 tpus. | ||
|
||
## Memory on TPU | ||
The HBM (High Bandwidth Memory) capacity per chip is 16gb for V5e, V5p and 32gb for V6e. So a v5e-8 (v5litepod-8), has 16gb*8=128gb of HBM memory | ||
|
||
## Performance on TPU | ||
There are several key metrics to consider when evaluating TPU performance: | ||
- Peak compute per chip (bf16/int8): Measures the maximum theoretical computing power in floating point or integer operations per second. Higher values indicate faster processing capability for machine learning workloads. | ||
HBM (High Bandwidth Memory) metrics: | ||
- Capacity: Amount of available high-speed memory per chip | ||
- Bandwidth: Speed at which data can be read from or written to memory | ||
These affect how much data can be processed and how quickly it can be accessed. | ||
- Inter-chip interconnect (ICI) bandwidth: Determines how fast TPU chips can communicate with each other, which is crucial for distributed training across multiple chips. | ||
Pod-level metrics: | ||
- Peak compute per Pod: Total computing power when multiple chips work together | ||
These indicate performance at scale for large training or serving jobs. | ||
|
||
The actual performance you achieve will depend on your specific workload characteristics and how well it matches these hardware capabilities. | ||
|
||
## Recommended Runtime for TPU | ||
|
||
During the TPU VM creation use the following TPU VM base images for optimum-tpu: | ||
- v2-alpha-tpuv6e (TPU v6e) (recommended) | ||
- v2-alpha-tpuv5 (TPU v5p) (recommended) | ||
- v2-alpha-tpuv5-lite (TPU v5e) (recommended) | ||
- tpu-ubuntu2204-base (default) | ||
|
||
For installation instructions, refer to our [TPU setup tutorial](./tutorials/tpu_setup). We recommend you use the *alpha* version with optimum-tpu, as optimum-tpu is tested and optimized for those. | ||
|
||
More information at https://cloud.google.com/tpu/docs/runtimes#pytorch_and_jax | ||
|
||
# Next steps | ||
For more information on the different TPU hardware, you can look at: | ||
https://cloud.google.com/tpu/docs/v6e | ||
https://cloud.google.com/tpu/docs/v5p | ||
https://cloud.google.com/tpu/docs/v5e | ||
|
||
Pricing informatin can be found here https://cloud.google.com/tpu/pricing | ||
|
||
Tpu availability can be found https://cloud.google.com/tpu/docs/regions-zones |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
# Contributing to Optimum TPU | ||
|
||
We're excited that you're interested in contributing to Optimum TPU! Whether you're fixing bugs, adding new features, improving documentation, or sharing your experiences, your contributions are highly valued 😄 | ||
|
||
## Getting Started | ||
|
||
1. Fork and clone the repository: | ||
```bash | ||
git clone https://github.com/YOUR_USERNAME/optimum-tpu.git | ||
cd optimum-tpu | ||
``` | ||
|
||
2. Install the package locally: | ||
```bash | ||
python -m venv .venv | ||
source .venv/bin/activate | ||
python -m pip install . -f https://storage.googleapis.com/libtpu-releases/index.html | ||
|
||
``` | ||
3. Install testing dependencies: | ||
```bash | ||
make test_installs | ||
``` | ||
|
||
## Development Tools | ||
|
||
The project includes a comprehensive Makefile with commands for various development tasks: | ||
|
||
### Testing | ||
```bash | ||
make tests # Run all tests | ||
make tgi_test # Run TGI tests with PyTorch/XLA | ||
make tgi_test_jetstream # Run TGI tests with Jetstream backend | ||
make tgi_docker_test # Run TGI integration tests in Docker | ||
``` | ||
|
||
### Code Quality | ||
```bash | ||
make style # Auto-fix code style issues | ||
make style_check # Check code style without fixing | ||
``` | ||
|
||
### Documentation | ||
```bash | ||
make preview_doc # Preview documentation locally | ||
``` | ||
|
||
### Docker Images | ||
```bash | ||
make tpu-tgi # Build TGI Docker image | ||
make tpu-tgi-ie # Build TGI inference endpoint image | ||
make tpu-tgi-gcp # Build TGI Google Cloud image | ||
``` | ||
|
||
### TGI Development | ||
When working on Text Generation Inference (`/text-generation-inference` folder). You will also want to build TGI image from scratch, as discussed in the manual image building section of the [serving how to guide](./howto/serving) | ||
|
||
1. Build the standalone server: | ||
```bash | ||
make tgi_server | ||
``` | ||
|
||
## Pull Request Process | ||
|
||
1. Create a new branch: | ||
```bash | ||
git checkout -b your-feature-name | ||
``` | ||
|
||
2. Make your changes | ||
|
||
3. Run tests: | ||
```bash | ||
make tests | ||
# Run more specialized test if needed such as make tgi_test, make tgi_test_jetstream, make tgi_docker_test | ||
make style_check | ||
``` | ||
|
||
4. Submit your PR with: | ||
- Clear description of changes | ||
- Test results | ||
- Documentation updates if needed | ||
|
||
5. Check that the CI tests are correct: | ||
- Verify all CI workflows have passed | ||
- Address any CI failures | ||
|
||
## Need Help? | ||
|
||
- Open an issue for bugs or feature requests | ||
- Check the [documentation](https://huggingface.co/docs/optimum/tpu/overview) | ||
|
||
## License | ||
|
||
By contributing to Optimum TPU, you agree that your contributions will be licensed under the Apache License, Version 2.0. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# Advanced Option for TGI server | ||
|
||
## Jetstream Pytorch and Pytorch XLA backends | ||
|
||
[Jetstream Pytorch](https://github.com/AI-Hypercomputer/jetstream-pytorch) is a highly optimized Pytorch engine for serving LLMs on Cloud TPU. This engine is selected by default if the dependency is available. | ||
|
||
We recommend using Jetstream with TGI for the best performance. If for some reason you want to use the Pytorch/XLA backend instead, you can set the `JETSTREAM_PT_DISABLE=1` environment variable. | ||
|
||
For more information, see our discussion on the [difference between jetstream and pytorch XLA](./conceptual_guide/difference) | ||
|
||
## Quantization | ||
When using Jetstream Pytorch engine, it is possible to enable quantization to reduce the memory footprint and increase the throughput. To enable quantization, set the `QUANTIZATION=1` environment variable. For instance, on a 2x4 TPU v5e (16GB per chip * 8 = 128 GB per pod), you can serve models up to 70B parameters, such as Llama 3.3-70B. The quantization is done in `int8` on the fly as the weight loads. As with any quantization option, you can expect a small drop in the model accuracy. Without the quantization option enabled, the model is served in bf16. | ||
|
||
## How to solve memory requirements | ||
|
||
If you encounter `Backend(NotEnoughMemory(2048))`. | ||
Here are some solutions that could help with reducing memory usage in TGI: | ||
|
||
bash | ||
``` | ||
docker run -p 8080:80 \ | ||
--shm-size 16G \ | ||
--privileged \ | ||
--net host \ | ||
-e QUANTIZATION=1 \ | ||
-e MAX_BATCH_SIZE=2 \ | ||
-e LOG_LEVEL=text_generation_router=debug \ | ||
-v ~/hf_data:/data \ | ||
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \ | ||
-e SKIP_WARMUP=1 \ | ||
ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \ | ||
--model-id google/gemma-2b-it \ | ||
--max-input-length 512 \ | ||
--max-total-tokens 1024 \ | ||
--max-batch-prefill-tokens 512 | ||
--max-batch-total-tokens 1024 | ||
``` | ||
|
||
- `-e QUANTIZATION=1`: To enable quantization. This should reduce memory requirements by almost half | ||
- `-e MAX_BATCH_SIZE=n`: You can manually reduce the size of the batch size | ||
- `--max-input-length`: Maximum input sequence length | ||
- `--max-total-tokens`: Maximum combined input and output tokens | ||
- `--max-batch-prefill-tokens`: Maximum tokens for batch processing | ||
- `--max-batch-total-tokens`: Maximum total tokens in a batch | ||
|
||
To reduce memory usage, you want to try a smaller number for `--max-input-length`, `--max-total-tokens`, `--max-batch-prefill-tokens`, and `--max-batch-total-tokens`. | ||
|
||
<Tip warning={true}> | ||
`max-batch-prefill-tokens ≤ max-input-length * max_batch_size`. Otherwise, you will have an error as the configuration does not make sense. If the max-batch-prefill-tokens were bigger, then you would not be able to process any request | ||
</Tip> | ||
|
||
## Sharding | ||
Sharding is done automatically by the TGI server, so your model uses all the TPUs that are available. We do tensor parallelism, so the layers are automatically split in all available TPUs. However, the TGI router will only see one shard. | ||
|
||
More information on tensor parralelsim can be found here https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism | ||
|
||
## Understanding the configuration | ||
|
||
Key parameters explained: | ||
- `--shm-size 16G`: Shared memory allocation | ||
- `--privileged`: Required for TPU access | ||
- `--net host`: Uses host network mode | ||
- `-v ~/hf_data:/data`: Volume mount for model storage | ||
- `-e SKIP_WARMUP=1`: Disables warmup for quick testing (not recommended for production) | ||
|
||
<Tip warning={true}> | ||
`--privileged --shm-size 16gb --net host` is required as specify in https://github.com/pytorch/xla | ||
</Tip> | ||
|
||
## Next steps | ||
Please check the [TGI docs](https://huggingface.co/docs/text-generation-inference) for more TGI server configuration options. |
Oops, something went wrong.