Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📝 overhaul of the documentation, now 4.5x bigger (better?) #144

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

baptistecolle
Copy link
Collaborator

@baptistecolle baptistecolle commented Jan 15, 2025

What does this PR do?

This is a complete overhaul of the documentation:

  • We want from 1686 to 7565 words (4.5X bigger)
  • We auto-generate documentation for our examples
  • New formatting and organization of the docs to make it easier to follow
  • Added new tutorials, how-to, conceptual guides, and references following the diataxis method

What is missing (could be added):

  • I think more examples would be nice, showing more diverse use cases
  • I believe FAQ and glossary would be nice to add, but this PR is big enough already
  • Guide and examples with Google Colab Pro as you can launch a v5e-1 TPU from there, so a one-click example would be nice
  • An example using GCE VM on Colab via GCP Marketplace
  • More diagrams and figures of the internal working of optimum-TPU to give some details would be interesting
  • A how-to guide on adding new models for new contributors
  • Docs for GKE is in the work and but not published yet as there are some blockers for that https://github.com/huggingface/optimum-tpu/blob/doc-deploy-gke/docs/source/howto/deploy-gke.md.
  • The current preview docs for GKE are for CLI only. A GUI guide would be interesting too

New Files Added

  • docs/scripts/auto-generate-examples.py
  • docs/scripts/examples_list.yml
  • docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx
  • docs/source/conceptual_guides/tpu_hardware_support.mdx
  • docs/source/contributing.mdx
  • docs/source/howto/advanced-tgi-serving.mdx
  • docs/source/howto/deploy_instance_on_ie.mdx
  • docs/source/howto/installation_inside_a_container.mdx
  • docs/source/installation.mdx
  • docs/source/optimum_container.mdx
  • docs/source/reference/fsdp_v2.mdx
  • docs/source/reference/tgi_advanced_options.mdx
  • docs/source/tutorials/inference_on_tpu.mdx
  • docs/source/tutorials/tpu_setup.mdx
  • docs/source/tutorials/training_on_tpu.mdx

Modified Files

  • docs/source/howto/training.mdx
  • docs/source/index.mdx
  • docs/source/supported-architectures.mdx

@baptistecolle baptistecolle changed the title 📝 overhaul of the documentation, now 4.5 bigger (better?) 📝 overhaul of the documentation, now 4.5x bigger (better?) Jan 15, 2025
@baptistecolle baptistecolle force-pushed the improve-documentation branch 2 times, most recently from 2ea95f2 to 1be1304 Compare January 15, 2025 13:13
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@baptistecolle baptistecolle marked this pull request as ready for review January 15, 2025 13:50
@baptistecolle
Copy link
Collaborator Author

baptistecolle commented Jan 15, 2025

BTW just for reference. We also now link to the optimum-tpu docs from:

The goal is to increase visibility of the doc

@pagezyhf pagezyhf self-requested a review January 15, 2025 15:00
Copy link
Collaborator

@tengomucho tengomucho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the huge work! Some general comments

  • I would prefer to avoid repetition: having information repeated in several places can be confusing and it is harder to maintain. E.g.: docker arguments, TGI args
  • you specify version numbers, I think it would be best if we could generate that, otherwise it will be a burden to maintain
  • Try to keep titles and toc tree in sync
  • There is a bit of repetition between the tutorials and howtos. Maybe you can rationalize that.
  • the conceptual guides should be more focused on optimum-tpu IMO, what do you think?

@@ -0,0 +1,17 @@
# Differences between JetStream and PyTorch XLA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about mentioning that you are talking about TGI? Also, "Jetstream Pytorch" might be more precise, as Jetstream has 2 implementations.
Also, I find this page a little bit confusing. We use Pytorch XLA everywhere, even Jetstream uses Pytorch XLA. Optimum TPU's TGI implementation can use Jetstream Pytorch or Pytorch XLA, but keep in mind this should be deprecated as we will probably remove it in the future.


You can find more information about:
- PyTorch XLA: https://pytorch.org/xla/ and https://github.com/pytorch/xla
- JetStream: https://github.com/google/jaxon/tree/main/jetstream
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,54 @@
# TPU hardware support
Optimum-TPU support and is optimized for V5e, V5p, and V6e TPUs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the V in V5e etc should be small case (that is how they write it). Also, remove v5p, we have never tested it.

## TPU naming convention
The TPU naming follows this format: `<tpu_version>-<number_of_tpus>`

TPU versions available:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be "TPU available versions"?

# TPU hardware support
Optimum-TPU support and is optimized for V5e, V5p, and V6e TPUs.

## When to use TPU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about renaming this "Why choosing TPUs"


1. Select TPU type:
- We'll use a TPU `v5e-8` (corresponds to a v5litepod8). This is a TPU node containing 8 v5e TPU chips
- For detailed specifications about TPU types, refer to our TPU types documentation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provide a link

Comment on lines +75 to +76
- For deploying existing models, start with Model Serving
- For training new models, begin with Model Training
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provide links


## 1. Start the Jupyter Container

Launch the container with the following command:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to clone the optimum tpu git, install the jupyter notebook and then you can run it, but you will need to mount the notebook too.

docker run --rm --net host --privileged \
-v$(pwd)/artifacts:/tmp/output \
-e HF_TOKEN=${HF_TOKEN} \
us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should provide a link to an image url that does not exist yet 😢

- `--privileged`: Required for TPU access
- `--net host`: Uses host network mode
- `-v ~/hf_data:/data`: Volume mount for model storage
- `-e SKIP_WARMUP=1`: Disables warmup for quick testing (not recommended for production)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants