Skip to content
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.

Commit

Permalink
[skip ci] fix
Browse files Browse the repository at this point in the history
  • Loading branch information
merrymercy committed Jun 23, 2022
1 parent 2c90024 commit 81f3cc1
Show file tree
Hide file tree
Showing 3 changed files with 41 additions and 15 deletions.
48 changes: 37 additions & 11 deletions docs/tutorials/opt_serving.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,11 @@ Serving OPT-175B using Alpa

This tutorial provides guides to setup a serving system to serve the largest available pretrained language model OPT-175B.


As a serving system, Alpa provides the following unique advantages:

- **Support commodity hardware**: With Alpa, you can serve OPT-175B using your in-house GPU cluster, without needing the latest generations of A100 80GB GPUs nor fancy InfiniBand connections -- no hardware constraints!

- **Flexible parallelism strategies**: Alpa will automatically figure out the appropriate model-parallelism strategies based on your cluster setup.
- **Flexible parallelism strategies**: Alpa will automatically figure out the appropriate model-parallel strategies based on your cluster setup.


In this example, we use Alpa to serve the open-source OPT model, supporting all sizes ranging from 125M to 175B.
Expand All @@ -20,15 +19,42 @@ Specifically, Alpa provides:

.. note::

The trained OPT model weights can be obtained from `Metaseq download page <https://github.com/facebookresearch/metaseq/tree/main/projects/OPT>`_. Usages of
The pre-trained OPT model weights can be obtained from `Metaseq download page <https://github.com/facebookresearch/metaseq/tree/main/projects/OPT>`_. Usages of
the pretrained model weights are subject to their `license <https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md>`_ .

.. note::

You will need at least 350GB memory to to serve the OPT-175B model. You can also follow this guide to setup a serving system to serve smaller versions of OPT,
such as OPT-66B, OPT-30B, etc. Pick an appropriate size from `OPT weight release page <https://github.com/facebookresearch/metaseq/tree/main/projects/OPT>`_ based on
your available resources.

You will need at least 350GB memory to to serve the OPT-175B model. For example, you can use 4 x AWS p3.16xlarge instance, which provide 4 instance x 8 (GPU/instance) x 16 (GB/GPU) = 512 GB memory.
You can also follow this guide to setup a serving system to serve smaller versions of OPT, such as OPT-66B, OPT-30B, etc.
Pick an appropriate size from `OPT weight release page <https://github.com/facebookresearch/metaseq/tree/main/projects/OPT>`_ based on your available resources.

Demo
----
Use huggingface/transformers interface and Alpa backend for distributed inference.

.. code:: python
from transformers import AutoTokenizer
from examples.opt_serving.model.wrapper import get_model
# Load the tokenizer. We have to use the 30B version because
# other versions have some issues. The 30B version works for all OPT models.
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", use_fast=False)
tokenizer.add_bos_token = False
# Load the model
model = get_model(model_name="alpa/opt-2.7b",
device="cuda",
path="/home/ubuntu/opt_weights/")
# Generate
prompt = "Paris is the capital city of "
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids=input_ids, max_length=256, do_sample=True)
generated_string = tokenizer.batch_decode(output, skip_special_tokens=True)
print(generated_string)
Requirements
------------
Expand Down Expand Up @@ -57,12 +83,12 @@ There are two ways you can obtain the pretrained OPT weights.
then use our script `convert_to_numpy_weight.py <scripts/convert_to_numpy_weights.p>`_ to convert it into Alpa-compatible formats.

2. We provide links to download the preprocessed 125M and 2.7B model below. For other sizes of OPT, please join `Alpa slack <https://forms.gle/YEZTCrtZD6EAVNBQ7>`_ to request a copy from the Alpa developer team.

- `OPT-125M weights <https://drive.google.com/file/d/1Ps7DFD80wNO7u2t39YCYcBX-9XwypGzl/view?usp=sharing>`_
- `OPT-2.7B weights <https://drive.google.com/file/d/1ayIaKRhxF9osZWgcFG-3vSkjcepSWdQd/view?usp=sharing>`_


Run Generation in Command Line
------------------------------
Run and Benchmark Generation in Command Line
--------------------------------------------

For a small model that can fit into one GPU, such as the OPT-125M, we can run single-GPU generation using either PyTorch backend or JAX backend.
For examples:
Expand Down Expand Up @@ -110,4 +136,4 @@ Then open ``https://[IP-ADDRESS]:10001`` in your browser to try out the model!
License
-------

The Use of the OPT pretrained weights are subject to the `Model Licence <https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md>`_ by Metaseq.
The use of the OPT pretrained weights are subject to the `Model Licence <https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md>`_ by Metaseq.
6 changes: 3 additions & 3 deletions examples/opt_serving/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@ Specifically, Alpa provides:
- A backend to perform model-parallel distributed inference for the large OPT models;
- A web frontend to collect and batch inference requests from users.

**Note**: the OPT model weights can be obtained from [Metaseq](https://github.com/facebookresearch/metaseq), subject to their license.
**Note**: the pre-trained OPT model weights can be obtained from [Metaseq](https://github.com/facebookresearch/metaseq), subject to their license.

## Example
## Demo
Use huggingface/transformers interface and Alpa backend for distributed inference.

```pyhton
```python
from transformers import AutoTokenizer
from examples.opt_serving.model.wrapper import get_model

Expand Down
2 changes: 1 addition & 1 deletion examples/opt_serving/textgen_demo.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Use huggingface/transformers' interface and Alpa backend for distributed inference."""
"""Use huggingface/transformers interface and Alpa backend for distributed inference."""
from transformers import AutoTokenizer
from examples.opt_serving.model.wrapper import get_model

Expand Down

0 comments on commit 81f3cc1

Please sign in to comment.