From 81f3cc11d5337c0b8f295014708e28d75fc032fb Mon Sep 17 00:00:00 2001 From: Lianmin Zheng Date: Thu, 23 Jun 2022 18:59:32 +0000 Subject: [PATCH] [skip ci] fix --- docs/tutorials/opt_serving.rst | 48 +++++++++++++++++++++------- examples/opt_serving/README.md | 6 ++-- examples/opt_serving/textgen_demo.py | 2 +- 3 files changed, 41 insertions(+), 15 deletions(-) diff --git a/docs/tutorials/opt_serving.rst b/docs/tutorials/opt_serving.rst index aa51fc512..174204769 100644 --- a/docs/tutorials/opt_serving.rst +++ b/docs/tutorials/opt_serving.rst @@ -3,12 +3,11 @@ Serving OPT-175B using Alpa This tutorial provides guides to setup a serving system to serve the largest available pretrained language model OPT-175B. - As a serving system, Alpa provides the following unique advantages: - **Support commodity hardware**: With Alpa, you can serve OPT-175B using your in-house GPU cluster, without needing the latest generations of A100 80GB GPUs nor fancy InfiniBand connections -- no hardware constraints! -- **Flexible parallelism strategies**: Alpa will automatically figure out the appropriate model-parallelism strategies based on your cluster setup. +- **Flexible parallelism strategies**: Alpa will automatically figure out the appropriate model-parallel strategies based on your cluster setup. In this example, we use Alpa to serve the open-source OPT model, supporting all sizes ranging from 125M to 175B. @@ -20,15 +19,42 @@ Specifically, Alpa provides: .. note:: - The trained OPT model weights can be obtained from `Metaseq download page `_. Usages of + The pre-trained OPT model weights can be obtained from `Metaseq download page `_. Usages of the pretrained model weights are subject to their `license `_ . .. note:: - You will need at least 350GB memory to to serve the OPT-175B model. You can also follow this guide to setup a serving system to serve smaller versions of OPT, - such as OPT-66B, OPT-30B, etc. Pick an appropriate size from `OPT weight release page `_ based on - your available resources. - + You will need at least 350GB memory to to serve the OPT-175B model. For example, you can use 4 x AWS p3.16xlarge instance, which provide 4 instance x 8 (GPU/instance) x 16 (GB/GPU) = 512 GB memory. + You can also follow this guide to setup a serving system to serve smaller versions of OPT, such as OPT-66B, OPT-30B, etc. + Pick an appropriate size from `OPT weight release page `_ based on your available resources. + +Demo +---- +Use huggingface/transformers interface and Alpa backend for distributed inference. + +.. code:: python + + from transformers import AutoTokenizer + from examples.opt_serving.model.wrapper import get_model + + # Load the tokenizer. We have to use the 30B version because + # other versions have some issues. The 30B version works for all OPT models. + tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", use_fast=False) + tokenizer.add_bos_token = False + + # Load the model + model = get_model(model_name="alpa/opt-2.7b", + device="cuda", + path="/home/ubuntu/opt_weights/") + + # Generate + prompt = "Paris is the capital city of " + + input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda") + output = model.generate(input_ids=input_ids, max_length=256, do_sample=True) + generated_string = tokenizer.batch_decode(output, skip_special_tokens=True) + + print(generated_string) Requirements ------------ @@ -57,12 +83,12 @@ There are two ways you can obtain the pretrained OPT weights. then use our script `convert_to_numpy_weight.py `_ to convert it into Alpa-compatible formats. 2. We provide links to download the preprocessed 125M and 2.7B model below. For other sizes of OPT, please join `Alpa slack `_ to request a copy from the Alpa developer team. + - `OPT-125M weights `_ - `OPT-2.7B weights `_ - -Run Generation in Command Line ------------------------------- +Run and Benchmark Generation in Command Line +-------------------------------------------- For a small model that can fit into one GPU, such as the OPT-125M, we can run single-GPU generation using either PyTorch backend or JAX backend. For examples: @@ -110,4 +136,4 @@ Then open ``https://[IP-ADDRESS]:10001`` in your browser to try out the model! License ------- -The Use of the OPT pretrained weights are subject to the `Model Licence `_ by Metaseq. +The use of the OPT pretrained weights are subject to the `Model Licence `_ by Metaseq. diff --git a/examples/opt_serving/README.md b/examples/opt_serving/README.md index 33e27a104..f6b4c07fd 100644 --- a/examples/opt_serving/README.md +++ b/examples/opt_serving/README.md @@ -9,12 +9,12 @@ Specifically, Alpa provides: - A backend to perform model-parallel distributed inference for the large OPT models; - A web frontend to collect and batch inference requests from users. -**Note**: the OPT model weights can be obtained from [Metaseq](https://github.com/facebookresearch/metaseq), subject to their license. +**Note**: the pre-trained OPT model weights can be obtained from [Metaseq](https://github.com/facebookresearch/metaseq), subject to their license. -## Example +## Demo Use huggingface/transformers interface and Alpa backend for distributed inference. -```pyhton +```python from transformers import AutoTokenizer from examples.opt_serving.model.wrapper import get_model diff --git a/examples/opt_serving/textgen_demo.py b/examples/opt_serving/textgen_demo.py index 22483f1cf..cc5a7cc2d 100644 --- a/examples/opt_serving/textgen_demo.py +++ b/examples/opt_serving/textgen_demo.py @@ -1,4 +1,4 @@ -"""Use huggingface/transformers' interface and Alpa backend for distributed inference.""" +"""Use huggingface/transformers interface and Alpa backend for distributed inference.""" from transformers import AutoTokenizer from examples.opt_serving.model.wrapper import get_model