From 81f3cc11d5337c0b8f295014708e28d75fc032fb Mon Sep 17 00:00:00 2001
From: Lianmin Zheng <lianminzheng@gmail.com>
Date: Thu, 23 Jun 2022 18:59:32 +0000
Subject: [PATCH] [skip ci] fix

---
 docs/tutorials/opt_serving.rst       | 48 +++++++++++++++++++++-------
 examples/opt_serving/README.md       |  6 ++--
 examples/opt_serving/textgen_demo.py |  2 +-
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/docs/tutorials/opt_serving.rst b/docs/tutorials/opt_serving.rst
index aa51fc512..174204769 100644
--- a/docs/tutorials/opt_serving.rst
+++ b/docs/tutorials/opt_serving.rst
@@ -3,12 +3,11 @@ Serving OPT-175B using Alpa
 
 This tutorial provides guides to setup a serving system to serve the largest available pretrained language model OPT-175B.
 
-
 As a serving system, Alpa provides the following unique advantages:
 
 - **Support commodity hardware**: With Alpa, you can serve OPT-175B using your in-house GPU cluster, without needing the latest generations of A100 80GB GPUs nor fancy InfiniBand connections -- no hardware constraints!
 
-- **Flexible parallelism strategies**: Alpa will automatically figure out the appropriate model-parallelism strategies based on your cluster setup.
+- **Flexible parallelism strategies**: Alpa will automatically figure out the appropriate model-parallel strategies based on your cluster setup.
 
 
 In this example, we use Alpa to serve the open-source OPT model, supporting all sizes ranging from 125M to 175B.
@@ -20,15 +19,42 @@ Specifically, Alpa provides:
 
 .. note::
 
-    The trained OPT model weights can be obtained from `Metaseq download page <https://github.com/facebookresearch/metaseq/tree/main/projects/OPT>`_. Usages of
+    The pre-trained OPT model weights can be obtained from `Metaseq download page <https://github.com/facebookresearch/metaseq/tree/main/projects/OPT>`_. Usages of
     the pretrained model weights are subject to their `license <https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md>`_ .
 
 .. note::
 
-    You will need at least 350GB memory to to serve the OPT-175B model. You can also follow this guide to setup a serving system to serve smaller versions of OPT,
-    such as OPT-66B, OPT-30B, etc. Pick an appropriate size from `OPT weight release page <https://github.com/facebookresearch/metaseq/tree/main/projects/OPT>`_ based on
-    your available resources.
-
+    You will need at least 350GB memory to to serve the OPT-175B model. For example, you can use 4 x AWS p3.16xlarge instance, which provide 4 instance x 8 (GPU/instance) x 16 (GB/GPU) = 512 GB memory.
+    You can also follow this guide to setup a serving system to serve smaller versions of OPT, such as OPT-66B, OPT-30B, etc.
+    Pick an appropriate size from `OPT weight release page <https://github.com/facebookresearch/metaseq/tree/main/projects/OPT>`_ based on your available resources.
+
+Demo
+----
+Use huggingface/transformers interface and Alpa backend for distributed inference.
+
+.. code:: python
+
+  from transformers import AutoTokenizer
+  from examples.opt_serving.model.wrapper import get_model
+  
+  # Load the tokenizer. We have to use the 30B version because
+  # other versions have some issues. The 30B version works for all OPT models.
+  tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", use_fast=False)
+  tokenizer.add_bos_token = False
+  
+  # Load the model
+  model = get_model(model_name="alpa/opt-2.7b",
+                    device="cuda",
+                    path="/home/ubuntu/opt_weights/")
+  
+  # Generate
+  prompt = "Paris is the capital city of "
+  
+  input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
+  output = model.generate(input_ids=input_ids, max_length=256, do_sample=True)
+  generated_string = tokenizer.batch_decode(output, skip_special_tokens=True)
+  
+  print(generated_string)
 
 Requirements
 ------------
@@ -57,12 +83,12 @@ There are two ways you can obtain the pretrained OPT weights.
 then use our script `convert_to_numpy_weight.py <scripts/convert_to_numpy_weights.p>`_ to convert it into Alpa-compatible formats.
 
 2. We provide links to download the preprocessed 125M and 2.7B model below. For other sizes of OPT, please join `Alpa slack <https://forms.gle/YEZTCrtZD6EAVNBQ7>`_ to request a copy from the Alpa developer team.
+
    - `OPT-125M weights <https://drive.google.com/file/d/1Ps7DFD80wNO7u2t39YCYcBX-9XwypGzl/view?usp=sharing>`_
    - `OPT-2.7B weights <https://drive.google.com/file/d/1ayIaKRhxF9osZWgcFG-3vSkjcepSWdQd/view?usp=sharing>`_
 
-
-Run Generation in Command Line
-------------------------------
+Run and Benchmark Generation in Command Line
+--------------------------------------------
 
 For a small model that can fit into one GPU, such as the OPT-125M, we can run single-GPU generation using either PyTorch backend or JAX backend.
 For examples:
@@ -110,4 +136,4 @@ Then open ``https://[IP-ADDRESS]:10001`` in your browser to try out the model!
 License
 -------
 
-The Use of the OPT pretrained weights are subject to the `Model Licence <https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md>`_ by Metaseq.
+The use of the OPT pretrained weights are subject to the `Model Licence <https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md>`_ by Metaseq.
diff --git a/examples/opt_serving/README.md b/examples/opt_serving/README.md
index 33e27a104..f6b4c07fd 100644
--- a/examples/opt_serving/README.md
+++ b/examples/opt_serving/README.md
@@ -9,12 +9,12 @@ Specifically, Alpa provides:
 - A backend to perform model-parallel distributed inference for the large OPT models;
 - A web frontend to collect and batch inference requests from users.
 
-**Note**: the OPT model weights can be obtained from [Metaseq](https://github.com/facebookresearch/metaseq), subject to their license.
+**Note**: the pre-trained OPT model weights can be obtained from [Metaseq](https://github.com/facebookresearch/metaseq), subject to their license.
 
-## Example
+## Demo
 Use huggingface/transformers interface and Alpa backend for distributed inference.
 
-```pyhton
+```python
 from transformers import AutoTokenizer
 from examples.opt_serving.model.wrapper import get_model
 
diff --git a/examples/opt_serving/textgen_demo.py b/examples/opt_serving/textgen_demo.py
index 22483f1cf..cc5a7cc2d 100644
--- a/examples/opt_serving/textgen_demo.py
+++ b/examples/opt_serving/textgen_demo.py
@@ -1,4 +1,4 @@
-"""Use huggingface/transformers' interface and Alpa backend for distributed inference."""
+"""Use huggingface/transformers interface and Alpa backend for distributed inference."""
 from transformers import AutoTokenizer
 from examples.opt_serving.model.wrapper import get_model