diff --git a/.gitignore b/.gitignore index 8b03c06..fabcc5d 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,3 @@ site/ .DS_Store +venv \ No newline at end of file diff --git a/docs/gpt-in-a-box/kubernetes/v0.1/custom_model.md b/docs/gpt-in-a-box/kubernetes/v0.1/custom_model.md deleted file mode 100644 index 8e1be37..0000000 --- a/docs/gpt-in-a-box/kubernetes/v0.1/custom_model.md +++ /dev/null @@ -1,31 +0,0 @@ -# Custom Model Support -We provide the capability to generate a MAR file with custom models and start an inference server using Kubeflow serving.
-!!! note - A model is recognised as a custom model if it's model name is not present in the model_config file. - -## Generate Model Archive File for Custom Models -To generate the MAR file, run the following: -``` -python3 $WORK_DIR/llm/download.py --no_download [--repo_version --handler ] --model_name --model_path --output -``` - -* **no_download**: Set flag to skip downloading the model files, must be set for custom models -* **model_name**: Name of custom model, this name must not be in model_config -* **repo_version**: Any model version, defaults to "1.0" (optional) -* **model_path**: Absolute path of custom model files (should be non empty) -* **output**: Mount path to your nfs server to be used in the kube PV where config.properties and model archive file be stored -* **handler**: Path to custom handler, defaults to llm/handler.py (optional)
- -## Start Inference Server with Custom Model Archive File -Run the following command for starting Kubeflow serving and running inference on the given input with a custom MAR file: -``` -bash $WORK_DIR/llm/run.sh -n -g -f -m -e [OPTIONAL -d ] -``` - -* **n**: Name of custom model, this name must not be in model_config -* **d**: Absolute path of input data folder (Optional) -* **g**: Number of gpus to be used to execute (Set 0 to use cpu) -* **f**: NFS server address with share path information -* **m**: Mount path to your nfs server to be used in the kube PV where model files and model archive file be stored -* **e**: Name of the deployment metadata - diff --git a/docs/gpt-in-a-box/kubernetes/v0.1/generating_mar.md b/docs/gpt-in-a-box/kubernetes/v0.1/generating_mar.md deleted file mode 100644 index b2172a6..0000000 --- a/docs/gpt-in-a-box/kubernetes/v0.1/generating_mar.md +++ /dev/null @@ -1,28 +0,0 @@ -## Download model files and Generate MAR file -Run the following command for downloading model files and generating MAR file: -``` -python3 $WORK_DIR/llm/download.py [--repo_version ] --model_name --output --hf_token -``` - -* **model_name**: Name of model -* **output**: Mount path to your nfs server to be used in the kube PV where model files and model archive file be stored -* **repo_version**: Commit id of model's repo from HuggingFace (optional, if not provided default set in model_config will be used) -* **hf_token**: Your HuggingFace token. Needed to download LLAMA(2) models. - -The available LLMs are mpt_7b (mosaicml/mpt_7b), falcon_7b (tiiuae/falcon-7b), llama2_7b (meta-llama/Llama-2-7b-hf). - -### Examples -The following are example commands to generate the model archive file. - -Download MPT-7B model files and generate model archive for it: -``` -python3 $WORK_DIR/llm/download.py --model_name mpt_7b --output /mnt/llm -``` -Download Falcon-7B model files and generate model archive for it: -``` -python3 $WORK_DIR/llm/download.py --model_name falcon_7b --output /mnt/llm -``` -Download Llama2-7B model files and generate model archive for it: -``` -python3 $WORK_DIR/llm/download.py --model_name llama2_7b --output /mnt/llm --hf_token -``` diff --git a/docs/gpt-in-a-box/kubernetes/v0.1/getting_started.md b/docs/gpt-in-a-box/kubernetes/v0.1/getting_started.md deleted file mode 100644 index 3dcbb81..0000000 --- a/docs/gpt-in-a-box/kubernetes/v0.1/getting_started.md +++ /dev/null @@ -1,85 +0,0 @@ -# Getting Started -This is a guide on getting started with GPT-in-a-Box 1.0 deployment on a Kubernetes Cluster. You can find the open source repository for the K8s version [here](https://github.com/nutanix/nai-llm-k8s). - -## Setup - -Inference experiments are done on a single NKE Cluster with Kubernetes version 1.25.6-0. The NKE Cluster has 3 non-gpu worker nodes with 12 vCPUs and 16G memory and 120 GB Storage. The cluster includes at least 1 gpu worker node with 12 vCPUs and 40G memory, 120 GB Storage and 1 A100-40G GPU passthrough. - -!!! note - Tested with python 3.10, a python virtual environment is preferred to managed dependencies. - -### Spec -**Jump node:** -OS: 22.04 -Resources: 1 VM with 8CPUs, 16G memory and 300 GB storage - -**NKE:** -NKE Version: 2.8 -K8s version: 1.25.6-0 -Resources: 3 cpu nodes with 12 vCPUs, 16G memory and 120 GB storage. - At least 1 gpu node with 12 vCPUs, 40G memory and 120 GB storage (1 A100-40G GPU passthrough) - -**NFS Server:** -Resources: 3 FSVMs with 4 vCPUs, 12 GB memory and 1 TB storage - - -| Software Dependency Matrix(Installed) | | -| --- | --- | -| Istio | 1.17.2 | -| Knative serving | 1.10.1 | -| Cert manager(Jetstack) | 1.3.0 | -| Kserve | 0.11.1 | - -### Jump machine setup -All commands are executed inside the jump machine. -Prerequisites are kubectl and helm. Both are required to orchestrate and set up necessary items in the NKE cluster. - -* [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) -* [helm](https://helm.sh/docs/intro/install/) - -Have a NFS mounted into your jump machine at a specific location. This mount location is required to be supplied as parameter to the execution scripts - -Command to mount NFS to local folder -``` -mount -t nfs : -``` -![Screenshot of a Jump Machine Setup.](image1.png) - - -**Follow the steps below to install the necessary prerequisites.** - -### Download and set up KubeConfig -Download and set up KubeConfig by following the steps outlined in [Downloading the Kubeconfig](https://portal.nutanix.com/page/documents/details?targetId=Nutanix-Kubernetes-Engine-v2_5:top-download-kubeconfig-t.html) on the Nutanix Support Portal. - -### Configure Nvidia Driver in the cluster using helm commands -For NKE 2.8, run the following command as per the [official documentaton](https://portal.nutanix.com/page/documents/details?targetId=Release-Notes-Nutanix-Kubernetes-Engine-v2_8:top-validated-config-r.html): -``` -helm repo add nvidia https://nvidia.github.io/gpu-operator && helm repo update -helm install --wait -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator --version=v23.3.1 --set toolkit.version=v1.13.1-centos7 -``` - -For NKE 2.9, refer the [official documentation](https://portal.nutanix.com/page/documents/details?targetId=Release-Notes-Nutanix-Kubernetes-Engine-v2_9:top-validated-config-r.html) for the validated config. - -### Download nutanix package and Install python libraries -Download the **v0.1** release version from [NAI-LLM-K8s Releases](https://github.com/nutanix/nai-llm-k8s/releases/tag/v0.1) and untar the release. Set the working directory to the root folder containing the extracted release. -``` -export WORK_DIR=absolute_path_to_empty_release_directory -mkdir $WORK_DIR -tar -xvf -C $WORK_DIR --strip-components=1 -``` - -### Kubeflow serving installation into the cluster -``` -curl -s "https://raw.githubusercontent.com/kserve/kserve/v0.11.1/hack/quick_install.sh" | bash -``` -Now we have our cluster ready for inference. - -### Install pip3 -``` -sudo apt-get install python3-pip -``` - -### Install required packages -``` -pip install -r $WORK_DIR/llm/requirements.txt -``` diff --git a/docs/gpt-in-a-box/kubernetes/v0.1/image1.png b/docs/gpt-in-a-box/kubernetes/v0.1/image1.png deleted file mode 100644 index 5be8e71..0000000 Binary files a/docs/gpt-in-a-box/kubernetes/v0.1/image1.png and /dev/null differ diff --git a/docs/gpt-in-a-box/kubernetes/v0.1/inference_requests.md b/docs/gpt-in-a-box/kubernetes/v0.1/inference_requests.md deleted file mode 100644 index 796591f..0000000 --- a/docs/gpt-in-a-box/kubernetes/v0.1/inference_requests.md +++ /dev/null @@ -1,58 +0,0 @@ -Kubeflow serving can be inferenced and managed through it's Inference APIs. Find out more about Kubeflow serving APIs in the official [Inference API](https://kserve.github.io/website/0.8/modelserving/v1beta1/torchserve/#model-inference) documentation. -### Set HOST and PORT -The first step is to [determine the ingress IP and ports](https://kserve.github.io/website/0.8/get_started/first_isvc/#4-determine-the-ingress-ip-and-ports) and set INGRESS_HOST and INGRESS_PORT. -The following command assigns the IP address of the host where the Istio Ingress Gateway pod is running to the INGRESS_HOST variable: -``` -export INGRESS_HOST=$(kubectl get po -l istio=ingressgateway -n istio-system -o jsonpath='{.items[0].status.hostIP}') -``` -The following command assigns the node port used for the HTTP2 service of the Istio Ingress Gateway to the INGRESS_PORT variable: -``` -export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}') -``` - -### Set Service Host Name -Next step is to determine service hostname. -This command retrieves the hostname of a specific InferenceService in a Kubernetes environment by extracting it from the status.url field and assigns it to the SERVICE_HOSTNAME variable: -``` -SERVICE_HOSTNAME=$(kubectl get inferenceservice -o jsonpath='{.status.url}' | cut -d "/" -f 3) -``` -#### Example: -``` -SERVICE_HOSTNAME=$(kubectl get inferenceservice llm-deploy -o jsonpath='{.status.url}' | cut -d "/" -f 3) -``` - -### Curl request to get inference -In the next step inference can be done on the deployed model. -The following is the template command for inferencing with a json file: -``` -curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/{model_name}/infer -d @{input_file_path} -``` -#### Examples: -Curl request for MPT-7B model -``` -curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/mpt_7b/infer -d @$WORK_DIR/data/qa/sample_test1.json -``` -Curl request for Falcon-7B model -``` -curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/falcon_7b/infer -d @$WORK_DIR/data/summarize/sample_test1.json -``` -Curl request for Llama2-7B model -``` -curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/llama2_7b/infer -d @$WORK_DIR/data/translate/sample_test1.json -``` - -### Input data format -Input data should be in **JSON** format. The input should be a '.json' file containing the prompt in the format below: -``` -{ - "id": "42", - "inputs": [ - { - "name": "input0", - "shape": [-1], - "datatype": "BYTES", - "data": ["Capital of India?"] - } - ] -} -``` \ No newline at end of file diff --git a/docs/gpt-in-a-box/kubernetes/v0.1/inference_server.md b/docs/gpt-in-a-box/kubernetes/v0.1/inference_server.md deleted file mode 100644 index 3ea3166..0000000 --- a/docs/gpt-in-a-box/kubernetes/v0.1/inference_server.md +++ /dev/null @@ -1,45 +0,0 @@ -## Start and run Kubeflow Serving - -Run the following command for starting Kubeflow serving and running inference on the given input: -``` -bash $WORK_DIR/llm/run.sh -n -g -f -m -e [OPTIONAL -d -v -t ] -``` - -* **n**: Name of model -* **d**: Absolute path of input data folder (Optional) -* **g**: Number of gpus to be used to execute (Set 0 to use cpu) -* **f**: NFS server address with share path information -* **m**: Mount path to your nfs server to be used in the kube PV where model files and model archive file be stored -* **e**: Name of the deployment metadata -* **v**: Commit id of model's repo from HuggingFace (optional, if not provided default set in model_config will be used) -* **t**: Your HuggingFace token. Needed for LLAMA(2) model. - -The available LLMs model names are mpt_7b (mosaicml/mpt_7b), falcon_7b (tiiuae/falcon-7b), llama2_7b (meta-llama/Llama-2-7b-hf). -Should print "Inference Run Successful" as a message once the Inference Server has successfully started. - -### Examples -The following are example commands to start the Inference Server. - -For 1 GPU Inference with official MPT-7B model and keep inference server alive: -``` -bash $WORK_DIR/llm/run.sh -n mpt_7b -d data/translate -g 1 -e llm-deploy -f '1.1.1.1:/llm' -m /mnt/llm -``` -For 1 GPU Inference with official Falcon-7B model and keep inference server alive: -``` -bash $WORK_DIR/llm/run.sh -n falcon_7b -d data/qa -g 1 -e llm-deploy -f '1.1.1.1:/llm' -m /mnt/llm -``` -For 1 GPU Inference with official Llama2-7B model and keep inference server alive: -``` -bash $WORK_DIR/llm/run.sh -n llama2_7b -d data/summarize -g 1 -e llm-deploy -f '1.1.1.1:/llm' -m /mnt/llm -t -``` - -### Cleanup Inference deployment - -Run the following command to stop the inference server and unmount PV and PVC. -``` -python3 $WORK_DIR/llm/cleanup.py --deploy_name -``` -Example: -``` -python3 $WORK_DIR/llm/cleanup.py --deploy_name llm-deploy -``` \ No newline at end of file diff --git a/docs/gpt-in-a-box/kubernetes/v0.2/custom_model.md b/docs/gpt-in-a-box/kubernetes/v0.2/custom_model.md deleted file mode 100644 index 5709696..0000000 --- a/docs/gpt-in-a-box/kubernetes/v0.2/custom_model.md +++ /dev/null @@ -1,33 +0,0 @@ -# Custom Model Support -In some cases you may want to use a custom model, e.g. a custom fine-tuned model. We provide the capability to generate a MAR file with custom models and start an inference server using Kubeflow serving.
- -## Generate Model Archive File for Custom Models - -!!! note - The model files should be placed in an NFS share accessible by the Nutanix package. This directory will be passed to the --model_path argument. You'll also need to provide the --output path where you want the model archive export to be stored. - -To generate the MAR file, run the following: -``` -python3 $WORK_DIR/llm/generate.py --skip_download [--repo_version --handler ] --model_name --model_path --output -``` - -* **skip_download**: Set flag to skip downloading the model files, must be set for custom models -* **model_name**: Name of custom model -* **repo_version**: Any model version, defaults to "1.0" (optional) -* **model_path**: Absolute path of custom model files (should be non empty) -* **output**: Mount path to your nfs server to be used in the kube PV where config.properties and model archive file be stored -* **handler**: Path to custom handler, defaults to llm/handler.py (optional)
- -## Start Inference Server with Custom Model Archive File -Run the following command for starting Kubeflow serving and running inference on the given input with a custom MAR file: -``` -bash $WORK_DIR/llm/run.sh -n -g -f -m -e [OPTIONAL -d ] -``` - -* **n**: Name of custom model, this name must not be in model_config -* **d**: Absolute path of input data folder (Optional) -* **g**: Number of gpus to be used to execute (Set 0 to use cpu) -* **f**: NFS server address with share path information -* **m**: Mount path to your nfs server to be used in the kube PV where model files and model archive file be stored -* **e**: Name of the deployment metadata - diff --git a/docs/gpt-in-a-box/kubernetes/v0.2/generating_mar.md b/docs/gpt-in-a-box/kubernetes/v0.2/generating_mar.md deleted file mode 100644 index 1e8ccd6..0000000 --- a/docs/gpt-in-a-box/kubernetes/v0.2/generating_mar.md +++ /dev/null @@ -1,28 +0,0 @@ -# Generate PyTorch Model Archive File -We will download the model files and generate a Model Archive file for the desired LLM, which will be used by TorchServe to load the model. Find out more about Torch Model Archiver [here](https://github.com/pytorch/serve/blob/master/model-archiver/README.md). - -Run the following command for downloading model files and generating MAR file: -``` -python3 $WORK_DIR/llm/generate.py [--hf_token --repo_version ] --model_name --output -``` - -* **model_name**: Name of a [validated model](validated_models.md) -* **output**: Mount path to your nfs server to be used in the kube PV where model files and model archive file be stored -* **repo_version**: Commit ID of model's HuggingFace repository (optional, if not provided default set in model_config will be used) -* **hf_token**: Your HuggingFace token. Needed to download LLAMA(2) models. (It can alternatively be set using the environment variable 'HF_TOKEN') - -### Examples -The following are example commands to generate the model archive file. - -Download MPT-7B model files and generate model archive for it: -``` -python3 $WORK_DIR/llm/generate.py --model_name mpt_7b --output /mnt/llm -``` -Download Falcon-7B model files and generate model archive for it: -``` -python3 $WORK_DIR/llm/generate.py --model_name falcon_7b --output /mnt/llm -``` -Download Llama2-7B model files and generate model archive for it: -``` -python3 $WORK_DIR/llm/generate.py --model_name llama2_7b --output /mnt/llm --hf_token -``` diff --git a/docs/gpt-in-a-box/kubernetes/v0.2/getting_started.md b/docs/gpt-in-a-box/kubernetes/v0.2/getting_started.md deleted file mode 100644 index e3c52af..0000000 --- a/docs/gpt-in-a-box/kubernetes/v0.2/getting_started.md +++ /dev/null @@ -1,85 +0,0 @@ -# Getting Started -This is a guide on getting started with GPT-in-a-Box 1.0 deployment on a Kubernetes Cluster. You can find the open source repository for the K8s version [here](https://github.com/nutanix/nai-llm-k8s). - -## Setup - -Inference experiments are done on a single NKE Cluster with Kubernetes version 1.25.6-0. The NKE Cluster has 3 non-gpu worker nodes with 12 vCPUs and 16G memory and 120 GB Storage. The cluster includes at least 1 gpu worker node with 12 vCPUs and 40G memory, 120 GB Storage and 1 A100-40G GPU passthrough. - -!!! note - Tested with python 3.10, a python virtual environment is preferred to managed dependencies. - -### Spec -**Jump node:** -OS: 22.04 -Resources: 1 VM with 8CPUs, 16G memory and 300 GB storage - -**NKE:** -NKE Version: 2.8 -K8s version: 1.25.6-0 -Resources: 3 cpu nodes with 12 vCPUs, 16G memory and 120 GB storage. - At least 1 gpu node with 12 vCPUs, 40G memory and 120 GB storage (1 A100-40G GPU passthrough) - -**NFS Server:** -Resources: 3 FSVMs with 4 vCPUs, 12 GB memory and 1 TB storage - - -| Software Dependency Matrix(Installed) | | -| --- | --- | -| Istio | 1.17.2 | -| Knative serving | 1.10.1 | -| Cert manager(Jetstack) | 1.3.0 | -| Kserve | 0.11.1 | - -### Jump machine setup -All commands are executed inside the jump machine. -Prerequisites are kubectl and helm. Both are required to orchestrate and set up necessary items in the NKE cluster. - -* [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) -* [helm](https://helm.sh/docs/intro/install/) - -Have a NFS mounted into your jump machine at a specific location. This mount location is required to be supplied as parameter to the execution scripts - -Command to mount NFS to local folder -``` -mount -t nfs : -``` -![Screenshot of a Jump Machine Setup.](image1.png) - - -**Follow the steps below to install the necessary prerequisites.** - -### Download and set up KubeConfig -Download and set up KubeConfig by following the steps outlined in [Downloading the Kubeconfig](https://portal.nutanix.com/page/documents/details?targetId=Nutanix-Kubernetes-Engine-v2_5:top-download-kubeconfig-t.html) on the Nutanix Support Portal. - -### Configure Nvidia Driver in the cluster using helm commands -For NKE 2.8, run the following command as per the [official documentaton](https://portal.nutanix.com/page/documents/details?targetId=Release-Notes-Nutanix-Kubernetes-Engine-v2_8:top-validated-config-r.html): -``` -helm repo add nvidia https://nvidia.github.io/gpu-operator && helm repo update -helm install --wait -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator --version=v23.3.1 --set toolkit.version=v1.13.1-centos7 -``` - -For NKE 2.9, refer the [official documentation](https://portal.nutanix.com/page/documents/details?targetId=Release-Notes-Nutanix-Kubernetes-Engine-v2_9:top-validated-config-r.html) for the validated config. - -### Download nutanix package and Install python libraries -Download the **v0.2.2** release version from [NAI-LLM-K8s Releases](https://github.com/nutanix/nai-llm-k8s/releases/tag/v0.2.2) and untar the release. Set the working directory to the root folder containing the extracted release. -``` -export WORK_DIR=absolute_path_to_empty_release_directory -mkdir $WORK_DIR -tar -xvf -C $WORK_DIR --strip-components=1 -``` - -### Kubeflow serving installation into the cluster -``` -curl -s "https://raw.githubusercontent.com/kserve/kserve/v0.11.1/hack/quick_install.sh" | bash -``` -Now we have our cluster ready for inference. - -### Install pip3 -``` -sudo apt-get install python3-pip -``` - -### Install required packages -``` -pip install -r $WORK_DIR/llm/requirements.txt -``` diff --git a/docs/gpt-in-a-box/kubernetes/v0.2/huggingface_model.md b/docs/gpt-in-a-box/kubernetes/v0.2/huggingface_model.md deleted file mode 100644 index 9c2f5be..0000000 --- a/docs/gpt-in-a-box/kubernetes/v0.2/huggingface_model.md +++ /dev/null @@ -1,46 +0,0 @@ -# HuggingFace Model Support -!!! Note - To start the inference server for the [**Validated Models**](validated_models.md), refer to the [**Deploying Inference Server**](inference_server.md) documentation. - -We provide the capability to download model files from any HuggingFace repository and generate a MAR file to start an inference server using Kubeflow serving.
- -To start the Inference Server for any other HuggingFace model, follow the steps below. - -## Generate Model Archive File for HuggingFace Models -Run the following command for downloading and generating the Model Archive File (MAR) with the HuggingFace Model files : -``` -python3 $WORK_DIR/llm/generate.py [--hf_token --repo_version --handler ] --model_name --repo_id --model_path --output -``` - -* **model_name**: Name of HuggingFace model -* **repo_id**: HuggingFace Repository ID of the model -* **repo_version**: Commit ID of model's HuggingFace repository, defaults to latest HuggingFace commit ID (optional) -* **model_path**: Absolute path of custom model files (should be empty) -* **output**: Mount path to your nfs server to be used in the kube PV where config.properties and model archive file be stored -* **handler**: Path to custom handler, defaults to llm/handler.py (optional)
-* **hf_token**: Your HuggingFace token. Needed to download and verify LLAMA(2) models. - -### Example -Download model files and generate model archive for codellama/CodeLlama-7b-hf: -``` -python3 $WORK_DIR/llm/generate.py --model_name codellama_7b_hf --repo_id codellama/CodeLlama-7b-hf --model_path /models/codellama_7b_hf/model_files --output /mnt/llm -``` - -## Start Inference Server with HuggingFace Model Archive File -Run the following command for starting Kubeflow serving and running inference on the given input with a custom MAR file: -``` -bash $WORK_DIR/llm/run.sh -n -g -f -m -e [OPTIONAL -d ] -``` - -* **n**: Name of HuggingFace model -* **d**: Absolute path of input data folder (Optional) -* **g**: Number of gpus to be used to execute (Set 0 to use cpu) -* **f**: NFS server address with share path information -* **m**: Mount path to your nfs server to be used in the kube PV where model files and model archive file be stored -* **e**: Name of the deployment metadata - -### Example -To start Inference Server with codellama/CodeLlama-7b-hf: -``` -bash $WORK_DIR/llm/run.sh -n codellama_7b_hf -d data/qa -g 1 -e llm-deploy -f '1.1.1.1:/llm' -m /mnt/llm -``` diff --git a/docs/gpt-in-a-box/kubernetes/v0.2/image1.png b/docs/gpt-in-a-box/kubernetes/v0.2/image1.png deleted file mode 100644 index 5be8e71..0000000 Binary files a/docs/gpt-in-a-box/kubernetes/v0.2/image1.png and /dev/null differ diff --git a/docs/gpt-in-a-box/kubernetes/v0.2/inference_requests.md b/docs/gpt-in-a-box/kubernetes/v0.2/inference_requests.md deleted file mode 100644 index eb2d101..0000000 --- a/docs/gpt-in-a-box/kubernetes/v0.2/inference_requests.md +++ /dev/null @@ -1,59 +0,0 @@ -Kubeflow serving can be inferenced and managed through its Inference APIs. Find out more about Kubeflow serving APIs in the official [Inference API](https://kserve.github.io/website/0.8/modelserving/v1beta1/torchserve/#model-inference) documentation. - -### Set HOST and PORT -The first step is to [determine the ingress IP and ports](https://kserve.github.io/website/0.8/get_started/first_isvc/#4-determine-the-ingress-ip-and-ports) and set INGRESS_HOST and INGRESS_PORT. -The following command assigns the IP address of the host where the Istio Ingress Gateway pod is running to the INGRESS_HOST variable: -``` -export INGRESS_HOST=$(kubectl get po -l istio=ingressgateway -n istio-system -o jsonpath='{.items[0].status.hostIP}') -``` -The following command assigns the node port used for the HTTP2 service of the Istio Ingress Gateway to the INGRESS_PORT variable: -``` -export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}') -``` - -### Set Service Host Name -Next step is to determine service hostname. -This command retrieves the hostname of a specific InferenceService in a Kubernetes environment by extracting it from the status.url field and assigns it to the SERVICE_HOSTNAME variable: -``` -SERVICE_HOSTNAME=$(kubectl get inferenceservice -o jsonpath='{.status.url}' | cut -d "/" -f 3) -``` -#### Example: -``` -SERVICE_HOSTNAME=$(kubectl get inferenceservice llm-deploy -o jsonpath='{.status.url}' | cut -d "/" -f 3) -``` - -### Curl request to get inference -In the next step inference can be done on the deployed model. -The following is the template command for inferencing with a json file: -``` -curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/{model_name}/infer -d @{input_file_path} -``` -#### Examples: -Curl request for MPT-7B model -``` -curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/mpt_7b/infer -d @$WORK_DIR/data/qa/sample_text1.json -``` -Curl request for Falcon-7B model -``` -curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/falcon_7b/infer -d @$WORK_DIR/data/summarize/sample_text1.json -``` -Curl request for Llama2-7B model -``` -curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/llama2_7b/infer -d @$WORK_DIR/data/translate/sample_text1.json -``` - -### Input data format -Input data should be in **JSON** format. The input should be a '.json' file containing the prompt in the format below: -``` -{ - "id": "42", - "inputs": [ - { - "name": "input0", - "shape": [-1], - "datatype": "BYTES", - "data": ["Capital of India?"] - } - ] -} -``` \ No newline at end of file diff --git a/docs/gpt-in-a-box/kubernetes/v0.2/inference_server.md b/docs/gpt-in-a-box/kubernetes/v0.2/inference_server.md deleted file mode 100644 index 58cb9b0..0000000 --- a/docs/gpt-in-a-box/kubernetes/v0.2/inference_server.md +++ /dev/null @@ -1,43 +0,0 @@ -## Start and run Kubeflow Serving - -Run the following command for starting Kubeflow serving and running inference on the given input: -``` -bash $WORK_DIR/llm/run.sh -n -g -f -m -e [OPTIONAL -d -v ] -``` - -* **n**: Name of a [validated model](validated_models.md) -* **d**: Absolute path of input data folder (Optional) -* **g**: Number of gpus to be used to execute (Set 0 to use cpu) -* **f**: NFS server address with share path information -* **m**: Mount path to your nfs server to be used in the kube PV where model files and model archive file be stored -* **e**: Desired name of the deployment metadata (will be created) -* **v**: Commit ID of model's HuggingFace repository (optional, if not provided default set in model_config will be used) - -Should print "Inference Run Successful" as a message once the Inference Server has successfully started. - -### Examples -The following are example commands to start the Inference Server. - -For 1 GPU Inference with official MPT-7B model and keep inference server alive: -``` -bash $WORK_DIR/llm/run.sh -n mpt_7b -d data/translate -g 1 -e llm-deploy -f '1.1.1.1:/llm' -m /mnt/llm -``` -For 1 GPU Inference with official Falcon-7B model and keep inference server alive: -``` -bash $WORK_DIR/llm/run.sh -n falcon_7b -d data/qa -g 1 -e llm-deploy -f '1.1.1.1:/llm' -m /mnt/llm -``` -For 1 GPU Inference with official Llama2-7B model and keep inference server alive: -``` -bash $WORK_DIR/llm/run.sh -n llama2_7b -d data/summarize -g 1 -e llm-deploy -f '1.1.1.1:/llm' -m /mnt/llm -``` - -### Cleanup Inference deployment - -Run the following command to stop the inference server and unmount PV and PVC. -``` -python3 $WORK_DIR/llm/cleanup.py --deploy_name -``` -Example: -``` -python3 $WORK_DIR/llm/cleanup.py --deploy_name llm-deploy -``` \ No newline at end of file diff --git a/docs/gpt-in-a-box/kubernetes/v0.2/validated_models.md b/docs/gpt-in-a-box/kubernetes/v0.2/validated_models.md deleted file mode 100644 index 0e31b95..0000000 --- a/docs/gpt-in-a-box/kubernetes/v0.2/validated_models.md +++ /dev/null @@ -1,16 +0,0 @@ -# Validated Models for Kubernetes Version - -GPT-in-a-Box 1.0 has been validated on a curated set of HuggingFace models Information pertaining to these models is stored in the ```llm/model_config.json``` file. - -The Validated Models are : - -| Model Name | HuggingFace Repository ID | -| --- | --- | -| mpt_7b | [mosaicml/mpt_7b](https://huggingface.co/mosaicml/mpt-7b) | -| falcon_7b | [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b) | -| llama2_7b | [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) | -| codellama_7b_python | [codellama/CodeLlama-7b-Python-hf](https://huggingface.co/codellama/CodeLlama-7b-Python-hf) | -| llama2_7b_chat | [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | - -!!! note - To start the inference server with any HuggingFace model, refer to [**HuggingFace Model Support**](huggingface_model.md) documentation. \ No newline at end of file diff --git a/docs/gpt-in-a-box/overview.md b/docs/gpt-in-a-box/overview.md deleted file mode 100644 index 8f3d6c2..0000000 --- a/docs/gpt-in-a-box/overview.md +++ /dev/null @@ -1,11 +0,0 @@ -# Nutanix GPT-in-a-Box 1.0 Documentation - -Welcome to the official home dedicated to documenting how to run Nutanix GPT-in-a-Box 1.0. Nutanix GPT-in-a-Box 1.0 is a turnkey solution that includes everything needed to build AI-ready infrastructure. Here, you'll find information and code to run Nutanix GPT-in-a-Box 1.0 on Virtual Machines or Kubernetes Clusters. - -This new solution includes: - -- Software-defined Nutanix Cloud Platform™ infrastructure supporting GPU-enabled server nodes for seamless scaling of virtualized compute, storage, and networking supporting both traditional virtual machines and Kubernetes-orchestrated containers -- Files and Objects storage; to fine-tune and run a choice of GPT models -- Open source software to deploy and run AI workloads including PyTorch framework & KubeFlow MLOps platform -- The management interface for enhanced terminal UI or standard CLI -- Support for a curated set of LLMs including Llama2, Falcon and MPT diff --git a/docs/gpt-in-a-box/support.md b/docs/gpt-in-a-box/support.md deleted file mode 100644 index 7ac49a3..0000000 --- a/docs/gpt-in-a-box/support.md +++ /dev/null @@ -1,14 +0,0 @@ -# Nutanix GPT-in-a-Box 1.0 Support - -Nutanix maintains public GitHub repositories for GPT in a box. Support is handled directly via the repository. Issues and enhancement requests can be submitted in the Issues tab of the relevant repository. Search for and review existing open issues before submitting a new issue. To report a new issue navigate to the GitHub repository: - -[GitHub - nutanix/nai-llm ](https://github.com/nutanix/nai-llm) - -This is the official repository for the virtual machine version of Nutanix GPT-in-a-Box 1.0. - -[GitHub - nutanix/nai-llm-k8s](https://github.com/nutanix/nai-llm-k8s) - -This is the official repository for the Kubernetes version of Nutanix GPT-in-a-Box 1.0. - -The support procedure is documented in [KB 16159](https://portal.nutanix.com/page/documents/kbs/details?targetId=kA0VO0000000dJ70AI). - diff --git a/docs/gpt-in-a-box/vm/v0.2/custom_model.md b/docs/gpt-in-a-box/vm/v0.2/custom_model.md deleted file mode 100644 index 997a4bb..0000000 --- a/docs/gpt-in-a-box/vm/v0.2/custom_model.md +++ /dev/null @@ -1,29 +0,0 @@ -# Custom Model Support -We provide the capability to generate a MAR file with custom models and start an inference server using it with Torchserve. -!!! note - A model is recognised as a custom model if it's model name is not present in the model_config file. - -## Generate Model Archive File for Custom Models -Run the following command for generating the Model Archive File (MAR) with the Custom Model files : -``` -python3 $WORK_DIR/llm/download.py --no_download [--repo_version --handler ] --model_name --model_path --mar_output -``` -Where the arguments are : - -- **model_name**: Name of custom model -- **repo_version**: Any model version, defaults to "1.0" (optional) -- **model_path**: Absolute path of custom model files (should be a non empty folder) -- **mar_output**: Absolute path of export of MAR file (.mar) -- **no_download**: Flag to skip downloading the model files, must be set for custom models -- **handler**: Path to custom handler, defaults to llm/handler.py (optional) - -## Start Inference Server with Custom Model Archive File -Run the following command to start TorchServe (Inference Server) and run inference on the provided input for custom models: -``` -bash $WORK_DIR/llm/run.sh -n -a [OPTIONAL -d ] -``` -Where the arguments are : - -- **n**: Name of custom model -- **d**: Absolute path of input data folder (optional) -- **a**: Absolute path to the Model Store directory \ No newline at end of file diff --git a/docs/gpt-in-a-box/vm/v0.2/generating_mar.md b/docs/gpt-in-a-box/vm/v0.2/generating_mar.md deleted file mode 100644 index 4aed2fb..0000000 --- a/docs/gpt-in-a-box/vm/v0.2/generating_mar.md +++ /dev/null @@ -1,38 +0,0 @@ -# Generate PyTorch Model Archive File -We will download the model files and generate a Model Archive file for the desired LLM, which will be used by TorchServe to load the model. Find out more about Torch Model Archiver [here](https://github.com/pytorch/serve/blob/master/model-archiver/README.md). - -Make two new directories, one to store the model files (model_path) and another to store the Model Archive files (mar_output). - -!!! note - The model store directory (i.e, mar_output) can be the same for multiple Model Archive files. But model files directory (i.e, model_path) should be empty if you're downloading the model. - -Run the following command for downloading model files and generating the Model Archive File (MAR) of the desired LLM: -``` -python3 $WORK_DIR/llm/download.py [--no_download --repo_version ] --model_name --model_path --mar_output --hf_token -``` -Where the arguments are : - -- **model_name**: Name of model -- **repo_version**: Commit ID of model's HuggingFace repository (optional, if not provided default set in model_config will be used) -- **model_path**: Absolute path of model files (should be empty if downloading) -- **mar_output**: Absolute path of export of MAR file (.mar) -- **no_download**: Flag to skip downloading the model files -- **hf_token**: Your HuggingFace token. Needed to download and verify LLAMA(2) models. - -The available LLMs are mpt_7b (mosaicml/mpt_7b), falcon_7b (tiiuae/falcon-7b), llama2_7b (meta-llama/Llama-2-7b-hf). - -## Examples -The following are example commands to generate the model archive file. - -Download MPT-7B model files and generate model archive for it: -``` -python3 $WORK_DIR/llm/download.py --model_name mpt_7b --model_path /home/ubuntu/models/mpt_7b/model_files --mar_output /home/ubuntu/models/model_store -``` -Download Falcon-7B model files and generate model archive for it: -``` -python3 $WORK_DIR/llm/download.py --model_name falcon_7b --model_path /home/ubuntu/models/falcon_7b/model_files --mar_output /home/ubuntu/models/model_store -``` -Download Llama2-7B model files and generate model archive for it: -``` -python3 $WORK_DIR/llm/download.py --model_name llama2_7b --model_path /home/ubuntu/models/llama2_7b/model_files --mar_output /home/ubuntu/models/model_store --hf_token -``` \ No newline at end of file diff --git a/docs/gpt-in-a-box/vm/v0.2/getting_started.md b/docs/gpt-in-a-box/vm/v0.2/getting_started.md deleted file mode 100644 index d5ac8d3..0000000 --- a/docs/gpt-in-a-box/vm/v0.2/getting_started.md +++ /dev/null @@ -1,49 +0,0 @@ -# Getting Started -This is a guide on getting started with GPT-in-a-Box 1.0 deployment on a Virtual Machine. You can find the open source repository for the virtual machine version [here](https://github.com/nutanix/nai-llm). - -Tested Specifications: - -| Specification | Tested Version | -| --- | --- | -| Python | 3.10 | -| Operating System | Ubuntu 20.04 | -| GPU | NVIDIA A100 40G | -| CPU | 8 vCPUs | -| System Memory | 32 GB | - -Follow the steps below to install the necessary prerequisites. - -### Install openjdk, pip3 -Run the following command to install pip3 and openjdk -``` -sudo apt-get install openjdk-17-jdk python3-pip -``` - -### Install NVIDIA Drivers -To install the NVIDIA Drivers, refer to the official [Installation Reference](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html#runfile). - -Proceed to downloading the latest [Datacenter NVIDIA drivers](https://www.nvidia.com/download/index.aspx) for your GPU type. - -For NVIDIA A100, Select A100 in Datacenter Tesla for Linux 64 bit with CUDA toolkit 11.7, latest driver is 515.105.01. - -``` -curl -fSsl -O https://us.download.nvidia.com/tesla/515.105.01/NVIDIA-Linux-x86_64-515.105.01.run -sudo sh NVIDIA-Linux-x86_64-515.105.01.run -s -``` -!!! note - We don’t need to install CUDA toolkit separately as it is bundled with PyTorch installation. Just NVIDIA driver installation is enough. - -### Download Nutanix package -Download the **v0.2** release version from the [NAI-LLM Releases](https://github.com/nutanix/nai-llm/releases/tag/v0.2) and untar the release on the node. Set the working directory to the root folder containing the extracted release. - -``` -export WORK_DIR=absolute_path_to_empty_release_directory -mkdir $WORK_DIR -tar -xvf -C $WORK_DIR --strip-components=1 -``` - -### Install required packages -Run the following command to install the required python packages. -``` -pip install -r $WORK_DIR/llm/requirements.txt -``` \ No newline at end of file diff --git a/docs/gpt-in-a-box/vm/v0.2/inference_requests.md b/docs/gpt-in-a-box/vm/v0.2/inference_requests.md deleted file mode 100644 index b69243a..0000000 --- a/docs/gpt-in-a-box/vm/v0.2/inference_requests.md +++ /dev/null @@ -1,82 +0,0 @@ -# Inference Requests -The Inference Server can be inferenced through the TorchServe Inference API. Find out more about it in the official [TorchServe Inference API](https://pytorch.org/serve/inference_api.html) documentation. - -**Server Configuration** - -| Variable | Value | -| --- | --- | -| inference_server_endpoint | localhost | -| inference_port | 8080 | - -The following are example cURL commands to send inference requests to the Inference Server. - -## Ping Request -To find out the status of a TorchServe server, you can use the ping API that TorchServe supports: -``` -curl http://{inference_server_endpoint}:{inference_port}/ping -``` -### Example -``` -curl http://localhost:8080/ping -``` -!!! note - This only provides information on whether the TorchServe server is running. To check whether a model is successfully registered, use the "List Registered Models" request in the [Management Requests](management_requests.md#list-registered-models) documentation. - -## Inference Requests -The following is the template command for inferencing with a text file: -``` -curl -v -H "Content-Type: application/text" http://{inference_server_endpoint}:{inference_port}/predictions/{model_name} -d @path/to/data.txt -``` - -The following is the template command for inferencing with a json file: -``` -curl -v -H "Content-Type: application/json" http://{inference_server_endpoint}:{inference_port}/predictions/{model_name} -d @path/to/data.json -``` - -Input data files can be found in the `$WORK_DIR/data` folder. - -### Examples - -For MPT-7B model -``` -curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/mpt_7b -d @$WORK_DIR/data/qa/sample_text1.txt -``` -``` -curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/mpt_7b -d @$WORK_DIR/data/qa/sample_text4.json -``` - -For Falcon-7B model -``` -curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/falcon_7b -d @$WORK_DIR/data/summarize/sample_text1.txt -``` -``` -curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/falcon_7b -d @$WORK_DIR/data/summarize/sample_text3.json -``` - -For Llama2-7B model -``` -curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/llama2_7b -d @$WORK_DIR/data/translate/sample_text1.txt -``` -``` -curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/llama2_7b -d @$WORK_DIR/data/translate/sample_text3.json -``` - -### Input data format -Input data can be in either **text** or **JSON** format. - -1. For text format, the input should be a '.txt' file containing the prompt - -2. For JSON format, the input should be a '.json' file containing the prompt in the format below: -``` -{ - "id": "42", - "inputs": [ - { - "name": "input0", - "shape": [-1], - "datatype": "BYTES", - "data": ["Capital of India?"] - } - ] -} -``` \ No newline at end of file diff --git a/docs/gpt-in-a-box/vm/v0.2/inference_server.md b/docs/gpt-in-a-box/vm/v0.2/inference_server.md deleted file mode 100644 index a89a807..0000000 --- a/docs/gpt-in-a-box/vm/v0.2/inference_server.md +++ /dev/null @@ -1,37 +0,0 @@ -# Deploying Inference Server -Run the following command to start TorchServe (Inference Server) and run inference on the provided input: -``` -bash $WORK_DIR/llm/run.sh -n -a [OPTIONAL -d -v ] -``` -Where the arguments are : - -- **n**: Name of model -- **v**: Commit ID of model's HuggingFace repository (optional, if not provided default set in model_config will be used) -- **d**: Absolute path of input data folder (optional) -- **a**: Absolute path to the Model Store directory - -The available LLMs model names are mpt_7b (mosaicml/mpt_7b), falcon_7b (tiiuae/falcon-7b), llama2_7b (meta-llama/Llama-2-7b-hf). - -Once the Inference Server has successfully started, you should see a "Ready For Inferencing" message. - -### Examples -The following are example commands to start the Inference Server. - -For Inference with official MPT-7B model: -``` -bash $WORK_DIR/llm/run.sh -n mpt_7b -d $WORK_DIR/data/translate -a /home/ubuntu/models/model_store -``` -For Inference with official Falcon-7B model: -``` -bash $WORK_DIR/llm/run.sh -n falcon_7b -d $WORK_DIR/data/qa -a /home/ubuntu/models/model_store -``` -For Inference with official Llama2-7B model: -``` -bash $WORK_DIR/llm/run.sh -n llama2_7b -d $WORK_DIR/data/summarize -a /home/ubuntu/models/model_store -``` - -## Stop Inference Server and Cleanup -Run the following command to stop the Inference Server and clean up temporarily generate files. -``` -python3 $WORK_DIR/llm/cleanup.py -``` \ No newline at end of file diff --git a/docs/gpt-in-a-box/vm/v0.2/management_requests.md b/docs/gpt-in-a-box/vm/v0.2/management_requests.md deleted file mode 100644 index cb9819c..0000000 --- a/docs/gpt-in-a-box/vm/v0.2/management_requests.md +++ /dev/null @@ -1,133 +0,0 @@ -# Management Requests -The Inference Server can be managed through the TorchServe Management API. Find out more about it in the official [TorchServe Management API](https://pytorch.org/serve/management_api.html) documentation - -**Server Configuration** - -| Variable | Value | -| --- | --- | -| inference_server_endpoint | localhost | -| management_port | 8081 | - -The following are example cURL commands to send management requests to the Inference Server. - -## List Registered Models -To describe all registered models, the template command is: -``` -curl http://{inference_server_endpoint}:{management_port}/models -``` - -### Example -For all registered models -``` -curl http://localhost:8081/models -``` - -## Describe Registered Models -Once a model is loaded on the Inference Server, we can use the following request to describe the model and it's configuration. - -The following is the template command for the same: -``` -curl http://{inference_server_endpoint}:{management_port}/models/{model_name} -``` -Example response of the describe models request: -``` -[ - { - "modelName": "llama2_7b", - "modelVersion": "6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9", - "modelUrl": "llama2_7b_6fdf2e6.mar", - "runtime": "python", - "minWorkers": 1, - "maxWorkers": 1, - "batchSize": 1, - "maxBatchDelay": 200, - "loadedAtStartup": false, - "workers": [ - { - "id": "9000", - "startTime": "2023-11-28T06:39:28.081Z", - "status": "READY", - "memoryUsage": 0, - "pid": 57379, - "gpu": true, - "gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::13423 MiB" - } - ], - "jobQueueStatus": { - "remainingCapacity": 1000, - "pendingRequests": 0 - } - } -] -``` - -!!! note - From this request, you can validate if a model is ready for inferencing. You can do this by referring to the values under the "workers" -> "status" keys of the response. - -### Examples -For MPT-7B model -``` -curl http://localhost:8081/models/mpt_7b -``` -For Falcon-7B model -``` -curl http://localhost:8081/models/falcon_7b -``` -For Llama2-7B model -``` -curl http://localhost:8081/models/llama2_7b -``` - -## Register Additional Models -TorchServe allows the registering (loading) of multiple models simultaneously. To register multiple models, make sure that the Model Archive Files for the concerned models are stored in the same directory. - -The following is the template command for the same: -``` -curl -X POST "http://{inference_server_endpoint}:{management_port}/models?url={model_archive_file_name}.mar&initial_workers=1&synchronous=true" -``` - -### Examples -For MPT-7B model -``` -curl -X POST "http://localhost:8081/models?url=mpt_7b.mar&initial_workers=1&synchronous=true" -``` -For Falcon-7B model -``` -curl -X POST "http://localhost:8081/models?url=falcon_7b.mar&initial_workers=1&synchronous=true" -``` -For Llama2-7B model -``` -curl -X POST "http://localhost:8081/models?url=llama2_7b.mar&initial_workers=1&synchronous=true" -``` -!!! note - Make sure the Model Archive file name given in the cURL request is correct and is present in the model store directory. - -## Edit Registered Model Configuration -The model can be configured after registration using the Management API of TorchServe. - -The following is the template command for the same: -``` -curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/models/{model_name}?min_workers={number}&max_workers={number}&batch_size={number}&max_batch_delay={delay_in_ms}" -``` - -### Examples -For MPT-7B model -``` -curl -v -X PUT "http://localhost:8081/models/mpt_7b?min_worker=2&max_worker=2" -``` -For Falcon-7B model -``` -curl -v -X PUT "http://localhost:8081/models/falcon_7b?min_worker=2&max_worker=2" -``` -For Llama2-7B model -``` -curl -v -X PUT "http://localhost:8081/models/llama2_7b?min_worker=2&max_worker=2" -``` -!!! note - Make sure to have enough GPU and System Memory before increasing number of workers, else the additional workers will fail to load. - -## Unregister a Model -The following is the template command to unregister a model from the Inference Server: -``` -curl -X DELETE "http://{inference_server_endpoint}:{management_port}/models/{model_name}/{repo_version}" -``` diff --git a/docs/gpt-in-a-box/vm/v0.2/model_version.md b/docs/gpt-in-a-box/vm/v0.2/model_version.md deleted file mode 100644 index 8816593..0000000 --- a/docs/gpt-in-a-box/vm/v0.2/model_version.md +++ /dev/null @@ -1,8 +0,0 @@ -# Model Version Support -We provide the capability to download and register various commits of the single model from HuggingFace. By specifying the commit ID as "repo_version", you can produce MAR files for multiple iterations of the same model and register them simultaneously. To transition between these versions, you can set a default version within TorchServe while it is running and inference the desired version. - -## Set Default Model Version -If multiple versions of the same model are registered, we can set a particular version as the default for inferencing by running the following command: -``` -curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/{model_name}/{repo_version}/set-default" -``` diff --git a/docs/gpt-in-a-box/vm/v0.3/custom_model.md b/docs/gpt-in-a-box/vm/v0.3/custom_model.md deleted file mode 100644 index f6abf94..0000000 --- a/docs/gpt-in-a-box/vm/v0.3/custom_model.md +++ /dev/null @@ -1,31 +0,0 @@ -# Custom Model Support -In some cases you may want to use a custom model, e.g. a custom fine-tuned model. We provide the capability to generate a MAR file with custom model files and start an inference server using it with Torchserve. - -## Generate Model Archive File for Custom Models - -!!! note - The model archive files should be placed in a directory accessible by the Nutanix package, e.g. /home/ubuntu/models/<custom_model_name>/model_files. This directory will be passed to the --model_path argument. You'll also need to provide the --mar_output path where you want the model archive export to be stored. - -Run the following command for generating the Model Archive File (MAR) with the Custom Model files : -``` -python3 $WORK_DIR/llm/generate.py --skip_download [--repo_version --handler ] --model_name --model_path --mar_output -``` -Where the arguments are : - -- **model_name**: Name of custom model -- **repo_version**: Any model version, defaults to "1.0" (optional) -- **model_path**: Absolute path of custom model files (should be a non empty folder) -- **mar_output**: Absolute path of export of MAR file (.mar) -- **skip_download**: Flag to skip downloading the model files, must be set for custom models -- **handler**: Path to custom handler, defaults to llm/handler.py (optional) - -## Start Inference Server with Custom Model Archive File -Run the following command to start TorchServe (Inference Server) and run inference on the provided input for custom models: -``` -bash $WORK_DIR/llm/run.sh -n -a [OPTIONAL -d ] -``` -Where the arguments are : - -- **n**: Name of custom model -- **d**: Absolute path of input data folder (optional) -- **a**: Absolute path to the Model Store directory \ No newline at end of file diff --git a/docs/gpt-in-a-box/vm/v0.3/generating_mar.md b/docs/gpt-in-a-box/vm/v0.3/generating_mar.md deleted file mode 100644 index a1b6f49..0000000 --- a/docs/gpt-in-a-box/vm/v0.3/generating_mar.md +++ /dev/null @@ -1,36 +0,0 @@ -# Generate PyTorch Model Archive File -We will download the model files and generate a Model Archive file for the desired LLM, which will be used by TorchServe to load the model. Find out more about Torch Model Archiver [here](https://github.com/pytorch/serve/blob/master/model-archiver/README.md). - -Make two new directories, one to store the model files (model_path) and another to store the Model Archive files (mar_output). - -!!! note - The model store directory (i.e, mar_output) can be the same for multiple Model Archive files. But model files directory (i.e, model_path) should be empty if you're downloading the model. - -Run the following command for downloading model files and generating the Model Archive File (MAR) of the desired LLM: -``` -python3 $WORK_DIR/llm/generate.py [--skip_download --repo_version --hf_token ] --model_name --model_path --mar_output -``` -Where the arguments are : - -- **model_name**: Name of a [validated model](validated_models.md) -- **repo_version**: Commit ID of model's HuggingFace repository (optional, if not provided default set in model_config will be used) -- **model_path**: Absolute path of model files (should be empty if downloading) -- **mar_output**: Absolute path of export of MAR file (.mar) -- **skip_download**: Flag to skip downloading the model files -- **hf_token**: Your HuggingFace token. Needed to download and verify LLAMA(2) models. (It can alternatively be set using the environment variable 'HF_TOKEN') - -## Examples -The following are example commands to generate the model archive file. - -Download MPT-7B model files and generate model archive for it: -``` -python3 $WORK_DIR/llm/generate.py --model_name mpt_7b --model_path /home/ubuntu/models/mpt_7b/model_files --mar_output /home/ubuntu/models/model_store -``` -Download Falcon-7B model files and generate model archive for it: -``` -python3 $WORK_DIR/llm/generate.py --model_name falcon_7b --model_path /home/ubuntu/models/falcon_7b/model_files --mar_output /home/ubuntu/models/model_store -``` -Download Llama2-7B model files and generate model archive for it: -``` -python3 $WORK_DIR/llm/generate.py --model_name llama2_7b --model_path /home/ubuntu/models/llama2_7b/model_files --mar_output /home/ubuntu/models/model_store --hf_token -``` \ No newline at end of file diff --git a/docs/gpt-in-a-box/vm/v0.3/getting_started.md b/docs/gpt-in-a-box/vm/v0.3/getting_started.md deleted file mode 100644 index 2603c5f..0000000 --- a/docs/gpt-in-a-box/vm/v0.3/getting_started.md +++ /dev/null @@ -1,49 +0,0 @@ -# Getting Started -This is a guide on getting started with GPT-in-a-Box 1.0 deployment on a Virtual Machine. You can find the open source repository for the virtual machine version [here](https://github.com/nutanix/nai-llm). - -Tested Specifications: - -| Specification | Tested Version | -| --- | --- | -| Python | 3.10 | -| Operating System | Ubuntu 20.04 | -| GPU | NVIDIA A100 40G | -| CPU | 8 vCPUs | -| System Memory | 32 GB | - -Follow the steps below to install the necessary prerequisites. - -### Install openjdk, pip3 -Run the following command to install pip3 and openjdk -``` -sudo apt-get install openjdk-17-jdk python3-pip -``` - -### Install NVIDIA Drivers -To install the NVIDIA Drivers, refer to the official [Installation Reference](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html#runfile). - -Proceed to downloading the latest [Datacenter NVIDIA drivers](https://www.nvidia.com/download/index.aspx) for your GPU type. - -For NVIDIA A100, Select A100 in Datacenter Tesla for Linux 64 bit with CUDA toolkit 11.7, latest driver is 515.105.01. - -``` -curl -fSsl -O https://us.download.nvidia.com/tesla/515.105.01/NVIDIA-Linux-x86_64-515.105.01.run -sudo sh NVIDIA-Linux-x86_64-515.105.01.run -s -``` -!!! note - There is no need to install CUDA toolkit separately as it is bundled with PyTorch installation. The NVIDIA driver installation is sufficient. - -### Download Nutanix package -Download the **v0.3** release version from the [NAI-LLM Releases](https://github.com/nutanix/nai-llm/releases/tag/v0.3) and untar the release on the node. Set the working directory to the root folder containing the extracted release. - -``` -export WORK_DIR=absolute_path_to_empty_release_directory -mkdir $WORK_DIR -tar -xvf -C $WORK_DIR --strip-components=1 -``` - -### Install required packages -Run the following command to install the required python packages. -``` -pip install -r $WORK_DIR/llm/requirements.txt -``` diff --git a/docs/gpt-in-a-box/vm/v0.3/huggingface_model.md b/docs/gpt-in-a-box/vm/v0.3/huggingface_model.md deleted file mode 100644 index 6abf283..0000000 --- a/docs/gpt-in-a-box/vm/v0.3/huggingface_model.md +++ /dev/null @@ -1,45 +0,0 @@ -# HuggingFace Model Support -!!! Note - To start the inference server for the [**Validated Models**](validated_models.md), refer to the [**Deploying Inference Server**](inference_server.md) documentation. - -We provide the capability to download model files from any HuggingFace repository and generate a MAR file to start an inference server using it with Torchserve. - -To start the Inference Server for any other HuggingFace model, follow the steps below. - -## Generate Model Archive File for HuggingFace Models -Run the following command for downloading and generating the Model Archive File (MAR) with the HuggingFace Model files : -``` -python3 $WORK_DIR/llm/generate.py [--hf_token --repo_version --handler ] --model_name --repo_id --model_path --mar_output -``` -Where the arguments are : - -- **model_name**: Name of HuggingFace model -- **repo_id**: HuggingFace Repository ID of the model -- **repo_version**: Commit ID of model's HuggingFace repository, defaults to latest HuggingFace commit ID (optional) -- **model_path**: Absolute path of model files (should be an empty folder) -- **mar_output**: Absolute path of export of MAR file (.mar) -- **handler**: Path to custom handler, defaults to llm/handler.py (optional) -- **hf_token**: Your HuggingFace token. Needed to download and verify LLAMA(2) models. - -### Example -Download model files and generate model archive for codellama/CodeLlama-7b-hf: -``` -python3 $WORK_DIR/llm/generate.py --model_name codellama_7b_hf --repo_id codellama/CodeLlama-7b-hf --model_path /models/codellama_7b_hf/model_files --mar_output /models/model_store -``` - -## Start Inference Server with HuggingFace Model -Run the following command to start TorchServe (Inference Server) and run inference on the provided input for HuggingFace models: -``` -bash $WORK_DIR/llm/run.sh -n -a [OPTIONAL -d ] -``` -Where the arguments are : - -- **n**: Name of HuggingFace model -- **d**: Absolute path of input data folder (optional) -- **a**: Absolute path to the Model Store directory - -### Example -To start Inference Server with codellama/CodeLlama-7b-hf: -``` -bash $WORK_DIR/llm/run.sh -n codellama_7b_hf -a /models/model_store -d $WORK_DIR/data/summarize -``` diff --git a/docs/gpt-in-a-box/vm/v0.3/inference_requests.md b/docs/gpt-in-a-box/vm/v0.3/inference_requests.md deleted file mode 100644 index 22c6905..0000000 --- a/docs/gpt-in-a-box/vm/v0.3/inference_requests.md +++ /dev/null @@ -1,82 +0,0 @@ -# Inference Requests -The Inference Server can be inferenced through the TorchServe Inference API. Find out more about it in the official [TorchServe Inference API](https://pytorch.org/serve/inference_api.html) documentation. - -**Server Configuration** - -| Variable | Value | -| --- | --- | -| inference_server_endpoint | localhost | -| inference_port | 8080 | - -The following are example cURL commands to send inference requests to the Inference Server. - -## Ping Request -To find out the status of a TorchServe server, you can use the ping API that TorchServe supports: -``` -curl http://{inference_server_endpoint}:{inference_port}/ping -``` -### Example -``` -curl http://localhost:8080/ping -``` -!!! note - This only provides information on whether the TorchServe server is running. To check whether a model is successfully registered on TorchServe, you can [**list all models**](management_requests.md#list-registered-models) and [**describe a registered model**](management_requests.md#describe-registered-models). - -## Inference Requests -The following is the template command for inferencing with a text file: -``` -curl -v -H "Content-Type: application/text" http://{inference_server_endpoint}:{inference_port}/predictions/{model_name} -d @path/to/data.txt -``` - -The following is the template command for inferencing with a json file: -``` -curl -v -H "Content-Type: application/json" http://{inference_server_endpoint}:{inference_port}/predictions/{model_name} -d @path/to/data.json -``` - -Input data files can be found in the `$WORK_DIR/data` folder. - -### Examples - -For MPT-7B model -``` -curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/mpt_7b -d @$WORK_DIR/data/qa/sample_text1.txt -``` -``` -curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/mpt_7b -d @$WORK_DIR/data/qa/sample_text4.json -``` - -For Falcon-7B model -``` -curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/falcon_7b -d @$WORK_DIR/data/summarize/sample_text1.txt -``` -``` -curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/falcon_7b -d @$WORK_DIR/data/summarize/sample_text3.json -``` - -For Llama2-7B model -``` -curl -v -H "Content-Type: application/text" http://localhost:8080/predictions/llama2_7b -d @$WORK_DIR/data/translate/sample_text1.txt -``` -``` -curl -v -H "Content-Type: application/json" http://localhost:8080/predictions/llama2_7b -d @$WORK_DIR/data/translate/sample_text3.json -``` - -### Input data format -Input data can be in either **text** or **JSON** format. - -1. For text format, the input should be a '.txt' file containing the prompt - -2. For JSON format, the input should be a '.json' file containing the prompt in the format below: -``` -{ - "id": "42", - "inputs": [ - { - "name": "input0", - "shape": [-1], - "datatype": "BYTES", - "data": ["Capital of India?"] - } - ] -} -``` \ No newline at end of file diff --git a/docs/gpt-in-a-box/vm/v0.3/inference_server.md b/docs/gpt-in-a-box/vm/v0.3/inference_server.md deleted file mode 100644 index 4a899d9..0000000 --- a/docs/gpt-in-a-box/vm/v0.3/inference_server.md +++ /dev/null @@ -1,36 +0,0 @@ -# Deploying Inference Server - -Run the following command to start TorchServe (Inference Server) and run inference on the provided input: -``` -bash $WORK_DIR/llm/run.sh -n -a [OPTIONAL -d -v ] -``` -Where the arguments are : - -- **n**: Name of a [validated model](validated_models.md) -- **v**: Commit ID of model's HuggingFace repository (optional, if not provided default set in model_config will be used) -- **d**: Absolute path of input data folder (optional) -- **a**: Absolute path to the Model Store directory - -Once the Inference Server has successfully started, you should see a "Ready For Inferencing" message. - -### Examples -The following are example commands to start the Inference Server. - -For Inference with official MPT-7B model: -``` -bash $WORK_DIR/llm/run.sh -n mpt_7b -d $WORK_DIR/data/translate -a /home/ubuntu/models/model_store -``` -For Inference with official Falcon-7B model: -``` -bash $WORK_DIR/llm/run.sh -n falcon_7b -d $WORK_DIR/data/qa -a /home/ubuntu/models/model_store -``` -For Inference with official Llama2-7B model: -``` -bash $WORK_DIR/llm/run.sh -n llama2_7b -d $WORK_DIR/data/summarize -a /home/ubuntu/models/model_store -``` - -## Stop Inference Server and Cleanup -Run the following command to stop the Inference Server and clean up temporarily generate files. -``` -python3 $WORK_DIR/llm/cleanup.py -``` \ No newline at end of file diff --git a/docs/gpt-in-a-box/vm/v0.3/management_requests.md b/docs/gpt-in-a-box/vm/v0.3/management_requests.md deleted file mode 100644 index cb9819c..0000000 --- a/docs/gpt-in-a-box/vm/v0.3/management_requests.md +++ /dev/null @@ -1,133 +0,0 @@ -# Management Requests -The Inference Server can be managed through the TorchServe Management API. Find out more about it in the official [TorchServe Management API](https://pytorch.org/serve/management_api.html) documentation - -**Server Configuration** - -| Variable | Value | -| --- | --- | -| inference_server_endpoint | localhost | -| management_port | 8081 | - -The following are example cURL commands to send management requests to the Inference Server. - -## List Registered Models -To describe all registered models, the template command is: -``` -curl http://{inference_server_endpoint}:{management_port}/models -``` - -### Example -For all registered models -``` -curl http://localhost:8081/models -``` - -## Describe Registered Models -Once a model is loaded on the Inference Server, we can use the following request to describe the model and it's configuration. - -The following is the template command for the same: -``` -curl http://{inference_server_endpoint}:{management_port}/models/{model_name} -``` -Example response of the describe models request: -``` -[ - { - "modelName": "llama2_7b", - "modelVersion": "6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9", - "modelUrl": "llama2_7b_6fdf2e6.mar", - "runtime": "python", - "minWorkers": 1, - "maxWorkers": 1, - "batchSize": 1, - "maxBatchDelay": 200, - "loadedAtStartup": false, - "workers": [ - { - "id": "9000", - "startTime": "2023-11-28T06:39:28.081Z", - "status": "READY", - "memoryUsage": 0, - "pid": 57379, - "gpu": true, - "gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::13423 MiB" - } - ], - "jobQueueStatus": { - "remainingCapacity": 1000, - "pendingRequests": 0 - } - } -] -``` - -!!! note - From this request, you can validate if a model is ready for inferencing. You can do this by referring to the values under the "workers" -> "status" keys of the response. - -### Examples -For MPT-7B model -``` -curl http://localhost:8081/models/mpt_7b -``` -For Falcon-7B model -``` -curl http://localhost:8081/models/falcon_7b -``` -For Llama2-7B model -``` -curl http://localhost:8081/models/llama2_7b -``` - -## Register Additional Models -TorchServe allows the registering (loading) of multiple models simultaneously. To register multiple models, make sure that the Model Archive Files for the concerned models are stored in the same directory. - -The following is the template command for the same: -``` -curl -X POST "http://{inference_server_endpoint}:{management_port}/models?url={model_archive_file_name}.mar&initial_workers=1&synchronous=true" -``` - -### Examples -For MPT-7B model -``` -curl -X POST "http://localhost:8081/models?url=mpt_7b.mar&initial_workers=1&synchronous=true" -``` -For Falcon-7B model -``` -curl -X POST "http://localhost:8081/models?url=falcon_7b.mar&initial_workers=1&synchronous=true" -``` -For Llama2-7B model -``` -curl -X POST "http://localhost:8081/models?url=llama2_7b.mar&initial_workers=1&synchronous=true" -``` -!!! note - Make sure the Model Archive file name given in the cURL request is correct and is present in the model store directory. - -## Edit Registered Model Configuration -The model can be configured after registration using the Management API of TorchServe. - -The following is the template command for the same: -``` -curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/models/{model_name}?min_workers={number}&max_workers={number}&batch_size={number}&max_batch_delay={delay_in_ms}" -``` - -### Examples -For MPT-7B model -``` -curl -v -X PUT "http://localhost:8081/models/mpt_7b?min_worker=2&max_worker=2" -``` -For Falcon-7B model -``` -curl -v -X PUT "http://localhost:8081/models/falcon_7b?min_worker=2&max_worker=2" -``` -For Llama2-7B model -``` -curl -v -X PUT "http://localhost:8081/models/llama2_7b?min_worker=2&max_worker=2" -``` -!!! note - Make sure to have enough GPU and System Memory before increasing number of workers, else the additional workers will fail to load. - -## Unregister a Model -The following is the template command to unregister a model from the Inference Server: -``` -curl -X DELETE "http://{inference_server_endpoint}:{management_port}/models/{model_name}/{repo_version}" -``` diff --git a/docs/gpt-in-a-box/vm/v0.3/model_version.md b/docs/gpt-in-a-box/vm/v0.3/model_version.md deleted file mode 100644 index 647199c..0000000 --- a/docs/gpt-in-a-box/vm/v0.3/model_version.md +++ /dev/null @@ -1,12 +0,0 @@ -# Model Version Support -We provide the capability to download and register various commits of the single model from HuggingFace. Follow the steps below for the same : - -- [Generate MAR files](generating_mar.md) for the required HuggingFace commits by passing it's commit ID in the "--repo_version" argument -- [Deploy TorchServe](inference_server.md) with any one of the versions passed through the "--repo_version" argument -- Register the rest of the required versions through the [register additional models](management_requests.md#register-additional-models) request. - -## Set Default Model Version -If multiple versions of the same model are registered, we can set a particular version as the default for inferencing by running the following command: -``` -curl -v -X PUT "http://{inference_server_endpoint}:{management_port}/{model_name}/{repo_version}/set-default" -``` diff --git a/docs/gpt-in-a-box/vm/v0.3/validated_models.md b/docs/gpt-in-a-box/vm/v0.3/validated_models.md deleted file mode 100644 index 0f4aebd..0000000 --- a/docs/gpt-in-a-box/vm/v0.3/validated_models.md +++ /dev/null @@ -1,16 +0,0 @@ -# Validated Models for Virtual Machine Version - -GPT-in-a-Box 1.0 has been validated on a curated set of HuggingFace models. Information pertaining to these models is stored in the ```llm/model_config.json``` file. - -The Validated Models are : - -| Model Name | HuggingFace Repository ID | -| --- | --- | -| mpt_7b | [mosaicml/mpt_7b](https://huggingface.co/mosaicml/mpt-7b) | -| falcon_7b | [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b) | -| llama2_7b | [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) | -| codellama_7b_python | [codellama/CodeLlama-7b-Python-hf](https://huggingface.co/codellama/CodeLlama-7b-Python-hf) | -| llama2_7b_chat | [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | - -!!! note - To start the inference server with any HuggingFace model, refer to [**HuggingFace Model Support**](huggingface_model.md) documentation. \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 6f94729..1bc9ac0 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -187,43 +187,6 @@ nav: - "Manual": "anthos/install/manual/index.md" - "Amazon EKS Anywhere": - "Install": "eksa/install/index.md" - - "GPT-in-a-Box 1.0": - - "Overview": "gpt-in-a-box/overview.md" - - "Deploy on Virtual Machine": - - "v0.3": - - "Getting Started": "gpt-in-a-box/vm/v0.3/getting_started.md" - - "Validated Models": "gpt-in-a-box/vm/v0.3/validated_models.md" - - "Generating Model Archive File": "gpt-in-a-box/vm/v0.3/generating_mar.md" - - "Deploying Inference Server": "gpt-in-a-box/vm/v0.3/inference_server.md" - - "Inference Requests": "gpt-in-a-box/vm/v0.3/inference_requests.md" - - "Model Version Support": "gpt-in-a-box/vm/v0.3/model_version.md" - - "HuggingFace Model Support": "gpt-in-a-box/vm/v0.3/huggingface_model.md" - - "Custom Model Support": "gpt-in-a-box/vm/v0.3/custom_model.md" - - "Management Requests": "gpt-in-a-box/vm/v0.3/management_requests.md" - - "v0.2": - - "Getting Started": "gpt-in-a-box/vm/v0.2/getting_started.md" - - "Generating Model Archive File": "gpt-in-a-box/vm/v0.2/generating_mar.md" - - "Deploying Inference Server": "gpt-in-a-box/vm/v0.2/inference_server.md" - - "Inference Requests": "gpt-in-a-box/vm/v0.2/inference_requests.md" - - "Model Version Support": "gpt-in-a-box/vm/v0.2/model_version.md" - - "Custom Model Support": "gpt-in-a-box/vm/v0.2/custom_model.md" - - "Management Requests": "gpt-in-a-box/vm/v0.2/management_requests.md" - - "Deploy on Kubernetes": - - "v0.2": - - "Getting Started": "gpt-in-a-box/kubernetes/v0.2/getting_started.md" - - "Validated Models": "gpt-in-a-box/kubernetes/v0.2/validated_models.md" - - "Generating Model Archive File": "gpt-in-a-box/kubernetes/v0.2/generating_mar.md" - - "Deploying Inference Server": "gpt-in-a-box/kubernetes/v0.2/inference_server.md" - - "Inference Requests": "gpt-in-a-box/kubernetes/v0.2/inference_requests.md" - - "HuggingFace Model Support": "gpt-in-a-box/kubernetes/v0.2/huggingface_model.md" - - "Custom Model Support": "gpt-in-a-box/kubernetes/v0.2/custom_model.md" - - "v0.1": - - "Getting Started": "gpt-in-a-box/kubernetes/v0.1/getting_started.md" - - "Generating Model Archive File": "gpt-in-a-box/kubernetes/v0.1/generating_mar.md" - - "Deploying Inference Server": "gpt-in-a-box/kubernetes/v0.1/inference_server.md" - - "Inference Requests": "gpt-in-a-box/kubernetes/v0.1/inference_requests.md" - - "Custom Model Support": "gpt-in-a-box/kubernetes/v0.1/custom_model.md" - - "Support": "gpt-in-a-box/support.md" - "Guides": - "Cloud Native": - "Red Hat OpenShift": @@ -238,7 +201,7 @@ markdown_extensions: - tables - toc: permalink: true -copyright: Copyright © 2021 - 2023 Nutanix, Inc. +copyright: Copyright © 2021 - 2024 Nutanix, Inc. extra: generator: false repo_url: https://github.com/nutanix-cloud-native/opendocs