Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jetstream support #677

Merged
merged 16 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ cd infra/stage-1

# Copy the sample variables and update the project ID, cluster name and other
parameters as needed in the `terraform.tfvars` file.
cp sample-terraform.tfvars terraform.tfvars
cp ./sample-tfvars/gpu-sample.tfvars terraform.tfvars

# Initialize the Terraform modules.
terraform init
Expand Down Expand Up @@ -67,7 +67,7 @@ cd infra/stage-2
# and the project name and bucket name parameters as needed in the
# `terraform.tfvars` file. You can specify a new bucket name in which case it
# will be created.
cp sample-terraform.tfvars terraform.tfvars
cp ./sample-tfvars/gpu-sample.tfvars terraform.tfvars

# Initialize the Terraform modules.
terraform init
Expand All @@ -88,7 +88,7 @@ cd inference-server/text-generation-inference
# Copy the sample variables and update the project number and cluster name in
# the fleet_host variable "https://connectgateway.googleapis.com/v1/projects/<project-number>/locations/global/gkeMemberships/<cluster-name>"
# in the `terraform.tfvars` file.
cp sample-terraform.tfvars terraform.tfvars
cp ./sample-tfvars/gpu-sample.tfvars terraform.tfvars

# Initialize the Terraform modules.
terraform init
Expand Down Expand Up @@ -120,7 +120,7 @@ cd benchmark/tools/locust-load-inference
# Copy the sample variables and update the project number and cluster name in
# the fleet_host variable "https://connectgateway.googleapis.com/v1/projects/<project-number>/locations/global/gkeMemberships/<cluster-name>"
# in the `terraform.tfvars` file.
cp sample-terraform.tfvars terraform.tfvars
cp ./sample-tfvars/tgi-sample.tfvars terraform.tfvars

# Initialize the Terraform modules.
terraform init
Expand Down
10 changes: 6 additions & 4 deletions benchmarks/benchmark/tools/locust-load-inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ The Locust benchmarking tool currently supports these frameworks:
- tensorrt_llm_triton
- text generation inference (tgi)
- vllm
- jetstream

## Instructions

Expand All @@ -49,7 +50,7 @@ This is my first prompt.\n
This is my second prompt.\n
```

Example prompt datasets are available in the "../../dataset" folder with python scripts and instructions on how to make the dataset available for consumption by this benchmark. The dataset used in the `sample-terraform.tfvars` is the "ShareGPT_v3_unflitered_cleaned_split".
Example prompt datasets are available in the "../../dataset" folder with python scripts and instructions on how to make the dataset available for consumption by this benchmark. The dataset used in the `./sample-tfvars/tgi-sample.tfvars` is the "ShareGPT_v3_unflitered_cleaned_split".

You will set the `gcs_path` in your `terraform.tfvars` to this gcs path containing your prompts.

Expand Down Expand Up @@ -100,10 +101,10 @@ gcloud artifacts repositories create ai-benchmark --location=us-central1 --repos

### Step 6: create and configure terraform.tfvars

Create a `terraform.tfvars` file. `sample-terraform.tfvars` is provided as an example file. You can copy the file as a starting point. Note that at a minimum you will have to change the existing `credentials_config`, `project_id`, and `artifact_registry`.
Create a `terraform.tfvars` file. `./sample-tfvars/tgi-sample.tfvars` is provided as an example file. You can copy the file as a starting point. Note that at a minimum you will have to change the existing `credentials_config`, `project_id`, and `artifact_registry`.

```bash
cp sample-terraform.tfvars terraform.tfvars
cp ./sample-tfvars/tgi-sample.tfvars terraform.tfvars
```

Fill out your `terraform.tfvars` with the desired model and server configuration, referring to the list of required and optional variables [here](#variables). The following variables are required:
Expand Down Expand Up @@ -265,5 +266,6 @@ To change the benchmark configuration, you will have to rerun terraform destroy
| <a name="input_sax_model"></a> [sax\_model](#input\_sax\_model) | Benchmark server configuration for sax model. Only required if framework is sax. | `string` | `""` | no |
| <a name="input_tokenizer"></a> [tokenizer](#input\_tokenizer) | Benchmark server configuration for tokenizer. | `string` | `"tiiuae/falcon-7b"` | yes |
| <a name="input_use_beam_search"></a> [use\_beam\_search](#input\_use\_beam\_search) | Benchmark server configuration for use beam search. | `bool` | `false` | no |
<a name="huggingface_secret"></a> [huggingface_secret](#input\_huggingface_secret) | Name of the kubectl huggingface secret token | `string` | `huggingface-secret` | no |
<a name="huggingface_secret"></a> [huggingface_secret](#input\_huggingface_secret) | Name of the secret holding the huggingface token. Stored in GCP Secrets Manager. | `string` | `huggingface-secret` | no |
<a name="k8s_hf_secret"></a> [k8s_hf_secret](#input\_huggingface_secret) | Name of the secret holding the huggingface token. Stored in K8s. Key is expected to be named: `HF_TOKEN`. See [here](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-kubectl/#use-raw-data) for more. | `string` | `huggingface-secret` | no |
<!-- END_TF_DOCS -->
1 change: 1 addition & 0 deletions benchmarks/benchmark/tools/locust-load-inference/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ locals {
tokenizer = var.tokenizer
use_beam_search = var.use_beam_search
hugging_face_token_secret_list = local.hugging_face_token_secret == null ? [] : [local.hugging_face_token_secret]
k8s_hf_secret_list = var.k8s_hf_secret == null ? [] : [var.k8s_hf_secret]
stop_timeout = var.stop_timeout
request_type = var.request_type
})) : data]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,13 @@ spec:
- name: USE_BEAM_SEARCH
value: ${use_beam_search}
%{ for hugging_face_token_secret in hugging_face_token_secret_list ~}
- name: HUGGINGFACE_TOKEN
valueFrom:
secretKeyRef:
name: hf-key
key: HF_TOKEN
%{ endfor ~}
%{ for hf_token in k8s_hf_secret_list ~}
annapendleton marked this conversation as resolved.
Show resolved Hide resolved
- name: HUGGINGFACE_TOKEN
valueFrom:
secretKeyRef:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
credentials_config = {
fleet_host = "https://connectgateway.googleapis.com/v1/projects/PROJECT_NUMBER/locations/global/gkeMemberships/ai-benchmark"
}

project_id = "PROJECT_ID"

namespace = "default"
ksa = "benchmark-sa"
request_type = "grpc"

k8s_hf_secret = "hf-token"


# Locust service configuration
artifact_registry = "REGISTRY_LOCATION"
inference_server_service = "jetstream-svc:9000"
locust_runner_kubernetes_service_account = "sample-runner-sa"
output_bucket = "${PROJECT_ID}-benchmark-output-bucket-01"
gcs_path = "PATH_TO_PROMPT_BUCKET"

# Benchmark configuration for Locust Docker accessing inference server
inference_server_framework = "jetstream"
tokenizer = "google/gemma-7b"

# Benchmark configuration for triggering single test via Locust Runner
test_duration = 60
# Increase test_users to allow more parallelism (especially when testing HPA)
test_users = 1
test_rate = 5
Original file line number Diff line number Diff line change
Expand Up @@ -197,8 +197,16 @@ variable "run_test_automatically" {
default = false
}

// TODO: add validation to make k8s_hf_secret & hugging_face_secret mutually exclusive once terraform is updated with: https://discuss.hashicorp.com/t/experiment-feedback-input-variable-validation-can-cross-reference-other-objects/66644
variable "k8s_hf_secret" {
description = "Name of secret for huggingface token; stored in k8s "
type = string
nullable = true
default = null
}

variable "hugging_face_secret" {
description = "name of the kubectl huggingface secret token"
description = "name of the kubectl huggingface secret token; stored in Secret Manager. Security considerations: https://kubernetes.io/docs/concepts/security/secrets-good-practices/"
type = string
nullable = true
default = null
Expand Down
151 changes: 151 additions & 0 deletions benchmarks/inference-server/jetstream/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# AI on GKE Benchmarking for JetStream

Deploying and benchmarking JetStream on TPU has many similarities with the standard GPU path. But distinct enough differences to warrant a separate readme. If you are familiar with deploying on GPU, much of this should be familiar. For a more detailed understanding of each step. Refer to our primary benchmarking [README](https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/benchmarks)

## Pre-requisites
annapendleton marked this conversation as resolved.
Show resolved Hide resolved
- [kaggle user/token](https://www.kaggle.com/docs/api)
- [huggingface user/token](https://huggingface.co/docs/hub/en/security-tokens)

### Creating K8s infra

To create our TPU cluster, run:

```
# Stage 1 creates the cluster.
cd infra/stage-1

# Copy the sample variables and update the project ID, cluster name and other
parameters as needed in the `terraform.tfvars` file.
annapendleton marked this conversation as resolved.
Show resolved Hide resolved
cp sample-tfvars/jetstream-sample.tfvars terraform.tfvars

# Initialize the Terraform modules.
terraform init

# Run plan to see the changes that will be made.
terraform plan

# Run apply if the changes look good by confirming the prompt.
terraform apply
```
To verify that the cluster has been set up correctly, run
```
# Get credentials using fleet membership
gcloud container fleet memberships get-credentials <cluster-name>

# Run a kubectl command to verify
kubectl get nodes
```

## Configure the cluster

To configure the cluster to run inference workloads we need to set up workload identity and GCS Fuse.
```
# Stage 2 configures the cluster for running inference workloads.
cd infra/stage-2

# Copy the sample variables and update the project number and cluster name in
# the fleet_host variable "https://connectgateway.googleapis.com/v1/projects/<project-number>/locations/global/gkeMemberships/<cluster-name>"
# and the project name and bucket name parameters as needed in the
# `terraform.tfvars` file. You can specify a new bucket name in which case it
# will be created.
cp sample-tfvars/jetstream-sample.tfvars terraform.tfvars

# Initialize the Terraform modules.
terraform init

# Run plan to see the changes that will be made.
terraform plan

# Run apply if the changes look good by confirming the prompt.
terraform apply
```

### Convert Gemma model weights to maxtext weights

JetStream has [two engine implementations](https://github.com/google/JetStream?tab=readme-ov-file#jetstream-engine-implementation). A Jax variant (via MaxText) and a Pytorch variant. This guide will use the Jax backend.

Jetstream currently requires that models be converted to MaxText weights. This example will deploy a Gemma-7b model. Much of this information is similar to this guide [here](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-tpu-jetstream#convert-checkpoints).

*SKIP IF ALREADY COMPLETED*

Create kaggle secret
```
kubectl create secret generic kaggle-secret \
--from-file=kaggle.json
```

Replace `model-conversion/kaggle_converter.yaml: GEMMA_BUCKET_NAME` with the correct bucket name where you would like the model to be stored.
***NOTE: If you are using a different bucket that the ones you created give the service account Storage Admin permissions on that bucket. This can be done on the UI or by running:
```
gcloud projects add-iam-policy-binding PROJECT_ID \
--member "serviceAccount:SA_NAME@PROJECT_ID.iam.gserviceaccount.com" \
--role roles/storage.admin
```

Run:
```
kubectl apply -f model-conversion/kaggle_converter.yaml
```

This should take ~10 minutes to complete.
annapendleton marked this conversation as resolved.
Show resolved Hide resolved

### Deploy JetStream

Replace the `jetstream.yaml:GEMMA_BUCKET_NAME` with the same bucket name as above.

Run:
```
kubectl apply -f jetstream.yaml
kfswain marked this conversation as resolved.
Show resolved Hide resolved
```

Verify the pod is running with
```
kubectl get pods
```

Get the external IP with:

```
kubectl get services
```

And you can make a request prompt with:
```
curl --request POST \
--header "Content-type: application/json" \
-s \
JETSTREAM_EXTERNAL_IP:8000/generate \
--data \
'{
"prompt": "What is a TPU?",
"max_tokens": 200
}'
```

### Deploy the benchmark

To prepare the dataset for the Locust inference benchmark, view the README.md file in:
```
cd benchmark/dataset/ShareGPT_v3_unflitered_cleaned_split
```

To deploy the Locust inference benchmark with the above model, run
```
cd benchmark/tools/locust-load-inference

# Copy the sample variables and update the project number and cluster name in
# the fleet_host variable "https://connectgateway.googleapis.com/v1/projects/<project-number>/locations/global/gkeMemberships/<cluster-name>"
# in the `terraform.tfvars` file.
cp sample-tfvars/jetstream-sample.tfvars terraform.tfvars

# Initialize the Terraform modules.
terraform init

# Run plan to see the changes that will be made.
terraform plan

# Run apply if the changes look good by confirming the prompt.
terraform apply
```

To further interact with the Locust inference benchmark, view the README.md file in `benchmark/tools/locust-load-inference`
63 changes: 63 additions & 0 deletions benchmarks/inference-server/jetstream/jetstream.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: maxengine-server
spec:
replicas: 1
selector:
matchLabels:
app: maxengine-server
template:
metadata:
labels:
app: maxengine-server
spec:
serviceAccountName: benchmark-sa
nodeSelector:
cloud.google.com/gke-tpu-topology: 2x2
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
containers:
- name: maxengine-server
image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.0
args:
- model_name=gemma-7b
- tokenizer_path=assets/tokenizer.gemma
- per_device_batch_size=4
kfswain marked this conversation as resolved.
Show resolved Hide resolved
- max_prefill_predict_length=1024
- max_target_length=2048
- async_checkpointing=false
- ici_fsdp_parallelism=1
- ici_autoregressive_parallelism=-1
- ici_tensor_parallelism=1
- scan_layers=false
- weight_dtype=bfloat16
- load_parameters_path=gs://GEMMA_BUCKET_NAME/final/unscanned/gemma_7b-it/0/checkpoints/0/items
ports:
- containerPort: 9000
resources:
requests:
google.com/tpu: 4
kfswain marked this conversation as resolved.
Show resolved Hide resolved
limits:
google.com/tpu: 4
- name: jetstream-http
image: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.0
ports:
- containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: jetstream-svc
spec:
selector:
app: maxengine-server
ports:
- protocol: TCP
name: http
port: 8000
targetPort: 8000
- protocol: TCP
name: grpc
port: 9000
targetPort: 9000
type: LoadBalancer
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
apiVersion: batch/v1
kind: Job
metadata:
name: data-loader-7b
spec:
ttlSecondsAfterFinished: 30
template:
spec:
serviceAccountName: benchmark-sa
restartPolicy: Never
containers:
- name: inference-checkpoint
image: us-docker.pkg.dev/cloud-tpu-images/inference/inference-checkpoint:v0.2.0
args:
- -b=GEMMA_BUCKET_NAME
- -m=google/gemma/maxtext/7b-it/2
volumeMounts:
- mountPath: "/kaggle/"
name: kaggle-credentials
readOnly: true
resources:
requests:
google.com/tpu: 4
limits:
google.com/tpu: 4
nodeSelector:
cloud.google.com/gke-tpu-topology: 2x2
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
volumes:
- name: kaggle-credentials
secret:
defaultMode: 0400
secretName: kaggle-secret
Loading
Loading