Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy eval mlperf #76

Merged
merged 10 commits into from
May 7, 2024
26 changes: 23 additions & 3 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ cd ~/JetStream/benchmarks
pip install -r requirements.in
```

## Benchmark
## Benchmark with shareGPT

### Prepare DataSet

Expand Down Expand Up @@ -61,11 +61,31 @@ python benchmark_serving.py \

```

## Benchmark with openorca dataset (openorca is used by MLPerf inference for LLaMA2 models)
```
python JetStream/benchmarks/benchmark_serving.py \
--tokenizer ~/maxtext/assets/tokenizer.llama2 \
--warmup-first true \
--save-result \
--save-request-outputs \
--request-outputs-file-path outputs.json \
--num-prompts 1000 \
--max-output-length 1024 \
--dataset openorca

```

## Standalone Evaluation Run

If you used `--save-request-outputs`, you can separately evaluate against the generated outputs.

```
python eval_accuracy.py
python eval_accuracy.py outputs.json

```
```

With openorca dataset and llama2-chat models (used by MLPerf), here are the reference accuracy numbers:
```
jwyang-google marked this conversation as resolved.
Show resolved Hide resolved
llama2-7b-chat {'rouge1': 42.0706, 'rouge2': 19.8021, 'rougeL': 26.8474, 'rougeLsum': 39.5952, 'gen_len': 1146679, 'gen_num': 998}
llama2-70b-chat {'rouge1': 44.4312, 'rouge2': 22.0352, 'rougeL': 28.6162}
```
26 changes: 18 additions & 8 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,17 @@
import random
import time
from typing import Any, AsyncGenerator, Optional
import os


import grpc
from jetstream.core.proto import jetstream_pb2
from jetstream.core.proto import jetstream_pb2_grpc
from jetstream.engine.token_utils import load_vocab
import numpy as np
from tqdm.asyncio import tqdm # pytype: disable=pyi-error
import pandas

from eval_accuracy import eval_accuracy


Expand Down Expand Up @@ -163,14 +167,20 @@ def load_sharegpt_dataset(
return dataset


def load_openorca_dataset(dataset_path: str) -> list[tuple[Any, Any]]:
# Load the dataset.
with open(dataset_path, "r", encoding="utf-8") as f:
dataset = json.load(f)
def load_openorca_dataset_pkl():
# read pickle file
samples = pandas.read_pickle(
os.path.join(
os.path.dirname(os.path.relpath(__file__)),
"open_orca_gpt4_tokenized_llama.calibration_1000.pkl",
jwyang-google marked this conversation as resolved.
Show resolved Hide resolved
)
)

# Tokenize the prompts and completions.
prompts = dataset["prompts"]
outputs = dataset["results"]
prompts = []
outputs = []
for _, row in samples.iterrows():
prompts.append(row["input"])
outputs.append(row["output"])

return [(prompt, output) for prompt, output in zip(prompts, outputs)]

Expand Down Expand Up @@ -542,7 +552,7 @@ def main(args: argparse.Namespace):
) # e.g. [("AB", 2, "AB", 3)]
else:
if args.dataset == "openorca":
dataset = load_openorca_dataset(args.dataset_path)
dataset = load_openorca_dataset_pkl()
elif args.dataset == "sharegpt":
dataset = load_sharegpt_dataset(
args.dataset_path,
Expand Down
Binary file not shown.
Loading