Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT nightly benchmark on Inferentia1 #2167

Merged
merged 11 commits into from
Mar 20, 2023

Conversation

namannandan
Copy link
Collaborator

@namannandan namannandan commented Mar 2, 2023

Description

Benchmark BERT model on Inferentia1 instance

Model artifacts:

Self hosted runner(inf1.6xlarge):

  • 24 vCPUs
  • 4 Inferentia1 chips (4 neuron cores per chip)

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature testing

Checkpoint file generation

Note: The artifacts above were traced using transformers version 4.6.0 as documented in the Inferentia tutorial. With more recent transformers versions, the traced model for Neuron may generate incorrect inference result. Model output is NaN.

$ cd examples/Huggingface_Transformers/
$ cat setup_config.json
{
 "model_name":"bert-base-uncased",
 "mode":"sequence_classification",
 "do_lower_case":true,
 "num_labels":"2",
 "save_mode":"torchscript",
 "max_length":"150",
 "captum_explanation":false,
 "embedding_name": "bert",
 "FasterTransformer":false,
 "BetterTransformer":false,
 "model_parallel":false,
 "hardware": "neuron",
 "batch_size": "2"
}
$ python Download_Transformer_models.py setup_config.json
$ ls Transformer_model/
traced_bert-base-uncased_model_neuron_batch_2.pt

MAR file generation

$ cat requirements.txt
torch-neuron
$ torch-model-archiver --model-name BERTSeqClassification_torchscript_neuron_batch_2 --version 1.0 --serialized-file ./examples/Huggingface_Transformers/Transformer_model/traced_bert-base-uncased_model_neuron_batch_2.pt --handler ./examples/Huggingface_Transformers/Transformer_handler_generalized_neuron.py --extra-files "./examples/Huggingface_Transformers/setup_config.json,./examples/Huggingface_Transformers/Seq_classification_artifacts/index_to_name.json,./examples/Huggingface_Transformers/Transformer_handler_generalized.py" --requirements-file requirements.txt

Benchmark run

$ cat benchmarks/benchmark_config_neuron.yaml
# Torchserve version is to be installed. It can be one of the options
#  - branch : "master"
#  - nightly: "2022.3.16"
#  - release: "0.5.3"
# Nightly build will be installed if "ts_version" is not specifiged
#ts_version:
#    branch: &ts_version "master"

# a list of model configure yaml files defined in benchmarks/models_config
# or a list of model configure yaml files with full path
models:
  - "bert_neuron_batch_2.yaml"

# benchmark on "cpu", "gpu" or "neuron".
# "cpu" is set if "hardware" is not specified
hardware: &hardware "neuron"
$
$ cat benchmarks/models_config/bert_neuron_batch_2.yaml
---
bert:
  scripted_mode:
    benchmark_engine: "ab"
    url: "file:///home/ubuntu/pytorch/model_store/BERTSeqClassification_torchscript_neuron_batch_2.mar"
    workers:
      - 4
    batch_delay: 100
    batch_size:
      - 2
    input: "./examples/Huggingface_Transformers/Seq_classification_artifacts/sample_text.txt"
    requests: 10000
    concurrency: 100
    backend_profiling: False
    exec_env: "local"
    processors:
      - "neuron"
$
$ python benchmarks/auto_benchmark.py --input benchmarks/benchmark_config_neuron.yaml
$ 
$ cat /tmp/ts_benchmark/scripted_mode_bert_w4_b2/ab_report.csv
Benchmark,Batch size,Batch delay,Workers,Model,Concurrency,Input,Requests,TS failed requests,TS throughput,TS latency P50,TS latency P90,TS latency P99,TS latency mean,TS error rate,Model_p50,Model_p90,Model_p99,predict_mean,handler_time_mean,waiting_time_mean,worker_thread_mean,cpu_percentage_mean,memory_percentage_mean,gpu_percentage_mean,gpu_memory_percentage_mean,gpu_memory_used_mean
AB,2,100,4,[.mar](file:///home/ubuntu/pytorch/model_store/BERTSeqClassification_torchscript_neuron_batch_2.mar),100,[input](./examples/Huggingface_Transformers/Seq_classification_artifacts/sample_text.txt),10000,0,436.38,225,234,254,229.157,0.0,17.14,18.07,22.93,17.44,17.31,207.2,0.27,0.0,0.0,0.0,0.0,0.0

Workflow test

Test branch: test-neuron-benchmark-workflow
Successful workflow run and artifacts: https://github.com/pytorch/serve/actions/runs/4352309159

Benchmark results:

TorchServe Benchmark on neuron

Date: 2023-03-07 23:18:08

TorchServe Version: torchserve-nightly==2023.3.6

scripted_mode_bert_neuron_batch_1

version Benchmark Batch size Batch delay Workers Model Concurrency Input Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean cpu_percentage_mean memory_percentage_mean gpu_percentage_mean gpu_memory_percentage_mean gpu_memory_used_mean
torchserve-nightly==2023.3.6 AB 1 100 4 .mar 100 input 10000 0 196.42 505 601 678 509.115 0.0 14.31 39.43 51.18 19.52 19.42 485.66 0.23 0.0 8.5 0.0 0.0 0.0

scripted_mode_bert_neuron_batch_2

version Benchmark Batch size Batch delay Workers Model Concurrency Input Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean cpu_percentage_mean memory_percentage_mean gpu_percentage_mean gpu_memory_percentage_mean gpu_memory_used_mean
torchserve-nightly==2023.3.6 AB 2 100 4 .mar 100 input 10000 0 575.7 163 198 230 173.702 0.0 12.28 14.42 34.95 13.15 13.06 156.46 0.18 0.0 0.0 0.0 0.0 0.0

scripted_mode_bert_neuron_batch_4

version Benchmark Batch size Batch delay Workers Model Concurrency Input Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean cpu_percentage_mean memory_percentage_mean gpu_percentage_mean gpu_memory_percentage_mean gpu_memory_used_mean
torchserve-nightly==2023.3.6 AB 4 100 4 .mar 100 input 10000 0 652.79 149 150 240 153.188 0.0 22.97 23.16 23.65 23.12 23.02 125.6 0.48 0.0 0.0 0.0 0.0 0.0

scripted_mode_bert_neuron_batch_8

version Benchmark Batch size Batch delay Workers Model Concurrency Input Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean cpu_percentage_mean memory_percentage_mean gpu_percentage_mean gpu_memory_percentage_mean gpu_memory_used_mean
torchserve-nightly==2023.3.6 AB 8 100 4 .mar 100 input 10000 0 649.43 150 164 171 153.98 0.0 46.95 47.26 48.51 47.05 46.93 101.99 0.48 0.0 0.0 0.0 0.0 0.0

Consolidated benchmark workflow test

Test branch: test-neuron-benchmark-workflow
Successful workflow run: https://github.com/pytorch/serve/actions/runs/4400212613

Regression test

CPU: neuron_benchmark_regression_log_cpu.txt
GPU: neuron_benchmark_regression_log_gpu.txt

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@codecov
Copy link

codecov bot commented Mar 2, 2023

Codecov Report

Merging #2167 (8a27cd5) into master (1768902) will increase coverage by 0.03%.
The diff coverage is n/a.

❗ Current head 8a27cd5 differs from pull request most recent head fd3743d. Consider uploading reports for the commit fd3743d to get more accurate results

@@            Coverage Diff             @@
##           master    #2167      +/-   ##
==========================================
+ Coverage   71.41%   71.45%   +0.03%     
==========================================
  Files          73       73              
  Lines        3296     3296              
  Branches       57       57              
==========================================
+ Hits         2354     2355       +1     
+ Misses        942      941       -1     

see 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@namannandan namannandan marked this pull request as ready for review March 4, 2023 01:41
@lxning
Copy link
Collaborator

lxning commented Mar 6, 2023

Please file a ticket to inferentia about transformer==4.6.0 issue.

Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@namannandan If we are going to run this nightly, It would be good to just add it to the benchmark_gpu workflow and modify it to run on both the machines with the matrix command. The yaml file can be changed with an if else statement.

@namannandan
Copy link
Collaborator Author

namannandan commented Mar 7, 2023

@lxning Inferentia team now has an internal ticket to track the issue with torch-neuron library being unable to correctly trace models when transformers package version is >4.19. Currently, the model archives I've prepared have checkpoint files that were traced using transformers version 4.6.0. I tested that the inference with these model archives works as expected with the latest transformers package version as of this writing which is 4.26.1. So the issue is only with tracing the model. We should be able to go ahead and start running the Inf1 benchmark and once the Inferentia team fixes the issue with torch-neuron we can just re-create the model archives and upload them to the model zoo. Please let me know your thoughts.

@agunapal since these models are traced and intended to run on Inferentia1 which is an additional hardware platform alongside CPU and GPU, would it make sense to maintain this in a separate workflow? Unless there is a downside to creating separate workflows for different hardware platforms. Please let me know your thoughts.

@lxning
Copy link
Collaborator

lxning commented Mar 7, 2023

@namannandan you can test it the workflows you created in the PR and link the result at here.

@namannandan
Copy link
Collaborator Author

@lxning tested workflow and linked benchmark results in the PR summary above.

Based on offline discussion with @agunapal, consolidation of workflow files for benchmarking using matrix command to run on different runners makes sense. This can be implemented in a separate PR.

@lxning
Copy link
Collaborator

lxning commented Mar 8, 2023

@namannandan

  • can you run regression test to make sure there are no breaks for the existing test for hf transformers.
  • as @ankithagunapal suggested, update github workflow by using matrix in this PR so that one single PR can cover everything of this task.
    Thanks a lot.

@namannandan
Copy link
Collaborator Author

namannandan commented Mar 13, 2023

@lxning, @agunapal

sudo apt-get install -y apache2-utils
pip install -r benchmarks/requirements-ab.txt
export omp_num_threads=1
sudo apt-get update -y
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for consolidating the yml files.
L41-43 is common across platforms. This can sit outside the if block.

@msaroufim @min-jean-cho
Is export omp_num_threads=1 applicable to CPU benchmarks only?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently it's only applicable to CPU benchmarks. By the way, I noticed export omp_num_threads=1 doesn't correctly set OMP_NUM_THREADS to 1, #2151 You may want to double check in the github action.

Copy link
Collaborator Author

@namannandan namannandan Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agunapal makes sense, I'll update the install dependencies step.

@min-jean-cho thanks for spoting that! I'll fix the env var.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @namannandan, by the way I recall setting the environment variable export OMP_NUM_THREADS=1 seemed not to be correctly setting the number of threads, https://github.com/pytorch/serve/pull/2151/files#diff-cac3a24029ba9498c7e1735f8fc6e65b5a8a090d7f015bd3c35051f57a9981caR178-R179. You may also want to have a double check, thanks!

Copy link
Collaborator Author

@namannandan namannandan Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I'll double check that. I wonder if using the env key in the Benchmark cpu nightly step would do the trick. It is documented here. I'll try this method as well.

Copy link
Collaborator Author

@namannandan namannandan Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like using the env key works as expected:
https://github.com/pytorch/serve/actions/runs/4410687740/workflow#L48

cpu benchmark logs:

OMP_NUM_THREADS:  1
torch.get_num_threads:  1
NEURON_RT_NUM_CORES: 

gpu benchmark logs:

OMP_NUM_THREADS: 
torch.get_num_threads:  24
NEURON_RT_NUM_CORES:

Inf1 benchmark logs:

OMP_NUM_THREADS: 
torch.get_num_threads:  12
NEURON_RT_NUM_CORES:  4

Naman Nandan and others added 2 commits March 13, 2023 18:16
Add necessary env variables for cpu and inf1
Disable fail-fast to enable all benchmarks to run even if one of them fail
Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@namannandan
Copy link
Collaborator Author

Successful consolidated benchmark workflow run: https://github.com/pytorch/serve/actions/runs/4451500075

@namannandan namannandan merged commit 6daaa42 into pytorch:master Mar 20, 2023
morgandu pushed a commit to morgandu/pytorch-serve that referenced this pull request Mar 25, 2023
* BERT nightly benchmark on Inferentia1

* Consolidate neuron benchmark model config files into a single file for BERT
Set the NEURON_RT_NUM_CORES value as a string in the inf1 nightly benchmark workflow file

* Update trnsformer model downloader documentation

* test workflow before merge

* Consolidate benchmark workflows

* Update runs-on syntax

* Remove hardware specific benchmark workflow files

* Consolidate install dependencies step
Add necessary env variables for cpu and inf1
Disable fail-fast to enable all benchmarks to run even if one of them fail

* update documentation

---------

Co-authored-by: Naman Nandan <namannan@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants