[release-test] A100 3803mnist_hogwild-latency increase 10x on linux.aws.a100 vs [a100-runner] #2573

atalman · 2025-01-21T22:15:40Z

Looking at the result of the run for 2.6.0 vs 2.5.1
https://github.com/pytorch/benchmark/actions/runs/12878326305/job/35904096937
Benchmark,pytorch-2.5.1-cuda-12.4,pytorch-2.6.0-cuda-12.4
mnist-cpu_memory,1118.67,1146.76
mnist-gpu_memory,0.0,0.0
mnist-latency,42.46,40.00
mnist_hogwild-cpu_memory,556.57,601.289
mnist_hogwild-gpu_memory,0.0,0.0
mnist_hogwild-latency,671.28,586.02
wlm_cpu_lstm-cpu_memory,885.141,907.066
wlm_cpu_lstm-gpu_memory,0.0,0.0
wlm_cpu_lstm-latency,1266.83,1079.37
wlm_cpu_trans-cpu_memory,852.113,899.531
wlm_cpu_trans-gpu_memory,0.0,0.0
wlm_cpu_trans-latency,1081.98,1078.99
wlm_gpu_lstm-cpu_memory,995.402,954.391
wlm_gpu_lstm-gpu_memory,0.0,0.0
wlm_gpu_lstm-latency,54.78,52.76
wlm_gpu_trans-cpu_memory,1007.86,993.949
wlm_gpu_trans-gpu_memory,0.0,0.0
wlm_gpu_trans-latency,56.41,55.54

Run 2.4.1 vs 2.5.0 (mnist_hogwild only):
https://github.com/pytorch/benchmark/actions/runs/12895573722
Benchmark,pytorch-2.4.1-cuda-12.4,pytorch-2.5.0-cuda-12.4
mnist_hogwild-cpu_memory,561.797,556.758
mnist_hogwild-gpu_memory,0.0,0.0
mnist_hogwild-latency,613.91,610.53

Run 2.5.1 vs 2.6.0 (mnist_hogwild only):
https://github.com/pytorch/benchmark/actions/runs/12894636482
Benchmark,pytorch-2.5.1-cuda-12.4,pytorch-2.6.0-cuda-12.4
mnist_hogwild-cpu_memory,561.73,579.324
mnist_hogwild-gpu_memory,0.0,0.0
mnist_hogwild-latency,592.67,599.23

Comparing mnist_hogwild-latency number with run on A100 hosted on GCP I see 10x difference:
Run 2.4.1 vs 2.5.0:
3803mnist_hogwild-latency ,61.42,62.19

xuzhao9 · 2025-01-22T00:05:14Z

mnist_hogwild is a CPU-only benchmark.

CPU hardware spec of GCP A100 ("a2-highgpu-1g"):

Intel(R) Xeon(R) Platinum 8481C CPU 12 threads @2.70 GHz with 85 GB DRAM

jeanschmidt · 2025-01-22T15:52:52Z

Yes, those AWS instances are way weaker in terms of CPU compared to the GCP A100. They have only 10 or 11 cores available.

Maybe we should consider moving away those simple CPU benchmark to cheaper instances. Given the very limited pool we have at hands and constrained budget for A100 instances.

atalman changed the title ~~[release-test] A100 3803mnist_hogwild-latency increase 10x~~ [release-test] A100 3803mnist_hogwild-latency increase 10x on linux.aws.a100 vs [a100-runner] Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-test] A100 3803mnist_hogwild-latency increase 10x on linux.aws.a100 vs [a100-runner] #2573

[release-test] A100 3803mnist_hogwild-latency increase 10x on linux.aws.a100 vs [a100-runner] #2573

atalman commented Jan 21, 2025 •

edited

Loading

xuzhao9 commented Jan 22, 2025 •

edited

Loading

jeanschmidt commented Jan 22, 2025

[release-test] A100 3803mnist_hogwild-latency increase 10x on linux.aws.a100 vs [a100-runner] #2573

[release-test] A100 3803mnist_hogwild-latency increase 10x on linux.aws.a100 vs [a100-runner] #2573

Comments

atalman commented Jan 21, 2025 • edited Loading

xuzhao9 commented Jan 22, 2025 • edited Loading

jeanschmidt commented Jan 22, 2025

atalman commented Jan 21, 2025 •

edited

Loading

xuzhao9 commented Jan 22, 2025 •

edited

Loading