Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-test] A100 3803mnist_hogwild-latency increase 10x on linux.aws.a100 vs [a100-runner] #2573

Open
atalman opened this issue Jan 21, 2025 · 2 comments

Comments

@atalman
Copy link
Contributor

atalman commented Jan 21, 2025

Looking at the result of the run for 2.6.0 vs 2.5.1
https://github.com/pytorch/benchmark/actions/runs/12878326305/job/35904096937
Benchmark,pytorch-2.5.1-cuda-12.4,pytorch-2.6.0-cuda-12.4
mnist-cpu_memory,1118.67,1146.76
mnist-gpu_memory,0.0,0.0
mnist-latency,42.46,40.00
mnist_hogwild-cpu_memory,556.57,601.289
mnist_hogwild-gpu_memory,0.0,0.0
mnist_hogwild-latency,671.28,586.02
wlm_cpu_lstm-cpu_memory,885.141,907.066
wlm_cpu_lstm-gpu_memory,0.0,0.0
wlm_cpu_lstm-latency,1266.83,1079.37
wlm_cpu_trans-cpu_memory,852.113,899.531
wlm_cpu_trans-gpu_memory,0.0,0.0
wlm_cpu_trans-latency,1081.98,1078.99
wlm_gpu_lstm-cpu_memory,995.402,954.391
wlm_gpu_lstm-gpu_memory,0.0,0.0
wlm_gpu_lstm-latency,54.78,52.76
wlm_gpu_trans-cpu_memory,1007.86,993.949
wlm_gpu_trans-gpu_memory,0.0,0.0
wlm_gpu_trans-latency,56.41,55.54

Run 2.4.1 vs 2.5.0 (mnist_hogwild only):
https://github.com/pytorch/benchmark/actions/runs/12895573722
Benchmark,pytorch-2.4.1-cuda-12.4,pytorch-2.5.0-cuda-12.4
mnist_hogwild-cpu_memory,561.797,556.758
mnist_hogwild-gpu_memory,0.0,0.0
mnist_hogwild-latency,613.91,610.53

Run 2.5.1 vs 2.6.0 (mnist_hogwild only):
https://github.com/pytorch/benchmark/actions/runs/12894636482
Benchmark,pytorch-2.5.1-cuda-12.4,pytorch-2.6.0-cuda-12.4
mnist_hogwild-cpu_memory,561.73,579.324
mnist_hogwild-gpu_memory,0.0,0.0
mnist_hogwild-latency,592.67,599.23

Comparing mnist_hogwild-latency number with run on A100 hosted on GCP I see 10x difference:
Run 2.4.1 vs 2.5.0:
3803mnist_hogwild-latency ,61.42,62.19

@atalman atalman changed the title [release-test] A100 3803mnist_hogwild-latency increase 10x [release-test] A100 3803mnist_hogwild-latency increase 10x on linux.aws.a100 vs [a100-runner] Jan 21, 2025
@xuzhao9
Copy link
Contributor

xuzhao9 commented Jan 22, 2025

mnist_hogwild is a CPU-only benchmark.

CPU hardware spec of GCP A100 ("a2-highgpu-1g"):

Intel(R) Xeon(R) Platinum 8481C CPU 12 threads @2.70 GHz with 85 GB DRAM

@jeanschmidt
Copy link
Contributor

Yes, those AWS instances are way weaker in terms of CPU compared to the GCP A100. They have only 10 or 11 cores available.

Maybe we should consider moving away those simple CPU benchmark to cheaper instances. Given the very limited pool we have at hands and constrained budget for A100 instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants