Improvements

Support for XPU and HPU
New measure methods based on timing Event
New compatibility layer for multi vendor support
New argument placeholders
- {arch}: accelerator arch (cuda, xpu, hpu, etc...)
- {ccl}: communication collection library (nccl, rccl, ccl, hccl, etc...)
- {cpu_count}: number of CPU available on the machine
- {cpu_per_gpu} : number of CPU available per GPUs cpu_count / device_count
- {n_worker}: recommended number of workers min(cpu_per_gpu, 16)

New Benchmarks

RL
- brax (jax)
- dqn (jax)
- ppo (jax)
- torchatari (torch)
Graph (torch geometric)
- dimenet
- recursiongfn
Vision
- diffusion
- lightning
- dinov2 - vision transformer
- jepa - video
llm
- llm-lora-single: lora (llama3.1 8B)
- llm-lora-ddp-gpus: ddp + lora (llama3.1 8B)
- llm-lora-ddp-nodes: multi nodes + ddp + lora (llama3.1 8B)
- llm-lora-mp-gpus: mp + lora (llama3.1 70B)
- llm-full-mp-gpus: mp + full (llama3.1 70B)
- llm-full-mp-nodes: multi nodes + mp + full (llama3.1 70B)
- rlhf monogpu
- rlhf multi gpu
- llava

Reference Run - Single Node

=================
Benchmark results
=================

System
------
cpu:      AMD EPYC 7742 64-Core Processor
n_cpu:    128
product:  NVIDIA A100-SXM4-80GB
n_gpu:    8
memory:   81920.0

Breakdown
---------
bench                    | fail |   n | ngpu |           perf |   sem% |   std% | peak_memory |           score | weight
brax                     |    0 |   1 |    8 |      730035.71 |   0.1% |   0.4% |        2670 |       730035.71 |   1.00
diffusion-gpus           |    0 |   1 |    8 |         117.67 |   1.5% |  11.7% |       59944 |          117.67 |   1.00
diffusion-single         |    0 |   8 |    1 |          25.02 |   0.8% |  17.9% |       53994 |          202.10 |   1.00
dimenet                  |    0 |   8 |    1 |         366.85 |   0.7% |  16.2% |        2302 |         2973.32 |   1.00
dinov2-giant-gpus        |    0 |   1 |    8 |         445.68 |   0.4% |   3.0% |       69614 |          445.68 |   1.00
dinov2-giant-single      |    0 |   8 |    1 |          53.54 |   0.4% |   9.5% |       74646 |          432.65 |   1.00
dqn                      |    0 |   8 |    1 | 23089954554.91 |   1.1% |  89.9% |       62106 | 184480810548.20 |   1.00
bf16                     |    0 |   8 |    1 |         293.43 |   0.2% |   6.3% |        1788 |         2361.16 |   0.00
fp16                     |    0 |   8 |    1 |         289.26 |   0.1% |   3.6% |        1788 |         2321.65 |   0.00
fp32                     |    0 |   8 |    1 |          19.14 |   0.0% |   0.7% |        2166 |          153.21 |   0.00
tf32                     |    0 |   8 |    1 |         146.63 |   0.1% |   3.6% |        2166 |         1177.04 |   0.00
bert-fp16                |    0 |   8 |    1 |         263.73 |   1.1% |  16.7% |         nan |         2165.37 |   0.00
bert-fp32                |    0 |   8 |    1 |          44.84 |   0.6% |   9.6% |       21170 |          364.52 |   0.00
bert-tf32                |    0 |   8 |    1 |         141.95 |   0.9% |  14.1% |        1764 |         1162.94 |   0.00
bert-tf32-fp16           |    0 |   8 |    1 |         265.04 |   1.0% |  15.6% |         nan |         2175.59 |   3.00
reformer                 |    0 |   8 |    1 |          62.29 |   0.3% |   6.0% |       25404 |          501.89 |   1.00
t5                       |    0 |   8 |    1 |          51.40 |   0.5% |   9.9% |       34390 |          416.14 |   2.00
whisper                  |    0 |   8 |    1 |         481.95 |   1.0% |  21.4% |        8520 |         3897.53 |   1.00
lightning                |    0 |   8 |    1 |         680.22 |   1.0% |  22.7% |       27360 |         5506.90 |   1.00
lightning-gpus           |    0 |   1 |    8 |        3504.74 |   7.9% |  62.9% |       28184 |         3504.74 |   1.00
llava-single             |    1 |   8 |    1 |           2.28 |   0.4% |   9.6% |       72556 |           14.12 |   1.00
llama                    |    0 |   8 |    1 |         484.86 |   4.4% |  80.0% |       27820 |         3680.86 |   1.00
llm-full-mp-gpus         |    0 |   1 |    8 |         193.92 |   3.1% |  16.2% |       48470 |          193.92 |   1.00
llm-lora-ddp-gpus        |    0 |   1 |    8 |       16738.58 |   0.4% |   2.0% |       36988 |        16738.58 |   1.00
llm-lora-mp-gpus         |    0 |   1 |    8 |        1980.63 |   2.2% |  11.8% |       55972 |         1980.63 |   1.00
llm-lora-single          |    0 |   8 |    1 |        2724.95 |   0.2% |   3.0% |       49926 |        21861.99 |   1.00
ppo                      |    0 |   8 |    1 |     3114264.32 |   1.6% |  57.2% |       62206 |     24915954.98 |   1.00
recursiongfn             |    0 |   8 |    1 |        7080.67 |   1.2% |  27.1% |       10292 |        57038.34 |   1.00
rlhf-gpus                |    0 |   1 |    8 |        6314.94 |   2.1% |  11.2% |       21730 |         6314.94 |   1.00
rlhf-single              |    0 |   8 |    1 |        1143.72 |   0.4% |   8.4% |       19566 |         9174.52 |   1.00
focalnet                 |    0 |   8 |    1 |         375.07 |   0.7% |  14.9% |       23536 |         3038.83 |   2.00
torchatari               |    0 |   8 |    1 |        5848.88 |   0.6% |  12.7% |        3834 |        46613.34 |   1.00
convnext_large-fp16      |    0 |   8 |    1 |         330.93 |   1.5% |  22.9% |       27376 |         2711.46 |   0.00
convnext_large-fp32      |    0 |   8 |    1 |          59.49 |   0.6% |   9.8% |       55950 |          483.84 |   0.00
convnext_large-tf32      |    0 |   8 |    1 |         155.41 |   0.9% |  14.3% |       49650 |         1273.31 |   0.00
convnext_large-tf32-fp16 |    0 |   8 |    1 |         322.28 |   1.6% |  24.5% |       27376 |         2637.88 |   3.00
regnet_y_128gf           |    0 |   8 |    1 |         119.46 |   0.5% |  10.0% |       29762 |          966.96 |   2.00
resnet152-ddp-gpus       |    0 |   1 |    8 |        3843.06 |   5.2% |  39.3% |       27980 |         3843.06 |   0.00
resnet50                 |    0 |   8 |    1 |         932.95 |   2.4% |  52.2% |       14848 |         7524.25 |   1.00
resnet50-noio            |    0 |   8 |    1 |        1163.88 |   0.3% |   6.7% |       27480 |         9385.35 |   0.00
vjepa-gpus               |    0 |   1 |    8 |         130.13 |   5.9% |  46.8% |       64244 |          130.13 |   1.00
vjepa-single             |    0 |   8 |    1 |          21.29 |   1.0% |  22.4% |       58552 |          172.11 |   1.00

Scores
------
Failure rate:       0.38% (PASS)
Score:            4175.57

Errors
------
1 errors, details in HTML report.

What's Changed

Improve exception parsing by @Delaunay in #222
enable long trace by default by @Delaunay in #223
Live report by @Delaunay in #146
Do NOT run pretrained llama by @Delaunay in #227
Add worker resolution by @Delaunay in #225
Update observer.py by @Delaunay in #230
Phase lock by @Delaunay in #228
Multi node check by @Delaunay in #234
update templates by @Delaunay in #235
New lightning bench by @Delaunay in #236
Update scaling.yaml by @Delaunay in #229
Dino by @Delaunay in #238
Update recipes.rst by @Delaunay in #242
Llama 3 by @Delaunay in #240
Initial commit Torch_PPO_Cleanrl_Atari_Envpool by @roger-creus in #243
recursiongfn benchmark by @josephdviviano in #249
Multi node tweaks by @Delaunay in #248
Create execution_modes.rst by @Delaunay in #241
Add Dimenet by @Delaunay in #251
Rlhf 2 by @Delaunay in #253
Benchmark Batch by @Delaunay in #252
Update pins for CUDA by @Delaunay in #259
Rl argparse by @Delaunay in #264
Cleanrl jax by @Delaunay in #263
Tweaks 3 by @Delaunay in #261
Staging by @Delaunay in #265
Adding LlaVa by @rabiulcste in #266
Fix diffusion by @satyaog in #267
Attempt fix on dinov2-giant-nodes by @satyaog in #268
Generate llama instead of downloading it by @satyaog in #250
Staging by @Delaunay in #269
Update pins by @Delaunay in #272
new RLHF benchmark by @Delaunay in #273
Rlhf hf by @Delaunay in #275
Fixes loss NaN issue for LlaVa by @rabiulcste in #279
Geo gnn fixes by @bouthilx in #284
Staging by @Delaunay in #283
Sync Stable with master by @Delaunay in #143
Batch resizing by @Delaunay in #286
Staging by @Delaunay in #291
Force exaclty one monitor tag by @satyaog in #288
Fix llm with torchtune v0.3 by @satyaog in #289
Fix rlhf on trl v0.11.0 by @satyaog in #290
Update report.py by @Delaunay in #310
H100 by @Delaunay in #309
Hpu by @Delaunay in #292
Rocm by @Delaunay in #293
Multirun system by @Delaunay in #308
Staging by @Delaunay in #311
Add missing tags to tests config by @Delaunay in #312

New Contributors

@roger-creus made their first contribution in #243
@josephdviviano made their first contribution in #249
@rabiulcste made their first contribution in #266

Full Changelog: v0.1.0...v1.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0

Improvements

New Benchmarks

Reference Run - Single Node

What's Changed

New Contributors

Contributors