[torchbench] `opacus_cifar10` memory not freed after each run. #6380

ysiraichi · 2024-01-25T17:13:28Z

🐛 Bug

Applying the following patch dumps the GPU peak memory usage of a given benchmark:

diff --git a/benchmarks/experiment_runner.py b/benchmarks/experiment_runner.py
index 9f55e3f02..a71cc20a2 100644
--- a/benchmarks/experiment_runner.py
+++ b/benchmarks/experiment_runner.py
@@ -202,6 +202,7 @@ class ExperimentRunner:
                        input_tensor):
     tracing_time = None
     total_time_start = time.perf_counter()
+    torch.cuda.reset_peak_memory_stats()
     # Invoke iteration function and measure tracing time w/o waiting on the
     # result.
     if benchmark_experiment.xla:
@@ -210,7 +211,8 @@ class ExperimentRunner:
         input_tensor, collect_full_output=self._args.collect_full_output)
     if benchmark_experiment.xla:
       tracing_time = time.perf_counter() - t_trace_start
-
+    print("> Max MEM (GB):", torch.cuda.max_memory_allocated() / 10**9)
     # Mark step.
     self._mark_step(benchmark_experiment)
     total_time = time.perf_counter() - total_time_start

Running opacus_cifar10 benchmark with the following command (see below) make the memory leak explicit. Basically, the amount of used memory keeps growing iteration after iteration.

# This should run 4 (2x2) training iterations.
python xla/benchmarks/experiment_runner.py \
    --suite-name torchbench --accelerator cuda --xla None --dynamo inductor --test train \
    --repeat 2 --iterations-per-run 2 \
    --no-resume --print-subprocess \
    -k opacus_cifar10

> Max MEM (GB): 3.061852672
> Max MEM (GB): 5.957279232
> Max MEM (GB): 8.821041152
> Max MEM (GB): 11.683754496

Expected behavior

After the first training iteration, memory usage stays roughly constant.

Environment

Reproducible on XLA backend [CPU/TPU]: CUDA
torch_xla version: 44660d8

cc @miladm @JackCaoG

The text was updated successfully, but these errors were encountered:

vanbasten23 · 2024-01-26T01:07:54Z

Does torch.cuda.reset_peak_memory_stats() work on torch_xla:gpu or it only works for native pytorch (inductor in this case)?

ysiraichi · 2024-01-26T13:16:11Z

I think it only works for inductor (I did try to use this with pt/xla, but didn't work). That's probably because their allocator is instrumented to gather this information (guess).

ysiraichi added the xla:gpu label Jan 25, 2024

ysiraichi mentioned this issue Jan 29, 2024

Failing Torchbench Models: tracking issue #5932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torchbench] `opacus_cifar10` memory not freed after each run. #6380

[torchbench] `opacus_cifar10` memory not freed after each run. #6380

ysiraichi commented Jan 25, 2024

vanbasten23 commented Jan 26, 2024

ysiraichi commented Jan 26, 2024

[torchbench] opacus_cifar10 memory not freed after each run. #6380

[torchbench] opacus_cifar10 memory not freed after each run. #6380

Comments

ysiraichi commented Jan 25, 2024

🐛 Bug

Expected behavior

Environment

vanbasten23 commented Jan 26, 2024

ysiraichi commented Jan 26, 2024

[torchbench] `opacus_cifar10` memory not freed after each run. #6380

[torchbench] `opacus_cifar10` memory not freed after each run. #6380