AutoBatch: CUDA anomaly detected #9287

alexk-ede · 2022-09-05T08:55:08Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

So I'm testing the autobatch feature which is pretty cool.
It seemed to work fine last week, but this week for whatever reason (maybe bc it's Monday, who knows ...) I'm having issues with it.

I'm running the yolov5s (latest git checkout ofc) and getting this (when using --batch -1)
Dataset is a slice from COCO

AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 2.20G reserved, 0.05G allocated, 5.54G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     7027720       6.744         2.414         27.87         35.49        (1, 3, 416, 416)                    list
     7027720       13.49         1.378         23.52         50.14        (2, 3, 416, 416)                    list
     7027720       26.98         1.380          23.8         56.75        (4, 3, 416, 416)                    list
     7027720       53.95         0.648         22.86         71.21        (8, 3, 416, 416)                    list
     7027720       107.9         1.330         26.38         91.88       (16, 3, 416, 416)                    list
AutoBatch: WARNING: ⚠️ CUDA anomaly detected, recommend restart environment and retry command.
AutoBatch: Using batch-size 16 for CUDA:0 0.96G/7.79G (12%) ✅

Meanwhile, the nvtop output is this before running the train.py
So there isn't really anything in the GPU memory.

Device 0 [NVIDIA GeForce RTX 3070] PCIe GEN 1@16x RX: 0.000 KiB/s TX: 0.000 KiB/s
 GPU 210MHz  MEM 405MHz  TEMP  53°C FAN  38% POW  19 / 220 W
 GPU[                                 0%] MEM[|                   0.208Gi/8.000Gi]

I am unsure about this from AutoBatch

7.79G total, 2.20G reserved, 0.05G allocated, 5.54G free

The 2.20G reserved is weird, because I stopped everything (including gdm3), so nothing is running on the GPU.
(besides the training process later).

And I can easily set batch to 80 and it works fine:

 Device 0 [NVIDIA GeForce RTX 3070] PCIe GEN 3@16x RX: 30.27 MiB/s TX: 8.789 MiB/s
 GPU 1905MHz MEM 6800MHz TEMP  68°C FAN  63% POW 199 / 220 W
 GPU[||||||||||||||||||||||||||||||||90%] MEM[||||||||||||||||||||7.319Gi/8.000Gi]

    PID USER DEV    TYPE  GPU        GPU MEM    CPU  HOST MEM Command
   6404 user   0 Compute  91%   7237MiB  88%   105%  14616MiB python train.py --img 416 --batch 80 --epochs 400  --cache --weights yolov5s.pt --data ...

I obviously did the recommended restart environment and even restarted the machine. Autobatch still complained about around 2.20G reserved

Any ideas how I can investigate this ?

My guess is, the 2.2GB do mess up the interpolation for autobatch because the GPU_mem (GB) column doesn't make much sense.

  GPU_mem (GB)       input 
   2.414       (1, 3, 416, 416)
   1.378       (2, 3, 416, 416)
   1.380       (4, 3, 416, 416)
   0.648       (8, 3, 416, 416) 
   1.330       (16, 3, 416, 416)

Additional

Maybe the issue title should be changed to AutoBatch: CUDA anomaly detected
some additional system info

Ubuntu 22.04.1 LTS
Kernel Linux 5.15.0-47-generic #51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

nvidia-smi
Mon Sep  5 16:22:01 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |

Update:
During training, it shows me usage around

      Epoch    GPU_mem   ...
    112/399      5.79G

So not sure where the rest went (aka the difference to the 7.2GB in nvtop) ...

The text was updated successfully, but these errors were encountered:

github-actions · 2022-09-05T08:55:40Z

👋 Hello @alexk-ede, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Denizzje · 2022-09-05T09:40:10Z

Hi and happy monday to you.

I use autobatch quite frequently and also just updated my local yolov5 today so I took a look.

GTX 1080 (8GB), PyTorch 1.12, Nvidia driver 515, Cuda 11.7, Fedora 36

When running a similar training like yours, so with input size 416 on Coco128 (I assume you mean that with slice of coco), I too get the warning.

Transferred 481/481 items from yolov5m.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce GTX 1080) 7.92G total, 2.53G reserved, 0.16G allocated, 5.24G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    21190557        20.8         2.816         34.12         119.1        (1, 3, 416, 416)                    list
    21190557       41.61         2.852         31.86         149.8        (2, 3, 416, 416)                    list
    21190557       83.21         2.842         30.95         173.2        (4, 3, 416, 416)                    list
    21190557       166.4         2.772         46.22         201.2        (8, 3, 416, 416)                    list
    21190557       332.9         2.741         82.16         318.4       (16, 3, 416, 416)                    list
AutoBatch: WARNING: ⚠️ CUDA anomaly detected, recommend restart environment and retry command.
AutoBatch: Using batch-size 16 for CUDA:0 2.74G/7.92G (35%) ✅
optimizer: SGD(lr=0.01) with parameter groups 79 weight(decay=0.0), 82 weight(decay=0.0005), 82 bias
train: Scanning '/home/xxx/ai_dev/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]                                                      
val: Scanning '/home/xxx/ai_dev/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]                                                        

AutoAnchor: 4.18 anchors/target, 0.977 Best Possible Recall (BPR). Anchors are a poor fit to dataset ⚠️, attempting to improve...
AutoAnchor: WARNING: Extremely small objects found: 16 of 929 labels are < 3 pixels in size
AutoAnchor: Running kmeans for 9 anchors on 927 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.6699: 100%|██████████| 1000/1000 [00:00<00:00, 1421.07it/s]                                                                                                                 
AutoAnchor: thr=0.25: 0.9935 best possible recall, 3.75 anchors past thr
AutoAnchor: n=9, img_size=416, metric_all=0.263/0.670-mean/best, past_thr=0.477-mean: 6,9, 16,14, 21,35, 55,47, 70,94, 80,188, 190,139, 216,249, 388,283
AutoAnchor: Done ✅ (optional: update model *.yaml to use these anchors in the future)
Plotting labels to runs/train/exp7/labels.jpg... 
Image sizes 416 train, 416 val
Using 8 dataloader workers
Logging results to runs/train/exp7
Starting training for 300 epochs...

When not inputting an input size and 640 is used, I do not get the CUDA enviornment warning.

AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce GTX 1080) 7.92G total, 2.53G reserved, 0.16G allocated, 5.24G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    21190557       49.24         2.802         46.01         252.1        (1, 3, 640, 640)                    list
    21190557       98.48         2.751         36.03           286        (2, 3, 640, 640)                    list
    21190557         197         2.724         50.43         334.9        (4, 3, 640, 640)                    list
    21190557       393.9         2.810         94.75         416.9        (8, 3, 640, 640)                    list
    21190557       787.8         5.492         180.2         560.2       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 11 for CUDA:0 4.18G/7.92G (53%) ✅
optimizer: SGD(lr=0.01) with parameter groups 79 weight(decay=0.0), 82 weight(decay=0.000515625), 82 bias
train: Scanning '/home/xxx/ai_dev/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]                                                                                                                                      
val: Scanning '/home/xxx/ai_dev/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]                                                                                                                                        

AutoAnchor: 4.27 anchors/target, 0.994 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/exp6/labels.jpg... 
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp6
Starting training for 300 epochs...

I too have this random chunk of 2.53 in my GPU memory. I do not know what it is neither, but it does not match with my usage in nvitop before training start (around 500-600mb, with gnome desktop and xorg on). Checking back from autobatch trainings on the initial release of v6.2 , I do see:

�[34m�[1mAutoBatch: �[0mComputing optimal batch size for --imgsz 1280
�[34m�[1mAutoBatch: �[0mCUDA:0 (NVIDIA A100-SXM-80GB) 79.35G total, 0.11G reserved, 0.10G allocated, 79.14G free
1660813541500 os-nvme-hbwt3yi6-2a100-44v-fin1 info       Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    12403204       65.97         0.872         29.57         20.63      (1, 3, 1280, 1280)                    list
    12403204       131.9         1.629         26.51         22.46      (2, 3, 1280, 1280)                    list
    12403204       263.9         3.181         31.44         28.91      (4, 3, 1280, 1280)                    list
1660813542584 os-nvme-hbwt3yi6-2a100-44v-fin1 info     12403204       527.8         5.981         33.11         43.87      (8, 3, 1280, 1280)                    list
    12403204        1056        12.092         54.42         79.66     (16, 3, 1280, 1280)                    list
1660813543232 os-nvme-hbwt3yi6-2a100-44v-fin1 error �[34m�[1mAutoBatch: �[0mUsing batch-size 95 for CUDA:0 70.93G/79.35G (89%) ✅
�[34m�[1moptimizer:�[0m SGD(lr=0.01) with parameter groups 75 weight(decay=0.0), 79 weight(decay=0.0007421875), 79 bias

alexk-ede · 2022-09-06T06:27:33Z

Hi @Denizzje and a happy start of the week to you, too.
The monday wasn't so bad actually ;)

I'll add another data point today.
This time yolov5m

Transferred 475/481 items from yolov5m.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 2.52G reserved, 0.16G allocated, 5.11G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    20879400       48.25         2.798         33.85         180.3        (1, 3, 640, 640)                    list
    20879400       96.49         2.749         14.16         186.3        (2, 3, 640, 640)                    list
    20879400         193         2.718         20.15         218.2        (4, 3, 640, 640)                    list
    20879400         386         2.720            37         282.9        (8, 3, 640, 640)                    list
    20879400       771.9         5.207         71.34         334.4       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 11 for CUDA:0 4.01G/7.79G (51%) ✅

But manually a batch of 16 is fine.

So yeah, looks like the 2.52G reserved do interfere, distort the testing for the size and then make the interpolation invalid.
Maybe it'd be better to go by the 0.16G allocated instead, because that is also what nvtop is basically showing.

Instead of just saying anomaly detected it'd be also useful to hint below, that the initial VRAM usage/reserved is suspiciously high.

Update:
I forgot to mention, looks that for your img 640 test, despite no warning, the autobatch result is still not optimal either, as the first 4 results are all around 2.8GB

AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce GTX 1080) 7.92G total, 2.53G reserved, 0.16G allocated, 5.24G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    21190557       49.24         2.802         46.01         252.1        (1, 3, 640, 640)                    list
    21190557       98.48         2.751         36.03           286        (2, 3, 640, 640)                    list
    21190557         197         2.724         50.43         334.9        (4, 3, 640, 640)                    list
    21190557       393.9         2.810         94.75         416.9        (8, 3, 640, 640)                    list
    21190557       787.8         5.492         180.2         560.2       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 11 for CUDA:0 4.18G/7.92G (53%) ✅

glenn-jocher · 2022-09-06T11:13:49Z

@alexk-ede AutoBatch may produce inaccurate results under certain circumstances, i.e. when previous trainings are in progress or have terminated early or not all CUDA memory has been released. If you find ways to improve please let us know, the relevant code is here:
https://github.com/ultralytics/yolov5/blob/master/utils/autobatch.py

alexk-ede · 2022-09-06T12:43:57Z

Yes, I saw that file after I was investigating where the warning message came from. That's where I learned about the interpolation, too.

i.e. when previous trainings are in progress or have terminated early or not all CUDA memory has been released.

I could understand that, if there was something using the GPU before, then that may be plausible.

But as I said, this is a completely fresh boot and fresh start of the environment. Nothing was run before, and obviously no trainings in progress, as it says 0.16G allocated.
I still have absolutely no clue, where this 2.53G reserved are coming from (and what it means).
Sure, there is some documentation, but not quite helpful.
https://pytorch.org/docs/stable/generated/torch.cuda.memory_reserved.html

This looks like it makes more sense
https://pytorch.org/docs/stable/generated/torch.cuda.memory_allocated.html

I'd obviously prefer to have a command that gives me the same output as nvtop.

I'll try later to use it here with memory_allocated instead

yolov5/utils/torch_utils.py

Line 189 in 1aea74c

    
           mem = torch.cuda.memory_reserved() / 1E9 if torch.cuda.is_available() else 0  # (GB)

as it's called by autobatch here

yolov5/utils/autobatch.py

Line 51 in 1aea74c

results = profile(img, model, n=3, device=device)

alexk-ede · 2022-09-06T13:04:51Z

I decided I'll test what will happen, when I'll run this during a training session that already uses most of the gpu memory.

def autobatch(model, imgsz=640, fraction=0.8, batch_size=16):
    # Automatically estimate best batch size to use `fraction` of available CUDA memory
    # Usage:
    #     import torch
    #     from utils.autobatch import autobatch
    #     model = torch.hub.load('ultralytics/yolov5', 'yolov5s', autoshape=False)
    #     print(autobatch(model))

Turns out it fails on the first try and the results list doesn't get initialized at all.


In [4]: print(autobatch(model))
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.03G allocated, 7.73G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 7.79 GiB total capacity; 37.38 MiB already allocated; 6.06 MiB free; 56.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 7.79 GiB total capacity; 46.76 MiB already allocated; 6.06 MiB free; 56.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
AutoBatch: CUDA out of memory. Tried to allocate 38.00 MiB (GPU 0; 7.79 GiB total capacity; 28.01 MiB already allocated; 26.06 MiB free; 36.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 print(autobatch(model))

File ~/repo/yolov5/utils/autobatch.py:56, in autobatch(model, imgsz, fraction, batch_size)
     53     LOGGER.warning(f'{prefix}{e}')
     55 # Fit a solution
---> 56 y = [x[2] for x in results if x]  # memory [2]
     57 p = np.polyfit(batch_sizes[:len(y)], y, deg=1)  # first degree polynomial fit
     58 b = int((f * fraction - p[1]) / p[0])  # y intercept (optimal batch size)

UnboundLocalError: local variable 'results' referenced before assignment

But in this case 0.03G allocated is also completely wrong, because the real usage is
GPU[||||||||||||||||||||||||||||||||92%] MEM[||||||||||||||||||||7.975Gi/8.000Gi]

So as planned, I'll try memory_allocated later instead, but it also yields weird results.
Now I see where the main problem is ...

alexk-ede · 2022-09-07T07:39:15Z

This is quite weird, I just quickly tested this demo code while a training is running and using 6.6GB VRAM.

import torch

print(torch.__version__)
my_tensor = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32, device="cpu")
print(my_tensor)
torch.cuda.is_available()


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print('Using device:', device)
print()

print("torch.cuda.is_available()")
print(torch.cuda.is_available())

#Additional Info when using cuda
if device.type == 'cuda':
    print("torch.cuda.current_device()")
    print(torch.cuda.current_device())

    print("torch.cuda.device(0)")
    print(torch.cuda.device(0))

    print("torch.cuda.get_device_name(0)")
    print(torch.cuda.get_device_name(0))
    
    print()
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

instead I only get now

<torch.cuda.device object at 0x7f24a6bb09d0>
torch.cuda.get_device_name(0)
NVIDIA GeForce RTX 3070

Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB

So I don't know how to fix that with only using the interface that torch provides.

The command nvidia-smi --query-gpu=memory.used --format=csv,nounits,noheader outputs the exact current usage that is also seen in nvtop.
So sure, one could add some functions like these to extract the GPU memory usage, but it's also not a very clean solution https://www.programcreek.com/python/?CodeExample=get+gpu+memory

Denizzje · 2022-09-16T13:29:04Z

There does seem to be something very wrong with the auto batch size at the moment. I believe it started after this "CUDA Anomaly" detected was implemented, though I did not do much trainings after a big batch right after the release of 6.2.

This time, I tried with latest YoloV5 from master, PyTorch 1.12, Ubuntu 20.04, Python 3.8 with Nvidia drivers 515 and CUDA 11.7 with an A100 80GB SXM GPU. The dataset is my regular dataset of ~40k training pictures this time, so not Coco1128. It spits out the CUDA anomaly warning and then proceeds with a batch size of 16...

AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA A100-SXM4-80GB) 79.21G total, 4.85G reserved, 0.16G allocated, 74.20G free

    2022-09-16 15:14:19            Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output

    2022-09-16 15:14:20            20964261 48.52 5.283 101 133.7 (1, 3, 640, 640) list

    2022-09-16 15:14:22            20964261 97.03 5.232 22.21 245.8 (2, 3, 640, 640) list

    2022-09-16 15:14:23            20964261 194.1 5.203 18.63 158.6 (4, 3, 640, 640) list

    2022-09-16 15:14:25            20964261 388.1 5.205 17.41 258.5 (8, 3, 640, 640) list

    2022-09-16 15:14:27            20964261 776.3 5.203 23.9 318.2 (16, 3, 640, 640) list

    2022-09-16 15:14:27     

AutoBatch: WARNING: ⚠️ CUDA anomaly detected, recommend restart environment and retry command.

AutoBatch: Using batch-size 16 for CUDA:0 5.19G/79.21G (7%) ✅

optimizer: SGD(lr=0.01) with parameter groups 79 weight(decay=0.0), 82 weight(decay=0.0005), 82 bias

For reference, this was an earlier training with the same machine but with PyTorch 1.10, Cuda 11.3 on YoloV5 release 6.2 (not master from that time), with an earlier version of the same dataset (size is roughly the same though).

AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA A100-SXM-80GB) 79.35G total, 0.29G reserved, 0.27G allocated, 78.78G free
2022-08-18 08:26:33
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    35397156       49.54         0.698         40.71         24.46        (1, 3, 640, 640)                    list
    35397156       99.07         1.028         34.32         23.26        (2, 3, 640, 640)                    list
    35397156       198.1         1.707         39.83         25.03        (4, 3, 640, 640)                    list
2022-08-18 08:26:34
    35397156       396.3         3.108         36.23         27.72        (8, 3, 640, 640)                    list
    35397156       792.6         5.895         46.38         39.48       (16, 3, 640, 640)                    list
2022-08-18 08:26:34
AutoBatch: Using batch-size 203 for CUDA:0 70.82G/79.35G (89%) ✅
optimizer: SGD(lr=0.01) with parameter groups 103 weight(decay=0.0), 107 weight(decay=0.0015859375), 107 bias

After my fixed size training run (batch size 128) is finished I will try to redo the autobatch on Yolov5 release 6.2. If this 80GB card is convinced there as well that it can only fit a batch size of 16 in its memory, then the cause is somewhere else. I am curious to see what happens if I retry with PyTorch 1.10 with latest master code.

glenn-jocher · 2022-09-16T18:14:36Z

@Denizzje yes I'm able to reproduce in Colab. Something is not correct. I'll add a TODO to investigate.

glenn-jocher · 2022-09-16T19:06:00Z

@Denizzje traced to 5d4787b

May resolve #9287 Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

glenn-jocher · 2022-09-16T19:46:12Z

@Denizzje good news 😃! Your original issue may now be fixed ✅ in PR #9448. This avoids setting cudnn.benchmark=True on init_seeds(), and also adds a check to AutoBatch that this setting is not in place. After these changes AutoBatch now works correctly:

To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

* AutoBatch `cudnn.benchmark=True` fix May resolve #9287 Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Update autobatch.py Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Update autobatch.py Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Update general.py Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com> Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

Denizzje · 2022-09-16T20:59:45Z

Awesome @glenn-jocher , did not expect this on a friday evening hehe. "Unfortunally" the A100 is still training and my GTX 1080 really cant handle my dataset properly anymore so I will wait untill its finished and then give it another try after pulling and report back ASAP if it can find its memory this time ;).

Denizzje · 2022-09-17T08:35:00Z

Top of the morning, @glenn-jocher ,

Happy to confirm that the A100 is now convinced it actually has 80GB of VRAM, and autobatch now gives me a batch size of 192. Also the "CUDA Anomaly is detected" is gone. This is even a "dirty" start, didnt start a new terminal or reboot the system from my previous training.

Transferred 475/481 items from yolov5m.pt
2022-09-17 10:27:12
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA A100-SXM4-80GB) 79.21G total, 0.25G reserved, 0.16G allocated, 78.80G free
2022-09-17 10:27:12
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    20964261       48.52         0.543         58.52         23.66        (1, 3, 640, 640)                    list
    20964261       97.03         0.858          34.5         21.23        (2, 3, 640, 640)                    list
    20964261       194.1         1.571         33.96         22.95        (4, 3, 640, 640)                    list
2022-09-17 10:27:13
    20964261       388.1         2.917         35.03         25.31        (8, 3, 640, 640)                    list
2022-09-17 10:27:14
    20964261       776.3         5.415         35.45         35.17       (16, 3, 640, 640)                    list
2022-09-17 10:27:14
AutoBatch: Using batch-size 192 for CUDA:0 62.74G/79.21G (79%) ✅
optimizer: SGD(lr=0.01) with parameter groups 79 weight(decay=0.0), 82 weight(decay=0.0015), 82 bias

Glad to see this very useful function back in action and thanks again for your quick work last night 😄 . Note my issue so I can close it but @alexk-ede is hopefully fine too when pulling the latest code from master.

glenn-jocher · 2022-09-17T10:01:24Z

@Denizzje great!

BTW we used to target 90% memory utilization but had some issues with smaller cards going over during training, which is why we dropped back to an 80% target. You can modify this fraction variable here:

yolov5/utils/autobatch.py

Lines 21 to 28 in 5e1a955

    
           def autobatch(model, imgsz=640, fraction=0.8, batch_size=16): 
        
               # Automatically estimate best batch size to use `fraction` of available CUDA memory 
        
               # Usage: 
        
               #     import torch 
        
               #     from utils.autobatch import autobatch 
        
               #     model = torch.hub.load('ultralytics/yolov5', 'yolov5s', autoshape=False) 
        
               #     print(autobatch(model))

alexk-ede · 2022-09-19T09:12:30Z

Hi everyone, looks like it's going to be a good Monday today ;)

And indeed, it seems to work fine right now.

Transferred 475/481 items from yolov5m.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.24G reserved, 0.16G allocated, 7.39G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    20883441       20.39         0.371         40.03         17.61        (1, 3, 416, 416)                    list
    20883441       40.78         0.482         20.08         13.69        (2, 3, 416, 416)                    list
    20883441       81.56         0.778         22.58         14.58        (4, 3, 416, 416)                    list
    20883441       163.1         1.277         21.24            19        (8, 3, 416, 416)                    list
    20883441       326.2         2.374         31.53         34.19       (16, 3, 416, 416)                    list
AutoBatch: Using batch-size 42 for CUDA:0 5.84G/7.79G (75%) ✅

MEM[||||||||||||||||||||7.711Gi/8.000Gi]

I'm just not sure where the (75%) ✅ are coming from, if fraction=0.8 ...
I'll check if there are some remains from my tests or not, but shouldn't be as I just checked out the latest master.

I'll have a few train runs to do soon, so I'll report back.

And yes, having it <= 80% makes sense, bc I also noticed, despite showing GPU_mem 6.42G in the epoch, the actual used gpu mem is what nvtop reports 7.711G .
I guess this is some constant overhead whatever, so I expect it to be less noticeable on bigger systems.

@Denizzje what does your nvtop report when you have 62.74G/79.21G (79%) ✅ ?

glenn-jocher · 2022-09-19T10:07:38Z

@alexk-ede 80% is the requested utilization, 75% is the predicted utilization (actual utilization will vary and is sometimes substantially different).

It's possible some of the difference is coming from running AutoBatch only on the free memory vs total memory displayed later.

glenn-jocher · 2022-09-19T10:09:11Z

@alexk-ede maybe I should re-add allocated and reserved amounts to the predicted amount for the final utilisation. This should be closer to 80%.

May resolve #9287 (comment) Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

May resolve #9287 (comment) Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com> Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

glenn-jocher · 2022-09-19T10:19:53Z

@alexk-ede good news 😃! Your original issue may now be fixed ✅ in PR #9491. This PR adds reserved and allocated memory to the final estimated utilization rate displayed, which should result in a value closer to the default requested 80%.

To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

alexk-ede · 2022-09-20T06:43:31Z

Hi,
ok, well looks like the utilization is now too close ;)
So here is an overview:

This one worked (but was PR #9448 and before PR #9491 ):
(and was only 416 resolution)

Transferred 343/349 items from yolov5n.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.01G allocated, 7.74G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     1769329        1.79         0.069         24.43         10.13        (1, 3, 416, 416)                    list
     1769329        3.58         0.109         11.23          9.78        (2, 3, 416, 416)                    list
     1769329       7.161         0.187          12.8         9.658        (4, 3, 416, 416)                    list
     1769329       14.32         0.327         12.59          11.1        (8, 3, 416, 416)                    list
     1769329       28.64         0.707          13.2         13.58       (16, 3, 416, 416)                    list
AutoBatch: Using batch-size 146 for CUDA:0 6.24G/7.79G (80%) ✅

nvtop:
MEM[||||||||||||||||||||7.400Gi/8.000Gi]
 
GPU_mem   
5.95G

but these ones failed:
now with PR #9491

Transferred 343/349 items from yolov5n.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.01G allocated, 7.74G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     1769329       4.237         0.115         33.26         13.79        (1, 3, 640, 640)                    list
     1769329       8.474         0.218         12.77         10.72        (2, 3, 640, 640)                    list
     1769329       16.95         0.419         13.02         11.39        (4, 3, 640, 640)                    list
     1769329        33.9         0.875         13.27         14.14        (8, 3, 640, 640)                    list
     1769329       67.79         1.705         16.68         21.99       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 58 for CUDA:0 6.23G/7.79G (80%) ✅

nvtop:
GPU[||||||||||||||||||||||||||||||||90%] MEM[||||||||||||||||||||7.742Gi/8.000Gi]

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      0/399      6.26G      0.106    0.03965     0.0384        340        640:  17%|█▋        | 76/439 [00:20<01:20,  4.50it/s]

Then I tried setting to 75% instead


Transferred 343/349 items from yolov5n.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.01G allocated, 7.74G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     1769329       4.237         0.115         30.87         13.88        (1, 3, 640, 640)                    list
     1769329       8.474         0.218         13.05         10.97        (2, 3, 640, 640)                    list
     1769329       16.95         0.419         13.39         11.45        (4, 3, 640, 640)                    list
     1769329        33.9         0.875         17.22         14.14        (8, 3, 640, 640)                    list
     1769329       67.79         1.705         16.76          22.1       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 54 for CUDA:0 5.81G/7.79G (75%) ✅

nvtop:
 GPU[||||||||||||||||||||||||||||||||90%] MEM[||||||||||||||||||||7.681Gi/8.000Gi]

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      0/399      6.19G    0.08888    0.04061     0.0315        267        640:  37%|███▋      | 176/472 [00:41<01:22,  3.59it/s]

It ran a bit longer, but then failed. Still rather odd that it ran for over 10 sec. Usually it fails instantly when it runs out of vram.

And last try with 70%


Transferred 343/349 items from yolov5n.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.01G allocated, 7.74G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     1769329       4.237         0.115         32.82         14.39        (1, 3, 640, 640)                    list
     1769329       8.474         0.218          12.9         10.98        (2, 3, 640, 640)                    list
     1769329       16.95         0.419         13.29          11.3        (4, 3, 640, 640)                    list
     1769329        33.9         0.875         13.56         14.32        (8, 3, 640, 640)                    list
     1769329       67.79         1.705         16.73         21.93       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 50 for CUDA:0 5.38G/7.79G (69%) ✅


nvtop:
GPU[||||||||||||||||||||||||||||||| 89%] MEM[||||||||||||||||||||7.441Gi/8.000Gi]

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      0/399      5.94G    0.08148    0.04051     0.0283        311        640:  55%|█████▍    | 278/510 [00:57<00:44,  5.21it/s]


nvtop:
MEM[||||||||||||||||||||6.873Gi/8.000Gi]

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      1/399      5.32G    0.05974    0.03991    0.01754        242        640:  54%|█████▎    | 274/510 [00:52<00:46,  5.11it/s]

Seems to run.
Initial vram usage is suspiciously high and causes this problem but then goes down in the next iterations. (but I observed the peak vram usage before, too, I just don't remember it causing so many problems to autobatch.)

Anyway, these 8GB cards will just have to work with a lower fraction, no other way around that.
Now you need to add auto-fraction (depending on available vram), too 🤣

alexk-ede · 2022-09-20T06:50:37Z

Update to the last ones, those results may be invalid.
I just checked htop and dmesg,
Looks like that was actually OOM now and I may have run out of normal ram (&swap)
(which should not happen with the dataset that I'm currently testing with, but I'll investigate that. The 32GB ram of that machine were usually enough for that dataset. Not sure why it started to use additional swap)

Dataset itself is around 24gb cached in ram.
All in all the ram usage was 32gb, now it is using 8-16GB of swap, too.
fraction 0.8 still fails, but fraction 0.75 works with more swap.

Were there any other changes that could affect that ?
(Also the swap is slowly rising/filling during the first 1-2 epochs).

glenn-jocher · 2022-09-20T10:21:36Z

@alexk-ede dataset caching is independent of CUDA usage, it either uses RAM or disk space.

alexk-ede · 2022-09-20T10:49:35Z

Yes, I know, I'm using the --cache option to use RAM. Otherwise the CPU load is just insane and the CPU can't keep up with the GPU.
Anyway, I need to investigate what changed bc that additional RAM/swap usage didn't happen before.

Denizzje · 2022-09-21T09:45:58Z

Hello @alexk-ede ,

I cannot check at the moment because I am doing a training on release 6.1 at the moment (no clearML and got deallocated overnight so I miss the original logs at the beginning).

Have you tried however, to try something else than yolov5n (yolov5m something), on that slice of COCO? Is it actually representable for your dataset / use case? Because I do remember when mucking about with Coco128 I actually crashed my training with autobatch and a yolov5n for instance.

alexk-ede · 2022-09-21T11:24:09Z

@Denizzje yeah I tried various yolov5 sizes, mostly n, s, m, (sometimes l just for testing).
But I never wrote down how much they consumed during training so I'm going to make some tests and a table after the current training round is finished.

alexk-ede added the question Further information is requested label Sep 5, 2022

glenn-jocher added the TODO High priority items label Sep 16, 2022

glenn-jocher mentioned this issue Sep 16, 2022

AutoBatch cudnn.benchmark=True fix #9448

Merged

glenn-jocher added a commit that referenced this issue Sep 16, 2022

AutoBatch cudnn.benchmark=True fix

378742e

May resolve #9287 Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

glenn-jocher removed the TODO High priority items label Sep 16, 2022

glenn-jocher closed this as completed in #9448 Sep 16, 2022

glenn-jocher mentioned this issue Sep 19, 2022

AutoBatch report include reserved+allocated #9491

Merged

glenn-jocher added a commit that referenced this issue Sep 19, 2022

AutoBatch report include reserved+allocated

862cc10

May resolve #9287 (comment) Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

glenn-jocher added a commit that referenced this issue Sep 19, 2022

AutoBatch report include reserved+allocated (#9491)

f038ad7

May resolve #9287 (comment) Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com> Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

alexk-ede changed the title ~~CUDA anomaly detected~~ AutoBatch: CUDA anomaly detected Sep 21, 2022

fcbfcb1998 mentioned this issue Feb 6, 2023

Randomness liudakai2/UnsupDIS-pytorch#33

Closed

HKLCXQ mentioned this issue Aug 31, 2023

adaptive_avg_pool2d_backward_cuda is not deterministic pytorch/pytorch#108341

Open

liu15509348793 mentioned this issue Aug 14, 2024

Seed setting ultralytics/ultralytics#15247

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoBatch: CUDA anomaly detected #9287

AutoBatch: CUDA anomaly detected #9287

alexk-ede commented Sep 5, 2022 •

edited

Loading

github-actions bot commented Sep 5, 2022 •

edited by UltralyticsAssistant

Loading

Denizzje commented Sep 5, 2022 •

edited

Loading

alexk-ede commented Sep 6, 2022 •

edited

Loading

glenn-jocher commented Sep 6, 2022

alexk-ede commented Sep 6, 2022 •

edited

Loading

alexk-ede commented Sep 6, 2022 •

edited

Loading

alexk-ede commented Sep 7, 2022

Denizzje commented Sep 16, 2022 •

edited

Loading

glenn-jocher commented Sep 16, 2022

glenn-jocher commented Sep 16, 2022

glenn-jocher commented Sep 16, 2022 •

edited by UltralyticsAssistant

Loading

Denizzje commented Sep 16, 2022 •

edited

Loading

Denizzje commented Sep 17, 2022

glenn-jocher commented Sep 17, 2022

alexk-ede commented Sep 19, 2022 •

edited

Loading

glenn-jocher commented Sep 19, 2022

glenn-jocher commented Sep 19, 2022

glenn-jocher commented Sep 19, 2022 •

edited by UltralyticsAssistant

Loading

alexk-ede commented Sep 20, 2022 •

edited

Loading

alexk-ede commented Sep 20, 2022 •

edited

Loading

glenn-jocher commented Sep 20, 2022

alexk-ede commented Sep 20, 2022

Denizzje commented Sep 21, 2022

alexk-ede commented Sep 21, 2022

AutoBatch: CUDA anomaly detected #9287

AutoBatch: CUDA anomaly detected #9287

Comments

alexk-ede commented Sep 5, 2022 • edited Loading

Search before asking

Question

Additional

github-actions bot commented Sep 5, 2022 • edited by UltralyticsAssistant Loading

Requirements

Environments

Status

Denizzje commented Sep 5, 2022 • edited Loading

alexk-ede commented Sep 6, 2022 • edited Loading

glenn-jocher commented Sep 6, 2022

alexk-ede commented Sep 6, 2022 • edited Loading

alexk-ede commented Sep 6, 2022 • edited Loading

alexk-ede commented Sep 7, 2022

Denizzje commented Sep 16, 2022 • edited Loading

glenn-jocher commented Sep 16, 2022

glenn-jocher commented Sep 16, 2022

glenn-jocher commented Sep 16, 2022 • edited by UltralyticsAssistant Loading

Denizzje commented Sep 16, 2022 • edited Loading

Denizzje commented Sep 17, 2022

glenn-jocher commented Sep 17, 2022

alexk-ede commented Sep 19, 2022 • edited Loading

glenn-jocher commented Sep 19, 2022

glenn-jocher commented Sep 19, 2022

glenn-jocher commented Sep 19, 2022 • edited by UltralyticsAssistant Loading

alexk-ede commented Sep 20, 2022 • edited Loading

alexk-ede commented Sep 20, 2022 • edited Loading

glenn-jocher commented Sep 20, 2022

alexk-ede commented Sep 20, 2022

Denizzje commented Sep 21, 2022

alexk-ede commented Sep 21, 2022

alexk-ede commented Sep 5, 2022 •

edited

Loading

github-actions bot commented Sep 5, 2022 •

edited by UltralyticsAssistant

Loading

Denizzje commented Sep 5, 2022 •

edited

Loading

alexk-ede commented Sep 6, 2022 •

edited

Loading

alexk-ede commented Sep 6, 2022 •

edited

Loading

alexk-ede commented Sep 6, 2022 •

edited

Loading

Denizzje commented Sep 16, 2022 •

edited

Loading

glenn-jocher commented Sep 16, 2022 •

edited by UltralyticsAssistant

Loading

Denizzje commented Sep 16, 2022 •

edited

Loading

alexk-ede commented Sep 19, 2022 •

edited

Loading

glenn-jocher commented Sep 19, 2022 •

edited by UltralyticsAssistant

Loading

alexk-ede commented Sep 20, 2022 •

edited

Loading

alexk-ede commented Sep 20, 2022 •

edited

Loading