get_data on GPU using cupy #241

Rad-hi · 2024-09-24T17:02:57Z

Previously, I tried to extend the Python API with the ability to keep the data on the GPU (#230), and I ran into some weird behaviors (back then they were weird, but now, it's obvious that it was just a lack on understanding of how the data is laid out in memory).

This PR, however, provides a fully functional extension.

NOTE: this change adds an extra dependency; cupy.

The targeted function is get_data(), and both modes of providing data (memory view / deep copy) were implemented for GPU as well.

This was tested on an Nvidia AGX Orin 32Gb, with JetPack 5.1.2, and ZED_SDK_4.1.4.

Shoutout to @andreacelani for the discussion that lead to figuring out how to implement this correctly (look into the closed PR #230 for details).

Benchmarking with an ML pipeline:

@andreacelani did some benchmarking with impressive results: #230 (comment)

Additionally, I tested it myself using a real feed from a ZED Mini with a simple pipeline (see picture), and here are my findings:

TL;DR:

grabbing is 60% faster
preprocessing on the GPU would be faster (when implemented correctly)

Details:

"""
HD2K @15FPS:

GPU:
[GPU_GRAB]             Mean: 8.531 ms, Std: 2.563 ms, Max: 15.460 ms, Min: 5.205 ms, N Samples: 200.
[GPU_PREP_RESIZE]      Mean: 5.205 ms, Std: 1.553 ms, Max: 7.745 ms, Min: 2.473 ms, N Samples: 200.
[GPU_PREP]             Mean: 6.004 ms, Std: 1.554 ms, Max: 8.721 ms, Min: 3.259 ms, N Samples: 200.
[GPU_ROT]              Mean: 0.916 ms, Std: 0.061 ms, Max: 1.162 ms, Min: 0.827 ms, N Samples: 200.
[GPU_INF]              Mean: 24.066 ms, Std: 0.701 ms, Max: 28.860 ms, Min: 23.353 ms, N Samples: 200.
[GPU_STEP]             Mean: 39.537 ms, Std: 1.452 ms, Max: 44.720 ms, Min: 38.024 ms, N Samples: 200.
[GPU_CPU_DUMMY_SLEEP]  Mean: 30.065 ms, Std: 0.003 ms, Max: 30.084 ms, Min: 30.046 ms, N Samples: 200.

Throughput: ~13 iter/s

CPU:
[CPU_GRAB]             Mean: 21.728 ms, Std: 1.193 ms, Max: 25.891 ms, Min: 20.530 ms, N Samples: 200.
[CPU_PREP_RESIZE]      Mean: 5.252 ms, Std: 0.167 ms, Max: 6.051 ms, Min: 5.183 ms, N Samples: 200.
[CPU_PREP_D2H]         Mean: 1.123 ms, Std: 0.066 ms, Max: 1.445 ms, Min: 0.772 ms, N Samples: 200.
[CPU_PREP]             Mean: 13.468 ms, Std: 0.468 ms, Max: 15.780 ms, Min: 13.130 ms, N Samples: 200.
[CPU_ROT]              Mean: 1.767 ms, Std: 0.475 ms, Max: 3.314 ms, Min: 1.053 ms, N Samples: 200.
[CPU_INF]              Mean: 24.054 ms, Std: 1.301 ms, Max: 31.345 ms, Min: 23.337 ms, N Samples: 200.
[CPU_STEP]             Mean: 61.058 ms, Std: 2.245 ms, Max: 70.546 ms, Min: 58.555 ms, N Samples: 200.
[GPU_CPU_DUMMY_SLEEP]  Mean: 30.064 ms, Std: 0.012 ms, Max: 30.084 ms, Min: 30.016 ms, N Samples: 200.

Throughput: ~10 iter/s

HD1080 @30FPS:

GPU:
[GPU_GRAB]             Mean: 6.146 ms, Std: 1.574 ms, Max: 11.672 ms, Min: 4.429 ms, N Samples: 200.
[GPU_PREP_RESIZE]      Mean: 6.188 ms, Std: 1.396 ms, Max: 7.494 ms, Min: 1.917 ms, N Samples: 200.
[GPU_PREP]             Mean: 6.907 ms, Std: 1.404 ms, Max: 8.313 ms, Min: 2.610 ms, N Samples: 200.
[GPU_ROT]              Mean: 0.851 ms, Std: 0.051 ms, Max: 1.244 ms, Min: 0.795 ms, N Samples: 200.
[GPU_INF]              Mean: 23.864 ms, Std: 0.697 ms, Max: 30.536 ms, Min: 22.047 ms, N Samples: 200.
[GPU_STEP]             Mean: 37.785 ms, Std: 0.774 ms, Max: 44.811 ms, Min: 35.756 ms, N Samples: 200.
[GPU_CPU_DUMMY_SLEEP]  Mean: 0.005 ms, Std: 0.003 ms, Max: 0.038 ms, Min: 0.003 ms, N Samples: 200.

Throughput: ~26 iter/s


CPU:
[CPU_GRAB]             Mean: 18.501 ms, Std: 1.092 ms, Max: 22.510 ms, Min: 17.040 ms, N Samples: 200.
[CPU_PREP_RESIZE]      Mean: 4.796 ms, Std: 0.139 ms, Max: 5.671 ms, Min: 4.714 ms, N Samples: 200.
[CPU_PREP_D2H]         Mean: 1.107 ms, Std: 0.062 ms, Max: 1.447 ms, Min: 0.901 ms, N Samples: 200.
[CPU_PREP]             Mean: 11.538 ms, Std: 0.361 ms, Max: 13.599 ms, Min: 11.297 ms, N Samples: 200.
[CPU_ROT]              Mean: 1.319 ms, Std: 0.350 ms, Max: 1.848 ms, Min: 0.862 ms, N Samples: 200.
[CPU_INF]              Mean: 24.247 ms, Std: 1.295 ms, Max: 31.933 ms, Min: 22.330 ms, N Samples: 200.
[CPU_STEP]             Mean: 55.640 ms, Std: 2.117 ms, Max: 69.769 ms, Min: 52.252 ms, N Samples: 200.
[GPU_CPU_DUMMY_SLEEP]  Mean: 0.009 ms, Std: 0.011 ms, Max: 0.163 ms, Min: 0.003 ms, N Samples: 200.

Throughput: ~17 iter/s
"""

Notes:

I used the generic YOLO (from ultralytics import YOLO), and a custom trained Pytorch YOLOV8 model.
I added the sleep because in the case of HD2K grabbing, my pipeline wasn't saturating the 15FPS rate, thus grabbing was seemingly slower in GPU (faulty read).
The preprocessing includes 4 channel to 3 channel reduction, resizing (to meet the 640x640 expected input), and normalization.
There's a step that I didn't put in the pipeline, which is a rotation on the X axis of the PCL just to simulate real work. (code details are here Feature/get data gpu #230 (comment).)

adujardin · 2024-11-28T11:29:56Z

Thank you for implementing this! This looks good, I'll make sure there's no compatibility issue before merging it though. This will probably come in the next release as binary to avoid any breaking change.

adujardin · 2024-11-29T12:48:17Z

I tested it and it seems to handle correctly every configs 🎉
I made a few modifications to handle the cupy dependency optionally. Unfortunately, I won't be able to merge it here as it will be in the next version, and with some modifications, but you'll be credited in the release notes 😉

Rad-hi · 2024-11-29T13:30:34Z

This is great news, thank you !
Happy to be potentially helpful to the community 😄

kwadl · 2024-12-11T16:14:19Z

This is a great piece of work, and will really help me speed up a particular pipeline. However, could I suggest the following very minor modification to handle grayscale images (cupy doesn't like shape and stride parameters of different lengths):

if self.mat.getChannels() == 1:
    shape = (self.mat.getHeight(), self.mat.getWidth())
    strides = (self.get_step_bytes(memory_type), self.get_pixel_bytes())
else:
    shape = (self.mat.getHeight(), self.mat.getWidth(), self.mat.getChannels())
    strides = (self.get_step_bytes(memory_type), self.get_pixel_bytes(), itemsize)

… did not like this>

Rad-hi · 2024-12-17T13:24:48Z

@kwadl good catch !

I was able to reproduce this issue by pulling the confidence map:

confidence_map = sl.Mat(width, height, sl.MAT_TYPE.U8_C1, sl.MEM.GPU)
zed.retrieve_measure(confidence_map, sl.MEASURE.CONFIDENCE, sl.MEM.GPU)

And the proposed fix works like a charm. Thanks!

get_data on GPU using cupy

9b8750d

Rad-hi marked this pull request as ready for review September 27, 2024 14:06

fix bug pointed out by @kwadl where Mat has a datatype of U8_C1 <cupy…

3ac0b85

… did not like this>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_data on GPU using cupy #241

get_data on GPU using cupy #241

Rad-hi commented Sep 24, 2024 •

edited

Loading

adujardin commented Nov 28, 2024

adujardin commented Nov 29, 2024

Rad-hi commented Nov 29, 2024

kwadl commented Dec 11, 2024 •

edited

Loading

Rad-hi commented Dec 17, 2024

get_data on GPU using cupy #241

Are you sure you want to change the base?

get_data on GPU using cupy #241

Conversation

Rad-hi commented Sep 24, 2024 • edited Loading

adujardin commented Nov 28, 2024

adujardin commented Nov 29, 2024

Rad-hi commented Nov 29, 2024

kwadl commented Dec 11, 2024 • edited Loading

Rad-hi commented Dec 17, 2024

Rad-hi commented Sep 24, 2024 •

edited

Loading

kwadl commented Dec 11, 2024 •

edited

Loading