Add candle `CudaDevice` and `MetalDevice` to avoid creating a new unique device each time #2290

laggui · 2024-09-20T15:17:44Z

Checklist

Confirmed that run-checks all script has been executed.

Related Issues/PRs

Flagged by a discord user: https://discord.com/channels/1038839012602941528/1091796857996451942/1286409544280576110

The candle device handling is clumsy. Turns out, every time you create a new device struct for the same device index, a new unique identifier is created - meaning that both structs (to represent the same device) are not equal.

The original issue flagged a problem for metal device due to this inequality, but in my tests the behavior leads to an insane training time with CUDA (ETA > 2hrs).

This is because the tensor.to_device(device) method in Candle actually transfers the data via the CPU for CUDA devices that are different. A terrible side effect of using different unique identifiers to represent the same device index.

Changes

The CandleDevice enum now captures the underlying device structs to avoid creating a new one with a unique identifier each time.

To keep the API simple (and similar to the previous usage), I've added creation methods for Cuda and Metal variants:

// Create a Cuda device from its index
let device = CandleDevice::cuda(0);
// Create a Metal device from its index
let device = CandleDevice::metal(0);

Also had to remove the Copy implementation because the candle devices do not implement it.

Testing

Tested (locally) the mnist example with a new candle feature flag to use the candle backend. The training now completes in approx. 3-4 minutes.

…device each time

codecov · 2024-09-20T16:02:30Z

Codecov Report

Attention: Patch coverage is 17.02128% with 39 lines in your changes missing coverage. Please review.

Project coverage is 85.42%. Comparing base (a6f7a5e) to head (6c99a64).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/burn-candle/src/backend.rs	0.00%	37 Missing ⚠️
crates/burn-candle/src/ops/base.rs	80.00%	1 Missing ⚠️
crates/burn-candle/src/ops/int_tensor.rs	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2290      +/-   ##
==========================================
- Coverage   85.44%   85.42%   -0.03%     
==========================================
  Files         766      766              
  Lines       97916    97948      +32     
==========================================
+ Hits        83667    83669       +2     
- Misses      14249    14279      +30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

for better training speed needs tracel-ai/burn#2290, thus temporary switch to main dep

Add candle CudaDevice and MetalDevice to avoid creating a new unique …

1d2b68e

…device each time

laggui requested a review from nathanielsimard September 20, 2024 15:17

laggui added 2 commits September 20, 2024 11:26

Fix doc example

d58770e

Change enum usage

fa72dc9

ivnsch added a commit to ivnsch/tns_brn that referenced this pull request Sep 22, 2024

use candle/metal backend

3a9232b

for better training speed needs tracel-ai/burn#2290, thus temporary switch to main dep

nathanielsimard approved these changes Sep 25, 2024

View reviewed changes

Merge branch 'main' into fix/candle/device

6c99a64

laggui merged commit 112f09e into main Sep 25, 2024
11 checks passed

laggui deleted the fix/candle/device branch September 25, 2024 17:08

ivnsch mentioned this pull request Sep 28, 2024

Error with metal device huggingface/candle#2484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add candle `CudaDevice` and `MetalDevice` to avoid creating a new unique device each time #2290

Add candle `CudaDevice` and `MetalDevice` to avoid creating a new unique device each time #2290

laggui commented Sep 20, 2024 •

edited

Loading

codecov bot commented Sep 20, 2024 •

edited

Loading

Add candle CudaDevice and MetalDevice to avoid creating a new unique device each time #2290

Add candle CudaDevice and MetalDevice to avoid creating a new unique device each time #2290

Conversation

laggui commented Sep 20, 2024 • edited Loading

Checklist

Related Issues/PRs

Changes

Testing

codecov bot commented Sep 20, 2024 • edited Loading

Codecov Report

Add candle `CudaDevice` and `MetalDevice` to avoid creating a new unique device each time #2290

Add candle `CudaDevice` and `MetalDevice` to avoid creating a new unique device each time #2290

laggui commented Sep 20, 2024 •

edited

Loading

codecov bot commented Sep 20, 2024 •

edited

Loading