Implementing abs/exp/div/sum_to cuda kernels #331

coreylowman · 2023-01-06T19:33:19Z

Also adds some necessary scaffolding for building/implementing them.

Resolves #184

coreylowman · 2023-01-06T19:54:14Z

src/tensor_ops/sum_to/sum_to.cu

+    auto tmp = inp[inp_strided_i];
+
+    unsigned int out_strided_i = get_strided_index(i, num_dims, dims, out_strides);
+    atomicAdd(out + out_strided_i, tmp);


While this is okay to start with, this will be really bad performance when reducing large tensors to 1 element. Need to add an issue to create a special kernel for that case (and there are standard ways to write high performant ones of those)

coreylowman · 2023-01-06T19:55:45Z

src/tensor_ops/sum_to/cuda_kernel.rs

+        let out_strides: Src::Concrete =
+            BroadcastStridesTo::<Src, Ax>::broadcast_strides(&dst, dst.strides());


the only difference between normal impl of forward and this is this call where the strides are broadcast. Maybe there is some way aroudn that?

coreylowman · 2023-01-06T19:56:55Z

src/tensor_ops/cuda_kernels.rs

+        let mut storage = self.dev.alloc_zeros_async::<f32>(numel)?;
+
+        let fwd_fn = self.dev.get_func(K::MODULE_NAME, K::FWD_FN_NAME).unwrap();
+        let cfg = LaunchConfig::for_num_elems(numel as u32);


Should probably make a helper method for computing a good version of this launch config - need to take advantage of threads & blocks.

coreylowman · 2023-01-06T20:01:02Z

build.rs

+#[cfg(feature = "cuda")]
+mod cuda {
+    pub fn build_ptx() {
+        // TODO build ptx file in source tree and don't call nvcc if so


I think this can be done later, I'm not even sure if it's necessary. Once we have all the kernels in place we can see, but no need to complicate something that's pretty simple atm

coreylowman · 2023-01-06T20:03:36Z

src/tensor_ops/cuda_kernels.rs

+        let dims: CudaSlice<usize> = self.dev.take_async(lhs.shape.concrete().into())?;
+        let lhs_strides: CudaSlice<usize> = self.dev.take_async(lhs.strides.into())?;
+        let rhs_strides: CudaSlice<usize> = self.dev.take_async(rhs.strides.into())?;
+        let out_strides: CudaSlice<usize> = self.dev.take_async(grad_out.strides.into())?;


These same values were also allocated in the forward call - a potential improvement for the future is pre-allocating them. Though these are only used in binary ops - if a tensor is only ever used in a unary op then it doesn't need to allocate these

coreylowman added 17 commits December 16, 2022 20:12

WIP commit for cuda device

ccd791f

Updating to latest cudarc, using take_async

952e550

Rework Cuda allocation to use cpu

d518052

Use TestDevice in tensor

f051fb1

Merge branch 'main' into cuda-device

058c99f

Using SampleTensor

4b774ca

Clean up cuda device

75ae247

Adding cuda kernel to all ops

f16cb47

Merge remote-tracking branch 'origin/main' into cuda-device

0c8b085

Adding cuda kernels to optims and fixing nn TestDeviceUsage

67c2f16

Remove cuda from default features

5b5b767

Update cudarc version

7fe98e6

Adding Unpin to Unit

4648e1b

Merge remote-tracking branch 'origin/main' into cuda-device

f07d2e7

Implementing some base kernels

ded9384

Updating to latest cudarc and fixing div kernel bugs

abe62ca

Merge branch 'main' into kernels

309aa11

coreylowman added the gpu Related to GPU support label Jan 6, 2023

Fixing bad merge

904fc46

coreylowman commented Jan 6, 2023

View reviewed changes

coreylowman merged commit 995471b into main Jan 7, 2023

coreylowman deleted the kernels branch January 7, 2023 16:11

coreylowman mentioned this pull request Jan 7, 2023

Optimize cuda sum_to kernel for full reductions #332

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing abs/exp/div/sum_to cuda kernels #331

Implementing abs/exp/div/sum_to cuda kernels #331

coreylowman commented Jan 6, 2023 •

edited

Loading

coreylowman Jan 6, 2023

coreylowman Jan 6, 2023

coreylowman Jan 6, 2023

coreylowman Jan 6, 2023

coreylowman Jan 6, 2023

		let out_strides: Src::Concrete =
		BroadcastStridesTo::<Src, Ax>::broadcast_strides(&dst, dst.strides());

Implementing abs/exp/div/sum_to cuda kernels #331

Implementing abs/exp/div/sum_to cuda kernels #331

Conversation

coreylowman commented Jan 6, 2023 • edited Loading

coreylowman Jan 6, 2023

Choose a reason for hiding this comment

coreylowman Jan 6, 2023

Choose a reason for hiding this comment

coreylowman Jan 6, 2023

Choose a reason for hiding this comment

coreylowman Jan 6, 2023

Choose a reason for hiding this comment

coreylowman Jan 6, 2023

Choose a reason for hiding this comment

coreylowman commented Jan 6, 2023 •

edited

Loading