I want to propose a PR for an new ops, which could be in the form of a tritionor a CUDA kerne？ #20658

pass-lin · 2024-12-18T06:23:07Z

RWKV is a new-generation RNN model. It has pre-trained versions of different sizes, ranging from 0.3B to 14B. It has performance similar to LLM and the inference advantages of MAMBA.
I want to contribute the RNN part of RWKV to Keras. But I have several questions now. Firstly, the core operator of RWKV, time-mix iteration, is quite fast. Should I wait for the stable version to submit a PR, or should I submit a new op for each minor version?
Secondly, we have implemented the RWKV-6-Keras, and found that if we only use keras' ops operations, the efficiency is relatively low. To achieve high efficiency, we need to implement it based on cuda or triton. Personally, I prefer to provide a triton implementation, and torch will come with the triton library by default. For jax, we only need to install jax-trition additionally to support it.Cuda implementation requires a complete cuda environment, and the jax and torch we usually install with pip cannot directly compile cuda operators. Therefore, the triton implementation seems to be more user-friendly.

Mr-back007 · 2024-12-19T17:06:25Z

Triton Kernel Implementation
`import triton
import triton. language as tl

@triton.jit
def my_op_kernel(x, y, output, N: tl.constexpr):
pid = tl.program_id(0)
start = pid * N
x_data = tl.load(x + start)
y_data = tl.load(y + start)
result = x_data + y_data
tl.store(output + start, result)

def launch_triton_kernel(x, y, output, N):
grid = (N // 1024,)
my_op_kernel[grid](x, y, output, N)
2. CUDA Kernel Implementation#include <cuda_runtime.h>

global void my_op_kernel(float *x, float *y, float *output, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
output[idx] = x[idx] + y[idx];
}
}

void launch_cuda_kernel(float *x, float *y, float *output, int N) {
int threadsPerBlock = 256;
int blocks = (N + threadsPerBlock - 1) / threadsPerBlock;
my_op_kernel<<<blocks, threadsPerBlock>>>(x, y, output, N);
cudaDeviceSynchronize();
}
`

To propose a PR for a new operation (ops) in the form of either a Triton or CUDA kernel, here's a concise solution outline:

Triton Kernel Implementation
python
Copy code
import triton
import triton.language as tl

@triton.jit
def my_op_kernel(x, y, output, N: tl.constexpr):
pid = tl.program_id(0)
start = pid * N
x_data = tl.load(x + start)
y_data = tl.load(y + start)
result = x_data + y_data
tl.store(output + start, result)

def launch_triton_kernel(x, y, output, N):
grid = (N // 1024,)
my_op_kernel[grid](x, y, output, N)
2. CUDA Kernel Implementation
cpp
Copy code
#include <cuda_runtime.h>

global void my_op_kernel(float *x, float *y, float *output, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
output[idx] = x[idx] + y[idx];
}
}

void launch_cuda_kernel(float *x, float *y, float *output, int N) {
int threadsPerBlock = 256;
int blocks = (N + threadsPerBlock - 1) / threadsPerBlock;
my_op_kernel<<<blocks, threadsPerBlock>>>(x, y, output, N);
cudaDeviceSynchronize();
}
3. Proposed PR Structure
Title: "Add Custom Operation Kernel (Triton or CUDA)"
Description:
Triton Kernel: Optimized for integration in ML workloads.
CUDA Kernel: Offers low-level control for maximum performance.
Provide the user with the ability to toggle between the two implementations.
def new_op(x, y, output, N, use_triton=True): if use_triton: launch_triton_kernel(x, y, output, N) else: launch_cuda_kernel(x, y, output, N)
Performance: Show that both implementations deliver faster execution for large data.
Testing: Add tests for both kernels.

github-actions bot assigned mehtamansi29 Dec 18, 2024

mehtamansi29 added type:feature The user is asking for a new feature. keras-team-review-pending Pending review by a Keras team member. labels Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I want to propose a PR for an new ops, which could be in the form of a tritionor a CUDA kerne？ #20658

I want to propose a PR for an new ops, which could be in the form of a tritionor a CUDA kerne？ #20658

pass-lin commented Dec 18, 2024 •

edited

Loading

Mr-back007 commented Dec 19, 2024

I want to propose a PR for an new ops, which could be in the form of a tritionor a CUDA kerne？ #20658

I want to propose a PR for an new ops, which could be in the form of a tritionor a CUDA kerne？ #20658

Comments

pass-lin commented Dec 18, 2024 • edited Loading

Mr-back007 commented Dec 19, 2024

pass-lin commented Dec 18, 2024 •

edited

Loading