NPU support in whisper.cpp #1557

bobqianic · 2023-11-27T00:09:35Z

Christmas is coming soon, and I want to take some time to research something interesting, such as edge low-power inference. Although current whisper.cpp can run on Raspberry Pi, the inference performance cannot achieve real-time transcription. Fortunately, there are now some development boards that use processors with NPUs, which can be used to achieve real-time transcription of large models. My primary goal is to first support RK3566 and RK3588.

Roadmap:

MatMul offloading
Conv-Gelu offloading
LayerNorm offloading
...

Reference:

~~https://github.com/rockchip-linux/rknpu2~~

ggerganov · 2023-11-27T09:21:25Z

Would be great if we can find a way to utilize the NPUs! Keep us in the loop!

Leeviber · 2023-11-30T04:11:47Z

I tried converting the whisper encode model to rknpu format(.rknn), it successed but the estimated runtime is quite slow, even lower than running on CPU. I think the NPU is not full support transformer, some operators are still running on the CPU.

RoboMagus · 2023-12-09T18:46:12Z

Some interesting development was done here: https://github.com/usefulsensors/useful-transformers.

However not everything runs on the NPU and I've personally had mixed success on running non English models.

bobqianic · 2023-12-09T19:35:25Z

Some interesting development was done here: https://github.com/usefulsensors/useful-transformers.

Yes, I've seen that. But I'm looking to enhance the ggml tensor library by adding some operators. This way, not only will whisper.cpp be able to utilize the NPU, but other ggml examples like llama.cpp as well. I've ordered an OrangePi 5 Plus with 32GiB RAM from Aliexpress, which is still in transit : )

However not everything runs on the NPU and I've personally had mixed success on running non English models.

Hopefully, we'll be able to run all models, regardless of their size, and whether they are English-only or support multiple languages.

bobqianic · 2023-12-18T19:41:31Z

The most challenging aspect I've encountered thus far is finding an appropriate driver for the RK3588 & RK3566 NPU. Most Linux distributions don't include an NPU driver, with this one being the notable exception.

https://github.com/unifreq/linux-5.10.y-rk35xx/tree/main/drivers/rknpu

bobqianic · 2023-12-23T23:41:24Z

I tried converting the whisper encode model to rknpu format(.rknn), it successed but the estimated runtime is quite slow, even lower than running on CPU. I think the NPU is not full support transformer, some operators are still running on the CPU.

You're right. From my experiments, it seems the NPU on the RK3588 is only effective for 3x3 convolutions. Unfortunately, its GEMM performance is quite poor. Despite being equipped with a 3x2 TOPs NPU, each unit only delivers about 10 GFLOPs for FP16 GEMM or 20 GFLOPs for INT8 GEMM. It's quite a letdown. I regret to share such disappointing news during the holiday.

bobqianic · 2023-12-24T21:39:39Z

I discovered that someone else did the exact same thing but didn't find success. @ggerganov

The challenge with the Rockchip NPU stems from its peculiar input and output dimensions. To attain maximum speed, it's necessary to transform a 2D matrix into a particular dimension. If you don't do this, the driver will take over, but it operates much slower. After processing, you need to convert the result back to its original dimension. This process is quite inefficient, and I'm sharing this to prevent others from spending unnecessary time trying to implement it.

With the RK3588, when you're working with a matrix A of size (N, K) and a matrix B of size (K, M), you'll need to reshape matrix A to the new dimensions of (K/8, N, 8). Similarly, reshape matrix B to (M/16, K/32, 16, 32). After these transformations, the resulting output matrix C will have the dimensions of (N/4, M, 4), instead of the expected (N, M).

Links:
https://clehaxze.tw/gemlog/2023/12-17-update-on-ggml-rknpu2-backend-and-rknpu2-1_6_0.gmi
https://github.com/marty1885/llama.cpp/tree/rknpu2-backend

Matrix A:

Matrix B:

Matrix C:

solarsamuel · 2023-12-28T12:28:45Z

@bobqianic this is a great idea. The question is how can we implement whisper.cpp on a NPU/TPU on an embedded device?

I have an OrangePi 5 and was hoping the NPU would provide benefits, but it looks like it won't be very useful. Thank you for looking into it.

I have one idea that may be theoretically possible, but would require a good amount of work and $$$. The idea is to use 4 Google Coral Edge TPU's in a pipeline (see pipeline example here https://coral.ai/examples/) and in essence jailbreak them (George Hotz is working on it in these videos: https://www.youtube.com/watch?v=rArv2NUXGU8) to run with models other than TensorFlow (for example whisper models). The Coral Edge TPU's would take up all of the USB slots on a Raspberry Pi (maybe a USB hub could be used too), so there would be a bandwidth constraint. Each TPU has up to 8 MB of SRAM to store the models, but in reality it's more like 6.5 MB each, so probably a maximum model size of 26 MB for 4 of these units. The quantized 4 bit tiny model comes in under this. The entire setup may be possible and run quickly, but the accuracy of the tiny model isn't that great.

Another idea would be to take TPU's or FPGA's and connect them to a Raspberry Pi via USB or as a Raspberry Pi hat. That will be bandwidth limited by the communication protocol (serial, I2C, etc...).

Maybe one day when chips like this come out things will be easier for embedded AI: https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-u55

ggerganov · 2023-12-29T09:56:22Z

@bobqianic Thank you for the updates! The work in marty1885/llama.cpp@rknpu2-backend is interesting and I will be following with the progress

marty1885 · 2024-01-04T07:02:56Z

For reference. People have worked around the matrix reordering specifically for Whisper by abstracting the entire thing around that fact.

useful-transformers is a very successful implementation. https://github.com/usefulsensors/useful-transformers

Lhemamou · 2024-06-08T06:04:16Z

Hey :) as raspberry is launching a new TPU hat https://www.raspberrypi.com/products/ai-kit/ I am reopening the topic. Do you have by chance any news or ways to begin enhance performance thanks to this hat ? I guess it would be easier than coral as we don't need to jailbreak it.

marty1885 · 2024-06-08T06:20:45Z

@Lhemamou I actually talked to Halio about this during Computex. Long story short. No unless someone wants to form a company and sign an NDA to gain low level access.

solarsamuel · 2024-06-10T22:47:39Z

@marty1885 I have a company and I'd be open to signing a NDA as long as it looks reasonable, but before I go too far, my main concern is in regard to hardware.

Does anyone know what the hailo hardware limit is in regard to model size? Feel free to send links.

For example, the Google Coral TPU stick ASIC has 8MB of SRAM built into the chip. Something like 1.5MB of overhead is used, so a model can only be 6.5MB max. https://coral.ai/docs/edgetpu/compiler/#parameter-data-caching

For the Google Coral TPU the whisper tiny model is too big, even the 4 bit quantized version of the tiny model is around 24MB.

tiny | 75 MiB disk | ~273 MB Mem

I'm assuming the Hailo chip does the matrix multiply internally and the results are stored in a pipeline in internal SRAM, but I could be wrong.

marty1885 · 2024-06-11T01:18:31Z

@solarsamuel I can't tell without knowing NDAed information. From what I gathered from their sales. At least I think he is a sales.

The Hailo 8 can fit YOLOv5s and a modified version of YOLOv5m (I assume quantized)
If their compiler cannot fit the model onto the chip. They can split the model and switch the weights on the fly
- But sure, there will be performance impact. Limited by PCIe bandwidth
- Or split the model across multiple chips
Hailo-10H has DRAM. You can put large models there. It eliminates PCIe transfer, now the bottleneck is DRAM bandwidth
whisper.cpp requires low level access to the accelerator. It needs to be able to command the accelerator to do matmul directly. A compiler layer is useless in this case. If you want to sign an NDA, you need to check that you also get that level of access.

solarsamuel · 2024-06-11T12:50:36Z

@marty1885 I can reach out. Who would be a good person to contact? I'm definitely not making any guarantees any of this will work out.

marty1885 · 2024-06-15T04:13:08Z

@solarsamuel Sorry for the late reply. I got caught in some personal issues. Let's not misuse the issue tracker and talk through email? You can find mine on my website via the link on my GitHub profile.

Your GH profile links to a company and I'm not sure if that's the one you want to use for discussion.

I don't have an email in mind - I don't have a business card from them since the NDA was a big show stopper for me.

jenskastensson · 2024-09-18T14:49:30Z

@bobqianic - Would you benefit from having a driver package for Mali GPU kernel drivers on RK3588 (specifically for Debian Bullseye)? Let me know if this is something that would improve inference performance !

/opt/libmali# dmesg | grep mali
[    3.406183] mali fb000000.gpu: Kernel DDK version g21p0-01eac0
[    3.406423] mali fb000000.gpu: Looking up mali-supply from device tree
[    3.406569] mali fb000000.gpu: Looking up mem-supply from device tree
[    3.406849] mali fb000000.gpu: Looking up mali-supply from device tree
[    3.407154] mali fb000000.gpu: bin=0
[    3.407392] mali fb000000.gpu: leakage=16
[    3.407516] mali fb000000.gpu: Looking up mali-supply from device tree
[    3.407547] debugfs: Directory 'fb000000.gpu-mali' with parent 'vdd_gpu_s0' already present!
[    3.408956] mali fb000000.gpu: pvtm=865
[    3.409284] mali fb000000.gpu: pvtm-volt-sel=3
[    3.409338] mali fb000000.gpu: Looking up mali-supply from device tree
[    3.409358] debugfs: Directory 'fb000000.gpu-mali' with parent 'vdd_gpu_s0' already present!
[    3.409375] mali fb000000.gpu: Looking up mem-supply from device tree
[    3.410738] mali fb000000.gpu: avs=0
[    3.410849] W : [File] : drivers/gpu/arm/bifrost/platform/rk/mali_kbase_config_rk.c; [Line] : 144; [Func] : kbase_platform_rk_init(); power-off-delay-ms not available.
[    3.411650] mali fb000000.gpu: Register LUT 000a0800 initialized for GPU arch 0x000a0806
[    3.411678] mali fb000000.gpu: r0p0 status 5 not found in HW issues table;
[    3.411689] mali fb000000.gpu: falling back to closest match: r0p0 status 0
[    3.411698] mali fb000000.gpu: Execution proceeding normally with fallback match
[    3.411706] mali fb000000.gpu: GPU identified as 0x7 arch 10.8.6 r0p0 status 0
[    3.411810] mali fb000000.gpu: No priority control manager is configured
[    3.411819] mali fb000000.gpu: Large page allocation set to false after hardware feature check
[    3.412265] mali fb000000.gpu: No memory group manager is configured
[    3.412300] mali fb000000.gpu: Protected memory allocator not available
[    3.413941] mali fb000000.gpu: EM: OPP:600000 is inefficient
[    3.413953] mali fb000000.gpu: EM: OPP:500000 is inefficient
[    3.413961] mali fb000000.gpu: EM: OPP:400000 is inefficient
[    3.413968] mali fb000000.gpu: EM: OPP:300000 is inefficient
[    3.414342] mali fb000000.gpu: EM: created perf domain
[    3.414989] mali fb000000.gpu: l=10000 h=85000 hyst=5000 l_limit=0 h_limit=800000000 h_table=0
[    3.415901] mali fb000000.gpu: * MALI kbase_mmap_min_addr compiled to CONFIG_DEFAULT_MMAP_MIN_ADDR, no runtime update possible! *
[    3.415920] mali fb000000.gpu: Probed as mali0
[    7.404098] mali fb000000.gpu: Loading Mali firmware 0x1010000
[    7.406892] mali fb000000.gpu: Mali firmware git_sha: ee476db42870778306fa8d559a605a73f13e455c

bobqianic added good first issue Good for newcomers performance CPU and memory usage - results and comparisons research🔬 labels Nov 27, 2023

bobqianic mentioned this issue Dec 21, 2023

Is it possible to use this on an NPU processor? ggerganov/ggml#648

Open

bobqianic changed the title ~~Rockchip NPU support in whisper.cpp~~ NPU support in whisper.cpp Dec 23, 2023

bobqianic mentioned this issue Mar 8, 2024

Can the warehouse calculation section add support for rk3568 and rk3588NPU interfaces. Currently, the CPU processing speed is too slow #1936

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPU support in whisper.cpp #1557

NPU support in whisper.cpp #1557

bobqianic commented Nov 27, 2023 •

edited

Loading

ggerganov commented Nov 27, 2023

Leeviber commented Nov 30, 2023 •

edited

Loading

RoboMagus commented Dec 9, 2023

bobqianic commented Dec 9, 2023

bobqianic commented Dec 18, 2023

bobqianic commented Dec 23, 2023

bobqianic commented Dec 24, 2023 •

edited

Loading

solarsamuel commented Dec 28, 2023

ggerganov commented Dec 29, 2023

marty1885 commented Jan 4, 2024

Lhemamou commented Jun 8, 2024

marty1885 commented Jun 8, 2024

solarsamuel commented Jun 10, 2024

marty1885 commented Jun 11, 2024 •

edited

Loading

solarsamuel commented Jun 11, 2024

marty1885 commented Jun 15, 2024 •

edited

Loading

jenskastensson commented Sep 18, 2024

NPU support in whisper.cpp #1557

NPU support in whisper.cpp #1557

Comments

bobqianic commented Nov 27, 2023 • edited Loading

Roadmap:

Reference:

ggerganov commented Nov 27, 2023

Leeviber commented Nov 30, 2023 • edited Loading

RoboMagus commented Dec 9, 2023

bobqianic commented Dec 9, 2023

bobqianic commented Dec 18, 2023

bobqianic commented Dec 23, 2023

bobqianic commented Dec 24, 2023 • edited Loading

Matrix A:

Matrix B:

Matrix C:

solarsamuel commented Dec 28, 2023

ggerganov commented Dec 29, 2023

marty1885 commented Jan 4, 2024

Lhemamou commented Jun 8, 2024

marty1885 commented Jun 8, 2024

solarsamuel commented Jun 10, 2024

marty1885 commented Jun 11, 2024 • edited Loading

solarsamuel commented Jun 11, 2024

marty1885 commented Jun 15, 2024 • edited Loading

jenskastensson commented Sep 18, 2024

bobqianic commented Nov 27, 2023 •

edited

Loading

Leeviber commented Nov 30, 2023 •

edited

Loading

bobqianic commented Dec 24, 2023 •

edited

Loading

marty1885 commented Jun 11, 2024 •

edited

Loading

marty1885 commented Jun 15, 2024 •

edited

Loading