Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Referencing yuv420sp mpp output to ffmpeg frames without memcopy #18

Closed
hbiyik opened this issue Apr 4, 2023 · 438 comments
Closed

Referencing yuv420sp mpp output to ffmpeg frames without memcopy #18

hbiyik opened this issue Apr 4, 2023 · 438 comments

Comments

@hbiyik
Copy link

hbiyik commented Apr 4, 2023

Hello

I am experimenting the approach where the mmapped mppbuffer pointer is directly referenced to AVFrame->data[0,1,2], so that i can get rid off memcopy totally. I could expect some alignment issues and without any conversion i can only get yuv420sp but current problem is i have huge mem leaks.

Do you have any suggestions what i am doing wrong? It seems that mpp frame is not releaes even though i release it explicitly.

Prototype here:
https://github.com/hbiyik/FFmpeg/blob/61c629b2a6b65a319b767fafac3f01221d9c16f7/libavcodec/rkmppdec.c

@JeffyCN
Copy link
Owner

JeffyCN commented Apr 6, 2023

maybe:
1/ make sure mpp frame been deinited at the end
2/ try to locate the leak with valgrind

maybe:

+            // free old buffers
+            av_buffer_unref(&frame->buf[0]);
+            av_buffer_unref(&frame->buf[1]);
+            av_buffer_unref(&frame->buf[2]);
+
+            frame->buf[0] = // create buffer wrapper
+
+            frame->linesize[0] = hstride;
+            frame->linesize[1] = hstride / 2;
+            frame->linesize[2] = hstride / 2;
+
+            frame->data[0] = frame->buf[0]->data;
+            frame->data[1] = frame->data[0] + hstride * vstride;
+            frame->data[2] = frame->data[1] + hstride * vstride / 4;

@hbiyik
Copy link
Author

hbiyik commented Apr 6, 2023

Thanks, think i have done all unrefing and frame releasing, apperantly somehwere i havent, i will go with valgrind, may be this helps to fix my shit code. closing the ticket.

@hbiyik hbiyik closed this as completed Apr 6, 2023
@hbiyik
Copy link
Author

hbiyik commented Apr 6, 2023

Aha, i fixed the issue, thanks. together with valgrind and your tips i realized that i have not released the old buffers.

@hbiyik
Copy link
Author

hbiyik commented Apr 16, 2023

@JeffyCN
I am reopening this issue and i think i made some real improvements.

  1. I am using the Y plane from MppBuffer without copying it back to the AvBuffer.
  2. I am using the the existing Avbuffer to convert only uv planes with libuv. Libuv has fast implementation of (SIMD) of buffer interleaving.

I tested 8k 30fps 60fps with this without any frame drops with normal players (mpv, ffplay)
Firefox 4k plays nice and smooth. Sometimes drops stuff but thats a huge step forward imho.

Prototype code below:
hbiyik@adaeb45

Still i am not so sure about referencing MppBuffer directly, because in my theory, Releasing the mpp buffer when the actual display is finished, may cause internal VPU buffer to be full longer, may be copying it with SIMD could be still faster.

@hbiyik hbiyik reopened this Apr 16, 2023
@hbiyik
Copy link
Author

hbiyik commented Apr 16, 2023

@avafinger

@avafinger
Copy link

I've missed the whole issue and your improvement. I can only test this in ffplay and report back.
Maybe @JeffyCN can give some hints.

@hbiyik
Copy link
Author

hbiyik commented Apr 16, 2023

@avafinger ok, make sure libyuv is installed. I did not check if configure script properly tests libyuv.

Full branch is here
https://github.com/hbiyik/FFmpeg/tree/dualbuffers

@avafinger
Copy link

Is libyuv this one?
git clone https://chromium.googlesource.com/libyuv/libyuv

@hbiyik
Copy link
Author

hbiyik commented Apr 16, 2023

Yes

@hbiyik
Copy link
Author

hbiyik commented Apr 16, 2023

And my branch is dualbuffers

@JeffyCN
Copy link
Owner

JeffyCN commented Apr 17, 2023

there are some clients that might try to use their own custom get buffer hook, there might be issues in those cases

@hbiyik
Copy link
Author

hbiyik commented Apr 17, 2023

This branch converts the whole NV12 to yuv420p without overriding get_buffer2. Perfomance is similar to 'dual buffers'. Yet slightly wrose than dual buffers.
https://github.com/hbiyik/FFmpeg/tree/libyuv

This branch, directly references NV12 frame output with get_buffer2. There is no conversion at all.
https://github.com/hbiyik/FFmpeg/tree/norga
The interesting part is, even this struggles with 8k videos, cpu load is very low, however some frames are dropped when played with mpv to output NV12 directly. Therefore i started to think that with FFmpeg unacceleareted decoder pipeline it is impossible to output 8k output, or either MPP or VPU is really slow when it comes to 8K. I know the VPU is dual core but dont know if mmp can really leverage that. or ffmpeg pipeline is really crappy.

I have run some further tests with dualbuffers, depending on the video it also drops even in 4k so i might have overstated the performance. Yet still a big improvement on the soft conversion. Could be a fallback where there is no rga.

@JeffyCN
Copy link
Owner

JeffyCN commented Apr 17, 2023

guessing it might related to the render flow, at least for mali GPU, there is a memcpy when importing normal texture

@hbiyik
Copy link
Author

hbiyik commented Apr 17, 2023

ah i think you are refering to yuv to rgb conversion at some point, but i though this was done by the players (i am no expert in video stuff at all, i might guess stupid things). Can you point out where this is hapenning? Or somefile in ffmped where this is happening?

@JeffyCN
Copy link
Owner

JeffyCN commented Apr 17, 2023

it's in the player(ffplay/mpv/chromium/firefox)

if they use gpu(egl) to render video frame, the buffer to texture(glTexImage2D) step would cause a memcpy(neon version).

the reason we use drm prime in ffmpeg is to use zero-copy dmabuf importing(with custom mpv/gst xvimage)

so:
1/ mpp return dma buf
2/ ffmpeg convert it to normal buf(memcpy)
3/ player import it to GPU(memcpy)
4/ GPU render it to window buf
5/ GPU render window buf to screen buf

@JeffyCN
Copy link
Owner

JeffyCN commented Apr 17, 2023

for VPU performance, you can check with driver log:
echo 0x100 >/sys/module*/rk_vcodec/parameter*/*debug
dmesg

@avafinger
Copy link

if they use gpu(egl) to render video frame, the buffer to texture(glTexImage2D) step would cause a memcpy(neon version).

For instance, would you have an example of how to import dma buf to texture (egl)?

@danergo
Copy link

danergo commented Apr 17, 2023

Quick (n00b) question here @hbiyik: if I wish to have an ffmpeg which can do the below task by hardware acceleration on RK3568B2:

  • transcode h264 or hevc source to h264 with lower fps/resolution/bitrate

Can any of your repository be my choice?

I tried with https://github.com/jjm2473/ffmpeg-rk, but it segfaults constantly, and repository owner is not responding to my questions.

@hbiyik
Copy link
Author

hbiyik commented Apr 17, 2023

@danergo libuv branch should do your trick

@danergo
Copy link

danergo commented Apr 17, 2023

Thank you.

You mean libyuv branch?

I have a RK3568 CPU (nanopi r5c), with this os: rk3568-eflasher-debian-bullseye-core-5.10-arm64-YYYYMMDD.img.gz

How shall I (from which repos) build or (from which packages) install the dependencies?

What are the dependencies? Librga and mpp?

What is the recommended configure command?

Thank you very much.

@hbiyik
Copy link
Author

hbiyik commented Apr 17, 2023

i suggest you open a ticket in my repo so jeffy's dont get spammed by those questions. there i can try to help the best i can

@JeffyCN
Copy link
Owner

JeffyCN commented Apr 17, 2023

For instance, would you have an example of how to import dma buf to texture (egl)?

check:
https://github.com/JeffyCN/drm-cursor/blob/master/drm_egl.c#L375

i was using custom mpv&xserver:
JeffyCN/mpv@3d668a7
https://github.com/JeffyCN/xorg-xserver/blob/1.20.4/glamor/glamor_xv.c#L464

@danergo
Copy link

danergo commented Apr 18, 2023

@hbiyik: that was my intention, but unfortunately your repo doesn't allow me to open tickets.

If you can let me do that I'll open it there. I appreciate your help.

Thank you!

Sorry Jeffy. :)

@hbiyik
Copy link
Author

hbiyik commented Apr 18, 2023

god VPU981 can max 8k@15fps. Ridicilous.
image

@hbiyik
Copy link
Author

hbiyik commented Apr 18, 2023

Only option i see now is to use libdav1d together with VPU so that may be it can ctach up to 8k@30 for youtube.

@JeffyCN Thanks for your help so far, if i were to use livdavid decoder paralel in rkmppdec.c, when i return the decoded frames, they should not be ordered right? FFmpeg should reorder them according to pts?

In that case how to feed 2 different decoders so that each frames on different decoders would only have the packets that they need. Is it even possible?

@JeffyCN
Copy link
Owner

JeffyCN commented Apr 18, 2023

TRM's max fps could be lower than the real world, maybe you can:
try gst(fakesink+fpsdisplaysink)
try afbc format
try performance mode(cpu/ddr)
check the driver performance log

and maybe you can ask @HermanChen about the actual max fps

@hbiyik
Copy link
Author

hbiyik commented Apr 18, 2023

thanks, will, do. btw previous libdav1d thing was silly. Even the compression algo is depending on previous frames, let alone you can not align packets with frames. So thats impossible.

@nyanmisaka
Copy link

I think I'm making some progress.

RGA does require the user to import dma_fd/vir_addr/phy_addr buffer as handle in advance and store it inside the RGA driver ioctl(RGA_IOC_IMPORT_BUFFER) in order to correctly cascade multiple BLIT tasks.

Before the entire RGA singleton is destroyed, all previously imported handles need to be released to prevent leaks ioctl(RGA_IOC_RELEASE_BUFFER).

If cascading is not required, the import buffer step can be skipped.

@hbiyik
Copy link
Author

hbiyik commented Nov 12, 2023

ah thats why it required c++ interface.. to have buffer wrapper

@nyanmisaka
Copy link

I'm revisiting the mpp encoder. Now I've done AFBC/FBCE. All that's left is to find a way to take advantage of dual core rkvenc. FFmpeg's existing framework is not suitable for enabling multi-threading for the encoder, maybe we should explore mpp's advanced interface MppTask and use its poll/dequeue/enqueue?

@hbiyik
Copy link
Author

hbiyik commented Nov 23, 2023

one question, is rkvenc dual core? i thought it was single core, only vdpu 381 decoder is dual core.

@hbiyik
Copy link
Author

hbiyik commented Nov 23, 2023

and could you find a solution to in/out fence problem?

@nyanmisaka
Copy link

When using multi-threading, you can see in top there are two rkvenc-core irq workers with different addresses.

irq/135-fdbd0000.rkvenc-core
irq/136-fdbe0000.rkvenc-core

image

@nyanmisaka
Copy link

and could you find a solution to in/out fence problem?

Not yet. Replacing fd with handle can only reduce but not avoid the issue. I plan to put it on hold for now and wait until the encoder is completed, then ask MPP developers to help us transfer the ffmpeg source code to RGA developers for testing.

@HermanChen
Copy link

I'm revisiting the mpp encoder. Now I've done AFBC/FBCE. All that's left is to find a way to take advantage of dual core rkvenc. FFmpeg's existing framework is not suitable for enabling multi-threading for the encoder, maybe we should explore mpp's advanced interface MppTask and use its poll/dequeue/enqueue?

The MppTask is not very good at multi-threading and it is not efficient for too many wait / signal.
It is better to refer to mpi_dec/enc_mt_test for multi-threading.

one question, is rkvenc dual core? i thought it was single core, only vdpu 381 decoder is dual core.

Both vepu580 and vdpu381 are dual core.

and could you find a solution to in/out fence problem?

Mpp decoder provide callback mode. Refer to MppDecCbCfg and it is on testing and not documented.

@HermanChen
Copy link

Mpp has already handled all the dual core issue. The user can ignore this part.

@nyanmisaka
Copy link

nyanmisaka commented Nov 23, 2023

Mpp has already handled all the dual core issue. The user can ignore this part.

@HermanChen
The problem is, using mpi_enc_test cannot make the dual core VEPU fully loaded to encoding 4k@120fps / 1080p@480fps , but using mpi_enc_mt_test can do this. Any insight on this?

263380759-124a2733-c343-4b06-9ba0-686da5154716

hbiyik#14 (comment)

@HermanChen
Copy link

Mpp has already handled all the dual core issue. The user can ignore this part.

@HermanChen The problem is, using mpi_enc_test cannot make the dual core VEPU fully loaded to encoding 4k@120fps / 1080p@480fps , but using mpi_enc_mt_test can do this. Any insight on this?

emm... The dual core mode has difference on H.264 and H.265.
The H.265 dual core mode is split one frame into 2 tiles and encode each tile on each core. So over 4K encoding will fully use the dual core.
The H.264 dual core mode is to encode two frame at the same time and each core encode one frame.
So it comes out that H.265 under 4K will use only one core and H.264 serial encoding mode also only use one core.

@HermanChen
Copy link

mpi_enc_test can use dual core on H.265 over 4K or enable auto_tile option.

@HermanChen
Copy link

mpi_enc_test can not use dual core on H.264 for its block input mode.

@hbiyik
Copy link
Author

hbiyik commented Nov 23, 2023

@nyanmisaka
i think there should be misunderstaing about this NV20 format, in either FFMpeg side or Rockchip side
I think rockchip is more accurate in this regard because their definition is inline with linux kernel DRM subsystem

https://github.com/FFmpeg/FFmpeg/blob/4adb93dff05dd947878c67784d98c9a4e13b57a7/libavutil/pixdesc.c#L2088

FFmpeg defined NV20 as 2 byte format with 6bits of padding, but NV20 is 10bit format without padding.
Also it has LE and BE variants but this i can not make sense because you can not have an Little of Big Endian alignment because there is no padding.

P210 which has 2 byte variant of NV20, which corresponds to YUV422SP_10bit uncompact align=0 in RGA, seems to be correct
https://github.com/FFmpeg/FFmpeg/blob/4adb93dff05dd947878c67784d98c9a4e13b57a7/libavutil/pixdesc.c#L2088

I had defined the NV15 with AV_PIX_FMT_FLAG_BITSTREAM flag to really describe it correctly, i think similar should have been done on NV20 as well.
https://github.com/hbiyik/FFmpeg/blob/65d4cf4bd74904dcbcd935685d2160da83f8f72d/libavutil/pixdesc.c#L2730

What do you think?

@nyanmisaka
Copy link

@hbiyik I'm also skeptical of the current AV_PIX_FMT_NV20, which was introduced 10 years ago and very little code in FFmpeg refers to it. However, the NV15/NV20/NV30 in DRM and V4L2 have only been discussed and added in recent years.

Also, the fate test for imgutils shows the same value for NV20 and P210.
This conflicts with RK's NV20 which does not have padding.

nv20le          planes: 2, linesizes: 128 128   0   0, plane_sizes:  6144  6144     0     0, plane_offsets:  6144     0     0, total_size: 12288
nv20be          planes: 2, linesizes: 128 128   0   0, plane_sizes:  6144  6144     0     0, plane_offsets:  6144     0     0, total_size: 12288
p210be          planes: 2, linesizes: 128 128   0   0, plane_sizes:  6144  6144     0     0, plane_offsets:  6144     0     0, total_size: 12288
p210le          planes: 2, linesizes: 128 128   0   0, plane_sizes:  6144  6144     0     0, plane_offsets:  6144     0     0, total_size: 12288

@hbiyik
Copy link
Author

hbiyik commented Nov 26, 2023

@nyanmisaka
https://github.com/hbiyik/linux/tree/panthor+mpp+rga

here is the port of mpp+rga+dmaheaps to mainline 6.7rc2 with panthor. I tested and it is working pretty fine.
to build with mpp mpp_defconfig is here, to build with rga rga_multi_defconfig is here. And for your reference here is a build script for kernel.

[alarm@alarm testfiles]$ uname -a
Linux alarm 6.7.0-rc1-panthor-g3255db267a2b #1 SMP PREEMPT Sun Nov 26 22:16:06 UTC 2023 aarch64 GNU/Linux
[alarm@alarm testfiles]$ sudo cat /sys/kernel/debug/rkrga/hardware
[sudo] password for alarm: 
===================================
rga3_core0, core 1: version: 3.0.76831
input range: 68x2 ~ 8176x8176
output range: 68x2 ~ 8128x8128
scale limit: 1/8 ~ 8
byte_stride_align: 16
max_byte_stride: 32768
csc: RGB2YUV 0xf YUV2RGB 0xf
feature: 0x4
mmu: RK_IOMMU
-----------------------------------
rga3_core1, core 2: version: 3.0.76831
input range: 68x2 ~ 8176x8176
output range: 68x2 ~ 8128x8128
scale limit: 1/8 ~ 8
byte_stride_align: 16
max_byte_stride: 32768
csc: RGB2YUV 0xf YUV2RGB 0xf
feature: 0x4
mmu: RK_IOMMU
-----------------------------------
rga2, core 4: version: 3.2.63318
input range: 2x2 ~ 8192x8192
output range: 2x2 ~ 4096x4096
scale limit: 1/16 ~ 16
byte_stride_align: 4
max_byte_stride: 32768
csc: RGB2YUV 0x3 YUV2RGB 0x7
feature: 0x205f
mmu: RGA_MMU
-----------------------------------
[alarm@alarm testfiles]$ ls /dev/rga* /dev/dma_heap/* /dev/mpp_service 
/dev/dma_heap/cma  /dev/dma_heap/cma-uncached  /dev/dma_heap/system  /dev/dma_heap/system-dma32  /dev/dma_heap/system-uncached  /dev/dma_heap/system-uncached-dma32  /dev/mpp_service  /dev/rga

dmesg of mpp

[    0.090807] mpp_service mpp-srv: 665456e38c4d author: boogie 2023-11-26 mpp: build all drivers, we can not use CONFIG_CPU_RK* in mainline
[    0.090816] mpp_service mpp-srv: probe start
[    0.092453] mpp_rkvdec2 fdc30000.rkvdec-ccu: rkvdec-ccu, probing start
[    0.092501] mpp_rkvdec2 fdc30000.rkvdec-ccu: ccu-mode: 1
[    0.092506] mpp_rkvdec2 fdc30000.rkvdec-ccu: probing finish
[    0.092708] mpp_rkvenc2 rkvenc-ccu: probing start
[    0.092714] mpp_rkvenc2 rkvenc-ccu: probing finish
[    0.092833] mpp_service mpp-srv: probe success
[    0.386016] mpp_vdpu2 fdb50400.vdpu: Adding to iommu group 3
[    0.387218] mpp_vdpu2 fdb50400.vdpu: probe device
[    0.388063] mpp_vdpu2 fdb50400.vdpu: reset_group->rw_sem_on=0
[    0.388586] mpp_vdpu2 fdb50400.vdpu: reset_group->rw_sem_on=0
[    0.389563] mpp_vdpu2 fdb50400.vdpu: probing finish
[    0.390237] mpp_rkvdec2 fdc38100.rkvdec-core: Adding to iommu group 4
[    0.391132] mpp_rkvdec2 fdc38100.rkvdec-core: rkvdec-core, probing start
[    0.392019] mpp_rkvdec2 fdc38100.rkvdec-core: shared_niu_a is not found!
[    0.393143] mpp_rkvdec2 fdc38100.rkvdec-core: shared_niu_h is not found!
[    0.394427] mpp_rkvdec2 fdc38100.rkvdec-core: core_mask=00010001
[    0.394965] mpp_rkvdec2 fdc38100.rkvdec-core: attach ccu as core 0
[    0.395914] mpp_rkvdec2 fdc38100.rkvdec-core: sram_start 0x00000000ff001000
[    0.396536] mpp_rkvdec2 fdc38100.rkvdec-core: rcb_iova 0x00000000fff00000
[    0.397138] mpp_rkvdec2 fdc38100.rkvdec-core: sram_size 491520
[    0.397656] mpp_rkvdec2 fdc38100.rkvdec-core: rcb_size 1048576
[    0.398175] mpp_rkvdec2 fdc38100.rkvdec-core: min_width 512
[    0.398672] mpp_rkvdec2 fdc38100.rkvdec-core: rcb_info_count 20
[    0.399197] mpp_rkvdec2 fdc38100.rkvdec-core: [136, 24576]
[    0.399702] mpp_rkvdec2 fdc38100.rkvdec-core: [137, 49152]
[    0.400191] mpp_rkvdec2 fdc38100.rkvdec-core: [141, 90112]
[    0.400678] mpp_rkvdec2 fdc38100.rkvdec-core: [140, 49152]
[    0.401166] mpp_rkvdec2 fdc38100.rkvdec-core: [139, 180224]
[    0.401662] mpp_rkvdec2 fdc38100.rkvdec-core: [133, 49152]
[    0.402149] mpp_rkvdec2 fdc38100.rkvdec-core: [134, 8192]
[    0.402629] mpp_rkvdec2 fdc38100.rkvdec-core: [135, 4352]
[    0.403109] mpp_rkvdec2 fdc38100.rkvdec-core: [138, 13056]
[    0.403620] mpp_rkvdec2 fdc38100.rkvdec-core: [142, 291584]
[    0.404169] mpp_rkvdec2 fdc38100.rkvdec-core: probing finish
[    0.404909] mpp_rkvdec2 fdc48100.rkvdec-core: Adding to iommu group 5
[    0.405787] mpp_rkvdec2 fdc48100.rkvdec-core: rkvdec-core, probing start
[    0.406641] mpp_rkvdec2 fdc48100.rkvdec-core: shared_niu_a is not found!
[    0.407783] mpp_rkvdec2 fdc48100.rkvdec-core: shared_niu_h is not found!
[    0.408986] mpp_rkvdec2 fdc48100.rkvdec-core: core_mask=00020002
[    0.409535] mpp_rkvdec2 fdc48100.rkvdec-core: attach ccu as core 1
[    0.410880] mpp_rkvdec2 fdc48100.rkvdec-core: sram_start 0x00000000ff079000
[    0.411517] mpp_rkvdec2 fdc48100.rkvdec-core: rcb_iova 0x00000000ffe00000
[    0.412120] mpp_rkvdec2 fdc48100.rkvdec-core: sram_size 487424
[    0.412638] mpp_rkvdec2 fdc48100.rkvdec-core: rcb_size 1048576
[    0.413157] mpp_rkvdec2 fdc48100.rkvdec-core: min_width 512
[    0.413655] mpp_rkvdec2 fdc48100.rkvdec-core: rcb_info_count 20
[    0.414180] mpp_rkvdec2 fdc48100.rkvdec-core: [136, 24576]
[    0.414667] mpp_rkvdec2 fdc48100.rkvdec-core: [137, 49152]
[    0.415154] mpp_rkvdec2 fdc48100.rkvdec-core: [141, 90112]
[    0.415649] mpp_rkvdec2 fdc48100.rkvdec-core: [140, 49152]
[    0.416138] mpp_rkvdec2 fdc48100.rkvdec-core: [139, 180224]
[    0.416634] mpp_rkvdec2 fdc48100.rkvdec-core: [133, 49152]
[    0.417122] mpp_rkvdec2 fdc48100.rkvdec-core: [134, 8192]
[    0.417601] mpp_rkvdec2 fdc48100.rkvdec-core: [135, 4352]
[    0.418082] mpp_rkvdec2 fdc48100.rkvdec-core: [138, 13056]
[    0.418569] mpp_rkvdec2 fdc48100.rkvdec-core: [142, 291584]
[    0.419118] mpp_rkvdec2 fdc48100.rkvdec-core: probing finish
[    0.419922] mpp_rkvenc2 fdbd0000.rkvenc-core: Adding to iommu group 6
[    0.420744] mpp_rkvenc2 fdbd0000.rkvenc-core: probing start
[    0.421594] mpp_rkvenc2 fdbd0000.rkvenc-core: attach ccu as core 0
[    0.422494] mpp_rkvenc2 fdbd0000.rkvenc-core: probing finish
[    0.423275] mpp_rkvenc2 fdbe0000.rkvenc-core: Adding to iommu group 7
[    0.424090] mpp_rkvenc2 fdbe0000.rkvenc-core: probing start
[    0.424892] mpp_rkvenc2 fdbe0000.rkvenc-core: attach ccu as core 1
[    0.425778] mpp_rkvenc2 fdbe0000.rkvenc-core: probing finish

dmesg of rga

[    0.358299] rga3_core0 fdb60000.rga: Adding to iommu group 1
[    0.358945] rga: rga3_core0, irq = 51, match scheduler
[    0.359709] rga: rga3_core0 hardware loaded successfully, hw_version:3.0.76831.
[    0.360397] rga: rga3_core0 probe successfully
[    0.360939] rga3_core1 fdb70000.rga: Adding to iommu group 2
[    0.361504] rga: rga3_core1, irq = 52, match scheduler
[    0.362212] rga: rga3_core1 hardware loaded successfully, hw_version:3.0.76831.
[    0.362867] rga: rga3_core1 probe successfully
[    0.363432] rga: rga2, irq = 63, match scheduler
[    0.364101] rga: rga2 hardware loaded successfully, hw_version:3.2.63318.
[    0.364724] rga: rga2 probe successfully
[    0.365089] rga_iommu: IOMMU binding successfully, default mapping core[0x1]
[    0.365855] rga: Module initialized. v1.3.0

@nyanmisaka
Copy link

@hbiyik
Actually it's still rc1, Collabora has rc2 in rk3588-test.

I'm curious if the non-essential features (qos, devfreq) of MPP you disabled earlier have any impact on performance.

Btw I added the drm GEM related changes so I could test it in my FFmpeg.

@hbiyik
Copy link
Author

hbiyik commented Nov 27, 2023

Actually it's still rc1, Collabora has rc2 in rk3588-test.

yeah i think overlooked it, rc1 then

I'm curious if the non-essential features (qos, devfreq) of MPP you disabled earlier have any impact on performance.

it must have an impact either to performance, power consumption or thermals, but currently i think mainline is booting with default values of clock which is the performance values, and mpp related hardware does not have a governor. I think if it causes problems the right way to do it would be port mpp to mainline interfaces. Because rkr interfaces are quite different and dependant with the rest of drivers up to pvtm.

Btw I added the drm GEM related changes so I could test it in my FFmpeg.

What does GEM help for? I have no idea what this is, read the kernel docs and it sounds complicated.

Also: i noticed on mpv can not do atomic swap when using drm output, did not dig in too deep, but i am suspecting that when the buffer is requested from rockchip_dma* it can not handle this properly when outputting to drm? May be it is realted to this GEM thing.

Also AV1 is not working because MMU driver is different for it, and rkr and mainline is too different in between versions, i did not want to make the prot bigger. Also AV1 is already supported in mainline over V4l2 so currently i would rather ignore it.

@nyanmisaka
Copy link

What does GEM help for? I have no idea what this is, read the kernel docs and it sounds complicated.

I mainly use it to create drm dumb buffers. The drm allocator inside MPP also relies on this.

The RGA seems under performing. While MPP decoder gives 1500fps on HEVC 1080p, RGA is capped at about 200fps. Can you test it using dma_heap on your side?

@hbiyik
Copy link
Author

hbiyik commented Nov 27, 2023

yes i can confirm, rga never reaches 100% hovers around %50 load. I get 700fps for 1080p but still should be higher

[alarm@alarm testfiles]$ FFMPEG_RKMPP_PIXFMT=nv12 ffmpeg -stream_loop -1 -loglevel info -i ~/hiresvids/jelly/jellyfish-10-mbps-hd-hevc.mkv -an -sn -f null -
ffmpeg version n6.0-102-g65d4cf4bd7 Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 12.1.0 (GCC)
  configuration: --enable-rkmpp --enable-version3 --enable-libdrm --disable-optimizations --disable-static --enable-shared
  libavutil      58.  2.100 / 58.  2.100
  libavcodec     60.  3.100 / 60.  3.100
  libavformat    60.  3.100 / 60.  3.100
  libavdevice    60.  1.100 / 60.  1.100
  libavfilter     9.  3.100 /  9.  3.100
  libswscale      7.  1.100 /  7.  1.100
  libswresample   4. 10.100 /  4. 10.100
Input #0, matroska,webm, from '/home/alarm/hiresvids/jelly/jellyfish-10-mbps-hd-hevc.mkv':
  Metadata:
    COMPATIBLE_BRANDS: iso4hvc1iso6
    MAJOR_BRAND     : iso4
    MINOR_VERSION   : 1
    ENCODER         : Lavf56.3.100
  Duration: 00:00:30.10, start: 0.067000, bitrate: 9978 kb/s
  Stream #0:0(und): Video: hevc (Main), yuv420p(tv), 1920x1080 [SAR 1:1 DAR 16:9], 29.97 fps, 29.97 tbr, 1k tbn (default)
    Metadata:
      CREATION_TIME   : 2016-02-04 22:41:00
      LANGUAGE        : und
      HANDLER_NAME    : hevc@GPAC0.5.2-DEV-rev565-g71748d7-ab-suite
Stream mapping:
  Stream #0:0 -> #0:0 (hevc (hevc_rkmpp_decoder) -> wrapped_avframe (native))
Press [q] to stop, [?] for help
[hevc_rkmpp_decoder @ 0xaaaae3c1c1c0] Pixfmt (nv12), Conversion (nv12[FBC]->nv12)
rga_api version 1.9.3_[2]
Output #0, null, to 'pipe:':
  Metadata:
    COMPATIBLE_BRANDS: iso4hvc1iso6
    MAJOR_BRAND     : iso4
    MINOR_VERSION   : 1
    encoder         : Lavf60.3.100
  Stream #0:0(und): Video: wrapped_avframe, nv12(tv, progressive), 1920x1080 [SAR 1:1 DAR 16:9], q=2-31, 200 kb/s, 29.97 fps, 29.97 tbn (default)
    Metadata:
      CREATION_TIME   : 2016-02-04 22:41:00
      LANGUAGE        : und
      HANDLER_NAME    : hevc@GPAC0.5.2-DEV-rev565-g71748d7-ab-suite
      encoder         : Lavc60.3.100 wrapped_avframe
frame= 4720 fps=703 q=-0.0 Lsize=N/A time=00:02:38.29 bitrate=N/A speed=23.6x   

image

@hbiyik
Copy link
Author

hbiyik commented Nov 27, 2023

we can rule out clock relation, i have a tool to inspect the registers directly, rga3 core is exactly in the same configured both for mainline and vendor kernel.
CPLL with M=250,P=2,S=1,K0 1500mhz, DIV=1=750Mhz

-c rk3588 -d CRU -r CPLL_CON0 -p M = 250, (default=64), (values=[64~1023])
-c rk3588 -d CRU -r CPLL_CON0 -p BYPASS = 0, (default=0), (values=[0~1])
-c rk3588 -d CRU -r CPLL_CON1 -p P = 2, (default=1), (values=[1~63])
-c rk3588 -d CRU -r CPLL_CON1 -p S = 1, (default=0), (values=[0~6])
-c rk3588 -d CRU -r CPLL_CON1 -p RESETB = 0, (default=0), (values=[0~1])
-c rk3588 -d CRU -r CPLL_CON2 -p K = 0, (default=0), (values=[0~65536])
-c rk3588 -d CRU -r CPLL_CLOCK -p clock = 1500.0 Mhz, (default=0)
-c rk3588 -d CRU -r RGA3CORE -p ROOT_DIV = 1, (default=1), (values=[0~32])
-c rk3588 -d CRU -r RGA3CORE -p ACLK_ROOT_SEL = CPLL, (default=CPLL), (values=GPLL,CPLL,AUPLL)
-c rk3588 -d CRU -r RGA3CORE -p HCLK_ROOT_SEL = 200MM, (default=200MM), (values=200MM,100MM,50MM,OSC)
-c rk3588 -d CRU -r RGA3CORE -p CORE_DIV = 1, (default=1), (values=[0~32])
-c rk3588 -d CRU -r RGA3CORE -p CORE_SEL = CPLL, (default=CPLL), (values=GPLL,CPLL,AUPLL)
-c rk3588 -d CRU -r RGA3CORE -p core_clock = 750.0 Mhz, (default=0)
-c rk3588 -d CRU -r RGA3CORE -p a_clock = 750.0 Mhz, (default=0)

and here is the mainline

-c rk3588 -d CRU -r CPLL_CON0 -p M = 250, (default=64), (values=[64~1023])
-c rk3588 -d CRU -r CPLL_CON0 -p BYPASS = 0, (default=0), (values=[0~1])
-c rk3588 -d CRU -r CPLL_CON1 -p P = 2, (default=1), (values=[1~63])
-c rk3588 -d CRU -r CPLL_CON1 -p S = 1, (default=0), (values=[0~6])
-c rk3588 -d CRU -r CPLL_CON1 -p RESETB = 0, (default=0), (values=[0~1])
-c rk3588 -d CRU -r CPLL_CON2 -p K = 0, (default=0), (values=[0~65536])
-c rk3588 -d CRU -r CPLL_CLOCK -p clock = 1500.0 Mhz, (default=0)
-c rk3588 -d CRU -r RGA3CORE -p ROOT_DIV = 1, (default=1), (values=[0~32])
-c rk3588 -d CRU -r RGA3CORE -p ACLK_ROOT_SEL = CPLL, (default=CPLL), (values=GPLL,CPLL,AUPLL)
-c rk3588 -d CRU -r RGA3CORE -p HCLK_ROOT_SEL = 200MM, (default=200MM), (values=200MM,100MM,50MM,OSC)
-c rk3588 -d CRU -r RGA3CORE -p CORE_DIV = 1, (default=1), (values=[0~32])
-c rk3588 -d CRU -r RGA3CORE -p CORE_SEL = CPLL, (default=CPLL), (values=GPLL,CPLL,AUPLL)
-c rk3588 -d CRU -r RGA3CORE -p core_clock = 750.0 Mhz, (default=0)
-c rk3588 -d CRU -r RGA3CORE -p a_clock = 750.0 Mhz, (default=0)

@nyanmisaka
Copy link

It doesn't even use the second core.

image

@hbiyik
Copy link
Author

hbiyik commented Nov 27, 2023

thats weird, may be async rga is not active in config. btw i am also in the same kernel with GEM fixes

zcat /proc/config.gz | grep RGA

CONFIG_ROCKCHIP_MULTI_RGA=y
CONFIG_ROCKCHIP_RGA_ASYNC=y
CONFIG_ROCKCHIP_RGA_PROC_FS=y
CONFIG_ROCKCHIP_RGA_DEBUG_FS=y
CONFIG_ROCKCHIP_RGA_DEBUGGER=y

@nyanmisaka
Copy link

radxa@rock-5a:~$ sudo dmesg | grep rga
[    0.377613] rga3_core0 fdb60000.rga: Adding to iommu group 0
[    0.377796] rga: rga3_core0, irq = 49, match scheduler
[    0.378080] rga: rga3_core0 hardware loaded successfully, hw_version:3.0.76831.
[    0.378132] rga: rga3_core0 probe successfully
[    0.378293] rga3_core1 fdb70000.rga: Adding to iommu group 1
[    0.378368] rga: rga3_core1, irq = 50, match scheduler
[    0.378627] rga: rga3_core1 hardware loaded successfully, hw_version:3.0.76831.
[    0.378650] rga: rga3_core1 probe successfully
[    0.378848] rga: rga2, irq = 63, match scheduler
[    0.379102] rga: rga2 hardware loaded successfully, hw_version:3.2.63318.
[    0.379136] rga: rga2 probe successfully
[    0.379160] rga_iommu: IOMMU binding successfully, default mapping core[0x1]
[    0.379322] rga: Module initialized. v1.3.0

radxa@rock-5a:~$ zcat /proc/config.gz | grep RGA
# CONFIG_VIDEO_ROCKCHIP_RGA is not set
CONFIG_ROCKCHIP_MULTI_RGA=y
CONFIG_ROCKCHIP_RGA_ASYNC=y
CONFIG_ROCKCHIP_RGA_PROC_FS=y
CONFIG_ROCKCHIP_RGA_DEBUG_FS=y
CONFIG_ROCKCHIP_RGA_DEBUGGER=y

Can you try my branch on your kernel and see how it performs?

--enable-gpl --enable-version3 --enable-libdrm --enable-rkmpp --enable-rkrga

./ffmpeg -hwaccel rkmpp -hwaccel_output_format drm_prime -afbc_mode 1 -i /path/to/video -an -sn -vf scale_rkrga=format=nv12 -f null -

The problem may be drm dumb buffer specific, or I'm missing a certain kernel option.

@hbiyik
Copy link
Author

hbiyik commented Nov 27, 2023

./ffmpeg -loglevel info -stream_loop -1 -hwaccel rkmpp -hwaccel_output_format drm_prime -afbc_mode 1 -i ~/hiresvids/jelly/jellyfish-10-mbps-hd-hevc.mkv -an -sn -vf scale_rkrga=w=1280:h=720:format=nv12 -f null -

i had to scale it make it work, if i do not set w=1280,h=720 it goes pass thorugh and rga is not applied (i think it does not check if input is afbc or not.)

never the less, it sometimes loads core1 to %1 otherwise is zero, core0 fluctiates arounf %30 ~ %50, fps aroung ~200/250, may if i enable rga debug logging it might tell us why scheduler is making such desicions...

i scaled the same file to the same (1280*720 dims) with my code, it loads 2 cores both up to %50 and fps is around 500~600

your ffmpeg hits 1500fps with the same file when no rga is involved. my ffmpeg limits at 950, however i know mine was also hitting 1500 on vendor kernel. So it seems like my fork has over all slowdown regardless of RGA (may be something is wrong internal decoder async loop), your problem seems different.

@hbiyik
Copy link
Author

hbiyik commented Nov 28, 2023

https://github.com/hbiyik/linux/blob/56db31a2d9bc06e897883d4193e6e521008fd25e/drivers/dma-buf/heaps/rk_system_heap.c#L744

I had disabled a section related TLB in rk_heap driver, the reason was swiotlb_max_segment() was removed in 6.3 and i suspect this might be the reason why i get slowdown in mainline since i am explicitly using dma buffers, may be it is also possible to verify the same in your version of ffmpeg, i will check this when i have the environment.

EDIT: Never mind, this block is only useful when memory is >4gb, i am testing on 4gb model this can not be the root cause.

@nyanmisaka
Copy link

@HermanChen

emm... The dual core mode has difference on H.264 and H.265.
The H.265 dual core mode is split one frame into 2 tiles and encode each tile on each core. So over 4K encoding will fully use the dual core.

[mpp/mpp_enc]: Add async encoder flow

[mpp_enc]: Fix h265e async issue

https://github.com/rockchip-linux/mpp/blob/318bfc1b783831424529d483ef74baadc7478807/mpp/codec/enc/h265/h265e_ps.c#L436-L458

So, can i interpret this as - the h265e hardware on rk3588 doesn't support frame parallel/async io, and it's not a driver limitation, right?

The H.264 dual core mode is to encode two frame at the same time and each core encode one frame.
So it comes out that H.265 under 4K will use only one core and H.264 serial encoding mode also only use one core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants