-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Referencing yuv420sp mpp output to ffmpeg frames without memcopy #18
Comments
maybe: maybe:
|
Thanks, think i have done all unrefing and frame releasing, apperantly somehwere i havent, i will go with valgrind, may be this helps to fix my shit code. closing the ticket. |
Aha, i fixed the issue, thanks. together with valgrind and your tips i realized that i have not released the old buffers. |
@JeffyCN
I tested 8k Prototype code below: Still i am not so sure about referencing MppBuffer directly, because in my theory, Releasing the mpp buffer when the actual display is finished, may cause internal VPU buffer to be full longer, may be copying it with SIMD could be still faster. |
I've missed the whole issue and your improvement. I can only test this in ffplay and report back. |
@avafinger ok, make sure libyuv is installed. I did not check if configure script properly tests libyuv. Full branch is here |
Is libyuv this one? |
Yes |
And my branch is dualbuffers |
there are some clients that might try to use their own custom get buffer hook, there might be issues in those cases |
This branch converts the whole NV12 to yuv420p without overriding get_buffer2. Perfomance is similar to 'dual buffers'. Yet slightly wrose than dual buffers. This branch, directly references NV12 frame output with get_buffer2. There is no conversion at all. I have run some further tests with dualbuffers, depending on the video it also drops even in 4k so i might have overstated the performance. Yet still a big improvement on the soft conversion. Could be a fallback where there is no rga. |
guessing it might related to the render flow, at least for mali GPU, there is a memcpy when importing normal texture |
ah i think you are refering to yuv to rgb conversion at some point, but i though this was done by the players (i am no expert in video stuff at all, i might guess stupid things). Can you point out where this is hapenning? Or somefile in ffmped where this is happening? |
it's in the player(ffplay/mpv/chromium/firefox) if they use gpu(egl) to render video frame, the buffer to texture(glTexImage2D) step would cause a memcpy(neon version). the reason we use drm prime in ffmpeg is to use zero-copy dmabuf importing(with custom mpv/gst xvimage) so: |
for VPU performance, you can check with driver log: |
For instance, would you have an example of how to import dma buf to texture (egl)? |
Quick (n00b) question here @hbiyik: if I wish to have an ffmpeg which can do the below task by hardware acceleration on RK3568B2:
Can any of your repository be my choice? I tried with https://github.com/jjm2473/ffmpeg-rk, but it segfaults constantly, and repository owner is not responding to my questions. |
@danergo libuv branch should do your trick |
Thank you. You mean libyuv branch? I have a RK3568 CPU (nanopi r5c), with this os: rk3568-eflasher-debian-bullseye-core-5.10-arm64-YYYYMMDD.img.gz How shall I (from which repos) build or (from which packages) install the dependencies? What are the dependencies? Librga and mpp? What is the recommended configure command? Thank you very much. |
i suggest you open a ticket in my repo so jeffy's dont get spammed by those questions. there i can try to help the best i can |
check: i was using custom mpv&xserver: |
@hbiyik: that was my intention, but unfortunately your repo doesn't allow me to open tickets. If you can let me do that I'll open it there. I appreciate your help. Thank you! Sorry Jeffy. :) |
Only option i see now is to use libdav1d together with VPU so that may be it can ctach up to 8k@30 for youtube. @JeffyCN Thanks for your help so far, if i were to use livdavid decoder paralel in rkmppdec.c, when i return the decoded frames, they should not be ordered right? FFmpeg should reorder them according to pts? In that case how to feed 2 different decoders so that each frames on different decoders would only have the packets that they need. Is it even possible? |
TRM's max fps could be lower than the real world, maybe you can: and maybe you can ask @HermanChen about the actual max fps |
thanks, will, do. btw previous libdav1d thing was silly. Even the compression algo is depending on previous frames, let alone you can not align packets with frames. So thats impossible. |
I think I'm making some progress. RGA does require the user to import dma_fd/vir_addr/phy_addr buffer as handle in advance and store it inside the RGA driver Before the entire RGA singleton is destroyed, all previously imported handles need to be released to prevent leaks If cascading is not required, the import buffer step can be skipped. |
ah thats why it required c++ interface.. to have buffer wrapper |
I'm revisiting the mpp encoder. Now I've done AFBC/FBCE. All that's left is to find a way to take advantage of dual core rkvenc. FFmpeg's existing framework is not suitable for enabling multi-threading for the encoder, maybe we should explore mpp's advanced interface MppTask and use its poll/dequeue/enqueue? |
one question, is rkvenc dual core? i thought it was single core, only vdpu 381 decoder is dual core. |
and could you find a solution to in/out fence problem? |
Not yet. Replacing fd with handle can only reduce but not avoid the issue. I plan to put it on hold for now and wait until the encoder is completed, then ask MPP developers to help us transfer the ffmpeg source code to RGA developers for testing. |
The MppTask is not very good at multi-threading and it is not efficient for too many wait / signal.
Both vepu580 and vdpu381 are dual core.
Mpp decoder provide callback mode. Refer to MppDecCbCfg and it is on testing and not documented. |
Mpp has already handled all the dual core issue. The user can ignore this part. |
@HermanChen |
emm... The dual core mode has difference on H.264 and H.265. |
mpi_enc_test can use dual core on H.265 over 4K or enable auto_tile option. |
mpi_enc_test can not use dual core on H.264 for its block input mode. |
@nyanmisaka FFmpeg defined NV20 as 2 byte format with 6bits of padding, but NV20 is 10bit format without padding. P210 which has 2 byte variant of NV20, which corresponds to YUV422SP_10bit uncompact align=0 in RGA, seems to be correct I had defined the NV15 with AV_PIX_FMT_FLAG_BITSTREAM flag to really describe it correctly, i think similar should have been done on NV20 as well. What do you think? |
@hbiyik I'm also skeptical of the current AV_PIX_FMT_NV20, which was introduced 10 years ago and very little code in FFmpeg refers to it. However, the NV15/NV20/NV30 in DRM and V4L2 have only been discussed and added in recent years. Also, the fate test for imgutils shows the same value for NV20 and P210.
|
@nyanmisaka here is the port of mpp+rga+dmaheaps to mainline 6.7rc2 with panthor. I tested and it is working pretty fine.
dmesg of mpp
dmesg of rga
|
@hbiyik I'm curious if the non-essential features (qos, devfreq) of MPP you disabled earlier have any impact on performance. Btw I added the drm GEM related changes so I could test it in my FFmpeg. |
yeah i think overlooked it, rc1 then
it must have an impact either to performance, power consumption or thermals, but currently i think mainline is booting with default values of clock which is the performance values, and mpp related hardware does not have a governor. I think if it causes problems the right way to do it would be port mpp to mainline interfaces. Because rkr interfaces are quite different and dependant with the rest of drivers up to pvtm.
What does GEM help for? I have no idea what this is, read the kernel docs and it sounds complicated. Also: i noticed on mpv can not do atomic swap when using drm output, did not dig in too deep, but i am suspecting that when the buffer is requested from rockchip_dma* it can not handle this properly when outputting to drm? May be it is realted to this GEM thing. Also AV1 is not working because MMU driver is different for it, and rkr and mainline is too different in between versions, i did not want to make the prot bigger. Also AV1 is already supported in mainline over V4l2 so currently i would rather ignore it. |
I mainly use it to create drm dumb buffers. The drm allocator inside MPP also relies on this. The RGA seems under performing. While MPP decoder gives 1500fps on HEVC 1080p, RGA is capped at about 200fps. Can you test it using dma_heap on your side? |
yes i can confirm, rga never reaches 100% hovers around %50 load. I get 700fps for 1080p but still should be higher
|
we can rule out clock relation, i have a tool to inspect the registers directly, rga3 core is exactly in the same configured both for mainline and vendor kernel.
and here is the mainline
|
thats weird, may be async rga is not active in config. btw i am also in the same kernel with GEM fixes zcat /proc/config.gz | grep RGA
|
Can you try my branch on your kernel and see how it performs?
The problem may be drm dumb buffer specific, or I'm missing a certain kernel option. |
i had to scale it make it work, if i do not set w=1280,h=720 it goes pass thorugh and rga is not applied (i think it does not check if input is afbc or not.) never the less, it sometimes loads core1 to %1 otherwise is zero, core0 fluctiates arounf %30 ~ %50, fps aroung ~200/250, may if i enable rga debug logging it might tell us why scheduler is making such desicions... i scaled the same file to the same (1280*720 dims) with my code, it loads 2 cores both up to %50 and fps is around 500~600 your ffmpeg hits 1500fps with the same file when no rga is involved. my ffmpeg limits at 950, however i know mine was also hitting 1500 on vendor kernel. So it seems like my fork has over all slowdown regardless of RGA (may be something is wrong internal decoder async loop), your problem seems different. |
I had disabled a section related TLB in rk_heap driver, the reason was EDIT: Never mind, this block is only useful when memory is >4gb, i am testing on 4gb model this can not be the root cause. |
[mpp/mpp_enc]: Add async encoder flow [mpp_enc]: Fix h265e async issue So, can i interpret this as - the h265e hardware on rk3588 doesn't support frame parallel/async io, and it's not a driver limitation, right?
|
Hello
I am experimenting the approach where the mmapped mppbuffer pointer is directly referenced to AVFrame->data[0,1,2], so that i can get rid off memcopy totally. I could expect some alignment issues and without any conversion i can only get yuv420sp but current problem is i have huge mem leaks.
Do you have any suggestions what i am doing wrong? It seems that mpp frame is not releaes even though i release it explicitly.
Prototype here:
https://github.com/hbiyik/FFmpeg/blob/61c629b2a6b65a319b767fafac3f01221d9c16f7/libavcodec/rkmppdec.c
The text was updated successfully, but these errors were encountered: