-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mmal: Decoder stops sending output buffers #417
Comments
Some more detail: The issues seems indeed to be triggered by the recreation of image_fx. Actually due to a design flaw in VLCs core the image_fx plugin is recreated twice very fast after another when an aspect ratio change occurs. I added an ugly hack to only recreate it once and haven't seen the error raising since. I'll try to find a proper way in the VLC core to avoid the double recreation. |
Anything in the normal (non-assert) log? I had some issues with creating/destroying of deinterlace component being unreliable (although in my case video_decode is tunnelled to image_fx - not sure if yours is the same). From advice by @6by9 I moved the pool to the video_render component (which doesn't get disabled) and enabled "zero copy" mode (which seems necessary for reference counting of opaque buffers), and it does seem more stable when disabling/enabling deinterlace "on-the-fly". @6by9 and @luked99 may have a better understanding of exactly what is safe and what is unsafe. |
@popcornmix Which log do you mean? The msg I posted is from In my case the components are not tunnelled, but controlled individually. I do use zerocopy though and the owner of the mmal_buffer_header pool is the video_output component. |
The message comes from:
Presumably this is something being sent from VC->ARM. Don't know what it is, but it is 292 bytes.
which might give a clue as to what the 292 byte buffer is. |
You can also enable mmal logging with instructions here: Using the start_db.elf firmware will produce more debug messages. |
Host-side VCHIQ logging might be interesting. I think you can enable it via some sysfs entries (would need to check the kernel driver to see for sure). Usually when I see that error it means the host process has died and VCHIQ has started failing messages. So it's possible something bad happened earlier, the host process died and you then got this message. |
It's a buffer, but can't say anything more:
so 292 is sizeof(mmal_worker_buffer_from_host) (defined in mmal_vc_msgs.h). Shame it doesn't print something more useful as the function does have the full MMAL buffer header and port :-( Agreed that it normally means that the client has died, or otherwise VCHIQ can't queue the buffer for delivery. |
Yes, now you mention it I think I've seen that vchiq LOG_ERROR myself when killing the arm user process. If VC is sending a message to arm user app when it exits (e.g. control-c or seg-fault) then it is normal behaviour to see a vchiq error like that, so it may be a false alarm, unless you are sure it occurs without the process getting killed. |
It might indeed be a false alarm then. I added an abort when I detect the stalled decoder because of input buffers being all in use. So the arm side process dies immediately and the error might just come from one of the still running mmal components. I will try with extended logging and without the abort tomorrow. |
I just hit the stall with enabled debug logging. See the log here: As I am out of office right now I had no chance to have a deeper look into the logs myself yet. Will do so later today. But maybe some of you sees something relevant already... |
Looks like decode has done as much as it can, but the frames haven't been passed on to the either deinterlace or the renderer, and from there recycled. https://gist.github.com/julianscheel/540d2da085b502dc3e3d#file-gistfile1-txt-L639 onwards, each "video_decode:45:RIL:cb:" is an output being generated by the decoder. No further calls to image_fxRIL or video_render afterwards. Has the ARM passed the buffers on? MMAL logging would be of use here, either on the ARM side or VC. |
@6by9 The host did no receive the buffers in fact. There is a moment where the output port callback just stops being called. From there on the input port continues reading data up to the point where all input buffers are in use and I do the abort(). How would I enable VC side mmal logging besides what I already have? |
Looks like you only have the ril component logging. Have you done this:
|
Yes, actually I did. Just copied from shell history:
|
In turning on RIL logging you have probably turned off all VCOS logging. |
@6by9 Thanks, that makes sense. Restarting the test now with proper level set. |
Hm, now the log looks like this: https://gist.github.com/julianscheel/04050bff4326b07e6f97 |
So |
Hm, something is fishy with the logs for me. This is the exact sequence I do, trying to activate mmal_dbg_log. But
Generated log while playing video (not in error case yet): https://gist.github.com/julianscheel/03fc6be84be0e334a6f0 |
You've been given duff info for the MMAL logging I'm afraid. The symbol you actually want is |
I think the method for setting it from the ARM was to rely on the ELF symbols. Which is easier to setup than a vcdbg symbol and works fine....unless you don't have the ELF symbol file for some reason :-( |
It looks like you don't even bother to register it as a proper VCOS category, so:
which ought to work returns:
|
Can't this just be open-sourced :-) |
Looks like: |
Ah, it is registered on first component creation, but also deregistered on last component destruction. So the app is running, you should be able to do |
Overlapping comments. |
Yes, that works :) Test running with enabled mmal_dbg_log now. Will post the log when hitting the error case. |
Ok, having the next log: https://gist.github.com/julianscheel/bc0169a9ec8295adf7c7 update: gist url changed, wrong one in initial post, sorry |
At the risk of swamping the log, you could try adding some output from vchiq to try to separate cause from effect:
Replace |
@pelwell Ok, test with vchiq_core info is running |
Next hit, this time with vchiq core logging enabled as well: Sequence of error is the same as before. Can you see anything suspicious in the log? |
I'm no MMAL expert, but it looks like everything is working fine until line 1759:
This is the first of CLOSE messages received by the VPU from the ARM, and it is for the vcgencmd service. A further 7 services are then closed, all instigated by the ARM. Unless the MMAL layer (or one of the other services) has done something to upset the ARM, it looks as though the VPU is in the clear. |
You can enable similar tracing on the ARM side using the /sys/kernel/debug interface. If that directory isn't mounted automatically you will need to do:
You can then access the vchiq logging controls in /sys/kernel/debug/vchiq/log/*. Enabling "core" logging is unlikely to shed any light on things as it will be a mirror of the VC view, but here are some logging suggestions - choose some or all:
Setting arm tracing to "info" is just enough to allow us to distinguish between an orderly close, which looks like this:
and client death:
but using "trace" instead will show more detail. |
Do we know if it's actually necessary to destroy and re-create the image_fx
components? Can they just be taken back to LOADED, reconfigured and then
brought back to IDLE and then EXECUTING?
I don't have the code to look at so I don't know, but changing things like
resolution _ought_ to be possible once it's in LOADED.
|
@pelwell Thanks, let me try with that logging as well. @luked99 Not for sure. I do know that it does not happen if image_fx is not in the pipeline at all. The reason for the recreation of image_fx are some pessimistic assumptions in VLCs core. In fact for aspect ratio changes we wouldn't even need to stop/start the components at all. I do have a patched version of VLC in testing which avoids the recreation of image_fx completely. But this can't be pushed to upstream without further rework, because it will break non-mmal usecases pretty likely. Anyway, if the tests succeed with this hack I can implement another hack to see how it behaves when doing a |
Agreed that the VPU appears to be in the clear from that log. The buffers have all been sent to the ARM, and it hasn't returned any to image_fx or render to consume them. As well as the VCHI logging, you can enable extra logging in the ARM side MMAL library. I always tend to edit interface/mmal/core/mmal_logging.c to set mmal_log_level to VCOS_LOG_TRACE and rebuild userland, but I think @pelwell has a magic trick for setting that log level without rebuilding. There is also |
before running your test. |
Sorry, took me a while to capture the next logs. Here we are: Contains dmesg output after error was hit plus the vlc output with meal trace logging interleaved. The vlc log starts a while before the last aspect change occurs (line 111) which triggers two recreations of image_fx after which no frames are spit out by the decoder anymore and the codec finally runs finally out of buffers for input data and aborts. |
The previous time it had 'Filter deinterlace appended to chain' there was no mmal_codec error, so I guess that may be significant. I don't have the source code so I can only guess as to what that means, but perhaps out of buffers on the input port of the decoder? EDIT: that's what you've said above, sorry. |
Sorry, at least to me, that logging doesn't add much :-( The output from |
@luked99 Yes that error means that we're running out of buffers for video_decodes input port. But in fact this happens because the codec stops outputting frames. Normally not more than 2 or 3 input buffers are in use, as we give frame-packetized data into the codec. Now when the error occurs the codec stops sending frames out and thus does not free the incoming buffers anymore. Up to the point where all buffers are in use and we abort. Is there a more verbose level then mmal:trace for the mmal-side logging? I've just pushed my current vlc testing tree (the mmal part is quite a bit ahead of upstream git due to recent refactorings) here: https://github.com/julianscheel/vlc/tree/mmal-debugging/modules/hw/mmal |
interface/mmal/core/mmal_port.c line 36
For this one we need to know all the buffer transfers. I thought that was under the normal trace, but obviously not. |
@6by9 It seems that adding the extra logging prevents the problem from being triggered... I'll try it further though. By chances it will happen sometime, just less likely now. |
I still was unable to trigger the issue with full trace logging enabled, but had a run without full debug and printing out the pool states as well as mmal-stats and components in error case: I don't think it reveals something really useful, but at least you can see that video_decode received exactly 20 buffers more than it sent out. Which matches the out of input buffers scenario, as buffer_num in set to 20 on the input port. |
With full debugging the test is running for a week in an endless log without hitting the issue now. So I assume the slowdown because of printing the logs is stopping it from happening. Any other thoughts what we could do to track it? |
So the pool called "image_fxRIL: image pool:" is in state DESTROYING. It's been asked to go away but appears to be stuck, with one image still held by mmal-worker. I think that sounds a bit as though an image has been passed to the host but not returned, or it was returned by the host but VideoCore dropped it on the floor for some reason (most likely the former). Can you count buffers going across the VC/Host interface on the ARM side to check the host is returning everything expected? |
Sure, I'll add some counting and post results. |
@julianscheel how did it go with the counting? Has this issue been resolved? |
@Ruffio Admittedly I never found time to analyze this further. I was able to avoid the deadlock by avoiding fast recreation of image_fx filter, so we are not seeing this issue in real life anymore. But I am pretty sure it would still exist. |
@julianscheel it sounds good to me. @popcornmix ? |
Sure. I'll close. Feel free to reopen when you have more time to investigate. |
I am facing an issue where the mmal video_decode component suddenly stops sending decoded pictures on the output port. It seems it is only happening right after an image_fx component is created, but I can't say for sure yet as it is quite infrequently happening. I can see an error occuring via
vcdbg log msg
though:When the component is in this state I can still send input buffers, but they won't be returned after a while anymore, as well (I assume it's just that the internal pictures buffers are filled so the decoder can not write anymore picture). The component can be reanimated by flushing the output port, it seems.
Until now I have only seen in with mpeg2 content, but this might be just because of image_fx is much more often recreated on mpeg2 streams because of format changes on mpeg2/sd content compared to h264 hd content...
This message repeats several times. Assertions are not raised. Any thoughts what might go on there or how to further trace it?
The text was updated successfully, but these errors were encountered: