Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmal: Decoder stops sending output buffers #417

Closed
julianscheel opened this issue Apr 26, 2015 · 52 comments
Closed

mmal: Decoder stops sending output buffers #417

julianscheel opened this issue Apr 26, 2015 · 52 comments

Comments

@julianscheel
Copy link

I am facing an issue where the mmal video_decode component suddenly stops sending decoded pictures on the output port. It seems it is only happening right after an image_fx component is created, but I can't say for sure yet as it is quite infrequently happening. I can see an error occuring via vcdbg log msg though:

mmalsrv: send_buffer_to_host: tx failed:size 292 st -1

When the component is in this state I can still send input buffers, but they won't be returned after a while anymore, as well (I assume it's just that the internal pictures buffers are filled so the decoder can not write anymore picture). The component can be reanimated by flushing the output port, it seems.

Until now I have only seen in with mpeg2 content, but this might be just because of image_fx is much more often recreated on mpeg2 streams because of format changes on mpeg2/sd content compared to h264 hd content...

This message repeats several times. Assertions are not raised. Any thoughts what might go on there or how to further trace it?

@julianscheel
Copy link
Author

Some more detail: The issues seems indeed to be triggered by the recreation of image_fx. Actually due to a design flaw in VLCs core the image_fx plugin is recreated twice very fast after another when an aspect ratio change occurs. I added an ugly hack to only recreate it once and haven't seen the error raising since. I'll try to find a proper way in the VLC core to avoid the double recreation.
Still I think it shouldn't cause the decoder to stall :) - Could it be that due to the lazy destruction of pools the core runs out of resources or something like that?

@popcornmix
Copy link
Contributor

Anything in the normal (non-assert) log?

I had some issues with creating/destroying of deinterlace component being unreliable (although in my case video_decode is tunnelled to image_fx - not sure if yours is the same).
Turned out the images lose their ref counts in this situation (opaque buffers and disable port that image pool belongs to) and bad things could happen.

From advice by @6by9 I moved the pool to the video_render component (which doesn't get disabled) and enabled "zero copy" mode (which seems necessary for reference counting of opaque buffers), and it does seem more stable when disabling/enabling deinterlace "on-the-fly".
See: https://github.com/popcornmix/xbmc/commits/newclock4 for changes I made.

@6by9 and @luked99 may have a better understanding of exactly what is safe and what is unsafe.

@julianscheel
Copy link
Author

@popcornmix Which log do you mean? The msg I posted is from vcdbg log msg, the assert log is empty.

In my case the components are not tunnelled, but controlled individually. I do use zerocopy though and the owner of the mmal_buffer_header pool is the video_output component.

@popcornmix
Copy link
Contributor

The message comes from:

   vst = vchiq_queue_message(comp->control_service->vchiq_service,
                             elems, 1);

   if (vst != VCHIQ_SUCCESS)
   {
      if (using_bulk_transfer)
         vcos_mutex_unlock(&comp->control_service->bulk_lock);

      comp->control_service->server->stats.buffers.tx_fails++;
      LOG_ERROR("tx failed:size %d st %d", elems[0].size, vst);
      using_bulk_transfer = 0;
      // FIXME: do something!
   }

Presumably this is something being sent from VC->ARM. Don't know what it is, but it is 292 bytes.
You should be able to enable more logging with:

vcgencmd set_logging level=0x40 

which might give a clue as to what the 292 byte buffer is.

@popcornmix
Copy link
Contributor

You can also enable mmal logging with instructions here:
#377 (comment)

Using the start_db.elf firmware will produce more debug messages.

@luked99
Copy link

luked99 commented Apr 27, 2015

Host-side VCHIQ logging might be interesting. I think you can enable it via some sysfs entries (would need to check the kernel driver to see for sure).

Usually when I see that error it means the host process has died and VCHIQ has started failing messages. So it's possible something bad happened earlier, the host process died and you then got this message.

@6by9
Copy link

6by9 commented Apr 27, 2015

It's a buffer, but can't say anything more:

   mmal_worker_buffer_from_host msg;
   VCHIQ_ELEMENT_T elems[] = {{&msg, sizeof(msg)}};
...
   LOG_ERROR("tx failed:size %d st %d", elems[0].size, vst);

so 292 is sizeof(mmal_worker_buffer_from_host) (defined in mmal_vc_msgs.h). Shame it doesn't print something more useful as the function does have the full MMAL buffer header and port :-(

Agreed that it normally means that the client has died, or otherwise VCHIQ can't queue the buffer for delivery.

@popcornmix
Copy link
Contributor

Yes, now you mention it I think I've seen that vchiq LOG_ERROR myself when killing the arm user process. If VC is sending a message to arm user app when it exits (e.g. control-c or seg-fault) then it is normal behaviour to see a vchiq error like that, so it may be a false alarm, unless you are sure it occurs without the process getting killed.

@julianscheel
Copy link
Author

It might indeed be a false alarm then. I added an abort when I detect the stalled decoder because of input buffers being all in use. So the arm side process dies immediately and the error might just come from one of the still running mmal components. I will try with extended logging and without the abort tomorrow.

@julianscheel
Copy link
Author

I just hit the stall with enabled debug logging. See the log here:
https://gist.github.com/julianscheel/540d2da085b502dc3e3d

As I am out of office right now I had no chance to have a deeper look into the logs myself yet. Will do so later today. But maybe some of you sees something relevant already...

@6by9
Copy link

6by9 commented Apr 28, 2015

Looks like decode has done as much as it can, but the frames haven't been passed on to the either deinterlace or the renderer, and from there recycled.

https://gist.github.com/julianscheel/540d2da085b502dc3e3d#file-gistfile1-txt-L639 onwards, each "video_decode:45:RIL:cb:" is an output being generated by the decoder. No further calls to image_fxRIL or video_render afterwards. Has the ARM passed the buffers on?

MMAL logging would be of use here, either on the ARM side or VC.

@julianscheel
Copy link
Author

@6by9 The host did no receive the buffers in fact. There is a moment where the output port callback just stops being called. From there on the input port continues reading data up to the point where all input buffers are in use and I do the abort().

How would I enable VC side mmal logging besides what I already have?

@popcornmix
Copy link
Contributor

Looks like you only have the ril component logging. Have you done this:

You can also enable mmal logging with instructions here:
#377 (comment)

@julianscheel
Copy link
Author

Yes, actually I did. Just copied from shell history:

# /tmp/vcdbg syms 2>&1 | grep mmal_dbg_log                                   0x0efd3478         4 0xcf6dc040 mmal_dbg_log
# /tmp/vcdbg trash 0xcf6dc040 4

@6by9
Copy link

6by9 commented Apr 28, 2015

In turning on RIL logging you have probably turned off all VCOS logging.
It's a bitmask, and to keep VCOS logging you need 0x30000 still set. I'm assuming you've done
vcgencmd set_logging level=192
so you actually need level=196800

@julianscheel
Copy link
Author

@6by9 Thanks, that makes sense. Restarting the test now with proper level set.

@julianscheel
Copy link
Author

Hm, now the log looks like this: https://gist.github.com/julianscheel/04050bff4326b07e6f97
Still does not seem to contain mmal_dbg_log?

@6by9
Copy link

6by9 commented Apr 28, 2015

So vcgencmd set_logging level=... is setting the variable called current_logging_level.
sudo vcdbg syms |& grep current_logging_level will display the current value. On my system I get 0x40010000.
Unless I'm missing something then level=196800 should be equivalent to 0x300C0. 0x40 gives the basic RIL logging, 0x80 adds extra logging from RIL. For some reason HDMI audio is also on 0x80 even though there is another category for audio.
0x4000000 isn't critical. 0x00010000 must be set to get any of the VCOS logging messages (which MMAL uses). Sounds like your best bet then is going to be 0x40010040, or 1073807424, though I wouldn't expect much difference from 65600 (which will shut up the HDMI audio messages too).

@julianscheel
Copy link
Author

Hm, something is fishy with the logs for me. This is the exact sequence I do, trying to activate mmal_dbg_log. But vcdbg log msg still does not contain more logs than before. I do this with start_db.elf from Apr/6, btw.

# /tmp/vcdbg syms 2>&1 | grep mmal.*log
0x0efd3478         4 0xcf6dc040 mmal_dbg_log
0x0efd347c         4 0xcf6dd180 mmal_host_log
# /tmp/vcdbg trash 0xcf6dc040 4
# vcgencmd set_logging level=0x40010040
level=1073807424 
# /tmp/vcdbg syms 2>&1 | grep current_logging_level
0xcec02038         4 0x40010040 current_logging_level

Generated log while playing video (not in error case yet): https://gist.github.com/julianscheel/03fc6be84be0e334a6f0

@6by9
Copy link

6by9 commented Apr 29, 2015

You've been given duff info for the MMAL logging I'm afraid. mmal_dbg_log is associated with the mmal_vc_diag eventlog command, not adding extra logging to VC side. Likewise mmal_host_log is used by mmal_vc_diag host_log.

The symbol you actually want is mmal_log_category, but it's not exported as a debug symbol so isn't visible. Need to have a look at the debug tools again as we did have a mechanism for setting these logging levels from the ARM.

@luked99
Copy link

luked99 commented Apr 29, 2015

I think the method for setting it from the ARM was to rely on the ELF symbols. Which is easier to setup than a vcdbg symbol and works fine....unless you don't have the ELF symbol file for some reason :-(

@pelwell
Copy link
Contributor

pelwell commented Apr 29, 2015

It looks like you don't even bother to register it as a proper VCOS category, so:

vcgencmd vcos log set mmal_log_category trace

which ought to work returns:

Unrecognized category: 'mmal_log_category'

@luked99
Copy link

luked99 commented Apr 29, 2015

Can't this just be open-sourced :-)

@6by9
Copy link

6by9 commented Apr 29, 2015

@luked99 Agreed, it used the ELF file to find any symbol. Not helpful.

@pelwell It is registered from mmal_logging_init (interface/mmal/core/mmal_logging.c). Not sure why it isn't showing up, but thanks for the reminder over how to tweak vcos logging levels via vcgencmd.

@popcornmix
Copy link
Contributor

Looks like:
vcgencmd vcos log set mmal trace
does work, but only once mmal is running, which is a bit awkward.
We believe that the logging level will be retained on next run.

@6by9
Copy link

6by9 commented Apr 29, 2015

Ah, it is registered on first component creation, but also deregistered on last component destruction. So the app is running, you should be able to do
vcgencmd vcos log set mmal trace

@6by9
Copy link

6by9 commented Apr 29, 2015

Overlapping comments.
Yes, I agree that the log level should be retained for the next run.

@julianscheel
Copy link
Author

Yes, that works :) Test running with enabled mmal_dbg_log now. Will post the log when hitting the error case.

@julianscheel
Copy link
Author

Ok, having the next log: https://gist.github.com/julianscheel/bc0169a9ec8295adf7c7
In line 1023 the last image_fx creating is executed. I think from line 1066 things are weird. The video_decoder component asks for more data continously until all 20 input buffers must be in use. No picture callback seems to be fired anymore then. The amount of data received within those 20 frames must be enough to contain multiple pictures, I'd say... Especially as incoming data should be packetized already, so as long as VLCs mpegvideo packetizing works properly one input buffer shall lead to one output picture.

update: gist url changed, wrong one in initial post, sorry

@pelwell
Copy link
Contributor

pelwell commented Apr 29, 2015

At the risk of swamping the log, you could try adding some output from vchiq to try to separate cause from effect:

vcgencmd vcos log set vchiq_core info

Replace info with warn or error to shut it up.

@julianscheel
Copy link
Author

@pelwell Ok, test with vchiq_core info is running

@julianscheel
Copy link
Author

Next hit, this time with vchiq core logging enabled as well:
https://gist.github.com/julianscheel/b05ee61bf8b218817e3e

Sequence of error is the same as before. Can you see anything suspicious in the log?

@pelwell
Copy link
Contributor

pelwell commented Apr 30, 2015

I'm no MMAL expert, but it looks like everything is working fine until line 1759:

4151244.978: vchiq_core: 0: prs CLOSE@52c347c0 (12->1)

This is the first of CLOSE messages received by the VPU from the ARM, and it is for the vcgencmd service. A further 7 services are then closed, all instigated by the ARM.

Unless the MMAL layer (or one of the other services) has done something to upset the ARM, it looks as though the VPU is in the clear.

@pelwell
Copy link
Contributor

pelwell commented Apr 30, 2015

You can enable similar tracing on the ARM side using the /sys/kernel/debug interface. If that directory isn't mounted automatically you will need to do:

sudo mount -t debugfs nodev /sys/kernel/debug

You can then access the vchiq logging controls in /sys/kernel/debug/vchiq/log/*. Enabling "core" logging is unlikely to shed any light on things as it will be a mirror of the VC view, but here are some logging suggestions - choose some or all:

echo info | sudo tee /sys/kernel/debug/vchiq/log/core
echo info | sudo tee /sys/kernel/debug/vchiq/log/arm       // Might be enough to be useful with core=info
echo trace | sudo tee /sys/kernel/debug/vchiq/log/arm      // If not, try this

Setting arm tracing to "info" is just enough to allow us to distinguish between an orderly close, which looks like this:

[ 2193.368962] vchiq: vchiq_open
[ 2193.735233] vchiq: close_delivered(handle=b002)
[ 2193.735446] vchiq: vchiq_release: instance=ad76e000

and client death:

[ 2174.720906] vchiq: vchiq_open
[ 2174.728848] vchiq: vchiq_release: instance=ad613c00

but using "trace" instead will show more detail.

@luked99
Copy link

luked99 commented Apr 30, 2015 via email

@julianscheel
Copy link
Author

@pelwell Thanks, let me try with that logging as well.

@luked99 Not for sure. I do know that it does not happen if image_fx is not in the pipeline at all. The reason for the recreation of image_fx are some pessimistic assumptions in VLCs core. In fact for aspect ratio changes we wouldn't even need to stop/start the components at all. I do have a patched version of VLC in testing which avoids the recreation of image_fx completely. But this can't be pushed to upstream without further rework, because it will break non-mmal usecases pretty likely. Anyway, if the tests succeed with this hack I can implement another hack to see how it behaves when doing a loaded->idle->executing cycle on each aspect ratio change

@6by9
Copy link

6by9 commented Apr 30, 2015

Agreed that the VPU appears to be in the clear from that log. The buffers have all been sent to the ARM, and it hasn't returned any to image_fx or render to consume them.

As well as the VCHI logging, you can enable extra logging in the ARM side MMAL library. I always tend to edit interface/mmal/core/mmal_logging.c to set mmal_log_level to VCOS_LOG_TRACE and rebuild userland, but I think @pelwell has a magic trick for setting that log level without rebuilding.
That should list all the buffers back and forth.

There is also vcgencmd vcos log set mmalsrv trace which is the logging for the VC side service processing all MMAL requests.

@pelwell
Copy link
Contributor

pelwell commented Apr 30, 2015

export VC_LOGLEVEL=mmal:trace

before running your test.

@julianscheel
Copy link
Author

Sorry, took me a while to capture the next logs. Here we are:
https://gist.github.com/julianscheel/b0bee89ee181c10b053f

Contains dmesg output after error was hit plus the vlc output with meal trace logging interleaved. The vlc log starts a while before the last aspect change occurs (line 111) which triggers two recreations of image_fx after which no frames are spit out by the decoder anymore and the codec finally runs finally out of buffers for input data and aborts.

@luked99
Copy link

luked99 commented May 3, 2015

[01a85738] core video output debug: Filter 'deinterlace' (0x1ab6ad8) appended to chain
[b5a08370] mmal_codec decoder error: Failed to retrieve buffer header for input data

The previous time it had 'Filter deinterlace appended to chain' there was no mmal_codec error, so I guess that may be significant. I don't have the source code so I can only guess as to what that means, but perhaps out of buffers on the input port of the decoder?

EDIT: that's what you've said above, sorry.

@luked99
Copy link

luked99 commented May 3, 2015

Sorry, at least to me, that logging doesn't add much :-(

The output from vcdbg pools would be possibly helpful, even more so if you could capture it just after the image_fx component has been destroyed.

@julianscheel
Copy link
Author

@luked99 Yes that error means that we're running out of buffers for video_decodes input port. But in fact this happens because the codec stops outputting frames. Normally not more than 2 or 3 input buffers are in use, as we give frame-packetized data into the codec. Now when the error occurs the codec stops sending frames out and thus does not free the incoming buffers anymore. Up to the point where all buffers are in use and we abort.
I can add some more debug output in vlc to show when buffers are received from the video_decode component and how many input buffers are in use.

Is there a more verbose level then mmal:trace for the mmal-side logging?

I've just pushed my current vlc testing tree (the mmal part is quite a bit ahead of upstream git due to recent refactorings) here: https://github.com/julianscheel/vlc/tree/mmal-debugging/modules/hw/mmal

@6by9
Copy link

6by9 commented May 3, 2015

Is there a more verbose level then mmal:trace for the mmal-side logging?

interface/mmal/core/mmal_port.c line 36

/* Define this if you want to log all buffer transfers */
//#define ENABLE_MMAL_EXTRA_LOGGING

For this one we need to know all the buffer transfers. I thought that was under the normal trace, but obviously not.

@julianscheel
Copy link
Author

@6by9 It seems that adding the extra logging prevents the problem from being triggered... I'll try it further though. By chances it will happen sometime, just less likely now.

@julianscheel
Copy link
Author

I still was unable to trigger the issue with full trace logging enabled, but had a run without full debug and printing out the pool states as well as mmal-stats and components in error case:
https://gist.github.com/julianscheel/7d7d92b7b1fe4a690c80

I don't think it reveals something really useful, but at least you can see that video_decode received exactly 20 buffers more than it sent out. Which matches the out of input buffers scenario, as buffer_num in set to 20 on the input port.

@julianscheel
Copy link
Author

With full debugging the test is running for a week in an endless log without hitting the issue now. So I assume the slowdown because of printing the logs is stopping it from happening. Any other thoughts what we could do to track it?

@luked99
Copy link

luked99 commented May 11, 2015

So the pool called "image_fxRIL: image pool:" is in state DESTROYING. It's been asked to go away but appears to be stuck, with one image still held by mmal-worker. I think that sounds a bit as though an image has been passed to the host but not returned, or it was returned by the host but VideoCore dropped it on the floor for some reason (most likely the former).

Can you count buffers going across the VC/Host interface on the ARM side to check the host is returning everything expected?

@julianscheel
Copy link
Author

Sure, I'll add some counting and post results.

@Ruffio
Copy link

Ruffio commented Dec 30, 2016

@julianscheel how did it go with the counting? Has this issue been resolved?

@julianscheel
Copy link
Author

@Ruffio Admittedly I never found time to analyze this further. I was able to avoid the deadlock by avoiding fast recreation of image_fx filter, so we are not seeing this issue in real life anymore. But I am pretty sure it would still exist.
As nobody else seems to be suffering by this I think it would be ok if we close the issue. In case I get back to it someday I can still reopen. Agreed?

@Ruffio
Copy link

Ruffio commented Jan 2, 2017

@julianscheel it sounds good to me. @popcornmix ?

@popcornmix
Copy link
Contributor

Sure. I'll close. Feel free to reopen when you have more time to investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants