Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native image encoder intermittently fails to send final frame of image data causing hang #2

Closed
caprica opened this issue May 17, 2017 · 85 comments
Labels

Comments

@caprica
Copy link
Owner

caprica commented May 17, 2017

From time to time, when requesting an image, the native image encoder simply fails to send the final frame of image data.

The image data is fetched in chunks, this is repeated 1..n times by the native code invoking a callback with the image data. The return value of that callback signifies to the native encoder that it should send the next frame. Tracing the Java code shows that the callback is correctly invoked, and that the correct return value is being sent back to native code. Nevertheless, from time to time the native encoder simply seems to stall and will stop sending image data. It always seems to be the last chunk of data.

What this means in practical terms is that the Java application hangs while waiting for a native callback that will never occur. The JVM must then be terminated.

This is a show-stopper issue that precludes the use of this library.

There is similar strangeness, it may be unrelated, during native resource clean-up whereby native warning messages are intermittently dumped to stdout or stderr (e.g. freeing a resource that has already been freed).

This may point to there being a fundamental problem with the implementation of this library.

Having said that, the code for handling capture data from the native library has been implemented and debugged very carefully from the ground up on two entirely separate occasions. So either there is something wrong in native code that the JVM is exposing, or the exact same mistake has been made on two separate Java implementations.

Until the root cause of this issue can be identified, this library can not be used reliably.

Any hints from anyone and anywhere on what could be wrong here will be most welcome.

@AlexanderBykin
Copy link

hi, found your beautiful implementation, here is RaspiMJPEG project written with C language, this utility takes snapshots as fast as possible and flush them into the disk, i think this can help you a lot

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

Thanks!

The Java implementation here is based on the C code for the RaspiStill utility in the project that you linked to.

Nevertheless I will look at the particular code you linked and see if I can spot anything I did wrong.

@AlexanderBykin
Copy link

here is also parts of RaspiMJPEG:
RaspiMCmds
RaspiMJPEG
this utility run in a while loop and opens FIFO there you can send commams to control camera things

@AlexanderBykin
Copy link

yes you're right, this utility based on RaspiStill, but author of this utility has increased speed of taking pictures

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

That code does not really help too much, the key idea behind this Java implementation is to use the MMAL API directly and not rely on forking processes and using native commands.

Some more background on this issue...

I very carefully did two separate Java implementations of this, from scratch, based on the source code in the raspicam project, in particular the source code that used the native MMAL API directly. Both of those implementations I did suffered the exact same problems, so clearly I missed something non-obvious, twice.

It is very much possible to invoke the native command-line tools from Java, and there are already Java projects that do that. But I do not want to do that, I want a Java API that does not execute native commands in their own processes, but instead to implement the exact same thing that those native commands do when they talk to the MMAL API.

This is much more difficult, and requires me to write a lot more code, but the end result in my opinion will be much better. If it only worked 100% reliably!

@AlexanderBykin
Copy link

yes, exactly for direct calls to MMAL API from Java by using JNA i liked your project

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

I still hope to find a resolution to this issue, so the project is still very much alive, it's just stalled right now.

@AlexanderBykin
Copy link

maybe this is a bug of MMAL API?
try to update Raspberry rpi-update
and check there is the fresher VC libraries at /opt/vc

@AlexanderBykin
Copy link

also take me a piont where is that happening at code, i want to see how can i help you

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

I can't really point directly to the code that is broken.

The code dealing with handling picture data is in https://github.com/caprica/picam/blob/master/src/main/java/uk/co/caprica/picam/EncoderBufferCallback.java

This is a Java implementation of a native callback invoked by the MMAL API when camera data is ready to be sent. This callback accepts the picture data and then either finishes if it's the last data block, or requests more picture data if it is not the last data block. The problem is that this code executes correctly, it requests the next block when needed, but sometimes the native code simply refuses to invoke this callback with the last black of data. This failure happens outside of Java code.

Now, it could be that the Java code has exposed a rare bug in the native code, or it could be that there is something wrong in the Java code. Another problem is that there is a lot of stuff that has to be set up and linked together via the MMAL API, and any part of that could be slightly wrong.

All of the native setup code is here https://github.com/caprica/picam/blob/master/src/main/java/uk/co/caprica/picam/Camera.java

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

Your post inspired me to fully upgrade my Pi and after doing so I took 50 pictures without any problems - no hanging.

I'll do some more testing.

I can't say this issue is definitely gone, but so far it's working better than before.

@AlexanderBykin
Copy link

it's still here, i get it time to time

@AlexanderBykin
Copy link

can you say how did you test it?
you just running several times from terminal?

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

Oh well, my testing was just ad-hoc casual testing using my own test application. Doing that same test previously it failed about 10% of the time. Today it hasn't failed yet.

I was going to write a proper stress-test later.

I guess there's still an issue.

@AlexanderBykin
Copy link

i have added extra logs

11:51:01.738 [Thread-35] DEBUG uk.co.caprica.picam.EncoderBufferapply - [apply] apply()
11:51:01.738 [Thread-35] DEBUG uk.co.caprica.picam.EncoderBufferapply - [apply] flags=0
11:51:01.739 [Thread-35] DEBUG uk.co.caprica.picam.EncoderBufferapply - [apply] bufferLength=0
11:51:01.740 [Thread-35] INFO uk.co.caprica.picam.EncoderBufferapply - >>> [apply] BufferedLength=0
11:51:01.741 [Thread-35] DEBUG uk.co.caprica.picam.EncoderBufferapply - [sendNextPictureBuffer] sendNextPictureBuffer()
11:51:01.746 [Thread-35] DEBUG uk.co.caprica.picam.EncoderBufferapply - [sendNextPictureBuffer] result=0
11:51:01.747 [Thread-35] DEBUG uk.co.caprica.picam.EncoderBufferapply - [apply] finished=false
11:51:01.747 [Thread-35] DEBUG uk.co.caprica.picam.EncoderBufferapply - [apply] enabled=true

looks like buffer lost his data somewhere

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

That is indeed the same behaviour I was seeing previously.

And yeah, that's a native buffer, not a Java buffer.

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

Sadly my stress test did not show good results. Strange messages from MMAL dumped to the error console, looks like a native race to me. Still hangs intermittently.

So updating everything to latest does not resolve the issue, unfortunately. I still don't know what's actually wrong.

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

This old abandoned issue looks somewhat similar:

https://www.raspberrypi.org/forums/viewtopic.php?t=108707&p=749409

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

With debug logging turned on on the Pi, I occasionally see these warnings in the log (they do not correspond to a hang):

599057.418: gpioman: gpioman_get_pin_num: pin CAMERA_LED not defined
600910.823: mmal: mmal_port_event_send: event lost on port 1,0 (buffer header callback not defined)
600937.237: gpioman: gpioman_get_pin_num: pin CAMERA_LED not defined
604524.678: mmal: mmal_port_disable: port ril.image_encode:out:0(PNG )(3e6df080) is not enabled
604524.702: mmalsrv: mmal_do_port_disable: could not disable port (3:EINVAL)
604535.891: mmal: mmal_port_disable: port ril.image_encode:out:0(PNG )(3e6df080) is not enabled
604535.918: mmalsrv: mmal_do_port_disable: could not disable port (3:EINVAL)
604600.604: mmal: mmal_port_disable: port ril.image_encode:out:0(PNG )(3e6df080) is not enabled
604600.630: mmalsrv: mmal_do_port_disable: could not disable port (3:EINVAL)

When I CTRL+C a hanged app, this line pops up in the log:

775057.940: mmalsrv: send_buffer_to_host: tx failed:size 292 st -1

My stress-test app ran to 90 out of 100 pictures taken.

What's interesting is it runs better with more logging (so it seems) and that to me implies a race somewhere.

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

This also looks somewhat related:

raspberrypi/firmware#417

@AlexanderBykin
Copy link

with this part of code it looks like nothing happens
my repo

public class MainApp {
    public static void main(String[] agrs) {
        bcm.bcm_host_init(); // FIXME seems not needed?

        CameraConfiguration config = CameraConfiguration.cameraConfiguration()
                .width(1920)
                .height(1080);

        try {
            Camera camera = new Camera(config);
            camera.takePicture(new FilePictureCaptureHandler(new File("/dev/shm/picam.png")));
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        bcm.bcm_host_deinit(); // FIXME seems not needed?
    }
}

@AlexanderBykin
Copy link

ah just got it again, never mind :)
i just thought that is related to bcm_host initialization

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

Yeah, I've been down that path myself before.

The reason it's there is because it was in the original C code.

@caprica
Copy link
Owner Author

caprica commented Jun 12, 2017

I'm afraid I don't really know the MMAL API, nor the Pi platform. I just attempted to port the code from C to Java/JNA.

@AlexanderBykin
Copy link

i am in the same way :)

@AlexanderBykin
Copy link

snapshots also possible by DispmanX it is part of Raspberry VideoCore

@caprica
Copy link
Owner Author

caprica commented Jun 13, 2017

Ideally I'd like to find the minimal possible working native code to get a snapshot/still-picture. I don't care about video. I don't care what native API I use to get it.

@AlexanderBykin
Copy link

AlexanderBykin commented Jun 13, 2017

ok, then this is what you looking for DispmanX
all of this methods is places at libbcm_host.so
before i found your project i try to use it but i have got a problem, how to decode RAW data into PNG or Jpeg
my code is ugly, i dont want to publish it

@mikelduke
Copy link

mikelduke commented Oct 17, 2017 via email

@caprica
Copy link
Owner Author

caprica commented Oct 17, 2017

Yeah, to be honest I don't really care if we never solve your scenario #2, that approach was never intended to be the end-game.

@mikelduke
Copy link

mikelduke commented Oct 17, 2017 via email

@techyian
Copy link

Trying to digest everything that's been mentioned in this thread already, but during the initial development of MMALSharp I used to see MMAL hangs due to not clearing port pools correctly. The hang was because there was still a reference on the buffer and wasn't released by MMAL.

So in MMALSharp specifically, I've managed to get a clean shutdown of resources by doing the following:

  1. Once you've finished processing image data. Stop the output port(s) of your last connected component
  2. Close connections of your connected components
  3. Clean the port pools by calling mmal_port_pool_destroy - I can see this call in your code but it may be the timing that's slightly out. Also note, I've done this on input & output ports on all components, probably overkill but done just in case.

Happy to help if you need it. This problem was bugging me for months!

@caprica
Copy link
Owner Author

caprica commented Jan 22, 2018

Thanks for such a helpful comment.

I may revisit this. I did try to be very careful about allocating and freeing resources, but I just couldn't see where I may have went wrong... on three separate, from scratch, implementations.

@caprica
Copy link
Owner Author

caprica commented Jan 22, 2018

@techyian one follow-up question if you don't mind, in your clean-up suggestions, are you doing the create/process-image-data/cleanup for every image capture individually, or do you keep the camera "open" and do multiple captures before cleaning up?

@techyian
Copy link

Sure - if you take a look at the BeginProcessing method, I keep the camera component resources active, but cleanup ports and stop capture after each run. That way, people using the library can then add/remove components and create a different pipeline after each run if they wish to.

@vrosca
Copy link

vrosca commented Mar 26, 2018

Just wanted to mention that after taking the approach explained in my first post (23 Sep 2017) my surveillance camera now has 6 months of uptime without any freezes

@caprica
Copy link
Owner Author

caprica commented Mar 26, 2018

That is impressive, thanks for the update.

Would be so awesome if you could isolate the essential code changes, last time you said the code was "butchered" :-D

@vrosca
Copy link

vrosca commented Mar 26, 2018

I'll try - at some point in Q2 - to get over my anti-git bias, rebase my changes on your codebase and send a pull request

@AlexanderBykin
Copy link

@vrosca just share your code as a zip file if you not good with Git

@caprica
Copy link
Owner Author

caprica commented Nov 21, 2018

I finally found some time to revisit this.

I applied the changes suggested by @vrosca on 23rd September and so far I've been able to capture hundreds upon hundreds of pictures without any problems (my code uses the PNG encoder rather than JPG for what it's worth). My last test ran 1,000 straight captures perfectly.

I will work on cleaning up my code and committing it soon and maybe even pushing a release to Maven Central.

I still don't see anything "wrong" with the original code and I still worry about races, but so far these changes look good.

Thank you all in this thread who helped, especially @vrosca for the suggested code changes.

@AlexanderBykin
Copy link

Cool, do you have statistics on how many pictures it can take in one second?

@AlexanderBykin
Copy link

with C++ implementation it can take about 30 pictures per second and save them into Ramdisk

@caprica
Copy link
Owner Author

caprica commented Nov 21, 2018

Honestly I have no idea since I have not measured it, but it seems slow-ish to me - then again, my tests so far save all images to disk rather than ramdisk so that's going to slow it down, and using the JPG encoder would likely improve things.

On the other hand, raw performance is not the primary goal of this project, rather the goal is ease of use and Java is a lot easier to deal with than C++, especially for less experienced developers.

@AlexanderBykin
Copy link

@caprica try to use /dev/shm/ this is ramdisk at Raspberry

@caprica
Copy link
Owner Author

caprica commented Nov 21, 2018

Sure, but like I mentioned raw performance is not the primary goal of this project, although I would be interested in the numbers :)

To make any significant performance improvements likely would mean removing JNA and using JNI directly - and again, that will make things more difficult for people, not easier.

@AlexanderBykin
Copy link

I understand your position. when I found your project I thought it was a godsend for my project (controlled tank from a mobile phone) but when using it I was disappointed with extremely slow work
tank picture

@caprica
Copy link
Owner Author

caprica commented Nov 21, 2018

Well, now that it is hopefully working reliably, performance concerns could be looked at. But if you want best performance, use C/C++ rather than Java I suppose. I mean, the JVM is fundamentally slow on these low-power devices isn't it.

@AlexanderBykin
Copy link

I agree, native applications will always be faster, but in our hands to bring the speed close to the native, we (community) just need to try

@vrosca
Copy link

vrosca commented Nov 21, 2018 via email

@vrosca
Copy link

vrosca commented Nov 21, 2018 via email

@caprica
Copy link
Owner Author

caprica commented Nov 21, 2018

My understanding is that JNA does indeed allocate a new Java Thread object for each callback, but no actual thread gets allocated, so that's not as bad as it might initially seem.

And yes, your comment about getting the structures filled before them being locked now makes sense. Thanks a lot.

I do want to reiterate that initially the main goal of this project was to make it very easy for inexperienced developers to use the Pi camera (Pi is foremost a learning platform after all), and for me that meant just using Java without asking people to compile JNI wrappers and such.

If this issue is truly vanquished, I'm more motivated to advance this project now so if there are things that we can do to improve performance (or other things) I would like to hear about it.

@caprica
Copy link
Owner Author

caprica commented Nov 21, 2018

In hindsight, as is often the case, this bug is obvious and stupid.

In short, by using the JNA auto-mapped object in the callback method signature, the native memory was being read by the Java code before that memory buffer was explicitly locked. Clearly this is bad, m'kay.

Anyway, I have now captured literally thousands of pictures of varying sizes from small up to 10mb without a single problem during my most recent testing and I will commit the fix today.

@vrosca
Copy link

vrosca commented Nov 21, 2018 via email

caprica added a commit that referenced this issue Nov 21, 2018
Fix for #2.

Before the fix the JNA structures were present in the method signature. This
meant that those structures would be automatically read (which is usually
nice and convenient) but in this case the native memory was being accessed
before it was locked (and therefore there was nothing preventing the contents
of that memory from being changed while it was being accessed).

This caused a race and would lead to an intermittent hang, usually observed
when the native library seemingly stopped sending frames of picture data
(always seemed to be the last frame).

The fix therefore is to replace the JNA structure object in the method
signature with a simple Pointer, then to use the native API to lock that
pointer, and only then access the data in the object.
@caprica
Copy link
Owner Author

caprica commented Nov 21, 2018

Fixed at 32deac5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants