Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

Fix uv_async crash (aka Option B) #2400

Closed
bleege opened this issue Sep 23, 2015 · 41 comments
Closed

Fix uv_async crash (aka Option B) #2400

bleege opened this issue Sep 23, 2015 · 41 comments
Assignees
Labels
Android Mapbox Maps SDK for Android crash

Comments

@bleege
Copy link
Contributor

bleege commented Sep 23, 2015

While investigating #2295 @ljbade discovered that OkHttp and uv_async were not working properly together. The result was the decision to ship 0.1.0 using curl instead. This ticket will be about restoring OkHttp by working out the issues with uv_async as documented by @ljbade. Specifically:

  • on curl all completed HTTPRequest handling happens in the same thread that creates HTTPContext as curl uses a loop pump to process everything (this is enforced via MBGL_VERIFY_THREAD macro)
  • on NSUrl and OkHTTP we farm HTTPRequest out to the system so we get the result back async on a different thread from the one that created HTTPContext
  • that is why both NSUrl and OkHTTP use uv_async to hand the final request processing back to the HTTPContext thread

So to remove uv_async we need an alternative way to get the request back to the HTTPContext thread.

I think @kkaefer will need to help us here with Option C, or he might be able to persue Option B.

@incanus @ljbade @kkaefer

@bleege bleege added the Android Mapbox Maps SDK for Android label Sep 23, 2015
@bleege bleege added this to the android-v0.2.0 milestone Sep 23, 2015
@jfirebaugh
Copy link
Contributor

uv_async is a fine way to communicate results back to the originating thread, and "remove use of uv_async" is not the correct long-term fix for "undiagnosed crash connected to uv_async use". Instead, we should diagnose the root cause of the crash.

@ljbade
Copy link
Contributor

ljbade commented Sep 23, 2015

@jfirebaugh Yes let's fix the uv_async bug.

@ljbade ljbade added the crash label Sep 23, 2015
@ljbade ljbade changed the title Refactor OkHttp To Not Use uv_async (aka Option C) Fix uv_async crash (aka Option B) Sep 23, 2015
@ljbade
Copy link
Contributor

ljbade commented Sep 23, 2015

@kkaefer Where else have you noticed uv_async crashing before? iOS/OSX/Linux?

@ljbade
Copy link
Contributor

ljbade commented Oct 24, 2015

I should revisit this since @mikemorris updated the libuv version.

@ljbade
Copy link
Contributor

ljbade commented Nov 3, 2015

@zugaldia I put some logging in the async class. I can confirm that when the app crashes, a async.send() happens after the async destructor is called.

The thread that does the send is the OkHTTP internal Java thread (from the onResponse callback). The FileSource thread does the destruction of the async.

@kkaefer It seems the problem that even cancelled OkHTTP requests from the Java side can outlive the C++ HTTPAndroidRequest object. Is there a way we can block destruction of HTTPAndroidRequest until we complete the outstanding async Java request?

@ljbade
Copy link
Contributor

ljbade commented Nov 3, 2015

I added a lock in the HTTPAndroidRequest blocks until it is cleared by either onResponse or onFailure and it seems to prevent crashing. Now I need to find out if FileSource gets deleted correctly.

@ljbade
Copy link
Contributor

ljbade commented Nov 3, 2015

It is still crashing. For some reason both onFailure and onResponse are called.

@ljbade
Copy link
Contributor

ljbade commented Nov 3, 2015

@zugaldia Do you know why OkHTTP will call both onFailure and onResponse?

I added a check to ignore the second callback and it seems to prevent crashing.

@zugaldia
Copy link
Member

zugaldia commented Nov 4, 2015

onFailure() doesn't like exceptions different from IOException so I'm catching the ProtocolException we saw above to rethrow it as IOException (see commit for details: fb502ca). That seems to avoid the previous crash and I now can't reproduce the original crash either. Can others confirm?

As a side note, see that I rethrow the ProtocolException to interrupt the execution as otherwise we'd have a call to nativeOnResponse() with a null body that would cause the JNI to complain with a fatal (code commented in the commit that we need to remove before merging with master).

/cc @bleege @tobrun

@ljbade
Copy link
Contributor

ljbade commented Nov 4, 2015

@zugaldia I think throwing an exception in onResponse will leave the JNI side waiting and cause a memory leak

I will add a null check to JNI which will switch to the failure code if it happens.

@ljbade
Copy link
Contributor

ljbade commented Nov 5, 2015

@zugaldia can you check my latest commit. I added the null check for body. I disabled the double response workaround to see if it is still needed. I also handle exceptions in onResponse as a failure to prevent outstanding exceptions.

It this all works fine, next step is to check for any memory leaks.

@zugaldia
Copy link
Member

zugaldia commented Nov 6, 2015

@ljbade check the latest commit. Good news, I cannot reproduce the crash.

@bleege
Copy link
Contributor Author

bleege commented Nov 6, 2015

I produced a new 2.3.0-SNAPSHOT this afternoon based off 2400-lock and ran it internally on Sirius. While the app itself didn't crash, the following error was logged. Is this a known issue?

11-06 17:03:39.303 31174-31274/com.mapbox.sirius I/mbgl: {Worker}[Sprite]: Can't find sprite named '-11'
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient: Callback failure for canceled call to https://a.tiles.mapbox.com/...
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient: java.net.ProtocolException: unexpected end of stream
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at com.squareup.okhttp.internal.http.HttpConnection$FixedLengthSource.read(HttpConnection.java:421)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.RealBufferedSource.read(RealBufferedSource.java:50)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.RealBufferedSource.exhausted(RealBufferedSource.java:60)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.InflaterSource.refill(InflaterSource.java:101)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.InflaterSource.read(InflaterSource.java:62)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.GzipSource.read(GzipSource.java:80)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.Buffer.writeAll(Buffer.java:956)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.RealBufferedSource.readByteArray(RealBufferedSource.java:92)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at com.squareup.okhttp.ResponseBody.bytes(ResponseBody.java:57)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at com.mapbox.mapboxsdk.http.HTTPContext$HTTPRequest.onResponse(HTTPContext.java:99)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at com.squareup.okhttp.Call$AsyncCall.execute(Call.java:168)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at com.squareup.okhttp.internal.NamedRunnable.run(NamedRunnable.java:33)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1112)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:587)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at java.lang.Thread.run(Thread.java:818)

@ljbade
Copy link
Contributor

ljbade commented Nov 9, 2015

That exception looks like something @zugaldia had before. It happens if the Internet drops out during onResponse

@bleege
Copy link
Contributor Author

bleege commented Nov 9, 2015

@ljbade So this is a known and acceptable issue then?

@ljbade
Copy link
Contributor

ljbade commented Nov 9, 2015

@bleege I'm going to double check that first.

@ljbade
Copy link
Contributor

ljbade commented Nov 9, 2015

The relevant code:

try {
                body = response.body().bytes();
            } catch (IOException e) {
                onFailure(null, e);
                throw e;
            } finally {
                response.body().close();
            }

@bleege What is happening is some sort of IO failure in read(). We catch this and then fail our pending request in C++ via onFailure. We then rethrow the exception so that OkHTTP catches it and cancels it's pending async side of the request. The stack trace is printed somewhere in OkHTTP when they handle it.

@ljbade
Copy link
Contributor

ljbade commented Nov 9, 2015

Next steps - rebase and condense PR into one tidy commit. Then wait for CI and merge.

ljbade pushed a commit that referenced this issue Nov 9, 2015
Update http_request_android.cpp for changes in #2727
Fix crash caused by calling both onFailure and onReponse in the same request
Fixes #2856
Fixes #2400
@bleege
Copy link
Contributor Author

bleege commented Nov 9, 2015

@ljbade Before you merge could you tell us more about the "some sort of IO failure in read()"? This strikes me as a bit odd and something that we should know more about. Otherwise it just seems like we're papering over the true issue. What do you think?

@ljbade
Copy link
Contributor

ljbade commented Nov 10, 2015

@bleege I realise I should have not used read()

The line in question is

That line does the actual IO work of downloading the HTTP request. If something happens during that function either because of a network failure, or a cancellation request that function will throw an IOException

You can see that the onReponse callback has a throws IOException thus we are expected to forward the exception after we handle it. It we don't rethrow the IOException, OkHTTP might not correctly clean up when a request fails.

@zugaldia Do you agree with my assesment?

@zugaldia
Copy link
Member

I've been running some scenarios and I've added a new application interceptor to OkHttp to help us with debugging: #2905 (comment)

To clarify the situation, we are not seeing the "callback gets called twice" situation anymore, instead, when we find an exception in onReponse we catch it and call onFailure manually to handle it. This is because OkHttp wouldn't do it otherwise -- it'd only call one ("signal the callback" in their terminology).

What I'm not sure at this point is if whether we should rethrow the IOException within onReponse. If I understand correctly OkHttp's behavior, that exception will simply get "swallowed" (see square/okhttp#1335) meaning that it just gets logged without further cleaning (the relevant method is here: https://github.com/square/okhttp/blob/master/okhttp/src/main/java/com/squareup/okhttp/Call.java#L159). That's why @bleege saw #2400 (comment) which is in turn harmless.

Another approach could be not to rethrow it and do a response.body().close() and nativeOnResponse() call only if no exception was found in onResponse. Otherwise, the SDK user will see exceptions logs potentially believing an exception was unhandled, when it was by our own onFailure.

@bleege
Copy link
Contributor Author

bleege commented Nov 12, 2015

@zugaldia @ljbade This is sounding like OkHttp is on pretty firm ground here. There's been no crashes reported on the internal testing app so far. If you two are both feeling good about this (I assume you both are?), let's polish the code up and get it into master.

Great job @zugaldia @ljbade for making this happen! 🚀

@incanus
Copy link
Contributor

incanus commented Nov 12, 2015 via email

ljbade pushed a commit that referenced this issue Nov 12, 2015
Update http_request_android.cpp for changes in #2727
Fix crash caused by calling both onFailure and onReponse in the same request
Fixes #2856
Fixes #2400
@ljbade
Copy link
Contributor

ljbade commented Nov 12, 2015

@zugaldia Thanks for digging into. Since it seems like everyone gets confused by OkHTTP logging the exception despite it being harmless, perhaps we should just swallow the exception? If you compare master to 2400-lock you will see I used to just return. However I suspect that might have been what caused the double callback somehow.

@tobrun
Copy link
Member

tobrun commented Nov 12, 2015

Great job @zugaldia @ljbade for making this happen! 🚀

👍

perhaps we should just swallow the exception?

Seems logical

@zugaldia
Copy link
Member

Agreed. As the exception is directed to onFailure I don't think there's any need for the extra logging. Also, it's consistent with the way OkHttp works and therefore it'd match the dev expectations.

@mb12
Copy link

mb12 commented Nov 12, 2015

@ljbade , @tobrun , @bleege Is it possible to measure the performance overhead of OkHttp as well? In this case we will be copying 25K to 100K+ bytes for each tile from java to native using JNI versus the older one where this copy and the additional overhead of managed code(GC, etc.) is missing.

@ljbade
Copy link
Contributor

ljbade commented Nov 12, 2015

@zugaldia I just pushed a commit with return instead of rethrow... can you check it doesnt bring back double onresponse/onfailure?

@zugaldia
Copy link
Member

@ljbade FWIW I just did a fresh build with the latest code and I couldn't find the double callback call.

@ljbade
Copy link
Contributor

ljbade commented Nov 16, 2015

@zugaldia Excellent I think we are ready to merge, except for figuring out how to benchmark the perf improvements. However we could do this after merge.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Android Mapbox Maps SDK for Android crash
Projects
None yet
Development

No branches or pull requests

8 participants