Fix uv_async crash (aka Option B) #2400

bleege · 2015-09-23T15:02:26Z

While investigating #2295 @ljbade discovered that OkHttp and uv_async were not working properly together. The result was the decision to ship 0.1.0 using curl instead. This ticket will be about restoring OkHttp by working out the issues with uv_async as documented by @ljbade. Specifically:

on curl all completed HTTPRequest handling happens in the same thread that creates HTTPContext as curl uses a loop pump to process everything (this is enforced via MBGL_VERIFY_THREAD macro)

on NSUrl and OkHTTP we farm HTTPRequest out to the system so we get the result back async on a different thread from the one that created HTTPContext

that is why both NSUrl and OkHTTP use uv_async to hand the final request processing back to the HTTPContext thread

So to remove uv_async we need an alternative way to get the request back to the HTTPContext thread.

I think @kkaefer will need to help us here with Option C, or he might be able to persue Option B.

@incanus @ljbade @kkaefer

The text was updated successfully, but these errors were encountered:

jfirebaugh · 2015-09-23T18:24:32Z

uv_async is a fine way to communicate results back to the originating thread, and "remove use of uv_async" is not the correct long-term fix for "undiagnosed crash connected to uv_async use". Instead, we should diagnose the root cause of the crash.

ljbade · 2015-09-23T22:34:16Z

@jfirebaugh Yes let's fix the uv_async bug.

ljbade · 2015-09-23T22:35:48Z

@kkaefer Where else have you noticed uv_async crashing before? iOS/OSX/Linux?

ljbade · 2015-10-24T12:48:04Z

I should revisit this since @mikemorris updated the libuv version.

ljbade · 2015-11-03T01:33:51Z

@zugaldia I put some logging in the async class. I can confirm that when the app crashes, a async.send() happens after the async destructor is called.

The thread that does the send is the OkHTTP internal Java thread (from the onResponse callback). The FileSource thread does the destruction of the async.

@kkaefer It seems the problem that even cancelled OkHTTP requests from the Java side can outlive the C++ HTTPAndroidRequest object. Is there a way we can block destruction of HTTPAndroidRequest until we complete the outstanding async Java request?

ljbade · 2015-11-03T02:09:17Z

I added a lock in the HTTPAndroidRequest blocks until it is cleared by either onResponse or onFailure and it seems to prevent crashing. Now I need to find out if FileSource gets deleted correctly.

ljbade · 2015-11-03T02:42:16Z

It is still crashing. For some reason both onFailure and onResponse are called.

ljbade · 2015-11-03T03:28:53Z

@zugaldia Do you know why OkHTTP will call both onFailure and onResponse?

I added a check to ignore the second callback and it seems to prevent crashing.

zugaldia · 2015-11-04T17:09:22Z

onFailure() doesn't like exceptions different from IOException so I'm catching the ProtocolException we saw above to rethrow it as IOException (see commit for details: fb502ca). That seems to avoid the previous crash and I now can't reproduce the original crash either. Can others confirm?

As a side note, see that I rethrow the ProtocolException to interrupt the execution as otherwise we'd have a call to nativeOnResponse() with a null body that would cause the JNI to complain with a fatal (code commented in the commit that we need to remove before merging with master).

/cc @bleege @tobrun

ljbade · 2015-11-04T20:49:54Z

@zugaldia I think throwing an exception in onResponse will leave the JNI side waiting and cause a memory leak

I will add a null check to JNI which will switch to the failure code if it happens.

ljbade · 2015-11-05T13:47:11Z

@zugaldia can you check my latest commit. I added the null check for body. I disabled the double response workaround to see if it is still needed. I also handle exceptions in onResponse as a failure to prevent outstanding exceptions.

It this all works fine, next step is to check for any memory leaks.

zugaldia · 2015-11-06T16:38:39Z

@ljbade check the latest commit. Good news, I cannot reproduce the crash.

bleege · 2015-11-06T23:11:12Z

I produced a new 2.3.0-SNAPSHOT this afternoon based off 2400-lock and ran it internally on Sirius. While the app itself didn't crash, the following error was logged. Is this a known issue?

11-06 17:03:39.303 31174-31274/com.mapbox.sirius I/mbgl: {Worker}[Sprite]: Can't find sprite named '-11'
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient: Callback failure for canceled call to https://a.tiles.mapbox.com/...
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient: java.net.ProtocolException: unexpected end of stream
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at com.squareup.okhttp.internal.http.HttpConnection$FixedLengthSource.read(HttpConnection.java:421)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.RealBufferedSource.read(RealBufferedSource.java:50)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.RealBufferedSource.exhausted(RealBufferedSource.java:60)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.InflaterSource.refill(InflaterSource.java:101)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.InflaterSource.read(InflaterSource.java:62)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.GzipSource.read(GzipSource.java:80)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.Buffer.writeAll(Buffer.java:956)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at okio.RealBufferedSource.readByteArray(RealBufferedSource.java:92)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at com.squareup.okhttp.ResponseBody.bytes(ResponseBody.java:57)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at com.mapbox.mapboxsdk.http.HTTPContext$HTTPRequest.onResponse(HTTPContext.java:99)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at com.squareup.okhttp.Call$AsyncCall.execute(Call.java:168)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at com.squareup.okhttp.internal.NamedRunnable.run(NamedRunnable.java:33)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1112)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:587)
11-06 17:03:51.573 31174-31310/com.mapbox.sirius I/OkHttpClient:     at java.lang.Thread.run(Thread.java:818)

ljbade · 2015-11-09T22:05:15Z

That exception looks like something @zugaldia had before. It happens if the Internet drops out during onResponse

bleege · 2015-11-09T22:16:28Z

@ljbade So this is a known and acceptable issue then?

ljbade · 2015-11-09T22:18:30Z

@bleege I'm going to double check that first.

ljbade · 2015-11-09T22:22:22Z

The relevant code:

try {
                body = response.body().bytes();
            } catch (IOException e) {
                onFailure(null, e);
                throw e;
            } finally {
                response.body().close();
            }

@bleege What is happening is some sort of IO failure in read(). We catch this and then fail our pending request in C++ via onFailure. We then rethrow the exception so that OkHTTP catches it and cancels it's pending async side of the request. The stack trace is printed somewhere in OkHTTP when they handle it.

ljbade · 2015-11-09T22:23:31Z

Next steps - rebase and condense PR into one tidy commit. Then wait for CI and merge.

Update http_request_android.cpp for changes in #2727 Fix crash caused by calling both onFailure and onReponse in the same request Fixes #2856 Fixes #2400

bleege · 2015-11-09T23:44:21Z

@ljbade Before you merge could you tell us more about the "some sort of IO failure in read()"? This strikes me as a bit odd and something that we should know more about. Otherwise it just seems like we're papering over the true issue. What do you think?

ljbade · 2015-11-10T07:07:04Z

@bleege I realise I should have not used read()

The line in question is

mapbox-gl-native/android/MapboxGLAndroidSDK/src/main/java/com/mapbox/mapboxsdk/http/HTTPContext.java

Line 99 in c66aec9

body = response.body().bytes();

That line does the actual IO work of downloading the HTTP request. If something happens during that function either because of a network failure, or a cancellation request that function will throw an IOException

You can see that the onReponse callback has a throws IOException thus we are expected to forward the exception after we handle it. It we don't rethrow the IOException, OkHTTP might not correctly clean up when a request fails.

@zugaldia Do you agree with my assesment?

zugaldia · 2015-11-11T22:24:44Z

I've been running some scenarios and I've added a new application interceptor to OkHttp to help us with debugging: #2905 (comment)

To clarify the situation, we are not seeing the "callback gets called twice" situation anymore, instead, when we find an exception in onReponse we catch it and call onFailure manually to handle it. This is because OkHttp wouldn't do it otherwise -- it'd only call one ("signal the callback" in their terminology).

What I'm not sure at this point is if whether we should rethrow the IOException within onReponse. If I understand correctly OkHttp's behavior, that exception will simply get "swallowed" (see square/okhttp#1335) meaning that it just gets logged without further cleaning (the relevant method is here: https://github.com/square/okhttp/blob/master/okhttp/src/main/java/com/squareup/okhttp/Call.java#L159). That's why @bleege saw #2400 (comment) which is in turn harmless.

Another approach could be not to rethrow it and do a response.body().close() and nativeOnResponse() call only if no exception was found in onResponse. Otherwise, the SDK user will see exceptions logs potentially believing an exception was unhandled, when it was by our own onFailure.

bleege · 2015-11-12T00:04:31Z

@zugaldia @ljbade This is sounding like OkHttp is on pretty firm ground here. There's been no crashes reported on the internal testing app so far. If you two are both feeling good about this (I assume you both are?), let's polish the code up and get it into master.

Great job @zugaldia @ljbade for making this happen! 🚀

incanus · 2015-11-12T01:30:00Z

🎉

Update http_request_android.cpp for changes in #2727 Fix crash caused by calling both onFailure and onReponse in the same request Fixes #2856 Fixes #2400

ljbade · 2015-11-12T05:45:55Z

@zugaldia Thanks for digging into. Since it seems like everyone gets confused by OkHTTP logging the exception despite it being harmless, perhaps we should just swallow the exception? If you compare master to 2400-lock you will see I used to just return. However I suspect that might have been what caused the double callback somehow.

tobrun · 2015-11-12T05:50:52Z

Great job @zugaldia @ljbade for making this happen! 🚀

👍

perhaps we should just swallow the exception?

Seems logical

zugaldia · 2015-11-12T16:08:09Z

Agreed. As the exception is directed to onFailure I don't think there's any need for the extra logging. Also, it's consistent with the way OkHttp works and therefore it'd match the dev expectations.

mb12 · 2015-11-12T19:47:45Z

@ljbade , @tobrun , @bleege Is it possible to measure the performance overhead of OkHttp as well? In this case we will be copying 25K to 100K+ bytes for each tile from java to native using JNI versus the older one where this copy and the additional overhead of managed code(GC, etc.) is missing.

ljbade · 2015-11-12T22:24:10Z

@zugaldia I just pushed a commit with return instead of rethrow... can you check it doesnt bring back double onresponse/onfailure?

zugaldia · 2015-11-16T20:09:38Z

@ljbade FWIW I just did a fresh build with the latest code and I couldn't find the double callback call.

ljbade · 2015-11-16T22:59:53Z

@zugaldia Excellent I think we are ready to merge, except for figuring out how to benchmark the perf improvements. However we could do this after merge.

bleege added the Android Mapbox Maps SDK for Android label Sep 23, 2015

bleege added this to the android-v0.2.0 milestone Sep 23, 2015

bleege mentioned this issue Sep 23, 2015

Crash in HTTP request when activity destroyed #2295

Closed

ljbade added the crash label Sep 23, 2015

ljbade changed the title ~~Refactor OkHttp To Not Use uv_async (aka Option C)~~ Fix uv_async crash (aka Option B) Sep 23, 2015

This was referenced Sep 24, 2015

Implement FileSource in Java #823

Closed

Crashes in http_request_nsurl.mm #2417

Closed

bleege modified the milestones: android-v2.1.0, android-v2.2.0 Oct 2, 2015

bleege modified the milestones: android-v2.2.0, android-v2.3.0 Oct 28, 2015

This was referenced Oct 29, 2015

Try OkHTTP on Android again #2856

Closed

[android] Bring back OkHTTP #2857

Closed

ljbade self-assigned this Nov 3, 2015

ljbade added the in progress label Nov 3, 2015

ljbade mentioned this issue Nov 3, 2015

[android] Check for second callback in HttpRequestAndroid #2905

Closed

ljbade pushed a commit that referenced this issue Nov 9, 2015

[android] Bring back OkHTTP

c66aec9

Update http_request_android.cpp for changes in #2727 Fix crash caused by calling both onFailure and onReponse in the same request Fixes #2856 Fixes #2400

ljbade pushed a commit that referenced this issue Nov 12, 2015

[android] Bring back OkHTTP

a950128

Update http_request_android.cpp for changes in #2727 Fix crash caused by calling both onFailure and onReponse in the same request Fixes #2856 Fixes #2400

zugaldia mentioned this issue Nov 16, 2015

HTTP exception #2979

Closed

ljbade closed this as completed in af2d034 Nov 16, 2015

ljbade removed the in progress label Nov 16, 2015

ljbade mentioned this issue Nov 16, 2015

Benchmark OkHTTP vs curl #3048

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix uv_async crash (aka Option B) #2400

Fix uv_async crash (aka Option B) #2400

bleege commented Sep 23, 2015

jfirebaugh commented Sep 23, 2015

ljbade commented Sep 23, 2015

ljbade commented Sep 23, 2015

ljbade commented Oct 24, 2015

ljbade commented Nov 3, 2015

ljbade commented Nov 3, 2015

ljbade commented Nov 3, 2015

ljbade commented Nov 3, 2015

zugaldia commented Nov 4, 2015

ljbade commented Nov 4, 2015

ljbade commented Nov 5, 2015

zugaldia commented Nov 6, 2015

bleege commented Nov 6, 2015

ljbade commented Nov 9, 2015

bleege commented Nov 9, 2015

ljbade commented Nov 9, 2015

ljbade commented Nov 9, 2015

ljbade commented Nov 9, 2015

bleege commented Nov 9, 2015

ljbade commented Nov 10, 2015

zugaldia commented Nov 11, 2015

bleege commented Nov 12, 2015

incanus commented Nov 12, 2015 via email

ljbade commented Nov 12, 2015

tobrun commented Nov 12, 2015

zugaldia commented Nov 12, 2015

mb12 commented Nov 12, 2015

ljbade commented Nov 12, 2015

zugaldia commented Nov 16, 2015

ljbade commented Nov 16, 2015

Fix uv_async crash (aka Option B) #2400

Fix uv_async crash (aka Option B) #2400

Comments

bleege commented Sep 23, 2015

jfirebaugh commented Sep 23, 2015

ljbade commented Sep 23, 2015

ljbade commented Sep 23, 2015

ljbade commented Oct 24, 2015

ljbade commented Nov 3, 2015

ljbade commented Nov 3, 2015

ljbade commented Nov 3, 2015

ljbade commented Nov 3, 2015

zugaldia commented Nov 4, 2015

ljbade commented Nov 4, 2015

ljbade commented Nov 5, 2015

zugaldia commented Nov 6, 2015

bleege commented Nov 6, 2015

ljbade commented Nov 9, 2015

bleege commented Nov 9, 2015

ljbade commented Nov 9, 2015

ljbade commented Nov 9, 2015

ljbade commented Nov 9, 2015

bleege commented Nov 9, 2015

ljbade commented Nov 10, 2015

zugaldia commented Nov 11, 2015

bleege commented Nov 12, 2015

incanus commented Nov 12, 2015 via email

ljbade commented Nov 12, 2015

tobrun commented Nov 12, 2015

zugaldia commented Nov 12, 2015

mb12 commented Nov 12, 2015

ljbade commented Nov 12, 2015

zugaldia commented Nov 16, 2015

ljbade commented Nov 16, 2015