Simplify and optimize worker task scheduling #10417

mourner · 2021-02-26T14:52:02Z

Closes #10187. There are two independent commits here.

The first one simplifies setData coalescing logic in GeoJSON source, previously introduced in #5902. I did this first to make worker logic easier to reason about, but theoretically it should improve performance — by moving the coalescing logic to the main thread, we avoid flooding the worker message queue with tasks that will get discarded. Technically the flow and consequently the overall performance characteristics shouldn't change, as I tried to demonstrate on this very rough and clunky chart:

Previously, any setData calls that follow one that's still in progress would get sent to the worker, which remembers the last call while returning the previous one as "abandoned", and waits until that first setData call successfully finishes processing and then calls coalesce message that tells the worker to additionally do an update for the last setData call it caught before.

Now, we simply don't send any worker messages on additional setData calls while a setData is already in progress, but we remember to issue one more setData with the last updated data if there are any updates which we previously refused.

The second commit is the one that fixes the Safari performance issue — it was caused by Safari being too slow to process worker tasks that are delayed to run after the current event loop (introduced in #8633), which we do to make sure <cancel> messages are processed before the tasks that were cancelled if both come together to the worker in a batch of postMessage calls. Previously, all messages were delayed, then #8913 made it delay only on the worker side, and finally #9031 introduced an explicit actor.send parameter to additionally delay getResource on the main thread to fix a perf regression. This PR changes the logic so that messages are handled immediately by default, and only delayed for <cancel> processing explicitly for calls where we commonly expect cancellations (mostly network-related) — in this case loadTile, loadDEMTile and getResource. I confirmed that this fixes the Safari setData performance issue in the ticket, while not increasing the percentage of unfulfilled cancellations when browsing the map quickly (it's about 40% before and after the PR).

Tagging @kkaefer @ChrisLoer just in case because it affects the code you added significantly — take a look if you have time but no worries if not.

Launch Checklist

briefly describe the changes in this PR
write tests for all new functionality (seems sufficiently covered by existing tests)
post benchmark scores
manually test the debug page
apply changelog label ('bug', 'feature', 'docs', etc) or use the label 'skip changelog'
add an entry inside this element for inclusion in the mapbox-gl-js changelog: <changelog>Fixes a performance regression in Safari on frequent GeoJSON setData calls</changelog>

mourner · 2021-02-26T16:37:17Z

Internal benchmarks show no difference in performance, which is expected:

ansis

The geojson simplification is great.

It looks like the callbacks from getGlyphs and getImages aren't getting added to the scheduler when I think they should be. This prioritization would matter if the glyphs for one time loaded before the vectortile of another tile. Our benchmarks don't appear to be covering this case. But maybe that means it's fine as is.

By bypassing the scheduler a whole bunch of work is now not measured as part of the workerTask diagnostic metric. This could be fixed by adding something like this to the other path or by putting all work through the scheduler (and making some of it immediate there).

The scheduler already gave messages the highest priority by default. Making all those immediate would probably work well.

src/source/geojson_source.js

pepe-invest-git

\

mourner · 2021-03-08T15:43:21Z

It looks like the callbacks from getGlyphs and getImages aren't getting added to the scheduler when I think they should be. This prioritization would matter if the glyphs for one time loaded before the vectortile of another tile. Our benchmarks don't appear to be covering this case. But maybe that means it's fine as is.

Good point! I also didn't realize responses after returning the result from the other thread also get scheduled, so I restored the condition that this only happens on the worker side, and additionally added mustQueue flags for the getGlyphs and getImages calls to make sure the behavior doesn't change.

By bypassing the scheduler a whole bunch of work is now not measured as part of the workerTask diagnostic metric. This could be fixed by adding something like this to the other path or by putting all work through the scheduler (and making some of it immediate there).

👍 went with the first option.

mourner · 2021-03-08T15:45:16Z

For the record, benchmark results after addressing feedback:

ansis · 2021-03-09T01:29:48Z

I also didn't realize responses after returning the result from the other thread also get scheduled, so I restored the condition that this only happens on the worker side, and additionally added mustQueue flags for the getGlyphs and getImages calls to make sure the behavior doesn't change.

From what I can tell it doesn't actually apply queuing to the responses to these calls. The messages have type === "<response>". I left notes in the two places I think we need changes

ansis · 2021-03-09T01:11:12Z

src/source/worker_tile.js

@@ -169,7 +169,7 @@ class WorkerTile {
                    glyphMap = result;
                    maybePrepare.call(this);
                }
-            }, undefined, undefined, taskMetadata);
+            }, undefined, true, taskMetadata);


I don't think we need to queue this on the main thread. We need to queue the response to this

So, the intent was to make mustQueue force queing only on the worker thread, whether it's the task (if you're sending from the main) or the response (if you're sending from the worker).

ansis · 2021-03-09T01:27:09Z

src/util/actor.js

-                // executing the next task in our queue, postMessage preempts this and <cancel>
-                // messages can be processed. We're using a MessageChannel object to get throttle the
-                // process() flow to one at a time.
+            if (isWorker() && data.mustQueue) {


We need to pass the responses from getImages and getGlyphs to the scheduler. Checking for the presence of callback.metadata in actor.js might be enough to decide whether to do that but I think letting the scheduler decide that might be slightly cleaner

The queuing on the main thread was intentional but it looks like it might not be needed since we dropped IE. I think this is the only case where we did queuing on the main thread. It was also applied to iOS Safari < 12.1 but I don't think it was actually needed there... not sure though. @arindam1993 do you remember if the queuing was only needed for IE?

Also old Safari verions wherein AbortController doesn't actually abort fetches.

mourner · 2021-03-09T08:10:18Z

@ansis OK, so the mustQueue (that I intended to only affect the response on the wokrer) wasn't passed in getGlyphs and getImages because of an existing blunder in the code (see commit earlier), but for now I just reverted making the responses queueing because the logic is getting too confusing now. Maybe we should merge it without glyph/image queueing for now, but plan a more substantial refactor as a follow-up — e.g. use options instead of a long list of arguments to send, and maybe separately have mustQueue and mustQueueCallback options.

arindam1993 · 2021-03-11T19:16:39Z

@ansis OK, so the mustQueue (that I intended to only affect the response on the wokrer) wasn't passed in getGlyphs and getImages because of an existing blunder in the code (see commit earlier), but for now I just reverted making the responses queueing because the logic is getting too confusing now. Maybe we should merge it without glyph/image queueing for now, but plan a more substantial refactor as a follow-up — e.g. use options instead of a long list of arguments to send, and maybe separately have mustQueue and mustQueueCallback options.

This! Do you think its worth moving all the queueing, throttling and cancellation logic into ajax.js\RequestManager layer to simplify some of this down.
Do we need that kind of functionality for anything other than network requests?

mourner added 2 commits February 25, 2021 23:52

simplify geojson source by coalescing on the main thread

6560f92

only queue worker tasks explicitly for loadTile/getResource

457f808

mourner added the bug 🐞 label Feb 26, 2021

mourner requested a review from ansis February 26, 2021 14:52

mourner self-assigned this Feb 26, 2021

mourner mentioned this pull request Feb 26, 2021

setData performance decreased in safari compared to v1.0.0 #10187

Closed

karimnaaji added this to the v2.2 milestone Mar 1, 2021

ansis reviewed Mar 4, 2021

View reviewed changes

src/source/geojson_source.js Outdated Show resolved Hide resolved

pepe-invest-git approved these changes Mar 4, 2021

View reviewed changes

mourner added 3 commits March 8, 2021 17:17

limit scheduler to the worker, queue getGlyphs/Images callbacks

92c4ca9

make sure all worker tasks are measured in metrics

334c656

make sure we coalesce GeoJSON updates even after error

0651ec6

mourner requested a review from ansis March 8, 2021 15:45

don't measure non-scheduled tasks twice

83d06d8

ansis reviewed Mar 9, 2021

View reviewed changes

mourner added 2 commits March 9, 2021 08:58

one more queueing fix

0cce742

revert only queueing on the worker

e7aa5dc

let Scheduler decide which tasks should be immediate

32c29c5

ansis mentioned this pull request Mar 11, 2021

clean up work queuing and scheduling #10460

Open

ansis approved these changes Mar 11, 2021

View reviewed changes

ansis merged commit df9d24f into main Mar 11, 2021

ansis deleted the improve-task-scheduling branch March 11, 2021 22:02

arindam1993 mentioned this pull request Apr 12, 2021

Just start from v1.3.0, only the symbol layer will get stuck. Does anyone know what's going on? #10554

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify and optimize worker task scheduling #10417

Simplify and optimize worker task scheduling #10417

mourner commented Feb 26, 2021 •

edited

Loading

mourner commented Feb 26, 2021

ansis left a comment

pepe-invest-git left a comment

mourner commented Mar 8, 2021

mourner commented Mar 8, 2021

ansis commented Mar 9, 2021

ansis Mar 9, 2021

mourner Mar 9, 2021 •

edited

Loading

ansis Mar 9, 2021

arindam1993 Mar 11, 2021

mourner commented Mar 9, 2021

arindam1993 commented Mar 11, 2021

Simplify and optimize worker task scheduling #10417

Simplify and optimize worker task scheduling #10417

Conversation

mourner commented Feb 26, 2021 • edited Loading

Launch Checklist

mourner commented Feb 26, 2021

ansis left a comment

Choose a reason for hiding this comment

pepe-invest-git left a comment

Choose a reason for hiding this comment

mourner commented Mar 8, 2021

mourner commented Mar 8, 2021

ansis commented Mar 9, 2021

ansis Mar 9, 2021

Choose a reason for hiding this comment

mourner Mar 9, 2021 • edited Loading

Choose a reason for hiding this comment

ansis Mar 9, 2021

Choose a reason for hiding this comment

arindam1993 Mar 11, 2021

Choose a reason for hiding this comment

mourner commented Mar 9, 2021

arindam1993 commented Mar 11, 2021

mourner commented Feb 26, 2021 •

edited

Loading

mourner Mar 9, 2021 •

edited

Loading