Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and optimize worker task scheduling #10417

Merged
merged 9 commits into from
Mar 11, 2021
Merged

Conversation

mourner
Copy link
Member

@mourner mourner commented Feb 26, 2021

Closes #10187. There are two independent commits here.

The first one simplifies setData coalescing logic in GeoJSON source, previously introduced in #5902. I did this first to make worker logic easier to reason about, but theoretically it should improve performance — by moving the coalescing logic to the main thread, we avoid flooding the worker message queue with tasks that will get discarded. Technically the flow and consequently the overall performance characteristics shouldn't change, as I tried to demonstrate on this very rough and clunky chart:

Image from iOS (1)

Previously, any setData calls that follow one that's still in progress would get sent to the worker, which remembers the last call while returning the previous one as "abandoned", and waits until that first setData call successfully finishes processing and then calls coalesce message that tells the worker to additionally do an update for the last setData call it caught before.

Now, we simply don't send any worker messages on additional setData calls while a setData is already in progress, but we remember to issue one more setData with the last updated data if there are any updates which we previously refused.

The second commit is the one that fixes the Safari performance issue — it was caused by Safari being too slow to process worker tasks that are delayed to run after the current event loop (introduced in #8633), which we do to make sure <cancel> messages are processed before the tasks that were cancelled if both come together to the worker in a batch of postMessage calls. Previously, all messages were delayed, then #8913 made it delay only on the worker side, and finally #9031 introduced an explicit actor.send parameter to additionally delay getResource on the main thread to fix a perf regression. This PR changes the logic so that messages are handled immediately by default, and only delayed for <cancel> processing explicitly for calls where we commonly expect cancellations (mostly network-related) — in this case loadTile, loadDEMTile and getResource. I confirmed that this fixes the Safari setData performance issue in the ticket, while not increasing the percentage of unfulfilled cancellations when browsing the map quickly (it's about 40% before and after the PR).

Tagging @kkaefer @ChrisLoer just in case because it affects the code you added significantly — take a look if you have time but no worries if not.

Launch Checklist

  • briefly describe the changes in this PR
  • write tests for all new functionality (seems sufficiently covered by existing tests)
  • post benchmark scores
  • manually test the debug page
  • apply changelog label ('bug', 'feature', 'docs', etc) or use the label 'skip changelog'
  • add an entry inside this element for inclusion in the mapbox-gl-js changelog: <changelog>Fixes a performance regression in Safari on frequent GeoJSON setData calls</changelog>

@mourner
Copy link
Member Author

mourner commented Feb 26, 2021

Internal benchmarks show no difference in performance, which is expected:

image

@karimnaaji karimnaaji added this to the v2.2 milestone Mar 1, 2021
Copy link
Contributor

@ansis ansis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The geojson simplification is great.

It looks like the callbacks from getGlyphs and getImages aren't getting added to the scheduler when I think they should be. This prioritization would matter if the glyphs for one time loaded before the vectortile of another tile. Our benchmarks don't appear to be covering this case. But maybe that means it's fine as is.

By bypassing the scheduler a whole bunch of work is now not measured as part of the workerTask diagnostic metric. This could be fixed by adding something like this to the other path or by putting all work through the scheduler (and making some of it immediate there).

The scheduler already gave messages the highest priority by default. Making all those immediate would probably work well.

src/source/geojson_source.js Outdated Show resolved Hide resolved
Copy link

@pepe-invest-git pepe-invest-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\

@mourner
Copy link
Member Author

mourner commented Mar 8, 2021

It looks like the callbacks from getGlyphs and getImages aren't getting added to the scheduler when I think they should be. This prioritization would matter if the glyphs for one time loaded before the vectortile of another tile. Our benchmarks don't appear to be covering this case. But maybe that means it's fine as is.

Good point! I also didn't realize responses after returning the result from the other thread also get scheduled, so I restored the condition that this only happens on the worker side, and additionally added mustQueue flags for the getGlyphs and getImages calls to make sure the behavior doesn't change.

By bypassing the scheduler a whole bunch of work is now not measured as part of the workerTask diagnostic metric. This could be fixed by adding something like this to the other path or by putting all work through the scheduler (and making some of it immediate there).

👍 went with the first option.

@mourner
Copy link
Member Author

mourner commented Mar 8, 2021

For the record, benchmark results after addressing feedback:

image

@mourner mourner requested a review from ansis March 8, 2021 15:45
@ansis
Copy link
Contributor

ansis commented Mar 9, 2021

I also didn't realize responses after returning the result from the other thread also get scheduled, so I restored the condition that this only happens on the worker side, and additionally added mustQueue flags for the getGlyphs and getImages calls to make sure the behavior doesn't change.

From what I can tell it doesn't actually apply queuing to the responses to these calls. The messages have type === "<response>". I left notes in the two places I think we need changes

@@ -169,7 +169,7 @@ class WorkerTile {
glyphMap = result;
maybePrepare.call(this);
}
}, undefined, undefined, taskMetadata);
}, undefined, true, taskMetadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to queue this on the main thread. We need to queue the response to this

Copy link
Member Author

@mourner mourner Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the intent was to make mustQueue force queing only on the worker thread, whether it's the task (if you're sending from the main) or the response (if you're sending from the worker).

// executing the next task in our queue, postMessage preempts this and <cancel>
// messages can be processed. We're using a MessageChannel object to get throttle the
// process() flow to one at a time.
if (isWorker() && data.mustQueue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to pass the responses from getImages and getGlyphs to the scheduler. Checking for the presence of callback.metadata in actor.js might be enough to decide whether to do that but I think letting the scheduler decide that might be slightly cleaner

The queuing on the main thread was intentional but it looks like it might not be needed since we dropped IE. I think this is the only case where we did queuing on the main thread. It was also applied to iOS Safari < 12.1 but I don't think it was actually needed there... not sure though. @arindam1993 do you remember if the queuing was only needed for IE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also old Safari verions wherein AbortController doesn't actually abort fetches.

@mourner
Copy link
Member Author

mourner commented Mar 9, 2021

@ansis OK, so the mustQueue (that I intended to only affect the response on the wokrer) wasn't passed in getGlyphs and getImages because of an existing blunder in the code (see commit earlier), but for now I just reverted making the responses queueing because the logic is getting too confusing now. Maybe we should merge it without glyph/image queueing for now, but plan a more substantial refactor as a follow-up — e.g. use options instead of a long list of arguments to send, and maybe separately have mustQueue and mustQueueCallback options.

@arindam1993
Copy link
Contributor

@ansis OK, so the mustQueue (that I intended to only affect the response on the wokrer) wasn't passed in getGlyphs and getImages because of an existing blunder in the code (see commit earlier), but for now I just reverted making the responses queueing because the logic is getting too confusing now. Maybe we should merge it without glyph/image queueing for now, but plan a more substantial refactor as a follow-up — e.g. use options instead of a long list of arguments to send, and maybe separately have mustQueue and mustQueueCallback options.

This! Do you think its worth moving all the queueing, throttling and cancellation logic into ajax.js\RequestManager layer to simplify some of this down.
Do we need that kind of functionality for anything other than network requests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

setData performance decreased in safari compared to v1.0.0
5 participants