fix(perf): improve perf of mutation and snapshot #1271

JonasBa · 2023-07-30T17:24:45Z

This PR builds on what @Juice10 had found - I basically identified the same thing using different tools.

Motivation:
We started collecting browser profiles on some Sentry sites and noticed rrweb being the bottleneck when diffing large trees (common for loading screens which populate a large DOM tree).

Example profile collected by the SDK from unknown arch (platform is macos):

Profile for matching route collected locally (arm64 macos)

By looking at the code, I managed to identify a couple of bottlenecks:

isBlocking has a runtime of On^2 when checking for ancestors - I tried swapping that out for el.matches(.block *), I think it works faster, but I need to run benchmarks. The other approach could be to keep a cache of blocked elements and use compareDocumentPosition, but that should be benchmarked before.
There is some inefficiencies around Array.from(nodelist) when iterating over nodes and their children. This is just unnecessary as we are creating and discarding arrays where we could be using NodeList.forEach
genAdds is a recursive fn with possible deep recursion - I converted this to an iterative procedure. It causes the snapshot tests to fail as the procedure now runs in breadth first iteration (the data is there, just in different places and id's are different)
some functions are just inefficient, they can either return early or avoid extra calls/allocations with simple modifications (example of refactoring to for loop vs doing .map().filter()). There are small changes like hoisting static regular expressions, calling .test instead of .match when discarding match groups and avoid creating a lot of new strings via lowerIfExists). Similar case with calling while(queue.length) queue.shift(). Since shift is On, it means the runtime for that is actually O(n^2)

Benchmark results:

master:#c6600e7
'append 70 x 70 x 70 elements' │   3   │ 1691.3333333333333 │ '1701, 1683, 1690'
'append 1000 elements and reverse their order' │   3   │   350    │ '342, 358, 350'
'real events recorded on bugs.chromium.org' │   3   │   3116   │ '3090, 3140, 3118'
'create 1000x10 DOM nodes' │ 'benchmark-dom-mutation.html' │ 'window.workload()' │  10   │  116.5   │ '114, 144, 119, 120, 111, 110, 127, 102, 112, 106'
'create 1000x10x2 DOM nodes and remove a bunch of them' │ 'benchmark-dom-mutation-add-and-remove.html' │ 'window.workload()' │  10   │  470.9   │ '526, 658, 455, 462, 415, 461, 415, 465, 441, 411'
'create 1000 DOM nodes and append into its previous looped node' │ 'benchmark-dom-mutation-multiple-descendant-add.html' │ 'window.workload()' │   5   │   69.2   │ '67, 62, 64, 89, 64'
'create 10000 DOM nodes and move it to new container' │ 'benchmark-dom-mutation-add-and-move.html' │ 'window.workload()' │   5   │  205.4   │ '208, 196, 214, 216, 193'

#bad990b
'append 70 x 70 x 70 elements' │   3   │ 1722.3333333333333 │ '1753, 1716, 1698'
'append 1000 elements and reverse their order' │   3   │ 358.6666666666667 │ '357, 341, 378'
'real events recorded on bugs.chromium.org' │   3   │ 3128.3333333333335 │ '3219, 3061, 3105
'create 1000x10 DOM nodes' │ 'benchmark-dom-mutation.html' │ 'window.workload()' │  10   │  121.2   │ '124, 168, 110, 130, 112, 107, 131, 115, 106, 109'
'create 1000x10x2 DOM nodes and remove a bunch of them' │ 'benchmark-dom-mutation-add-and-remove.html' │ 'window.workload()' │  10   │  196.6   │ '182, 201, 207, 167, 166, 210, 166, 170, 318, 179'
'create 1000 DOM nodes and append into its previous looped node' │ 'benchmark-dom-mutation-multiple-descendant-add.html' │ 'window.workload()' │   5   │   43.2   │ '44, 46, 48, 40, 38'
'create 10000 DOM nodes and move it to new container' │ 'benchmark-dom-mutation-add-and-move.html' │ 'window.workload()' │   5   │  223.6   │ '214, 225, 247, 220, 212'

Main performance improvement seems to be for the following cases (I would like to re-run these after review in case there are mistakes in my PR):
create 1000x10x2 DOM nodes and remove a bunch of them which decreased from ~500ms to ~200ms
create 1000 DOM nodes and append into its previous looped node which decreased from ~70ms to ~40ms

I suspect that by being a bit smarter about operations should also decrease memory pressure and avoid GC sweeps, but that is hard to quantify (basing this off my experience)

Feedback/critics on the PR are much appreciated. For what it's worth, I am not very familiar with rrweb codebase so take this with a grain of salt, I may have invalidated some assumptions the lib relies on and gotten invalid benchmark results. If I did, please point it out and I will address it. If someone is interested, feel free to cherry pick my changes as this PR has gotten quite large and risky. Hopefully it inspires someone to further improve the performance of operations (I am also happy to split it myself)

While this does address performance of the current implementation, I think the largest win in this case would be to process genAdds in an async loop so that we avoid blocking the main thread (I lack context on everything that might break?). Besides that, removing try/catch statements around hot code paths like isBlocking should be reevaluated as it is a common optimization bailout reason in v8 (are those really required?)

changeset-bot · 2023-07-30T17:24:48Z

⚠️ No Changeset found

Latest commit: bc2c8f2

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

packages/rrweb/src/utils.ts

packages/rrweb-snapshot/src/snapshot.ts

packages/rrweb/src/record/mutation.ts

JonasBa · 2023-08-01T10:46:58Z

I re-ran the benchmark after 5d59621

append 70 x 70 x 70 elements' │   3   │ 1669.3333333333333 │ '1688, 1666, 1654
append 1000 elements and reverse their order' │   3   │   343    │ '347, 343, 339
real events recorded on bugs.chromium.org' │   3   │ 3020.6666666666665 │ '3128, 2970, 2964
create 1000x10 DOM nodes' │ 'benchmark-dom-mutation.html' │ 'window.workload()' │  10   │  117.6   │ '115, 120, 133, 112, 119, 112, 121, 114, 109, 121
create 1000x10x2 DOM nodes and remove a bunch of them' │ 'benchmark-dom-mutation-add-and-remove.html' │ 'window.workload()' │  10   │   179    │ '187, 209, 163, 165, 201, 163, 163, 211, 164, 164
create 1000 DOM nodes and append into its previous looped node' │ 'benchmark-dom-mutation-multiple-descendant-add.html' │ 'window.workload()' │   5   │   40.6   │ '41, 42, 44, 38, 38
create 10000 DOM nodes and move it to new container' │ 'benchmark-dom-mutation-add-and-move.html' │ 'window.workload()' │   5   │  218.8   │ '210, 217, 236, 219, 212

The timings are consistent and we can still see a boost for create + remove and create + append into previous looped node

JonasBa · 2023-08-01T14:31:51Z

Actually @eoghanmurray, running the benchmark here and testing matches with the .tagName combination and it seems to be even slower. I've looked at the blink implementation of matches (just because I was curious) and it seems that each individual selector is parsed and tested in a left to right order. I suspect the dip here is because the engine can short circuit the search if the selector it first checks after .block is a * selector as anything will match

eoghanmurray · 2023-08-03T16:13:32Z

@Juice10 is gonna get the performance/benchmark tests in main shortly so will hold off on committing #1272 until we can do a nice demonstration of that in action, and hopefully apply the same to this PR

JonasBa · 2023-08-03T16:52:29Z

@eoghanmurray makes sense, just keep me updated and let me know how I can help further.

On a small side note, I looked into the blink implementation of matches and it confirms the difference between closest and matches implementations - closest applies the same check as ::matches, but it does so for each parent element. I suspect that * selector could be faster as the engine has to only compare the parent -> child relationship between nodes and not match against the actual selector as * will match anything.

bool SelectorDataList::matches(Element& targetElement) const
{
    if (m_needsUpdatedDistribution)
        targetElement.updateDistribution();

    unsigned selectorCount = m_selectors.size();
    for (unsigned i = 0; i < selectorCount; ++i) {
        if (selectorMatches(*m_selectors[i], targetElement, targetElement))
            return true;
    }

    return false;
}

Element* SelectorDataList::closest(Element& targetElement) const
{
    if (m_needsUpdatedDistribution)
        targetElement.updateDistribution();

    unsigned selectorCount = m_selectors.size();
+    for (Element* currentElement = &targetElement; currentElement; currentElement = currentElement->parentElement()) {
        for (unsigned i = 0; i < selectorCount; ++i) {
            if (selectorMatches(*m_selectors[i], *currentElement, targetElement))
                return currentElement;
        }
+    }
    return nullptr;
}

Correction: I was looking at the wrong source, however the implementation in chromium (third_party/blink/renderer/core/css/selector_query.cc) has the same difference

eoghanmurray · 2023-08-03T16:56:17Z

closest applies the same check as ::matches, but it does so for each parent element

Ya but in this case you are also modifying the selector so it's not a straight comparison... but closest does do an unnecessary check on the element itself (we've already checked that via the classList check, so I wonder if an optimisation to remove the classList check would work just as well)

I suspect that * selector could be faster as the engine has to only compare the parent -> child relationship between nodes and not match against the actual selector as * will match anything.

Performance guestimation is always counter-intuitive to me!

eoghanmurray · 2023-08-03T16:56:30Z

I've merged a recent PR that is now on main

JonasBa · 2023-08-03T17:02:56Z

@eoghanmurray would you like me to rerun the benchmarks?

JonasBa · 2023-08-03T17:15:20Z

Ya but in this case you are also modifying the selector so it's not a straight comparison... but closest does do an unnecessary check on the element itself (we've already checked that via the classList check, so I wonder if an optimisation to remove the classList check would work just as well)

I updated the benchmark https://jsbench.me/7ilkpmi7rn/1 - it seems that matches is still faster (benchmark is pushing to an array in order to escape optimizations)

JonasBa · 2023-08-03T17:18:47Z

It seems that the changes from master broke lint rules

eoghanmurray · 2023-08-03T17:46:22Z

It seems that the changes from master broke lint rules

cannot find name addedIds

I did the merge conflict resolution in the github editor and messed it up!

JonasBa · 2023-08-03T17:58:50Z

I'll reset and rebase this correctly, no worries

JonasBa · 2023-08-07T19:41:19Z

@eoghanmurray I reverted the genAdds change and updated snapshots, the changes are now in the new PR I opened. lmk if you need me to change something else

This reverts commit e188253.

JonasBa · 2023-08-10T14:19:54Z

@eoghanmurray something I just noticed is that the SDK has a default blockClass option set to "rr-block". I'm not sure why that is, maybe you have an idea? Imo unless there are external users outside of the SDK relying on this behavior, then I think this should be opt-in with no default value. The consequence of the default value is that we are performing a lot of isBlocked calls on each mutation when the page might not even contain any rr-block elements... :/

This reverts commit 5b05e6b.

eoghanmurray · 2023-08-11T10:49:35Z

Ya this is so we can tell people to 'add rr-block class to anything you want blocked' without having to also have to explain how to modify their config (or redeploy to a customer). I don't think this can be changed now unless in a major version. We could provide a recommended installation which configures blockClass: false, // performance: omit default check class of 'rr-block' but this would need to be a recipe for backwards compatibility (IMO)

JonasBa · 2023-08-11T13:07:26Z

Ya this is so we can tell people to 'add rr-block class to anything you want blocked' without having to also have to explain how to modify their config (or redeploy to a customer). I don't think this can be changed now unless in a major version. We could provide a recommended installation which configures blockClass: false, // performance: omit default check class of 'rr-block' but this would need to be a recipe for backwards compatibility (IMO)

@eoghanmurray yes, that makes sense. Imo a recommendation is fine since this is definitely in the hot path

JonasBa · 2023-09-25T03:33:43Z

@eoghanmurray friendly ping :)

JonasBa · 2023-10-05T17:26:46Z

@eoghanmurray any chance we could get a pair of eyes on this PR?

marandaneto · 2024-01-26T09:32:55Z

@eoghanmurray any chance we could get a pair of eyes on this PR?

Maybe @Juice10 can help here as I've seen their activity on other PRs recently.

pauldambra

fly-by approval fwiw since I'd love to see this land 🚀 ❤️

pauldambra · 2024-01-26T09:22:56Z

packages/rrweb-snapshot/src/snapshot.ts

+  if (!blockClass && !blockSelector) {
+    return false;
+  }


pauldambra · 2024-01-26T09:26:10Z

packages/rrweb-snapshot/src/snapshot.ts

-        (lowerIfExists(sn.attributes.property).match(/^(og|twitter|fb):/) || // og = opengraph (facebook)
-          lowerIfExists(sn.attributes.name).match(/^(og|twitter):/) ||
-          lowerIfExists(sn.attributes.name) === 'pinterest')
+        (OG_TWITTER_OR_FB_REGEXP.test(snAttributeProperty) || // og = opengraph (facebook)


very tiny nitpick: that the comment makes more sense up at the declaration now 🙈

Yuyz0112 · 2024-01-26T09:41:23Z

Is there any chance @JonasBa can split these cool optimizations into multiple PRs? So some small tweaks should be merged more quickly.

Since I found there are already some conflicts in this PR(sorry for the late review).

JonasBa · 2024-07-18T19:56:02Z

Is there any chance @JonasBa can split these cool optimizations into multiple PRs? So some small tweaks should be merged more quickly.

Since I found there are already some conflicts in this PR(sorry for the late review).

Late to this, but I'm starting to do that yes.

JonasBa · 2024-09-19T15:47:49Z

Closing this in favor of the smaller PRs that I opened

JonasBa added 6 commits July 28, 2023 15:56

perf(snapshot): optimize code flow of slimDomExcluded

cc90576

perf(snapshot): remove lowerIfExists

18ca335

perf(snapshot): return early

ba66b77

perf(snapshot): test

dc4c7b1

perf(mutation): shift n^2

85b23bf

perf(mutation): improve process

bad990b