[perf] remove extra combine logic #1515

lifeart · 2023-12-05T15:37:37Z

combine function already handle 0, 1 cases:

I think it's safe to remove pre-checks here

    Benchmark Results Summary    

duration phase estimated improvement -90ms [-173ms to -7ms] OR -0.56% [-1.08% to -0.04%]
renderEnd phase no difference [-2ms to 1ms]
render1000Items1End phase no difference [-15ms to 10ms]
clearItems1End phase no difference [-1ms to 2ms]
render1000Items2End phase estimated improvement -12ms [-18ms to -1ms] OR -1.21% [-1.73% to -0.07%]
clearItems2End phase no difference [0ms to 0ms]
render5000Items1End phase no difference [-66ms to 0ms]
clearManyItems1End phase no difference [-4ms to 1ms]
render5000Items2End phase estimated improvement -44ms [-66ms to -22ms] OR -1.08% [-1.61% to -0.52%]
clearManyItems2End phase estimated improvement -3ms [-6ms to 0ms] OR -1.32% [-2.54% to -0.14%]
render1000Items3End phase no difference [-16ms to 0ms]
append1000Items1End phase no difference [-4ms to 7ms]
append1000Items2End phase no difference [-15ms to 0ms]
updateEvery10thItem1End phase estimated improvement -1ms [-5ms to 0ms] OR -0.82% [-2.88% to -0.27%]
updateEvery10thItem2End phase no difference [-1ms to 0ms]
selectFirstRow1End phase no difference [0ms to 0ms]
selectSecondRow1End phase estimated regression +1ms [0ms to 6ms] OR +0.43% [0.06% to 4.4%]
removeFirstRow1End phase no difference [0ms to 1ms]
removeSecondRow1End phase estimated regression +14ms [7ms to 16ms] OR +5.93% [3.2% to 7.05%]
swapRows1End phase estimated improvement -3ms [-7ms to 0ms] OR -1.77% [-4.55% to -0.15%]
swapRows2End phase no difference [-1ms to 0ms]
clearItems4End phase estimated regression +4ms [1ms to 7ms] OR +3.01% [0.9% to 5.14%]
paint phase no difference [0ms to 0ms]

artifact-1.pdf

Set check ideas (not actual anymore)

In addition, we check if item already in set, it seems 20% faster (https://jsben.ch/ti5Ts) than enforce set to override.

chancancode · 2023-12-05T17:10:15Z

I don't think there is a particularly compelling reason to revisit this code. The original intention was probably to avoid having to allocate an array to pass to combine() in those cases just to throw it away immediately. Does it matter? Maybe not, but it would likely take more work to investigate that than any real world macro gains from the change. As for the set change, those are the kind of micro perf things we deliberately not want to chase – the benchmark itself is too small to be reliable (and the site/harness has so many ads running 😐), and it's the kind of thing that can easily change between different browsers/browser versions.

lifeart · 2023-12-05T17:12:03Z

@chancancode could we have only one change from this PR?
eager return if item already in set: this.tags.has(tag)) return

chancancode · 2023-12-05T17:18:02Z

No that's what I was saying in the second half of the response, I specifically don't think we need to/should be doing that.

Unless we have evidence that this ends up being a hot spot in macro app benchmarks, I don't think that kind of changes are a good idea to make based on just those microbenchmarks you did. Even if you set up the benchmark perfectly it is still too small to be meaningful, and even if that happens to be slightly faster in the version of the browser you tested, that can easily change in a different browser, or in a different version. The current code is making a single call to Set.prototype.add, maybe it is somehow not particularly optimized, but if you switch it to make two separate calls in the happy path, that could easily end up being slower a few versions later when .add is optimized better, and no one has time to keep track of these things that are baked deeply in to the codebase.

So we just do the straight forward thing everywhere unless there is specific evidence that one spot needs to be optimized in a particular way, and even then you should heavily document why that deviates from the normal pattern in code comments, etc.

NullVoxPopuli · 2023-12-05T17:32:14Z

I'm re-opening because there is no evidence one way or another, and the opening of this PR is to communicate that we need to run benchmarks on this, before and after.

lifeart · 2023-12-05T18:07:25Z

Some extra notes here, I found V8 implementation details for has and add:

https://github.com/nodejs/node/blob/1e31a01f898a3586faf01480928161f2f5ef965d/deps/v8/src/builtins/builtins-collections-gen.cc#L1730

HAS

ADD

ADD_TO_ORDERED_HASH_TABLE

And it seems has is really faster in scope of logic. But I agree with @chancancode to not move here util we verify it really make difference in macrobench.

wycats · 2023-12-05T18:40:05Z

A few things:

At minimum, we should wait until the errors feature has landed before considering this kind of change, because the errors feature touches on a lot of the core infrastructure.
I agree with Godfrey that there should be a high bar for tweaking the deep tags/reference implementation. Changes should either come from feature needs or after identifying performance issues at a macro level and pinning them on a piece of low-level implementation.
At bare minimum, changes like this should come along with reproducible benchmarks showing that the change makes a difference. Performance improvements also need to be verified in terms of higher-level benchmarks.

@lifeart your analysis of the v8 codebase is impressive and a really good start for the work that would go into a possible optimization.

Let's circle back around once the errors feature has landed and figure out what it would take to verify that the performance benefits are significant enough to meet the "changes to low-level features" bar.

In the meantime, you can feel free to apply the change to a local version of Ember and see if it impacts any of the ecosystem's macro-benchmarks (such as the https://github.com/krausest/js-framework-benchmark). Any information along those lines would be a good head start.

lifeart · 2023-12-05T19:37:27Z

@chancancode @wycats @NullVoxPopuli

Left to right:
1.) Original 5.3
2.) Only check for tags.has
3.) Check for tags.has + removed extra combine logic (replaced with Array.from(this.tags))

Here is results:

Memory allocation:

It seems results a bit better (2-7%), but we have almost 2mb memory improvement in large lists rendering.

NullVoxPopuli · 2023-12-05T19:55:31Z

that's amazing!!!! nice work!!

chancancode · 2023-12-07T06:27:48Z

Sorry for being a bit terse and not fully unpacking the reasoning the first time, if I seemed a bit impatient it is because this is not the first time we have to restate these things. There is a particular piece of institutional knowledge/learned experience from the project that I am deliberately trying to pass on here.

Usually, it would be faster and sufficient to say "we don't prefer to accept these kind of changes, maybe you can find something else to work on?", especially when there are plenty of other low hanging fruits here. But, since you seem to be particularly motivated in working on these things you deserve a longer explanation and hopefully you can internalize these principles, find places in the codebase to invest your effort in a way that would be productive for everyone, and maybe pass it on in the future.

So let's unpack.

Let's take this PR as an example, but the principles apply more generally. There are two distinct changes that you are trying to make here:

introducing a new micro-optimization
removing an existing micro-optimization

Let's look at these separately.

The first one, at its core, is the claim that:

// Version A: Existing
set.add(potentiallyExistingItem);

...is slower than...

// Version B: Proposed
if (set.has(potentiallyExistingItem)) {
  set.add(potentiallyExistingItem)
}

This seems counter-intuitive on the surface, as the new code appears to be doing more work, potentially duplicating work that is already happening inside set.add.

Now, let's say it actually is the case that the new code is universally faster across various engines and use cases, then I think the appropriate thing to do with that discovery is to upstream that to V8/etc. If it is always faster for us to do this in JavaScript/userland, surely they could do this even faster in native code.

If this is indeed universally faster, even if we don't propose the change, someone else will eventually notice it and optimize this in the engines. That's great, everyone/everywhere will benefit from the change!

But, in that world, if we had accepted this change now, then we would now be stuck with the slower code, because we would really be duplicating work that is already happening in the engine anyway, and the engine can't necessarily/easily optimize away the extra call we are making into the native code, and probably no one would be around to notice and remove the workaround. That would be a net negative in the long term. There are plenty of these workarounds in Ember and in here, accumulated over the years, and they are very difficult to maintain and even remove! (We will discuss that one later.)

Now, the above was with the assumption that "if this is universally faster". If you actually tried to upstream that change, I suspect this is the pushback you may get:

This is very much workload dependent, and not clearly a win across the board.

What does that mean? Well, take a step back and think about why version B may be faster in the case you measured:

Let's assume set.has() takes the same amount of time, T1 on average to complete, for both existing and non-existing items
set.add(existingItem) takes T2 on average to complete
set.add(nonExistingItem) takes T3 on average to complete
p represents the fraction of the calls made with an existing item, and (1-p) are for non existing items

For version B to be faster, the following has to hold:

                T(Version A) > T(Version B)
        p * T2 +  (1-p) * T3 > p * T1 + (1-p) * (T1 + T3)
        p * T2 + T3 - p * T3 > p * T1 + T1 + T3 - p * T1 - p * T3
        p * T2      - p * T3 > p * T1 + T1      - p * T1 - p * T3
        p * T2               > p * T1 + T1      - p * T1
        p * T2               >          T1
                           p > T1 / T2

Which is a fancy way to say, assuming the performance profiles of these methods are fixed (which is, not even that great of an assumption – as that can easily vary between desktop/mobile/browsers/CPU architecture etc), whether version A and version B is faster entirely depends on how often do you call this method with duplicated items!

You can easily intuit this also, because, clearly, if you never call this method with duplicate items, then version B has to be slower, so it follows that, there is a crossover point, when you call the method enough times with duplicate items, then the extra code you added will start to pay off, by eliminating the slower path in .add for those calls.

In order for this to be a win in the real world, we essentially need to:

(Indirectly) measure T1/T2 across enough browsers/device to feel confident that code is worth it in enough browsers
(Indirectly) measure p is

(1) is frankly quite annoying to do and difficult to measure correctly, but at least it's tractable. But what is p? It's the famous "it depends". It's different in every app so the best you could do is probably measure it on a real app (ideally, a few real apps, even – you don't want to accidentally make Audit Board faster but slow everyone else down), and hope it's representative.

But you know what is not representative for these kinds of scenarios? Microbenchmarks and artificial benchmarks. You can probably intuit why – here we are talking about the percentage calls to the auto-tracker where we attempt to add an already-existing tag to an auto-tracking frame that already contains the same tag.

Think about the diversity of auto-tracked getters/helpers/etc in your app, the vast amount of different things they do, and then go look at the code for the JS framework benchmark – how many tracked getters does it have? Does it remotely resemble the diversity of code in your production app? Does it give you confidence that optimizing for that benchmark for this case would definitely yield good results for your actual app? It doesn't for me. Can you see at least a possibility that this may actually make things slower in your actual app when you account for the real world usage patterns? I can definitely see that

Do you see the problem here?

But there is more! Even if you did all the work to measure things right now, things can change over time. The browser can optimize their implementations, swinging T1/T2 from versions to versions (we already talked about that), but so could our code. If someone refactors the code, that could vastly change the value of p!

Speaking of which, here is something better to investigate: evidently, your measurements are showing you that the p is high enough – that is, we must be adding duplicate tags to the same tracking frame often enough for this to matter.

That seems quite surprising to me – how often do you reference the same piece of tracked state over and over again, within, say, the same getter? I would expect it to happen occasionally, but not that often (other than isConst() tags, which we already guarded against). Looking into may reveal places where we are doing things unnecessarily, or opportunity to refactor the js framework microbenchmark code, or provide evidence for an RFC to add some kind of opt-out utility for tracking frames in the few places where it matters. Personally I am not a massive fan for the latter, as it tends to be quite distracting for developers (we are basically talking about {{unbound}}), and the same argument about fragile optimizations holds just as much in app code, but if you can find compelling cases in real apps where it actually matters a lot right now, and we can't easily fix it on our end, I think it would be hard to not at least entertain that as a power tool.

Anyway. I don't think these hurdles are necessary impossible to overcome, and I am not saying that there are zero cases where it is worth it. But I hope you can appreciate why it is so important to measure these things correctly, and why doing that properly is very difficult and a lot of work. These are not easy PRs to review either, despite the seemingly tiny code change – you are basically reviewing for the methodology and data, as well as thinking about any unintended consequences elsewhere. Balancing all of these, hopefully you can see why chasing these tiny gains are, in the grand scheme of things not very appealing and generally not worth everyone's time, and in the long term, can easily flip around and become a liability.

Now let's talk about the second part of your change to remove an existing micro-optimization that someone else already added.

Ironically, you are now witnessing the consequence of chasing these gains from the opposite side!!

Presumably, once upon a time, someone added that optimization with good intentions, perhaps even did some of those measurements we discussed and found it to be well worth it. But it wasn't written down in the code, the benchmarks weren't maintained, so, who knows. Even if it were benchmarked at the time, it was probably in the era well before Apple Silicon. Are those results still valid? Have anyone retested them lately? Do you see why these are a maintenance hazard in the long run?

Still, we can't be reckless about it. IMO the default/conservative affinity for existing code is to assume that it is there for a good reason unless we can deduce prove otherwise with reasonable confidence. I think it should be a lower bar to clear for removing existing optimizations compared to adding new ones, but still, the benchmarks you did aren't remotely enough to be conclusive.

Even the JS framework benchmarks one isn't that great – here it essentially boils down to the distribution for the number of items typically found in the set. The artificial benchmark has two components, and I would not consider that distribution to be representative of your real apps, hope you can see why.

Or, you may be making an even stronger claim – which is that the existing optimization never pays off regardless of the number of items. That is perhaps possible to prove in a more targeted microbenchmark, but even then, the possibility of much more aggressive inlining (and the much more pronounced effect of depots) tends to make even these targeted microbenchmarks not very representative.

Still, I think this would be a promising candidate to look into, especially from the memory angle of eliminating that extra field (though it is a little suspect, as tracking frames are short lived transient object, they have a tiny lifetime in the grand scheme of things, so it feels suspicious that they have a strong impact of memory consumption – high watermark I can believe, but straight up memory consumption would imply they are being held longer than expected, which could indicate other issues).

Anyway, if you understand what I am saying and you are still down to look into doing more comprehensive testing (at least enough variety of browsers in real app workloads) to make sure we aren't doing anything bad, I am pretty open to removing that if we can't find a compelling reason to keep it around.

Hopefully I have said enough this time to prove that I am not just turning you away for no reason or throwing up arbitrary roadblocks? I am not saying stop working on performance improvements, but I am saying that you may want to pick your battles wisely (also considering review bandwidth, among other things), and I think that would help you pick better areas to focus on that justifies the overall cost.

If the starting point of this was "this is my work app, it is slow under these situations, we profiled it and landed on X being the culprit", I think that would be a very different conversation.

packages/@glimmer/validator/lib/tracking.ts

lifeart · 2023-12-07T09:11:27Z

@chancancode thank you for detailed answer!

I'm not pushing this change, and really appreciate your time and patience while reviewing all this stuff and providing answers. I think it's a great round-up in general and we could use it as reference for further related PR's.

I also agree that we have to highlight has/set performance finding to v8 team (already send to related peoples).

According to already existing microoptimisations:
I agree that we should be able to remove it or at least have a way to confirm it's still actual overwise it's not a stability without stagnation approach where we keep things as-is just because our grandpas do it. JS world and especially browsers moving fast and we should be flexible here.

Contributor notes: I think microopimizations should be explicitly marked by comments otherwise we should count this code as a place for improvements.

I do believe we need to try to cover cases where we could improve 2-3% because after 10-20-30 iterations we could get up to 7-10% performance improvements and it's just a small steps to notable difference.

You mentioned low handing perf fruits in codebase - it will be great to have issue-like think with a references to this places to everyone who would like to improve codebase will be able to look into and make meaningful changes blessed by core-team.

lifeart · 2023-12-23T13:56:41Z

Outdated bench results

Bench results (23 Dec 2023) (with has check)

    Benchmark Results Summary    

duration phase no difference [-66ms to 37ms]
render1000Items1End phase no difference [-7ms to 6ms]
clearItems1End phase no difference [0ms to 1ms]
render1000Items2End phase no difference [-4ms to 6ms]
clearItems2End phase no difference [-1ms to 0ms]
render10000Items1End phase estimated improvement -32ms [-65ms to -3ms] OR -0.83% [-1.69% to -0.08%]
clearItems3End phase no difference [-1ms to 2ms]
render1000Items3End phase estimated regression +9ms [2ms to 17ms] OR +2.03% [0.46% to 3.92%]
append1000Items1End phase no difference [-3ms to 9ms]
updateEvery10thItem1End phase no difference [0ms to 1ms]
selectFirstRow1End phase estimated regression +7ms [5ms to 8ms] OR +16.92% [12.61% to 21.22%]
selectSecondRow1End phase estimated improvement -4ms [-5ms to -4ms] OR -32.48% [-38.47% to -26.39%]
removeFirstRow1End phase no difference [-5ms to 0ms]
removeSecondRow1End phase no difference [0ms to 2ms]
swapRows1End phase no difference [-2ms to 1ms]
clearItems4End phase estimated regression +1ms [0ms to 2ms] OR +2.38% [0.41% to 4.29%]
paint phase no difference [0ms to 0ms]

artifact-1.pdf

Without has check

    Benchmark Results Summary    

duration phase estimated improvement -63ms [-91ms to -33ms] OR -1.03% [-1.49% to -0.54%]
render1000Items1End phase no difference [-9ms to 1ms]
clearItems1End phase no difference [0ms to 0ms]
render1000Items2End phase no difference [-3ms to 3ms]
clearItems2End phase no difference [-1ms to 0ms]
render10000Items1End phase estimated improvement -66ms [-90ms to -43ms] OR -1.72% [-2.35% to -1.13%]
clearItems3End phase estimated improvement -2ms [-4ms to -1ms] OR -1.14% [-1.76% to -0.56%]
render1000Items3End phase estimated regression +7ms [2ms to 12ms] OR +1.59% [0.49% to 2.72%]
append1000Items1End phase no difference [-2ms to 4ms]
updateEvery10thItem1End phase no difference [0ms to 0ms]
selectFirstRow1End phase estimated regression +6ms [5ms to 7ms] OR +15.69% [14% to 17.44%]
selectSecondRow1End phase estimated improvement -5ms [-5ms to -4ms] OR -35.49% [-39.39% to -30.53%]
removeFirstRow1End phase no difference [-2ms to 1ms]
removeSecondRow1End phase no difference [0ms to 2ms]
swapRows1End phase no difference [-2ms to 0ms]
clearItems4End phase no difference [0ms to 1ms]
paint phase no difference [0ms to 0ms]

artifact-1.pdf

chancancode · 2023-12-29T22:33:32Z

You mentioned low handing perf fruits in codebase - it will be great to have issue-like think with a references to this places to everyone who would like to improve codebase will be able to look into and make meaningful changes blessed by core-team.

I meant to come back to this and write a longer response, but I never found the time to do it, so here is a short one:

I didn't low hanging fruits in performance work – I think part of working on performance is to embrace that interactions at scale are messy and often unpredictable, and accept that most of the time you don't really know for sure until you are able to measure that at the appropriate scale for the change, and a lot of times things don't really end up panning out.

We should also get on the same page about what we meant by "macro" benchmarks, while certainly better than just a micro isolated loop, the js framework benchmark is very focused on a particular (and IMO not representative) and skew things pretty massively in a certain direction, and in general we really need to be testing in real apps, and it is what I said I would prefer as a starting point – start with a real app, find out what's slow in real life and then hone in from that direction. It is much easier to validate the motivation/assumptions in a perf PR that way.

I think making it easier to link and test things in a real app would go a long way, #1523 is a step in the right direction, but there are much more to do. I don't know if you consider that "low hanging" but those are the things I meant. Other examples are implementing the (element) helper and the truth helpers. There are some previous attempts that were stalled from various reasons.

I offered this to @NullVoxPopuli before (they ended up getting busy with other things, which is fine), and I'd extend it to anyone else, get the tests (e.g. for element) ported/rebased and failing and we can pair for an hour or two on getting them to work, if you are down to follow up with the remaining polish work to get it across the finish line. I'm sorry it have to be that way, but with my time being pretty limited (I "retired" for a reason after all), we'll have to maximize and make the most out of the time together, so things would have to be ready to go before we pair, and even then we'll have to mostly focus on the broad strokes. But given what you were able to figure out on your own in these other PRs I don't think that would be much of a challenge for you if that's something you are interested in.

lifeart · 2024-01-11T08:37:03Z

Closing as not significant

lifeart force-pushed the perf-improvements-remove-extra-combine-logic branch from ff7fd6d to 6f2d044 Compare December 5, 2023 15:46

chancancode closed this Dec 5, 2023

NullVoxPopuli reopened this Dec 5, 2023

lifeart mentioned this pull request Dec 6, 2023

[perf] replace Math.max usage with reduce and pure compare #1520

Merged

runspired reviewed Dec 7, 2023

View reviewed changes

packages/@glimmer/validator/lib/tracking.ts Show resolved Hide resolved

lifeart mentioned this pull request Dec 7, 2023

[memory] replace forEach with arrayFrom #1522

Merged

lifeart marked this pull request as draft December 7, 2023 11:01

lifeart force-pushed the perf-improvements-remove-extra-combine-logic branch from 6f2d044 to 46c32e6 Compare December 23, 2023 13:24

lifeart mentioned this pull request Dec 23, 2023

[perf] merged perf experiments bench #1534

Closed

[perf] remove extra combine logic

9c480ca

lifeart force-pushed the perf-improvements-remove-extra-combine-logic branch from 010d935 to 9c480ca Compare December 24, 2023 20:13

lifeart closed this Jan 11, 2024

lifeart deleted the perf-improvements-remove-extra-combine-logic branch January 11, 2024 08:37

lifeart mentioned this pull request Jan 11, 2024

Things to start from #1546

Open

chancancode mentioned this pull request Jan 26, 2024

[perf | trim] Remove extra class #1510

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf] remove extra combine logic #1515

[perf] remove extra combine logic #1515

lifeart commented Dec 5, 2023 •

edited

Loading

chancancode commented Dec 5, 2023

lifeart commented Dec 5, 2023

chancancode commented Dec 5, 2023

NullVoxPopuli commented Dec 5, 2023

lifeart commented Dec 5, 2023

wycats commented Dec 5, 2023

lifeart commented Dec 5, 2023 •

edited

Loading

NullVoxPopuli commented Dec 5, 2023

chancancode commented Dec 7, 2023

lifeart commented Dec 7, 2023

lifeart commented Dec 23, 2023 •

edited

Loading

Bench results (23 Dec 2023) (with has check)

Without has check

chancancode commented Dec 29, 2023

lifeart commented Jan 11, 2024

[perf] remove extra combine logic #1515

[perf] remove extra combine logic #1515

Conversation

lifeart commented Dec 5, 2023 • edited Loading

chancancode commented Dec 5, 2023

lifeart commented Dec 5, 2023

chancancode commented Dec 5, 2023

NullVoxPopuli commented Dec 5, 2023

lifeart commented Dec 5, 2023

HAS

ADD

ADD_TO_ORDERED_HASH_TABLE

wycats commented Dec 5, 2023

lifeart commented Dec 5, 2023 • edited Loading

NullVoxPopuli commented Dec 5, 2023

chancancode commented Dec 7, 2023

lifeart commented Dec 7, 2023

lifeart commented Dec 23, 2023 • edited Loading

Bench results (23 Dec 2023) (with has check)

Without has check

chancancode commented Dec 29, 2023

lifeart commented Jan 11, 2024

lifeart commented Dec 5, 2023 •

edited

Loading

lifeart commented Dec 5, 2023 •

edited

Loading

lifeart commented Dec 23, 2023 •

edited

Loading