Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataGrid] Filtering performance #9120

Closed
wants to merge 24 commits into from
Closed

Conversation

romgrk
Copy link
Contributor

@romgrk romgrk commented May 26, 2023

Summary

This PR is a POC to demonstrate how we can improve the performance of our filtering, and other operations. It's not meant to be merged as it is, I'm just exploring ways to evolve our API & architecture to improve performance. Please add comments.

https://deploy-preview-9120--material-ui-x.netlify.app/x/react-data-grid/filtering/#header-filters

Results

This PR improves the speed of one-column string-contains filtering by about 75%.
(edit: the first 2 commits of this PR do, and the later commits improve to 93%)

Changes & observations

Our biggest cost by far is memory allocations. The amount of objects & functions (closures) being allocated to pass data around creates much more CPU work than anything else.

1. Fast filters

The first change is to define some of our filter functions as "fast filters". What this means is that they only use the cellParams.value field of their input arguments, so we can avoid calling the expensive API function .getCellParams() and instead only call .getCellValue().

In practice, from what I've seen, all of our filters could be fast filters, so we could change the API. But the filter API is public so we can't change it without a major version increase.

2. Memoize filtered items

The second change is to avoid calling getFilterCallbackFromItem() during passFilterLogic(). The former is called simply to filter the model items, but it's a very expensive call (it basically recreates the filter function) and it was done for every row, so for N rows we were re-creating the filtering function N times (plus 1, for the filtering function actually used for filtering).

Edit: I've added more changes below, see next comment for details. Each commit correspond to one change, and it's probably easier to read each commit independently than the PR change as a whole.

Benchmarks

The measurements were done by filtering 100,000 rows with the string "am". The Elapsed (ms) results below are generated by wrapping the flatFilteringMethod with performance.now() timings.

Before After
Screenshot from 2023-05-25 19-52-23 Screenshot from 2023-05-25 19-53-59

NOTE: The results above don't use the same set of 100,000 rows (it's the Employee dataset, randomly generated), thus why the Count differs. These results are approximate but give an accurate perspective and were consistently reproducible.

@romgrk romgrk added performance component: data grid This is the name of the generic UI component, not the React module! labels May 26, 2023
@mui-bot
Copy link

mui-bot commented May 26, 2023

Netlify deploy preview

Netlify deploy preview: https://deploy-preview-9120--material-ui-x.netlify.app/

Updated pages

No updates.

These are the results for the performance tests:

Test case Unit Min Max Median Mean σ
Filter 100k rows ms 313.2 502.7 351.9 376.5 70.729
Sort 100k rows ms 628.4 1,164.4 628.4 961.52 186.242
Select 100k rows ms 185.7 333.9 277.3 272.5 49.567
Deselect 100k rows ms 134 305.1 269.4 238.3 66.966

Generated by 🚫 dangerJS against 189acf0

@romgrk
Copy link
Contributor Author

romgrk commented May 26, 2023

I've added more changes, some of them quite exotic, to see how much I could push the performance. They've improved the filtering by another 10%, so 85% compared to the baseline. They're described here.

Changes

3. Non-memoized selectors

The selectors created with createSelector() have a cost, which become apparent when they're called in a loop like filtering does. Most of our selector don't need the memoization of createSelector(), so I've added createRawSelector() to avoid paying the memoization cost.

I recommend that we apply this change to all our existing selectors who don't need memoization (all those that don't derive their input argument).

4. Reduce API methods indirection

The useGridApiMethod hook creates indirection by wrapping in a trampoline function. If we remove it, we avoid function calls and allocations (the rest arguments).

privateApiRef.current.register(visibility, {
  [methodName]: (...args: any[]) => {
    const fn = apiMethodsRef.current[methodName] as Function;
    return fn(...args);
  },
});

If there aren't apparent downsides to this, I would suggest that we do it.

5. Avoid dynamic object props

The pattern { [dynamicKey]: value } is expensive for JS engines, I've used eval(...) to create a function with a static { key: value }. This is exotic/evil, I don't really recommend that we implement it, but it's an interesting optimization.

6. Direct state access

Avoid selectors and using direct apiRef.current.state. access in our internal functions can improve the performance, selector functions (even unmemoized) still have a cost.

7. Use Set

We often use simple objects for use cases where we do Set operations. Using the proper data structure improves performance, but these changes are also semver major because they're selectable by the public selectors.

Benchmarks

@flaviendelangle
Copy link
Member

In practice, from what I've seen, all of our filters could be fast filters, so we could change the API. But the filter API is public so we can't change it without a major version increase.

Sorry if you already proposed it.
But for this kind of scenario where we want to allow people to access some rarely used data, it would probably be better to pass a callback so that it's only executed if the user actually wants it.

Something like:

cellFilterParas: {
  value: TValue,
  getCellParams: () => GridCellParams
}

Or if we want to avoid creating the callback (not sure how negligible that is on large dataset), we just say that now people can access the apiRef and call apiRef.current.getCellParams themselves, so we pass:

cellFilterParas: {
  value: TValue,
  id: GridRowId;
  field: string;
}

Copy link
Member

@cherniavskii cherniavskii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! 🚀

@@ -20,14 +20,10 @@ export function useGridApiMethod<
return;
}
apiMethodsNames.forEach((methodName) => {
if (!privateApiRef.current.hasOwnProperty(methodName)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here was to have stable references to API methods.
For example, with this change, it's now impossible to spy on API methods, because the spy gets overridden with the new function reference (this is why the tests fail).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use useEventCallback on the methods instead ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason other than testing for which we would like to keep the stable references?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've benchmarked this one, for the raw filtering time, this indirection degrades performance by 11% if we take the final results as the baseline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, testing is the only reason to keep the stable references.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I'll try to make it work without the trampoline then.

@romgrk
Copy link
Contributor Author

romgrk commented May 28, 2023

Did some more profiling, I found two more substantial improvements:

8. Remove API Proxy wrapper

The wrapper around the API object is expensive, in particular if it needs unwrapping. Removing it would improve our performance across all the DataGrid functions.

9. Iterate rows directly

We are currently iterating row ids in flatFilteringMethod, which is slow and unefficient because then we need to do dataRowIdToModelLookup[id] to retrieve the row, and it is a huge cost if it's a huge object. If we know we are filtering all the rows, then we're much better off iterating row objects directly. Below is a comparison.

Before (Builtins_KeyedLoadIC_Megamorphic is the relevant entry)
Screenshot from 2023-05-28 02-53-11

After
Screenshot from 2023-05-28 13-49-21

Final results

The improvement compared to the baseline is 93% with these last changes. I haven't added the commits here yet, I'll cleanup what I have and evaluate which commits we can merge cleanly without a semver major change.

Comment on lines +231 to +249
if (appliers.length === 1) {
const applier = appliers[0];
const applierFn = applier.fn;

const filteredAppliers = shouldApplyFilter
? appliers.filter((applier) => shouldApplyFilter(applier.item.field))
: appliers;
const applierCall = applier.v7
? 'applierFn(row)'
: 'applierFn(getRowId ? getRowId(row) : row.id)';
const fn = eval(`
(row, shouldApplyFilter) => {
// ${applierFn.name} <- Keep a ref, prevent the bundler from optimizing away

filteredAppliers.forEach((applier) => {
resultPerItemId[applier.item.id!] = applier.fn(rowId);
});
if (shouldApplyFilter && !shouldApplyFilter(applier.item.field)) {
return { '${applier.item.id!}': false };
}
return { '${applier.item.id!}': ${applierCall} };
}
`);

return fn;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we consider the final results as the baseline, and we decide to not apply this change, the performance degrades by 20%. I know "eval is evil", but in this case we're getting a real performance improvement, so I feel like we should keep it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exotic indeed, but also risky - try running this locally 😅

import * as React from 'react';
import { DataGrid } from '@mui/x-data-grid';

const columns = [{ field: 'id' }];

export default function InitialFilters() {
  const [rows, setRows] = React.useState<any[]>([]);

  React.useEffect(() => {
    setRows([{ id: 1 }]);
  }, []);

  return (
    <div style={{ height: 400, width: '100%' }}>
      <DataGrid
        columns={columns}
        rows={rows}
        filterModel={{
          items: [
            {
              field: 'id',
              operator: 'equals',
              value: '1',
              id: "1': alert('hello') } //",
            },
          ],
        }}
      />
    </div>
  );
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point! Could we reliably escape that value with ... { ${JSON.stringify(String(applier.item.id))}: ...? I feel that if String is guaranteed to return a string, then JSON.stringify should be guaranteed to return valid javascript string representation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow.
${JSON.stringify(String(applier.item.id))} would return the same string, and it will be evaluated as before. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON.stringify is doing the string-quoting, which means it's escaped properly for javascript:

image

Alternatively using parseInt(applier.item.id) could also work. It would return NaN for invalid payloads, and it's guaranteed to return a number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I forgot to remove ' ' around JSON.stringify(...) 😅
I agree that considering the performance gains, it's okay to use eval in this case 👍

Comment on lines +386 to +402
for (let i = 0; i < rows.length; i += 1) {
const row = rows[i];

isRowMatchingFilters(row, undefined, result);

const isRowPassing = passFilterLogicSingle(
result.passingFilterItems,
result.passingQuickFilterValues,
params.filterModel,
apiRef,
);

if (isRowPassing) filteredRowsLookup.add(row.id);
}

// XXX: Is props.rows what we want?
// XXX: Handle footer rows
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add autogenerated rows after the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using (tree[GRID_ROOT_GROUP_ID]).children; here as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See point 9

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We still use the tree for non-flat filtering though.

Maybe it's worth giving another try to Map for dataRowIdToModelLookup? Looking at the related discussion we had in #9120 (comment), it should make it faster, so maybe there was something else causing the slowdown?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This quick benchmark shows that at least property access should be faster when using Map: https://jsperf.app/petawa

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I benchmarked by replacing our state model by a Map inside d8. The complexity of JS engines makes it hard to predict the performance solely with a microbenchmark, we should benchmark further the whole process before switching. I'll start with the other points, once I'm done I can look further into using a Map.

@github-actions github-actions bot added the PR: out-of-date The pull request has merge conflicts and can't be merged label Jun 1, 2023
@github-actions
Copy link

github-actions bot commented Jun 1, 2023

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@@ -154,6 +160,38 @@ export function useGridParamsApi(apiRef: React.MutableRefObject<GridPrivateApiCo
[apiRef, getBaseCellParams],
);

const getRowValue = React.useCallback<GridParamsApi['getRowValue']>(
(row, colDef) => {
const id = getRowId ? getRowId(row) : row.id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to produce the same result as getCellValue. What is the added value here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to point 9: if we're iterating the row objects directly, we need a way to know the cell value without going through dataRowIdToModelLookup, which .getCellValue does. Big lookup objects are expensive, in particular if we need to access them in a loop that's already O(n) to start with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using Set for the dataRowIdToModelLookup help here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*Map

No, I've tried using Maps and it slows down the benchmark. The issue is really the indirection & additional memory accesses. The row object is basically a direct pointer to the row data. Passing row.id to an hashmap to get the row data again is wasted cycles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is dataRowIdToModelLookup slow, because we change its shape many times (adding more and more keys to it) and therefore all its properties are "slow properties" (according to https://v8.dev/blog/fast-properties)?
Is this correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I have a suggestion for the getCellValue then - can we support conditional row and column arguments and use them if available, ani f not - it will use getRow() and getColumn as fallback.
What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the signature?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getCellValue: <V extends any = any>(
  id: GridRowId,
  field: string,
  row?: GridRowModel,
  colDef?: GridColDef,
) => V;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that we'll always have either both objects or none. How about we keep the API like it's implemented but leave it undocumented, and we can keep iterating & refactoring it before v7? I'll open a PR for the v7 filters shortly, we can continue the discussion there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!filterItem.value) {
return null;
}

const valueAsBoolean = filterItem.value === 'true';
return ({ value }): boolean => {
return (value, _, __, ___): boolean => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not skip the unused arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old habit, V8 used to have an adaptor for functions with mismatched arity that was expensive, but the overhead is mostly gone now. But I still like to write javascript code that is easy & predictable for engines to optimize. In particular for functions like this one that are run in a hot loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the article, I have learned a lot about v8 lately 🎉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Np, I love learning about JS engines internals, helps a lot with performance optimization. I'm less familiar with SpiderMonkey though, and I know very little about JSC. But V8 is the most common engine by far.

If you're interested in reading more, those links are all interesting:

Comment on lines +386 to +402
for (let i = 0; i < rows.length; i += 1) {
const row = rows[i];

isRowMatchingFilters(row, undefined, result);

const isRowPassing = passFilterLogicSingle(
result.passingFilterItems,
result.passingQuickFilterValues,
params.filterModel,
apiRef,
);

if (isRowPassing) filteredRowsLookup.add(row.id);
}

// XXX: Is props.rows what we want?
// XXX: Handle footer rows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using (tree[GRID_ROOT_GROUP_ID]).children; here as before?

@oliviertassinari
Copy link
Member

oliviertassinari commented Jun 3, 2023

From what I can quickly test on

PR: https://deploy-preview-9120--material-ui-x.netlify.app/x/react-data-grid/filtering/#header-filters vs. HEAD: https://mui.com/x/react-data-grid/filtering/header-filters/

with:

Screenshot 2023-06-04 at 00 18 33

In the logs of applyStrategyProcessor I see that we move from 500ms to 20ms 🥹. I suspect that we should reduce the debounce, it feels like a lot:

(should likely be configurable)

/**
* The debounce time in milliseconds.
* @default 500
*/
debounceMs?: number;

What I would personally explore for client-side filters is to run the filter in idle time (likely overkill for column filter but more relevant for quick search), yield to the main thread when it asks for it, have no debounce or a very small one, and cancel the previous filtering tasks when the input changes.

Copy link
Member

@MBilalShafi MBilalShafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work 🎉

If you plan something for v7, maybe add that as a point or GH issue in this umbrella issue, the goal is to have all the planned changes in a single place.

@@ -13,6 +13,7 @@ import {
import {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the specific discussions going on, generally speaking, the initiatives taken for the performance enhancement in this PR are really solid, a few of them were actually new to me and they strengthened my basic concepts about how things operate under the hood, so a huge thanks. I think the direction we are going in is outstanding.
I'll really appreciate it if it's possible to extract these changes (as we already discussed) into smaller PRs to:

  1. Have a stronger spotlight on the optimization each of the changes is bringing in
  2. To measure the impact of each change separately (and possibly push it further)
  3. To assess which of the changes could be applicable to the other areas of the application.

1, 2: Points no 1 and 2 are certainly the most impacting ones, although we are going to have a workaround (.v7), but with v7 around the corner and the benefits internally applied, that seems a step forward.
3: Non-cached selector option is a nice improvement too, why need to cache something when it could be accessed with a simple (.) notation.
4: It seems, the indirection (or checking of properties was to stop initializing if already did or have stable references, but I am not sure why it was wrapped inside the register function). It'll be good to simplify it if it doesn't cause a side effect.
5. I have never used eval this way, maybe discuss this one separately and see if there are any loose grounds?
6, 7: These are good, I think we should do them where possible in a non-breakable way and plan for v7 for the remaining part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's continue the ongoing discussions in the threads above, I'll open separate PRs as we reach an agreement for each of those.

@romgrk
Copy link
Contributor Author

romgrk commented Jun 12, 2023

TODO: getDefaultFilterModel() allocates a new value each time and is called on every iteration.

@romgrk
Copy link
Contributor Author

romgrk commented Jan 17, 2024

Superseded by the split PRs.

@romgrk romgrk closed this Jan 17, 2024
@romgrk romgrk deleted the perf-filtering branch January 17, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: data grid This is the name of the generic UI component, not the React module! performance PR: out-of-date The pull request has merge conflicts and can't be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants