Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use bloom filter for evaluating dynamic filters on strings #24528

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Dec 19, 2024

Description

Benchmark                               (filterSize)  (inputDataSet)  (inputNullChance)  (nonNullsSelectivity)  (nullsAllowed)   Mode  Cnt     Before Score       After Score  Units
BenchmarkDynamicPageFilter.filterPages           100  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  145.858 ± 4.541  590.506 ± 28.510  ops/s
BenchmarkDynamicPageFilter.filterPages          1000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  136.995 ± 2.395  596.036 ± 22.694  ops/s
BenchmarkDynamicPageFilter.filterPages         10000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  136.990 ± 5.284  594.118 ± 15.764  ops/s
BenchmarkDynamicPageFilter.filterPages        100000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  114.591 ± 7.307  587.445 ±  9.818  ops/s
BenchmarkDynamicPageFilter.filterPages       1000000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20   43.234 ± 1.621  578.800 ± 15.694  ops/s
BenchmarkDynamicPageFilter.filterPages       5000000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20   40.018 ± 2.245  464.153 ± 20.914  ops/s

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## General
* Improve performance of selective joins on strings. ({issue}`24528`)

bloom = new long[bloomSize];
bloomSizeMask = bloomSize - 1;
for (Slice value : values) {
long hashCode = XxHash64.hash(value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slice has a hashCode that is using XxHash64 already (and is memoized). Just value.hashCode()

}

@VisibleForTesting
public boolean contains(Slice data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is only used in test. Can we use contains(data, offset, length) in test instead?


private boolean contains(Slice data, int offset, int length)
{
long hashCode = XxHash64.hash(data, offset, length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just data.hashCode(offset, length);


private static long bloomMask(long hashCode)
{
// returned mask sets 3 bits based on portions of given hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

            // returned mask sets 3 bits based on portions of given hash
            // Extract 38th to 43rd bits
            return (1L << ((hashCode >> 21) & 63)) |
                    // Extract 32nd to 37th bits
                    (1L << ((hashCode >> 27) & 63)) |
                    // Extract 26th to 31st bits
                    (1L << ((hashCode >> 33) & 63));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the convention in existing code has been to keep the operator at start of each new line
e.g.

TrinoFileStatus that = (TrinoFileStatus) o;
        return isDirectory == that.isDirectory
                && length == that.length
                && modificationTime == that.modificationTime
                && blockLocations.equals(that.blockLocations)
                && path.equals(that.path);

@@ -119,7 +119,7 @@ public Expression toPredicate(TupleDomain<Symbol> tupleDomain)
return IrUtils.combineConjuncts(toPredicateConjuncts(tupleDomain));
}

public List<Expression> toPredicateConjuncts(TupleDomain<Symbol> tupleDomain)
private List<Expression> toPredicateConjuncts(TupleDomain<Symbol> tupleDomain)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had made it public for usage in DynamicPageFilter, now I don't need it so made it private again

@@ -132,7 +132,7 @@ public List<Expression> toPredicateConjuncts(TupleDomain<Symbol> tupleDomain)
.collect(toImmutableList());
}

private Expression toPredicate(Domain domain, Reference reference)
public Expression toPredicate(Domain domain, Reference reference)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using it in DynamicPageFilter

@findinpath
Copy link
Contributor

Could you please add a high-level description about where the oprimizations proposed in this PR would apply.
I'm particularly interested in a SQL sketch where you've observer/foresee that the engine will perform better.

@raunaqmorarka raunaqmorarka force-pushed the df-bloom branch 3 times, most recently from 27017ce to 13b8ccd Compare December 20, 2024 06:20
BenchmarkDynamicPageFilter.filterPages
(filterSize)  (inputDataSet)  (inputNullChance)  (nonNullsSelectivity)  (nullsAllowed)   Mode  Cnt     Before Score       After Score  Units
         100  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  145.858 ± 4.541  590.506 ± 28.510  ops/s
        1000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  136.995 ± 2.395  596.036 ± 22.694  ops/s
       10000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  136.990 ± 5.284  594.118 ± 15.764  ops/s
      100000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  114.591 ± 7.307  587.445 ±  9.818  ops/s
     1000000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20   43.234 ± 1.621  578.800 ± 15.694  ops/s
     5000000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20   40.018 ± 2.245  464.153 ± 20.914  ops/s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants