New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Use bloom filter for evaluating dynamic filters on strings #24528

Open

raunaqmorarka wants to merge 1 commit into trinodb:master from raunaqmorarka:df-bloom

+355 −39

Member

raunaqmorarka commented Dec 19, 2024 •

edited

Loading

Description

Benchmark                               (filterSize)  (inputDataSet)  (inputNullChance)  (nonNullsSelectivity)  (nullsAllowed)   Mode  Cnt     Before Score       After Score  Units
BenchmarkDynamicPageFilter.filterPages           100  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  145.858 ± 4.541  590.506 ± 28.510  ops/s
BenchmarkDynamicPageFilter.filterPages          1000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  136.995 ± 2.395  596.036 ± 22.694  ops/s
BenchmarkDynamicPageFilter.filterPages         10000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  136.990 ± 5.284  594.118 ± 15.764  ops/s
BenchmarkDynamicPageFilter.filterPages        100000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  114.591 ± 7.307  587.445 ±  9.818  ops/s
BenchmarkDynamicPageFilter.filterPages       1000000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20   43.234 ± 1.621  578.800 ± 15.694  ops/s
BenchmarkDynamicPageFilter.filterPages       5000000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20   40.018 ± 2.245  464.153 ± 20.914  ops/s

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## General
* Improve performance of selective joins on strings. ({issue}`24528`)

cla-bot bot added the cla-signed label

raunaqmorarka added the performance label

raunaqmorarka requested review from lukasz-stec, martint, dain, sopel39 and Dith3r

December 19, 2024 11:21

wendigo reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java Outdated Show resolved Hide resolved

wendigo reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java Outdated Show resolved Hide resolved

wendigo reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java Outdated Show resolved Hide resolved

raunaqmorarka force-pushed the df-bloom branch from 137c6e9 to 0edc826 Compare

December 19, 2024 11:28

wendigo reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java

+                          bloom = new long[bloomSize];
+                          bloomSizeMask = bloomSize - 1;
+                          for (Slice value : values) {
+                              long hashCode = XxHash64.hash(value);

Contributor

wendigo Dec 19, 2024

Slice has a hashCode that is using XxHash64 already (and is memoized). Just value.hashCode()

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java

+                      }
+                      @VisibleForTesting
+                      public boolean contains(Slice data)

Contributor

wendigo Dec 19, 2024

This method is only used in test. Can we use contains(data, offset, length) in test instead?

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java

+                      private boolean contains(Slice data, int offset, int length)
+                      {
+                          long hashCode = XxHash64.hash(data, offset, length);

Contributor

wendigo Dec 19, 2024

just data.hashCode(offset, length);

core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java

+                      private static long bloomMask(long hashCode)
+                      {
+                          // returned mask sets 3 bits based on portions of given hash

Contributor

wendigo Dec 19, 2024

nit:

            // returned mask sets 3 bits based on portions of given hash
            // Extract 38th to 43rd bits
            return (1L << ((hashCode >> 21) & 63)) |
                    // Extract 32nd to 37th bits
                    (1L << ((hashCode >> 27) & 63)) |
                    // Extract 26th to 31st bits
                    (1L << ((hashCode >> 33) & 63));

Member Author

raunaqmorarka Dec 19, 2024

I think the convention in existing code has been to keep the operator at start of each new line
e.g.

TrinoFileStatus that = (TrinoFileStatus) o;
        return isDirectory == that.isDirectory
                && length == that.length
                && modificationTime == that.modificationTime
                && blockLocations.equals(that.blockLocations)
                && path.equals(that.path);

core/trino-main/src/main/java/io/trino/sql/planner/DomainTranslator.java

                       return IrUtils.combineConjuncts(toPredicateConjuncts(tupleDomain));
                   }
-                  public List<Expression> toPredicateConjuncts(TupleDomain<Symbol> tupleDomain)
+                  private List<Expression> toPredicateConjuncts(TupleDomain<Symbol> tupleDomain)

Contributor

wendigo Dec 19, 2024

why?

Member Author

raunaqmorarka Dec 19, 2024

I had made it public for usage in DynamicPageFilter, now I don't need it so made it private again

core/trino-main/src/main/java/io/trino/sql/planner/DomainTranslator.java

                               .collect(toImmutableList());
                   }
-                  private Expression toPredicate(Domain domain, Reference reference)
+                  public Expression toPredicate(Domain domain, Reference reference)

Contributor

wendigo Dec 19, 2024

why?

Member Author

raunaqmorarka Dec 19, 2024

I'm using it in DynamicPageFilter

Contributor

findinpath commented Dec 19, 2024

Could you please add a high-level description about where the oprimizations proposed in this PR would apply.
I'm particularly interested in a SQL sketch where you've observer/foresee that the engine will perform better.

raunaqmorarka force-pushed the df-bloom branch 3 times, most recently from 27017ce to 13b8ccd Compare

December 20, 2024 06:20


          Use bloom filter for evaluating dynamic filters on strings

d8b44ff

BenchmarkDynamicPageFilter.filterPages
(filterSize)  (inputDataSet)  (inputNullChance)  (nonNullsSelectivity)  (nullsAllowed)   Mode  Cnt     Before Score       After Score  Units
         100  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  145.858 ± 4.541  590.506 ± 28.510  ops/s
        1000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  136.995 ± 2.395  596.036 ± 22.694  ops/s
       10000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  136.990 ± 5.284  594.118 ± 15.764  ops/s
      100000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20  114.591 ± 7.307  587.445 ±  9.818  ops/s
     1000000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20   43.234 ± 1.621  578.800 ± 15.694  ops/s
     5000000  VARCHAR_RANDOM               0.05                    0.2           false  thrpt   20   40.018 ± 2.245  464.153 ± 20.914  ops/s

raunaqmorarka force-pushed the df-bloom branch from 13b8ccd to d8b44ff Compare

December 20, 2024 07:13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

wendigo wendigo left review comments

lukasz-stec Awaiting requested review from lukasz-stec

martint Awaiting requested review from martint

dain Awaiting requested review from dain

sopel39 Awaiting requested review from sopel39

Dith3r Awaiting requested review from Dith3r

Labels

cla-signed performance