PARQUET-1660: align Bloom filter implementation with format #686

chenjunjiedada · 2019-09-26T09:25:22Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

jbapple · 2019-10-05T04:03:39Z

...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java

-    int bucketIndex = (int)(hash >> 32) & (bitset.length / BYTES_PER_BLOCK - 1);
+    long numBlocks = bitset.length / BYTES_PER_BLOCK;
+    long lowHash = hash >>> 32;
+    int blockIndex = (int)(lowHash * numBlocks >> 32);


What happens if this product overflows? How does that behavior compare to this line operating on unsigned values in C++, which cannot overflow on multiplication?

The number of blocks right shift 5 bits at first, so its value should be less than 1<<27 and the overflow should not happen here.

jbapple · 2019-12-16T03:34:32Z

LGTM.

Fokko

LGTM, do you already use this internally?

Fokko · 2019-12-21T11:30:29Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

@@ -221,6 +231,12 @@ public boolean getPageWriteChecksumEnabled() {
    return bloomFilterExpectedDistinctNumbers;
  }

+  public Set<String> getBloomFilterColumns() {return bloomFilterColumns;}


Can we put the return on the next line, similar to getMaxBloomFilterBytes

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java

chenjunjiedada · 2019-12-28T13:48:18Z

@Fokko, @gszadovszky, could you help to have another look? Is it close to merging?

chenjunjiedada · 2020-01-02T01:58:26Z

@Fokko, forgot your last question. Yes, we already use it internally in some cases.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java

chenjunjiedada · 2020-01-07T07:26:57Z

@gszadovszky, I updated the code, would you please take another look?

chenjunjiedada · 2020-01-07T09:11:02Z

@gszadovszky , thanks for your review. I'd like to rebase this to master before merging. Maybe it needs your another look again. Thanks in advance!

chenjunjiedada · 2020-01-07T09:26:49Z

Looks like I need to merge instead of rebasing. Let me fix this. Sorry for confusing.

chenjunjiedada · 2020-01-07T10:01:24Z

@gszadovszky @Fokko, I just realized that you may need squashing for this PR, so it would be better to submit a separated PR for merging master. So please help to merge this in your convenience.

Fokko · 2020-01-07T10:03:33Z

So you'll create a new PR from bloom-filter to master? I'm fine with that. @gszadovszky WDYT?

chenjunjiedada · 2020-01-07T10:08:43Z

@Fokko, I may put in the wrong way, but it is a PR to merge master to bloom-filter branch. I have done that job local machine and can submit that if you prefer to squash one more merging PR.

gszadovszky · 2020-01-07T10:45:03Z

Let's squash+merge this PR to the feature branch first. Then, check the merge PR and push it to the feature branch as well.
If everything is ready, create a PR from the feature branch to master. There shall be no conflicts. After the final review/tests succeed squash+merge the whole to master.

chenjunjiedada · 2020-01-07T11:15:54Z

SGTM

* PARQUET-1328: Add Bloom filter reader and writer (#587) * PARQUET-1516: Store Bloom filters near to footer (#608) * PARQUET-1391: Integrate Bloom filter logic (#619) * PARQUET-1660: align Bloom filter implementation with format (#686)

* PARQUET-1328: Add Bloom filter reader and writer (apache#587) * PARQUET-1516: Store Bloom filters near to footer (apache#608) * PARQUET-1391: Integrate Bloom filter logic (apache#619) * PARQUET-1660: align Bloom filter implementation with format (apache#686)

PARQUET-1660: align Bloom filter implementation with format

0936a94

jbapple reviewed Oct 5, 2019

View reviewed changes

update format to 2.7.0

a55e7d1

Fokko approved these changes Dec 21, 2019

View reviewed changes

gszadovszky requested changes Jan 6, 2020

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java Outdated Show resolved Hide resolved

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java Outdated Show resolved Hide resolved

gszadovszky approved these changes Jan 7, 2020

View reviewed changes

chenjunjiedada force-pushed the bloom-filter branch 2 times, most recently from 81dd09b to 039ffdc Compare January 7, 2020 09:15

chenjunjiedada added 2 commits January 7, 2020 17:37

address review comments

0046546

address comments

b3d54e8

chenjunjiedada force-pushed the bloom-filter branch from 039ffdc to b3d54e8 Compare January 7, 2020 09:45

gszadovszky merged commit ba28686 into apache:bloom-filter Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1660: align Bloom filter implementation with format #686

PARQUET-1660: align Bloom filter implementation with format #686

chenjunjiedada commented Sep 26, 2019

jbapple Oct 5, 2019

chenjunjiedada Oct 5, 2019 •

edited

Loading

jbapple commented Dec 16, 2019

Fokko left a comment

Fokko Dec 21, 2019

chenjunjiedada commented Dec 28, 2019

chenjunjiedada commented Jan 2, 2020

chenjunjiedada commented Jan 7, 2020

chenjunjiedada commented Jan 7, 2020

chenjunjiedada commented Jan 7, 2020

chenjunjiedada commented Jan 7, 2020

Fokko commented Jan 7, 2020

chenjunjiedada commented Jan 7, 2020

gszadovszky commented Jan 7, 2020

chenjunjiedada commented Jan 7, 2020

PARQUET-1660: align Bloom filter implementation with format #686

PARQUET-1660: align Bloom filter implementation with format #686

Conversation

chenjunjiedada commented Sep 26, 2019

Jira

Tests

Commits

Documentation

jbapple Oct 5, 2019

Choose a reason for hiding this comment

chenjunjiedada Oct 5, 2019 • edited Loading

Choose a reason for hiding this comment

jbapple commented Dec 16, 2019

Fokko left a comment

Choose a reason for hiding this comment

Fokko Dec 21, 2019

Choose a reason for hiding this comment

chenjunjiedada commented Dec 28, 2019

chenjunjiedada commented Jan 2, 2020

chenjunjiedada commented Jan 7, 2020

chenjunjiedada commented Jan 7, 2020

chenjunjiedada commented Jan 7, 2020

chenjunjiedada commented Jan 7, 2020

Fokko commented Jan 7, 2020

chenjunjiedada commented Jan 7, 2020

gszadovszky commented Jan 7, 2020

chenjunjiedada commented Jan 7, 2020

chenjunjiedada Oct 5, 2019 •

edited

Loading