-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1660: align Bloom filter implementation with format #686
Conversation
int bucketIndex = (int)(hash >> 32) & (bitset.length / BYTES_PER_BLOCK - 1); | ||
long numBlocks = bitset.length / BYTES_PER_BLOCK; | ||
long lowHash = hash >>> 32; | ||
int blockIndex = (int)(lowHash * numBlocks >> 32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this product overflows? How does that behavior compare to this line operating on unsigned values in C++, which cannot overflow on multiplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of blocks right shift 5 bits at first, so its value should be less than 1<<27
and the overflow should not happen here.
LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, do you already use this internally?
@@ -221,6 +231,12 @@ public boolean getPageWriteChecksumEnabled() { | |||
return bloomFilterExpectedDistinctNumbers; | |||
} | |||
|
|||
public Set<String> getBloomFilterColumns() {return bloomFilterColumns;} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put the return on the next line, similar to getMaxBloomFilterBytes
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java
Show resolved
Hide resolved
@Fokko, @gszadovszky, could you help to have another look? Is it close to merging? |
@Fokko, forgot your last question. Yes, we already use it internally in some cases. |
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java
Outdated
Show resolved
Hide resolved
@gszadovszky, I updated the code, would you please take another look? |
@gszadovszky , thanks for your review. I'd like to rebase this to master before merging. Maybe it needs your another look again. Thanks in advance! |
81dd09b
to
039ffdc
Compare
Looks like I need to merge instead of rebasing. Let me fix this. Sorry for confusing. |
039ffdc
to
b3d54e8
Compare
@gszadovszky @Fokko, I just realized that you may need squashing for this PR, so it would be better to submit a separated PR for merging master. So please help to merge this in your convenience. |
So you'll create a new PR from |
@Fokko, I may put in the wrong way, but it is a PR to merge master to bloom-filter branch. I have done that job local machine and can submit that if you prefer to squash one more merging PR. |
Let's squash+merge this PR to the feature branch first. Then, check the merge PR and push it to the feature branch as well. |
SGTM |
* PARQUET-1328: Add Bloom filter reader and writer (apache#587) * PARQUET-1516: Store Bloom filters near to footer (apache#608) * PARQUET-1391: Integrate Bloom filter logic (apache#619) * PARQUET-1660: align Bloom filter implementation with format (apache#686)
Make sure you have checked all steps below.
Jira
Tests
Commits
Documentation