Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BloomFilter and float point is ambiguous #407

Open
asfimport opened this issue Mar 13, 2023 · 5 comments
Open

BloomFilter and float point is ambiguous #407

asfimport opened this issue Mar 13, 2023 · 5 comments

Comments

@asfimport
Copy link
Collaborator

Currently, our Parquet can use BloomFilter for any physical types. However, when BloomFilter apply on float:

  1. What does +0 -0 means? Are they equal?

  2. Should qNaN sNaN written in BloomFilter? Are they equal?

     

Reporter: Xuwei Fu / @mapleFU

Note: This issue was originally created as PARQUET-2255. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Gang Wu / @wgtmac:
These are good questions. Let me try to answer them from the perspective of Java:

  1. +0.0 and -0.0 are different things but they are equal on the Java side. https://stackoverflow.com/a/24238344

  2. Java does not have signaling NaN. And it only has a single NaN representation. https://stackoverflow.com/a/25051746

    To support better interoperability, I think we should do two things:

  • If +0.0 is inserted into the bloom filter, so should -0.0. Vice versa for -0.0.
  • No NaN should be inserted into the bloom filter. I doubt any user really wants to test existence of NaN.

@asfimport
Copy link
Collaborator Author

@asfimport
Copy link
Collaborator Author

Gabor Szadovszky / @gszadovszky:
Bloom filters are for searching for exact values. Exact checking of floating point numbers are usually code smell. Usually checking if the difference is below an epsilon value is suggested over using exact equality. I am wondering if there is a real usecase for searching for an exact floating point number. Maybe disabling bloom filters completely for FP numbers is the simplest choice and probably won't bother anyone.

If we still want to handle FP bloom filters I agree with @wgtmac's proposal. (It is a similar approach we implemented for min/max values.) Keep in mind that we need to handle the case when someone wants to filter on a NaN.

@asfimport
Copy link
Collaborator Author

Gang Wu / @wgtmac:
I think there is a similar issue in the dictionary encoding of floating-point types. @gszadovszky

@asfimport
Copy link
Collaborator Author

Gabor Szadovszky / @gszadovszky:
But we don't build the dictionary for filtering but for encoding. We should not add anything else than what we have in the pages. So anything should be added to the read path.

Maybe we do not need to handle +0.0 and -0.0 differently from the other values. (We needed to handle them separately for min/max values because the comparison is not trivial and there were actual issues.) If someone deals with FP numbers they should know about the difference between +0.0 and -0.0.

Because the FP spec allows to have multiple NaN values (even though java use one actual bitmap for it) we need to avoid using Bloom filter in this case. Dictionary is a different thing because we deserialize it to java Double/Float values in a Set so we will have one NaN value that is the very same one we are searching for. (It is more for the other implementations to deal with NaN if the language has several NaN values.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants