-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: add contains_nan to field_summary #1872
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -72,10 +73,11 @@ public GenericPartitionFieldSummary(Schema avroSchema) { | |||
} | |||
} | |||
|
|||
public GenericPartitionFieldSummary(boolean containsNull, ByteBuffer lowerBound, | |||
public GenericPartitionFieldSummary(boolean containsNull, Boolean containsNaN, ByteBuffer lowerBound, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use Boolean
instead of the primitive here? Is there a place that constructs these without knowing whether there are NaN
values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for a test case that provide null here to test backward compatibility. I guess I'll mark this constructor as @VisibleForTesting
and create another public one that accepts boolean
?
This looks about ready to me. Please update now that #1747 is in and we'll get it merged. Thank you! |
c59ec69
to
1277887
Compare
* Returns true if at least one data file in the manifest has a nan value for the field. | ||
* Null if this information doesn't exist. | ||
*/ | ||
Boolean containsNaN(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense for this to have a default of null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think once this change is released, this field will be populated by default, and making it nullable is mostly for backward compatibility; also the unassigned value for this field is null in both the actual implementation and test implementation, so I think defaulting to null here might not help much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we don't expect other implementations of this interface, it is a fair point that we would accept other implementations in any method that uses ManifestFile
as an argument type. Adding the default would ensure backward-compatiblity in case anyone has an alternative implementation, and is really low cost. So I'd opt to add it. Thanks @holdenk!
@@ -203,8 +203,12 @@ default boolean hasDeletedFiles() { | |||
/** | |||
* Returns true if at least one data file in the manifest has a nan value for the field. | |||
* Null if this information doesn't exist. | |||
* <p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: do should we add a closing
in the docstring?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this I think <p>
itself might be enough?
If you have more than one paragraph in the doc comment, separate the paragraphs with a <p> paragraph tag, as shown
* Default to return null to ensure backward compatibility. | ||
*/ | ||
default Boolean containsNaN() { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just going back to the discussion of default to make sure my understanding is correct, if the returned value is null, does it mean the manifest file would suggest the data file does not contain NaN?
This behavior seems consistent with the default false
in PartitionSummary.java
, but logically speaking it seems like returning true as default is more reasonable to suggest a file always might contain NaN. What is the take here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think PartitionSummary
is populated when trying to write a file, so if the code contains this change, it will populate and write NaN boolean correctly. I guess you might be thinking about GenericPartitionFieldSummary
when you mention returning true as default since that's the class to be constructed when reading from avro files that may not contain this info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, make sense, thanks for the explanation.
api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/GenericPartitionFieldSummary.java
Outdated
Show resolved
Hide resolved
@yyanyy, looks mostly good. I think we should be a little bit more careful in the evaluator, but otherwise there's nothing other than minor changes needed. Thanks! |
// containsNull encodes whether at least one partition value is null, lowerBound is null if all partition values | ||
// are null; in case bounds don't include NaN value, containsNaN needs to be checked against. | ||
return summary.containsNull() && summary.lowerBound() == null && | ||
summary.containsNaN() != null && !summary.containsNaN(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
containsNaN
could only be true if the type is float
or double
. Should this check the column type before checking containsNaN
? Then we would be able to ignore it for most cases. For example, the test below uses a string column but has no containsNaN
values, so it doesn't catch that all values are null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry I forgot to address this comment... Will update
Thanks @yyanyy! Looks great. I merged it. |
Will update manifest evaluator to use this new field after either this or #1747 is merged.