-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NaN value count to content file #1803
Conversation
yyanyy
commented
Nov 21, 2020
- counterpart of API: add isNaN and notNaN predicates #1747
- will only add the new count in V2 table
- also added some tests for stats handling in manifest reader
public void testReadEntriesWithFilterAndSelectIncludesFullStats() throws IOException { | ||
ManifestFile manifest = writeManifest(1000L, FILE); | ||
try (ManifestReader<DataFile> reader = ManifestFiles.read(manifest, FILE_IO) | ||
.select(ImmutableSet.of("record_count")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I change this record_count
to something else it will result in NPE due to InclusiveMetrisEvaluator.eval
needing record count, however STATS_COLUMNS
in manifest reader doesn't have it. I know the reader normally will only be used internally so we don't expect to run into this often, but wonder if we want to ensure record_count
is always added when populating stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we do. Maybe we should do that in a separate update, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll create a separate pr for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR: #1820
core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java
Outdated
Show resolved
Hide resolved
@@ -272,6 +273,8 @@ public void testMetricsForNestedStructFields() throws IOException { | |||
assertCounts(6, 1L, 0L, metrics); | |||
assertBounds(6, BinaryType.get(), | |||
ByteBuffer.wrap("A".getBytes()), ByteBuffer.wrap("A".getBytes()), metrics); | |||
assertCounts(7, 1L, 0L, 1L, metrics); | |||
assertBounds(7, DoubleType.get(), Double.NaN, Double.NaN, metrics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are NaN values getting into the lower and upper bounds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was because I added NaN as the only value in this column during the creation of the record in buildNestedTestRecord
, and currently this will result in upper and lower bound being both NaN (similar behavior as in this test. I added this extra column in order to test NaN handling in metrics modes, and change to this test was a side effect. Do you want me to remove the bound check in this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess this will continue to happen until we ignore NaN values and keep track of the lower and upper bounds ourselves for Parquet and ORC?
This is fine for now, but I would want this to be correct eventually.
Nice work. Thanks @yyanyy! |