Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NaN value count to content file #1803

Merged
merged 3 commits into from
Nov 25, 2020
Merged

Add NaN value count to content file #1803

merged 3 commits into from
Nov 25, 2020

Conversation

yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Nov 21, 2020

public void testReadEntriesWithFilterAndSelectIncludesFullStats() throws IOException {
ManifestFile manifest = writeManifest(1000L, FILE);
try (ManifestReader<DataFile> reader = ManifestFiles.read(manifest, FILE_IO)
.select(ImmutableSet.of("record_count"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change this record_count to something else it will result in NPE due to InclusiveMetrisEvaluator.eval needing record count, however STATS_COLUMNS in manifest reader doesn't have it. I know the reader normally will only be used internally so we don't expect to run into this often, but wonder if we want to ensure record_count is always added when populating stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we do. Maybe we should do that in a separate update, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll create a separate pr for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR: #1820

@@ -272,6 +273,8 @@ public void testMetricsForNestedStructFields() throws IOException {
assertCounts(6, 1L, 0L, metrics);
assertBounds(6, BinaryType.get(),
ByteBuffer.wrap("A".getBytes()), ByteBuffer.wrap("A".getBytes()), metrics);
assertCounts(7, 1L, 0L, 1L, metrics);
assertBounds(7, DoubleType.get(), Double.NaN, Double.NaN, metrics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are NaN values getting into the lower and upper bounds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was because I added NaN as the only value in this column during the creation of the record in buildNestedTestRecord, and currently this will result in upper and lower bound being both NaN (similar behavior as in this test. I added this extra column in order to test NaN handling in metrics modes, and change to this test was a side effect. Do you want me to remove the bound check in this test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess this will continue to happen until we ignore NaN values and keep track of the lower and upper bounds ourselves for Parquet and ORC?

This is fine for now, but I would want this to be correct eventually.

@rdblue rdblue merged commit b1296bc into apache:master Nov 25, 2020
@rdblue
Copy link
Contributor

rdblue commented Nov 25, 2020

Nice work. Thanks @yyanyy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants