Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: add contains_nan to field_summary #1872

Merged
merged 6 commits into from
Feb 5, 2021

Conversation

yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Dec 3, 2020

Will update manifest evaluator to use this new field after either this or #1747 is merged.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -72,10 +73,11 @@ public GenericPartitionFieldSummary(Schema avroSchema) {
}
}

public GenericPartitionFieldSummary(boolean containsNull, ByteBuffer lowerBound,
public GenericPartitionFieldSummary(boolean containsNull, Boolean containsNaN, ByteBuffer lowerBound,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use Boolean instead of the primitive here? Is there a place that constructs these without knowing whether there are NaN values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for a test case that provide null here to test backward compatibility. I guess I'll mark this constructor as @VisibleForTesting and create another public one that accepts boolean?

@rdblue
Copy link
Contributor

rdblue commented Dec 6, 2020

This looks about ready to me. Please update now that #1747 is in and we'll get it merged. Thank you!

@yyanyy yyanyy force-pushed the nan_partition_field_summary branch from c59ec69 to 1277887 Compare December 7, 2020 22:47
* Returns true if at least one data file in the manifest has a nan value for the field.
* Null if this information doesn't exist.
*/
Boolean containsNaN();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for this to have a default of null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think once this change is released, this field will be populated by default, and making it nullable is mostly for backward compatibility; also the unassigned value for this field is null in both the actual implementation and test implementation, so I think defaulting to null here might not help much?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we don't expect other implementations of this interface, it is a fair point that we would accept other implementations in any method that uses ManifestFile as an argument type. Adding the default would ensure backward-compatiblity in case anyone has an alternative implementation, and is really low cost. So I'd opt to add it. Thanks @holdenk!

@@ -203,8 +203,12 @@ default boolean hasDeletedFiles() {
/**
* Returns true if at least one data file in the manifest has a nan value for the field.
* Null if this information doesn't exist.
* <p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do should we add a closing

in the docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this I think <p> itself might be enough?

If you have more than one paragraph in the doc comment, separate the paragraphs with a <p> paragraph tag, as shown

* Default to return null to ensure backward compatibility.
*/
default Boolean containsNaN() {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just going back to the discussion of default to make sure my understanding is correct, if the returned value is null, does it mean the manifest file would suggest the data file does not contain NaN?

This behavior seems consistent with the default false in PartitionSummary.java, but logically speaking it seems like returning true as default is more reasonable to suggest a file always might contain NaN. What is the take here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think PartitionSummary is populated when trying to write a file, so if the code contains this change, it will populate and write NaN boolean correctly. I guess you might be thinking about GenericPartitionFieldSummary when you mention returning true as default since that's the class to be constructed when reading from avro files that may not contain this info?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, make sense, thanks for the explanation.

@rdblue
Copy link
Contributor

rdblue commented Feb 3, 2021

@yyanyy, looks mostly good. I think we should be a little bit more careful in the evaluator, but otherwise there's nothing other than minor changes needed. Thanks!

// containsNull encodes whether at least one partition value is null, lowerBound is null if all partition values
// are null; in case bounds don't include NaN value, containsNaN needs to be checked against.
return summary.containsNull() && summary.lowerBound() == null &&
summary.containsNaN() != null && !summary.containsNaN();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containsNaN could only be true if the type is float or double. Should this check the column type before checking containsNaN? Then we would be able to ignore it for most cases. For example, the test below uses a string column but has no containsNaN values, so it doesn't catch that all values are null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I forgot to address this comment... Will update

@rdblue rdblue merged commit 8a84715 into apache:master Feb 5, 2021
@rdblue
Copy link
Contributor

rdblue commented Feb 5, 2021

Thanks @yyanyy! Looks great. I merged it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants