Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/rename metrics related classes for NaN support #1829

Merged
merged 7 commits into from
Dec 17, 2020

Conversation

yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Nov 25, 2020

  • separate refactoring/renaming changes in Core: Implement NaN counts in ORC #1790 to this PR to reduce the size of that PR
  • no behavior change in metrics related classes
  • added extra supports in StructInternalRow

@@ -115,12 +120,30 @@ public short getShort(int ordinal) {

@Override
public int getInt(int ordinal) {
return struct.get(ordinal, Integer.class);
Object integer = struct.get(ordinal, Object.class);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for changing this class is discussed in this thread

Copy link

@giovannifumarola giovannifumarola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yan for the patch.

}

public static MetricsModes.MetricsMode getMetricsMode(Schema inputSchema, MetricsConfig metricsConfig, int fieldId) {
String columnName = inputSchema.findColumnName(fieldId);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputschema can be null at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a validation check to ensure they are not null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Let's add checks to fail when metricsConfig or inputSchema is null. This is a useful method so I think it is worth keeping it public, but we would ideally not fail with a NullPointerException. This is not called in a tight loop, so it should be fine to add the checks on each call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I overwrote my own earlier change that addressed this when I renamed this file...


public static MetricsModes.MetricsMode getMetricsMode(Schema inputSchema, MetricsConfig metricsConfig, int fieldId) {
String columnName = inputSchema.findColumnName(fieldId);
return metricsConfig.columnMode(columnName);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for metrics config.

private MetricsUtil() {
}

public static Map<Integer, Long> getNanValueCounts(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: add javadoc.

* exceptions when they are accessed.
*/
public class ParquetFieldMetrics extends FieldMetrics {
public class NaNFieldMetrics extends FieldMetrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we expect to add lower and upper bounds to this class (since Parquet and ORC do not ignore NaN values), I think this should probably be called FloatFieldMetrics. That way we don't need to rename when this adds support for lower and upper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good; I'll keep the javadoc as is since they currently reflect reality, and will update comments to mention tracking float fields when the class starts to track upper/lower bound

}

@Override
public ByteBuffer upperBound() {
throw new IllegalStateException(
"Shouldn't access upperBound() within ParquetFieldMetrics, as this metric is tracked by Parquet footer. ");
"Shouldn't access upperBound() within NaNOnlyFieldMetrics, as this metric is tracked in file statistics. ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This demonstrates the drawback to adding too much context to an exception message. The method and class should be in the stack trace and renames require updating this message. Let's keep the good error message, but remove the class and method names.

@yyanyy yyanyy mentioned this pull request Dec 16, 2020
@rdblue
Copy link
Contributor

rdblue commented Dec 17, 2020

Thanks for updating this, @yyanyy! I'll merge it.

@rdblue rdblue merged commit 425a45f into apache:master Dec 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants