Core: Implement NaN counts in ORC #1790

yyanyy · 2020-11-19T19:19:49Z

Similar to Add NaN counter to Metrics and implement in Parquet writers #1641
Renamed a few classes to be shared between Parquet and ORC
Also noticed an issue where the Spark implementation of TestMergingMetrics didn't handle Date/Timestamp types as expected: they don't have matching type in StructInternalRow.get and the logic default the value to null, and this didn't trigger issue earlier since in the test case they were declared as optional fields. ORC Spark writer handles null values differently and revealed this problem.

Edit: To reduce the size of this PR I separated some of the changes to #1829, and will rebase this on top of #1829 once that is merged. This PR still contains some of the changes in #1829 to ensure the code can compile and tests can run.

jackye1995 · 2020-11-23T00:25:06Z

api/src/main/java/org/apache/iceberg/NaNOnlyFieldMetrics.java

 * exceptions when they are accessed.
 */
-public class ParquetFieldMetrics extends FieldMetrics {


I thought the reason for having ParquetFieldMetrics extending FieldMetrics is to allow ORC and parquet to diverge if needed. If we are moving them back to a common class, why not just move everything back to FieldMetrics?

It's actually to allow Parquet and ORC to diverge from Avro, so we will use the actual FiledMetrics in avro

I am a bit confused, in that case why not have 3 separated classes ParquetFieldMetrics, ORCFieldMetrics and AvroFieldMetrics?

Yeah that's also an alternative approach, I wasn't sure if I want to directly duplicate the code in ParquetFieldMetrics to create a ORC version though since I don't think parquet/ORC libraries will support NaN natively soon.

I don't see a need for 3 classes when 2 of them would be nearly identical. I like what is in this PR, with the one note about naming.

jackye1995 · 2020-11-23T00:36:29Z

core/src/main/java/org/apache/iceberg/MetricsUtil.java

+    }
+
+    return fieldMetrics
+        .filter(metrics -> {


Since we are having this util class, can we decompose this function into methods like metricsColumnName(FieldMetrics, Schema) and metricsMode(FieldMetrics, MetricsConfig)? They might be useful in other classes, and also make the lambda chain cleaner.

Sure, I'll break this down

jackye1995 · 2020-11-23T00:36:58Z

core/src/main/java/org/apache/iceberg/MetricsUtil.java

+  private MetricsUtil() {
+  }
+
+  public static Map<Integer, Long> getNanValueCounts(


nit: seems like iceberg prefers method names without get.

I was aware of that, but couldn't come up with a good name without get in this case (things like generateCount/createCount is longer and sounds like just for workaround the word "get"), and later assume it was fine since it's not an ordinary getter... Do you have a recommendation?

The problem with get in this case is that it isn't clear where the return value is coming from. I think that create is a better option because it is clear that the return value is built from the input arguments.

An example of where the value may come from somewhere else is IPUtil.getHostName(String iface). The input value is used, but the actual return value would come from an external source.

Makes sense, I'll update in both here and #1829

jackye1995 · 2020-11-23T00:37:53Z

data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriter.java

@@ -109,6 +112,10 @@ private WriteBuilder() {
              iPrimitive, primitive));
      }
    }
+
+    private int getFieldId(TypeDescription typeDescription) {


nit: the private method feels redundant since the body is also just one line

I personally think it's easier to read by segregating the abstract logic of what's doing here from the actual underlying implementation so I created the helper; do you feel strong about this? I'll see if other people have the same comment?

I agree with @jackye1995. I probably wouldn't separate this out into its own method.

jackye1995 · 2020-11-23T00:43:19Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcWriter.java

@@ -129,5 +139,9 @@ private WriteBuilder() {
              "Invalid iceberg type %s corresponding to Flink logical type %s", iPrimitive, flinkPrimitive));
      }
    }
+
+    private int getFieldId(TypeDescription typeDescription) {


nit:same comment as before, do we need this private method for the 1 line call?

jackye1995 · 2020-11-23T00:43:59Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkOrcWriter.java

@@ -46,8 +50,8 @@ private FlinkOrcWriter(RowType rowType, Schema iSchema) {
    }
  }

-  public static OrcRowWriter<RowData> buildWriter(RowType rowType, Schema iSchema) {
-    return new FlinkOrcWriter(rowType, iSchema);
+  public static OrcRowWriter<RowData> buildWriter(RowType rowType, Schema iSchema, TypeDescription schema) {


is it necesSary to pass in the type description? can we get id from the iceberg schema?

Yeah I tried to avoid changing the signature but wasn't able to find id information from schema, I'll see if people more familiar with the project would have comment on this

The field IDs are kept in NestedField, not primitives. That's probably why you didn't find one that was usable in the primitive method. What we do in other visitors is add methods to the visitor that are called before and after visiting a struct field, array element, map key, and map value. Those methods are passed the field. Then the visitor just needs to implement the before/after to maintain a stack of field IDs.

Here's a visit method with the callbacks: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/types/TypeUtil.java#L334
And here's an example of using them to get the field IDs: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/avro/TypeToSchema.java#L75-L83

Thanks for the info! Yeah earlier I noticed that the ids exist in NestedField but wasn't able to find out a good way to extract without larger changes to the signature, and tried to replace StructType/NestedType with NestedField but that would result in losing other information. I'll update to use the before/after pattern.

jackye1995 · 2020-11-23T00:49:31Z

spark/src/main/java/org/apache/iceberg/spark/source/StructInternalRow.java

@@ -115,12 +120,30 @@ public short getShort(int ordinal) {

  @Override
  public int getInt(int ordinal) {
-    return struct.get(ordinal, Integer.class);
+    Object integer = struct.get(ordinal, Object.class);


For this issue about handling date and timestamp, can it be a separated PR? This PR is already very big.

This is needed for TestMergingMetrics to work for ORC (also mentioned in pr description) so I think it makes more sense to have this in here

If so, should we publish a separated PR for this? My biggest concern is that people might have different opinions about the way to handle these data types. When embedded as a part of a big PR, it gets less attention for people who are interested.

I did a similar change in this class in an earlier pr, and I think this change is specific to the test itself so I'm bit reluctant to move it out of the current context, as without this context this change will be confusing to understand:

currently spark supports dates and time the way this class already handles, as spark represents dates/time with internal structures of long/int ref1 ref2, so the new code introduced by this won't affect the behavior of this class, as Spark will continue to write and get these types with the same internal numeric types. And this class is only used for loading metadata tables so performance with the extra if-else check probably is not a big concern too.

However in TestMergingMetrics itself, since data are created by random record generator which uses the actual LocalDate to populate the fields, and we are wrapping the generated records with this class for writing, we have to handle them specially here. But I don't think this usage pattern (of writing spark rows by wrapping them with iceberg record) will be used in production since the spark engine will use the actual spark InternalRow.

I guess since I'll send out a separate PR anyway I'll include change in this class as part of that PR

This is okay with me.

rdblue · 2020-11-25T01:16:16Z

This touches a lot of files. Can we separate the refactoring out into a separate PR and focus on just ORC implementations here?

yyanyy · 2020-11-25T02:37:53Z

This touches a lot of files. Can we separate the refactoring out into a separate PR and focus on just ORC implementations here?

Sure, I'll send out a PR that only contains refactoring, and once that's merged I'll update this to depend on that one, unless you want it the other way around?

rdblue · 2020-12-01T21:50:36Z

api/src/main/java/org/apache/iceberg/NaNOnlyFieldMetrics.java

   * @param id field id being tracked by the writer
   * @param nanValueCount number of NaN values, will only be non-0 for double or float field.
   */
-  public ParquetFieldMetrics(int id,
+  public NaNOnlyFieldMetrics(int id,


I would probably change this to NaNFieldMetrics instead because it is likely that we will be adding lower/upper bounds to this in the near future. That would avoid another rename, but it's up to you.

Sounds good, I'll update in both here and #1829

rdblue · 2020-12-01T21:52:30Z

api/src/main/java/org/apache/iceberg/NaNOnlyFieldMetrics.java

@@ -17,51 +17,50 @@
 * under the License.
 */

-package org.apache.iceberg.parquet;


Does this need to be in API or could it be in core instead?

If the class is never returned to users, then I would keep it in core. The classes in api are primarily those that users would interact with.

Sounds good, I'll move both this and FieldMetrics to core in both here and #1829

rdblue · 2020-12-01T21:55:30Z

data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java

@@ -77,12 +80,12 @@ private GenericOrcWriters() {
    return LongWriter.INSTANCE;
  }

-  public static OrcValueWriter<Float> floats() {
-    return FloatWriter.INSTANCE;
+  public static OrcValueWriter<Float> floats(Integer id) {


The id is required, right? If so, then I think it could be int instead. It's an int in the constructor that gets called.

This could add a precondition to check that the id is non-null, but I think it would be better to do that before calling this method because the caller would probably know the field name rather than just ID. Using the field name would produce a better error message.

rdblue · 2020-12-01T22:08:32Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkSchemaVisitor.java


 abstract class FlinkSchemaVisitor<T> {

-  static <T> T visit(RowType flinkType, Schema schema, FlinkSchemaVisitor<T> visitor) {
-    return visit(flinkType, schema.asStruct(), visitor);
+  static <T> T visit(RowType flinkType, Schema schema, TypeDescription typeDesc, FlinkSchemaVisitor<T> visitor) {


I think it would be cleaner to add the beforeField/afterField callbacks instead of the typeDesc.

rdblue · 2020-12-01T22:10:10Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

@@ -114,6 +120,7 @@ private static Metrics buildOrcMetrics(final long numOfRows, final TypeDescripti
          columnSizes,
          valueCounts,
          nullCounts,
+          Maps.newHashMap(),


Why not use null when metrics are missing?

I think I missed updating this during refactoring. Will update!

rdblue · 2020-12-01T22:10:56Z

orc/src/main/java/org/apache/iceberg/orc/OrcRowWriter.java

+   * Returns a stream of {@link FieldMetrics} that this OrcRowWriter keeps track of.
+   * <p>
+   * Since ORC keeps track of most metrics via column statistics, for now OrcRowWriter only keeps track of NaN
+   * counters for double or float columns.


I don't think this paragraph needs to be here because it is a snapshot of how another component works. It could get stale really easily.

Makes sense, I'll remove similar instances in here, OrcValueWriter and ParquetValueWriter

orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java

rdblue · 2020-12-01T22:15:38Z

Looking close to ready, but I made a few comments.

rdblue · 2020-12-28T23:27:27Z

@yyanyy, could you rebase this? I think we can work on getting it in now, right? #1829 is in.

jackye1995 · 2021-01-11T22:03:10Z

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java

@@ -98,9 +106,9 @@ public SparkOrcValueWriter primitive(Type.PrimitiveType iPrimitive, TypeDescript
        case LONG:
          return SparkOrcValueWriters.longs();
        case FLOAT:
-          return SparkOrcValueWriters.floats();
+          return SparkOrcValueWriters.floats(getFieldId(primitive));


nit: getFieldId is not used anywhere else, why not just use ORCSchemaUtil.fieldId

jackye1995 · 2021-01-11T22:11:48Z

orc/src/main/java/org/apache/iceberg/orc/OrcRowWriter.java

+  /**
+   * Returns a stream of {@link FieldMetrics} that this OrcRowWriter keeps track of.
+   */
+  Stream<FieldMetrics> metrics();


why some method signatures of metrics have default, but some others below do not?

Currently value writers all have a default, and this is because we are only tracking this metrics for float and wrapper types, declaring a default will save other types from declaring empty stream. This row writer on the other hand handles row writing, and it will always need to read from value writers' metrics, and thus default wouldn't help much on this aspect. Although I guess we want to prevent breaking people's code if they implement their own version of row writer, that I'll add a default here to avoid that.

jackye1995 · 2021-01-11T22:21:00Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkSchemaVisitor.java

+        visitor.beforeListElement(elementField);
+        try {
+          element = visit(listType.getElementType(), iListType.elementType(), visitor);
+        } finally {


error should be logged if we catch anything. same for the try finally block above.

I think we are not catching any exception here so we don't have anything to log? Did I miss something?

I see. I was thinking about logging the exception once in the catch block, but it seems unnecessary.

rdblue · 2021-02-03T01:29:51Z

Looks good now. Thanks for the latest update, I think it is a bit cleaner now that it doesn't pass the type description.

github-actions bot added API core data flink ORC parquet spark labels Nov 19, 2020

jackye1995 reviewed Nov 23, 2020

View reviewed changes

yyanyy mentioned this pull request Nov 25, 2020

Refactor/rename metrics related classes for NaN support #1829

Merged

rdblue reviewed Dec 1, 2020

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java Show resolved Hide resolved

yyanyy added a commit to yyanyy/iceberg that referenced this pull request Dec 2, 2020

update based on comments in apache#1790

0ef5f87

yyanyy added a commit to yyanyy/iceberg that referenced this pull request Dec 7, 2020

update based on comments in apache#1790

ee0ae1d

Core: Implement NaN counts in ORC

5a1b122

yyanyy force-pushed the orc_nan branch from e509e17 to 5a1b122 Compare January 5, 2021 23:32

jackye1995 reviewed Jan 11, 2021

View reviewed changes

address comments

376524e

jackye1995 approved these changes Jan 14, 2021

View reviewed changes

yyanyy mentioned this pull request Jan 19, 2021

API: handle NaN as min/max stats in evaluators #2069

Merged

rdblue approved these changes Feb 3, 2021

View reviewed changes

rdblue merged commit 8e026f1 into apache:master Feb 3, 2021

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021

ORC: Collect NaN counts in ORC writers (apache#1790)

d31238b

Core: Implement NaN counts in ORC #1790

Core: Implement NaN counts in ORC #1790

Conversation

yyanyy commented Nov 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Dec 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Nov 25, 2020

yyanyy commented Nov 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Dec 1, 2020

rdblue commented Dec 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Feb 3, 2021

yyanyy commented Nov 19, 2020 •

edited

Loading

rdblue Dec 1, 2020 •

edited

Loading