Core: update record_count behavior, include in manifest reader #1820

yyanyy · 2020-11-25T02:47:25Z

Please see this comment for the reason to have this change
Please note that this changes the behavior of recordCount in BaseFile; originally if BaseFile was created by avro schema reflection without populating recordCount, calling recordCount() will throw NPE because its return type is primitive. I'm currently following the same style as fileSizeInBytes to return -1 when it is not populated.
~~One implication of this is that the NPE problem described in the original comment will no longer exist, instead metrics evaluators will not filter out anything.~~
Alternatively I can refrain from changing this and accept that data.recordCount() could throw NPE in tests, or change the return type of recordCount() to be Long; I don't really have a strong preference so suggestions are welcome!

rdblue · 2020-11-25T18:44:17Z

core/src/main/java/org/apache/iceberg/ManifestReader.java

+
+  // the difference between the two stats set below is to support ContentFile.copyWithoutStats(), which
+  // still keeps record count.
+  private static final Set<String> STATS_COLUMNS = Sets.newHashSet(


Keeps record count or discards record count?

I think it was an oversight to not include record count in stats. I think we should just have one list.

I think copyWithoutStats doesn't discard record count will discarding all column-specific stats.

I do agree that having one list is simpler, the reason for me to do this is

If we add record_count to this list then it will result in a behavior change, that if people select record_count without other stats listed here, earlier they will not receive those stats, but now they will receive a full list. This is because dropStats relies on this list.

Alternatively we can stop copying recordCount over within copyWithoutStats but I'm not entirely sure if we want to do that since currently the metrics that can be discarded are all map, and recordCount is long; and I guess if we no longer copy recordCount we may as well not copy fileSizeInBytes which is another long. After this change since these two attributes return primitive type, they will return -1, which I'm not sure if it's the best thing to do.

I think the first approach is safer, but I wasn't sure if it's worth changing the behavior to keep the code simpler. Do you have a recommendation?

Can we special-case record_count? I don't think that record_count should be dropped in copyWithoutStats, but I also agree that simply selecting record_count should not select all stats columns.

This set is primarily for ensuring that all stats required to filter the instances of DataFile or DeleteFile stored in a manifest are present. I think we should limit it to that purpose. I don't think that this list should affect the behavior or copyWithoutStats.

Sounds good, I'll add record_count to the set so that the set tracks all stats required for filtering data/delete file, and special-case record_count in dropStats() to account for the case that selecting only record_count will still result in copyWithoutStats instead of a full copy. (This is actually the same as the current logic, just consolidate the two sets into one)

rdblue · 2020-11-25T18:45:59Z

core/src/main/java/org/apache/iceberg/ManifestReader.java

@@ -289,12 +296,12 @@ static boolean dropStats(Expression rowFilter, Collection<String> columns) {
        Sets.intersection(Sets.newHashSet(columns), STATS_COLUMNS).isEmpty();
  }

-  private static Collection<String> withStatsColumns(Collection<String> columns) {
+  static Collection<String> withStatsColumns(Collection<String> columns) {


If this returned List, then we wouldn't need to copy the list in ManifestGroup

In one of the usage columns directly comes from the field column which is a collection so I didn't change that, but it seems that from all its current usage this field could be changed to a list. I'll need to update a few tests after this but I think it's doable. Will do!

rdblue · 2020-12-06T00:53:00Z

core/src/main/java/org/apache/iceberg/BaseFile.java

@@ -60,7 +60,7 @@ public PartitionData copy() {
  private String filePath = null;
  private FileFormat format = null;
  private PartitionData partitionData = null;
-  private Long recordCount = null;
+  private long recordCount = -1L;


I don't think we should make this change. I like that the NullPointerException ensures that an incorrect record count just can't be used.

I mostly did this to simplify testing as I noticed fileSizeInBytes which is also a long has the pattern of defaulting to -1, but I don't have strong opinion either way. Will revert!

rdblue · 2020-12-06T00:57:25Z

core/src/test/java/org/apache/iceberg/TestManifestReaderStats.java

+  }
+
+  private void assertNoStats(DataFile dataFile) {
+    Assert.assertEquals(-1L, dataFile.recordCount());


I think this should always contain the record count, even after copyWithoutStats. That's primarily to drop the stats maps, which can be really large.

This code actually is not testing results from copyWithoutStats, copyWithoutStats results are actually tested by another method assertStatsDropped which still keeps recordCount. In the test case that uses this assertNoStats, only select but no filter operation is applied to the manifest reader, so the reader doesn't project stats columns when reading (since manifest entries don't have to go through evaluators when no filter is applied), and thus only return the field being selected (in this case file_path). But I can see that this name is confusing so I updated it a bit to hopefully makes things clearer.

core/src/main/java/org/apache/iceberg/ManifestReader.java

rdblue · 2020-12-29T00:08:37Z

core/src/main/java/org/apache/iceberg/ManifestReader.java

@@ -136,7 +137,7 @@ public PartitionSpec spec() {
    return spec;
  }

-  public ManifestReader<F> select(Collection<String> newColumns) {
+  public ManifestReader<F> select(List<String> newColumns) {


Why was it necessary to change this to List instead of Collection? That's not a binary-compatible change. Couldn't we just make a copy of the collection if we need the field to be a list? I'm also not sure why columns can't be a Collection.

I think I modified this when trying to address this comment, as I noticed that all usage of columns could be done via list so I directly changed the type to avoid the list copying, without thinking about backward compatibility of this method. I guess your original suggestion was to move the copying from ManifestGroup to ManifestReader, but I misinterpreted it to get rid of the list copy completely?

core/src/test/java/org/apache/iceberg/TestManifestReaderStats.java

jackye1995

Just a few small details, otherwise it looks good to me.

jackye1995 · 2021-01-11T22:24:46Z

core/src/main/java/org/apache/iceberg/ManifestReader.java

-  static final Set<String> STATS_COLUMNS = Sets.newHashSet(
-      "value_counts", "null_value_counts", "nan_value_counts", "lower_bounds", "upper_bounds");
+
+  private static final Set<String> STATS_COLUMNS = Sets.newHashSet(


nit: use immutable set

jackye1995 · 2021-01-11T22:38:44Z

core/src/test/java/org/apache/iceberg/TestManifestReaderStats.java

      Assert.assertNull(dataFile.lowerBounds());
      Assert.assertNull(dataFile.upperBounds());
-      Assert.assertNull(dataFile.nanValueCounts());


nit: to minimize changes, no need to move line for this. Same comment for L206

rdblue · 2021-02-03T01:04:47Z

Looks great, thanks @yyanyy!

yyanyy mentioned this pull request Nov 25, 2020

Add NaN value count to content file #1803

Merged

github-actions bot added the core label Nov 25, 2020

rdblue reviewed Nov 25, 2020

View reviewed changes

rdblue reviewed Dec 6, 2020

View reviewed changes

yyanyy added 3 commits December 8, 2020 12:16

Core: update record_count behavior, include in manifest reader

287c82b

update columns type to list in manifest reader

106d5db

revert basefile change

76d6056

yyanyy force-pushed the manifest_reader_stats branch from bb09e68 to 76d6056 Compare December 8, 2020 20:28

update some comment

7e757de

rdblue reviewed Dec 29, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/ManifestReader.java Outdated Show resolved Hide resolved

rdblue reviewed Dec 29, 2020

View reviewed changes

core/src/test/java/org/apache/iceberg/TestManifestReaderStats.java Show resolved Hide resolved

address comments

7e177d1

jackye1995 reviewed Jan 11, 2021

View reviewed changes

minor changes

ca8e362

jackye1995 approved these changes Jan 14, 2021

View reviewed changes

rdblue approved these changes Feb 3, 2021

View reviewed changes

rdblue merged commit 97703fb into apache:master Feb 3, 2021

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021

Core: Include record_count with stats in ManifestReader (apache#1820)

563aa2a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: update record_count behavior, include in manifest reader #1820

Core: update record_count behavior, include in manifest reader #1820

yyanyy commented Nov 25, 2020 •

edited by rdblue

Loading

rdblue Nov 25, 2020 •

edited

Loading

yyanyy Nov 25, 2020

rdblue Dec 6, 2020

yyanyy Dec 8, 2020

rdblue Nov 25, 2020

yyanyy Nov 25, 2020

rdblue Dec 6, 2020

yyanyy Dec 8, 2020

rdblue Dec 6, 2020

yyanyy Dec 8, 2020

rdblue Dec 29, 2020

yyanyy Jan 6, 2021

jackye1995 left a comment

jackye1995 Jan 11, 2021

jackye1995 Jan 11, 2021

rdblue commented Feb 3, 2021

Core: update record_count behavior, include in manifest reader #1820

Core: update record_count behavior, include in manifest reader #1820

Conversation

yyanyy commented Nov 25, 2020 • edited by rdblue Loading

rdblue Nov 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackye1995 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Feb 3, 2021

yyanyy commented Nov 25, 2020 •

edited by rdblue

Loading

rdblue Nov 25, 2020 •

edited

Loading