Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: update record_count behavior, include in manifest reader #1820

Merged
merged 6 commits into from
Feb 3, 2021

Conversation

yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Nov 25, 2020

  • Please see this comment for the reason to have this change
  • Please note that this changes the behavior of recordCount in BaseFile; originally if BaseFile was created by avro schema reflection without populating recordCount, calling recordCount() will throw NPE because its return type is primitive. I'm currently following the same style as fileSizeInBytes to return -1 when it is not populated.
  • One implication of this is that the NPE problem described in the original comment will no longer exist, instead metrics evaluators will not filter out anything.
  • Alternatively I can refrain from changing this and accept that data.recordCount() could throw NPE in tests, or change the return type of recordCount() to be Long; I don't really have a strong preference so suggestions are welcome!


// the difference between the two stats set below is to support ContentFile.copyWithoutStats(), which
// still keeps record count.
private static final Set<String> STATS_COLUMNS = Sets.newHashSet(
Copy link
Contributor

@rdblue rdblue Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeps record count or discards record count?

I think it was an oversight to not include record count in stats. I think we should just have one list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think copyWithoutStats doesn't discard record count will discarding all column-specific stats.

I do agree that having one list is simpler, the reason for me to do this is

  • If we add record_count to this list then it will result in a behavior change, that if people select record_count without other stats listed here, earlier they will not receive those stats, but now they will receive a full list. This is because dropStats relies on this list.
  • Alternatively we can stop copying recordCount over within copyWithoutStats but I'm not entirely sure if we want to do that since currently the metrics that can be discarded are all map, and recordCount is long; and I guess if we no longer copy recordCount we may as well not copy fileSizeInBytes which is another long. After this change since these two attributes return primitive type, they will return -1, which I'm not sure if it's the best thing to do.

I think the first approach is safer, but I wasn't sure if it's worth changing the behavior to keep the code simpler. Do you have a recommendation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we special-case record_count? I don't think that record_count should be dropped in copyWithoutStats, but I also agree that simply selecting record_count should not select all stats columns.

This set is primarily for ensuring that all stats required to filter the instances of DataFile or DeleteFile stored in a manifest are present. I think we should limit it to that purpose. I don't think that this list should affect the behavior or copyWithoutStats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll add record_count to the set so that the set tracks all stats required for filtering data/delete file, and special-case record_count in dropStats() to account for the case that selecting only record_count will still result in copyWithoutStats instead of a full copy. (This is actually the same as the current logic, just consolidate the two sets into one)

@@ -289,12 +296,12 @@ static boolean dropStats(Expression rowFilter, Collection<String> columns) {
Sets.intersection(Sets.newHashSet(columns), STATS_COLUMNS).isEmpty();
}

private static Collection<String> withStatsColumns(Collection<String> columns) {
static Collection<String> withStatsColumns(Collection<String> columns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this returned List, then we wouldn't need to copy the list in ManifestGroup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one of the usage columns directly comes from the field column which is a collection so I didn't change that, but it seems that from all its current usage this field could be changed to a list. I'll need to update a few tests after this but I think it's doable. Will do!

@@ -60,7 +60,7 @@ public PartitionData copy() {
private String filePath = null;
private FileFormat format = null;
private PartitionData partitionData = null;
private Long recordCount = null;
private long recordCount = -1L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should make this change. I like that the NullPointerException ensures that an incorrect record count just can't be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly did this to simplify testing as I noticed fileSizeInBytes which is also a long has the pattern of defaulting to -1, but I don't have strong opinion either way. Will revert!

}

private void assertNoStats(DataFile dataFile) {
Assert.assertEquals(-1L, dataFile.recordCount());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should always contain the record count, even after copyWithoutStats. That's primarily to drop the stats maps, which can be really large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code actually is not testing results from copyWithoutStats, copyWithoutStats results are actually tested by another method assertStatsDropped which still keeps recordCount. In the test case that uses this assertNoStats, only select but no filter operation is applied to the manifest reader, so the reader doesn't project stats columns when reading (since manifest entries don't have to go through evaluators when no filter is applied), and thus only return the field being selected (in this case file_path). But I can see that this name is confusing so I updated it a bit to hopefully makes things clearer.

@yyanyy yyanyy force-pushed the manifest_reader_stats branch from bb09e68 to 76d6056 Compare December 8, 2020 20:28
@@ -136,7 +137,7 @@ public PartitionSpec spec() {
return spec;
}

public ManifestReader<F> select(Collection<String> newColumns) {
public ManifestReader<F> select(List<String> newColumns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was it necessary to change this to List instead of Collection? That's not a binary-compatible change. Couldn't we just make a copy of the collection if we need the field to be a list? I'm also not sure why columns can't be a Collection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I modified this when trying to address this comment, as I noticed that all usage of columns could be done via list so I directly changed the type to avoid the list copying, without thinking about backward compatibility of this method. I guess your original suggestion was to move the copying from ManifestGroup to ManifestReader, but I misinterpreted it to get rid of the list copy completely?

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small details, otherwise it looks good to me.

static final Set<String> STATS_COLUMNS = Sets.newHashSet(
"value_counts", "null_value_counts", "nan_value_counts", "lower_bounds", "upper_bounds");

private static final Set<String> STATS_COLUMNS = Sets.newHashSet(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use immutable set

Assert.assertNull(dataFile.lowerBounds());
Assert.assertNull(dataFile.upperBounds());
Assert.assertNull(dataFile.nanValueCounts());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: to minimize changes, no need to move line for this. Same comment for L206

@rdblue rdblue merged commit 97703fb into apache:master Feb 3, 2021
@rdblue
Copy link
Contributor

rdblue commented Feb 3, 2021

Looks great, thanks @yyanyy!

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants