-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: update record_count behavior, include in manifest reader #1820
Conversation
|
||
// the difference between the two stats set below is to support ContentFile.copyWithoutStats(), which | ||
// still keeps record count. | ||
private static final Set<String> STATS_COLUMNS = Sets.newHashSet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeps record count or discards record count?
I think it was an oversight to not include record count in stats. I think we should just have one list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think copyWithoutStats
doesn't discard record count will discarding all column-specific stats.
I do agree that having one list is simpler, the reason for me to do this is
- If we add
record_count
to this list then it will result in a behavior change, that if people selectrecord_count
without other stats listed here, earlier they will not receive those stats, but now they will receive a full list. This is becausedropStats
relies on this list. - Alternatively we can stop copying
recordCount
over withincopyWithoutStats
but I'm not entirely sure if we want to do that since currently the metrics that can be discarded are all map, andrecordCount
islong
; and I guess if we no longer copyrecordCount
we may as well not copyfileSizeInBytes
which is anotherlong
. After this change since these two attributes return primitive type, they will return -1, which I'm not sure if it's the best thing to do.
I think the first approach is safer, but I wasn't sure if it's worth changing the behavior to keep the code simpler. Do you have a recommendation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we special-case record_count
? I don't think that record_count
should be dropped in copyWithoutStats
, but I also agree that simply selecting record_count
should not select all stats columns.
This set is primarily for ensuring that all stats required to filter the instances of DataFile
or DeleteFile
stored in a manifest are present. I think we should limit it to that purpose. I don't think that this list should affect the behavior or copyWithoutStats
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll add record_count
to the set so that the set tracks all stats required for filtering data/delete file, and special-case record_count
in dropStats()
to account for the case that selecting only record_count
will still result in copyWithoutStats
instead of a full copy. (This is actually the same as the current logic, just consolidate the two sets into one)
@@ -289,12 +296,12 @@ static boolean dropStats(Expression rowFilter, Collection<String> columns) { | |||
Sets.intersection(Sets.newHashSet(columns), STATS_COLUMNS).isEmpty(); | |||
} | |||
|
|||
private static Collection<String> withStatsColumns(Collection<String> columns) { | |||
static Collection<String> withStatsColumns(Collection<String> columns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this returned List
, then we wouldn't need to copy the list in ManifestGroup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one of the usage columns
directly comes from the field column
which is a collection so I didn't change that, but it seems that from all its current usage this field could be changed to a list. I'll need to update a few tests after this but I think it's doable. Will do!
@@ -60,7 +60,7 @@ public PartitionData copy() { | |||
private String filePath = null; | |||
private FileFormat format = null; | |||
private PartitionData partitionData = null; | |||
private Long recordCount = null; | |||
private long recordCount = -1L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should make this change. I like that the NullPointerException
ensures that an incorrect record count just can't be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly did this to simplify testing as I noticed fileSizeInBytes
which is also a long
has the pattern of defaulting to -1, but I don't have strong opinion either way. Will revert!
} | ||
|
||
private void assertNoStats(DataFile dataFile) { | ||
Assert.assertEquals(-1L, dataFile.recordCount()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should always contain the record count, even after copyWithoutStats
. That's primarily to drop the stats maps, which can be really large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code actually is not testing results from copyWithoutStats
, copyWithoutStats
results are actually tested by another method assertStatsDropped
which still keeps recordCount
. In the test case that uses this assertNoStats
, only select
but no filter
operation is applied to the manifest reader, so the reader doesn't project stats columns when reading (since manifest entries don't have to go through evaluators when no filter is applied), and thus only return the field being selected (in this case file_path
). But I can see that this name is confusing so I updated it a bit to hopefully makes things clearer.
bb09e68
to
76d6056
Compare
@@ -136,7 +137,7 @@ public PartitionSpec spec() { | |||
return spec; | |||
} | |||
|
|||
public ManifestReader<F> select(Collection<String> newColumns) { | |||
public ManifestReader<F> select(List<String> newColumns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was it necessary to change this to List
instead of Collection
? That's not a binary-compatible change. Couldn't we just make a copy of the collection if we need the field to be a list? I'm also not sure why columns
can't be a Collection
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I modified this when trying to address this comment, as I noticed that all usage of columns
could be done via list so I directly changed the type to avoid the list copying, without thinking about backward compatibility of this method. I guess your original suggestion was to move the copying from ManifestGroup
to ManifestReader
, but I misinterpreted it to get rid of the list copy completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small details, otherwise it looks good to me.
static final Set<String> STATS_COLUMNS = Sets.newHashSet( | ||
"value_counts", "null_value_counts", "nan_value_counts", "lower_bounds", "upper_bounds"); | ||
|
||
private static final Set<String> STATS_COLUMNS = Sets.newHashSet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use immutable set
Assert.assertNull(dataFile.lowerBounds()); | ||
Assert.assertNull(dataFile.upperBounds()); | ||
Assert.assertNull(dataFile.nanValueCounts()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: to minimize changes, no need to move line for this. Same comment for L206
Looks great, thanks @yyanyy! |
Please note that this changes the behavior ofrecordCount
inBaseFile
; originally ifBaseFile
was created by avro schema reflection without populatingrecordCount
, callingrecordCount()
will throw NPE because its return type is primitive. I'm currently following the same style asfileSizeInBytes
to return -1 when it is not populated.One implication of this is that the NPE problem described in the original comment will no longer exist, instead metrics evaluators will not filter out anything.Alternatively I can refrain from changing this and accept thatdata.recordCount()
could throw NPE in tests, or change the return type ofrecordCount()
to beLong
; I don't really have a strong preference so suggestions are welcome!