-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify the serialization of ScalarValue::List
#3547
Conversation
1c36f13
to
b30172d
Compare
@@ -714,6 +714,9 @@ message Union{ | |||
} | |||
|
|||
message ScalarListValue{ | |||
// encode null explicitly to distinguish a list with a null value | |||
// from a list with no values) | |||
bool is_null = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe much of the complexity of the code to serialize List values was related to trying to figure out NULL values. Encoding it explicitly makes the code much simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a null even being encoded at this level, and not using the null_value support at the ScalarValue level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short answer is I don't know -- the longer answer is that the null_value
support at the scalar level requires a strange mirror type (PrimitiveScalarValue
) which doesn't have the same support for types as the normal DataType
(specifically it doesn't have a List
equivalent) -- I am working in the background to remove that structure in #3612 and I will also try to remove this explict null coding as well
@@ -825,17 +828,6 @@ enum PrimitiveScalarType{ | |||
TIME64 = 27; | |||
} | |||
|
|||
message ScalarType{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why but the previous code was encoding ScalarValue::List
using a different pattern than the struct. This PR changes the serialization code to use the same structure as ScalarValue::List
-- the Field and the a list of scalar values, resulting in a drastic simplification
@@ -838,6 +754,23 @@ impl TryFrom<&protobuf::ScalarValue> for ScalarValue { | |||
} | |||
} | |||
|
|||
/// Ensures that all `values` are of type DataType::List and have the | |||
/// same type as field | |||
fn validate_list_values(field: &Field, values: &[ScalarValue]) -> Result<(), Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how valuable it is to do this validation if serialized ScalarValue
always came from DataFusion, but I wanted to keep the same semantics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a weird edge case but if it did happen it would be hard to debug from whatever error happened downstream so I think it's worthwhile to do here.
"The value {:?} should not have been able to serialize. Serialized to :{:?}", | ||
test_case, res | ||
); | ||
let proto: Result<super::protobuf::ScalarValue, super::to_proto::Error> = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the validation is now done during deserialization rather than serialization so I had to change this test slightly
@@ -480,14 +492,11 @@ mod roundtrip_tests { | |||
ScalarValue::Float32(Some(2.0)), | |||
ScalarValue::Float32(Some(1.0)), | |||
]), | |||
DataType::List(new_box_field("level1", DataType::Float32, true)), | |||
DataType::Float32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not valid ScalarValue::List
structures -- specifically, the type of the field on a List
is the type of the element (aka it is not DataType::List(type of element), which is what this test was previously doing).
You can see correct usages in the rest of the codebase https://github.com/apache/arrow-datafusion/search?q=new_list
for example
)), | ||
]; | ||
|
||
let should_fail: Vec<DataType> = vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is for serializing into a list specific protobuf that was removed in this PR
)) | ||
} | ||
} | ||
(scalar::ScalarValue::Boolean(_), DataType::Boolean) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not clear why there was other type specific code to serializing lists - after this PR the existing code for serializing other types are used
Codecov Report
@@ Coverage Diff @@
## master #3547 +/- ##
=========================================
Coverage ? 86.08%
=========================================
Files ? 300
Lines ? 56329
Branches ? 0
=========================================
Hits ? 48489
Misses ? 7840
Partials ? 0
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
b30172d
to
9927216
Compare
@@ -714,6 +714,9 @@ message Union{ | |||
} | |||
|
|||
message ScalarListValue{ | |||
// encode null explicitly to distinguish a list with a null value | |||
// from a list with no values) | |||
bool is_null = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope to remove this when I remove PrimitiveScalarType
and use ArrowType
instead
This PR is now ready for review. It cleans up a lot of this code -- I think I can push one more PR after this that will clean up even more |
Unless there are any comments on this PR I plan to merge it tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much simpler. Very nice!
@@ -838,6 +754,23 @@ impl TryFrom<&protobuf::ScalarValue> for ScalarValue { | |||
} | |||
} | |||
|
|||
/// Ensures that all `values` are of type DataType::List and have the | |||
/// same type as field | |||
fn validate_list_values(field: &Field, values: &[ScalarValue]) -> Result<(), Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a weird edge case but if it did happen it would be hard to debug from whatever error happened downstream so I think it's worthwhile to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that I had reviewed this ticket a long time ago but forgot to leave an approval.
LGTM, a very nice simplification -- Very sorry for the delay in pushing this ticket to be merged
Thanks @thinkharderdev and @xudong963 |
Benchmark runs are scheduled for baseline = 488b2ce and contender = e54110f. e54110f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Draft as it builds on #3536 and doesn't pass all testsWhich issue does this PR close?
Re #3531
Rationale for this change
While working on #3531 I found the
ScalarValue::List
serialization code quite challenging to understand as it did not match the structure ofScalarValue::List
. It can be drastically simplified and maintain the same semanticsI also hope a final PR in this series to remove
PrimitiveScalarType
but I didn't quite make it today and ran out of timeWhat changes are included in this PR?
Are there any user-facing changes?
No