Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify the serialization of ScalarValue::List #3547

Merged
merged 1 commit into from
Oct 8, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 20, 2022

Draft as it builds on #3536 and doesn't pass all tests

Which issue does this PR close?

Re #3531

Rationale for this change

While working on #3531 I found the ScalarValue::List serialization code quite challenging to understand as it did not match the structure of ScalarValue::List. It can be drastically simplified and maintain the same semantics

I also hope a final PR in this series to remove PrimitiveScalarType but I didn't quite make it today and ran out of time

What changes are included in this PR?

  1. Remove special case null handling for ScalarValue::List (it is already covered)

Are there any user-facing changes?

No

@alamb alamb force-pushed the alamb/clean_up_list_serialization branch 3 times, most recently from 1c36f13 to b30172d Compare September 25, 2022 13:22
@@ -714,6 +714,9 @@ message Union{
}

message ScalarListValue{
// encode null explicitly to distinguish a list with a null value
// from a list with no values)
bool is_null = 3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe much of the complexity of the code to serialize List values was related to trying to figure out NULL values. Encoding it explicitly makes the code much simpler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a null even being encoded at this level, and not using the null_value support at the ScalarValue level?

Copy link
Contributor Author

@alamb alamb Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short answer is I don't know -- the longer answer is that the null_value support at the scalar level requires a strange mirror type (PrimitiveScalarValue) which doesn't have the same support for types as the normal DataType (specifically it doesn't have a List equivalent) -- I am working in the background to remove that structure in #3612 and I will also try to remove this explict null coding as well

@@ -825,17 +828,6 @@ enum PrimitiveScalarType{
TIME64 = 27;
}

message ScalarType{
Copy link
Contributor Author

@alamb alamb Sep 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why but the previous code was encoding ScalarValue::List using a different pattern than the struct. This PR changes the serialization code to use the same structure as ScalarValue::List -- the Field and the a list of scalar values, resulting in a drastic simplification

@@ -838,6 +754,23 @@ impl TryFrom<&protobuf::ScalarValue> for ScalarValue {
}
}

/// Ensures that all `values` are of type DataType::List and have the
/// same type as field
fn validate_list_values(field: &Field, values: &[ScalarValue]) -> Result<(), Error> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how valuable it is to do this validation if serialized ScalarValue always came from DataFusion, but I wanted to keep the same semantics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a weird edge case but if it did happen it would be hard to debug from whatever error happened downstream so I think it's worthwhile to do here.

"The value {:?} should not have been able to serialize. Serialized to :{:?}",
test_case, res
);
let proto: Result<super::protobuf::ScalarValue, super::to_proto::Error> =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the validation is now done during deserialization rather than serialization so I had to change this test slightly

@@ -480,14 +492,11 @@ mod roundtrip_tests {
ScalarValue::Float32(Some(2.0)),
ScalarValue::Float32(Some(1.0)),
]),
DataType::List(new_box_field("level1", DataType::Float32, true)),
DataType::Float32,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not valid ScalarValue::List structures -- specifically, the type of the field on a List is the type of the element (aka it is not DataType::List(type of element), which is what this test was previously doing).

You can see correct usages in the rest of the codebase https://github.com/apache/arrow-datafusion/search?q=new_list

for example

https://github.com/apache/arrow-datafusion/blob/e0a9fa347f7adb1e6d6157b537e44c574e88244a/datafusion/physical-expr/src/aggregate/array_agg.rs#L174-L183

)),
];

let should_fail: Vec<DataType> = vec![
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is for serializing into a list specific protobuf that was removed in this PR

))
}
}
(scalar::ScalarValue::Boolean(_), DataType::Boolean) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not clear why there was other type specific code to serializing lists - after this PR the existing code for serializing other types are used

@codecov-commenter
Copy link

codecov-commenter commented Sep 25, 2022

Codecov Report

❗ No coverage uploaded for pull request base (master@b4c0601). Click here to learn what that means.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##             master    #3547   +/-   ##
=========================================
  Coverage          ?   86.08%           
=========================================
  Files             ?      300           
  Lines             ?    56329           
  Branches          ?        0           
=========================================
  Hits              ?    48489           
  Misses            ?     7840           
  Partials          ?        0           
Impacted Files Coverage Δ
datafusion/proto/src/from_proto.rs 42.29% <100.00%> (ø)
datafusion/proto/src/lib.rs 93.93% <100.00%> (ø)
datafusion/proto/src/to_proto.rs 58.42% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@alamb alamb force-pushed the alamb/clean_up_list_serialization branch from b30172d to 9927216 Compare September 28, 2022 20:19
@@ -714,6 +714,9 @@ message Union{
}

message ScalarListValue{
// encode null explicitly to distinguish a list with a null value
// from a list with no values)
bool is_null = 3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to remove this when I remove PrimitiveScalarType and use ArrowType instead

@alamb alamb marked this pull request as ready for review September 29, 2022 11:58
@alamb
Copy link
Contributor Author

alamb commented Sep 29, 2022

This PR is now ready for review. It cleans up a lot of this code -- I think I can push one more PR after this that will clean up even more

@alamb alamb requested a review from tustvold September 30, 2022 20:15
@alamb
Copy link
Contributor Author

alamb commented Oct 7, 2022

Unless there are any comments on this PR I plan to merge it tomorrow

Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much simpler. Very nice!

@@ -838,6 +754,23 @@ impl TryFrom<&protobuf::ScalarValue> for ScalarValue {
}
}

/// Ensures that all `values` are of type DataType::List and have the
/// same type as field
fn validate_list_values(field: &Field, values: &[ScalarValue]) -> Result<(), Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a weird edge case but if it did happen it would be hard to debug from whatever error happened downstream so I think it's worthwhile to do here.

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that I had reviewed this ticket a long time ago but forgot to leave an approval.

LGTM, a very nice simplification -- Very sorry for the delay in pushing this ticket to be merged

@alamb
Copy link
Contributor Author

alamb commented Oct 8, 2022

Thanks @thinkharderdev and @xudong963

@alamb alamb merged commit e54110f into apache:master Oct 8, 2022
@alamb alamb deleted the alamb/clean_up_list_serialization branch October 8, 2022 10:01
@ursabot
Copy link

ursabot commented Oct 8, 2022

Benchmark runs are scheduled for baseline = 488b2ce and contender = e54110f. e54110f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants