Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42406][SQL] Fix check for missing required fields of to_protobuf #40080

Closed
wants to merge 3 commits into from

Conversation

rangadi
Copy link
Contributor

@rangadi rangadi commented Feb 19, 2023

What changes were proposed in this pull request?

Protobuf serializer (used in to_protobuf()) should error if non-nullable fields (i.e. protobuf required fields) are present in the schema of the catalyst record being converted to a protobuf.

But isNullable() method used for this check returns opposite (see PR comment in the diff). As a result, Serializer incorrectly requires the fields that are optional. This PR fixes this check (see PR comment in the diff).

This also requires corresponding fix for couple of unit tests. In order use a Protobuf message with a required field, Protobuf version 2 file proto2_messages.proto is added.
Two tests are updated to verify missing required fields results in an error.

Why are the changes needed?

This is need to fix a bug where we were incorrectly enforcing a schema check on optional fields rather than on required fields.

Does this PR introduce any user-facing change?

It fixes a bug, and gives more flexibility for user queries.

How was this patch tested?

  • Updated unit tests

Add a proto2 file to verify required field.
@rangadi
Copy link
Contributor Author

rangadi commented Feb 19, 2023

Copy link
Contributor Author

@rangadi rangadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added inline comment to help with review.


/** Return true if `fieldDescriptor` is optional. */
private[protobuf] def isNullable(fieldDescriptor: FieldDescriptor): Boolean =
!fieldDescriptor.isOptional
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug. extra ! returns false for optional field. But optional fields are nullable. This is fixed at line #103 above.

@@ -0,0 +1,8 @@

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a descriptor file for proto2_messages.proto

* limitations under the License.
*/

syntax = "proto2";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protobuf v2 file is added to test required fields.

@@ -59,17 +59,6 @@ message TypeMiss {
int64 bar = 1;
}

/* Field boo missing from SQL root, but available in Protobuf root*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two are moved to proto2_messages.proto (and renamed)

@SandishKumarHN
Copy link
Contributor

@rangadi with this change, we can conclude that spark-protobuf supports proto2 and proto3 versions right?

Copy link
Contributor

@SandishKumarHN SandishKumarHN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rangadi rangadi changed the title [SPARK-42406] Fix check for missing required fields. [SPARK-42406][3.4] Fix check for missing required fields. Feb 19, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @rangadi . Please remove [3.4] from the PR title. Apache Spark community uses a branch tag like [3.4] only when the PR is open for the release branch, branch-3.4. This PR is mis-using it as a target version.

@rangadi rangadi changed the title [SPARK-42406][3.4] Fix check for missing required fields. [SPARK-42406]Fix check for missing required fields. Feb 20, 2023
@rangadi
Copy link
Contributor Author

rangadi commented Feb 20, 2023

@dongjoon-hyun, done. Remove 3.4 in the summary.

@rangadi with this change, we can conclude that spark-protobuf supports proto2 and proto3 versions right?

Yes. And we always supported Proto2.

@dongjoon-hyun
Copy link
Member

Thank you for updates, @rangadi .

cc @hvanhovell

@rangadi
Copy link
Contributor Author

rangadi commented Feb 21, 2023

@gengliangwang could you take a quick look and merge in to master and 3.4.
Currently, it looks like you are the only one who can merge Protobuf patches. Please suggest someone else also for next time.

@@ -144,44 +146,52 @@ class ProtobufSerdeSuite extends SharedSparkSession with ProtobufTestBase {
test("Fail to convert with missing Catalyst fields") {
val protoFile = ProtobufUtils.buildDescriptor(testFileDesc, "FieldMissingInSQLRoot")

// serializing with extra fails if extra field is missing in SQL Schema
val foobarSQLType = DataType.fromDDL(
"struct<foo string>" // "bar" is missing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we test catalyst schema which contains bar only and there is a result returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe a bit more?
Do we want to test not having 'foo' field which is optional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new test at line #69. PTAL.

@gengliangwang
Copy link
Member

@rangadi can you update the PR description to follow the PR template?

@rangadi
Copy link
Contributor Author

rangadi commented Feb 21, 2023

@rangadi can you update the PR description to follow the PR template?

@gengliangwang update the summary to follow the template. Sorry about that.
Also added a new test. PTAL.

@gengliangwang gengliangwang changed the title [SPARK-42406]Fix check for missing required fields. [SPARK-42406][SQL] Fix check for missing required fields of to_protobuf Feb 21, 2023
@gengliangwang
Copy link
Member

Thanks, merging to master/3.4

gengliangwang pushed a commit that referenced this pull request Feb 22, 2023
### What changes were proposed in this pull request?

Protobuf serializer (used in `to_protobuf()`) should error if non-nullable fields (i.e. protobuf `required` fields) are present in the schema of the catalyst record being converted to a protobuf.

But `isNullable()` method used for this check returns opposite (see PR comment in the diff).  As a result, Serializer incorrectly requires the fields that are optional. This PR fixes this check (see PR comment in the diff).

This also requires corresponding fix for couple of unit tests. In order use a Protobuf message with a `required` field, Protobuf version 2 file `proto2_messages.proto` is added.
Two tests are updated to verify missing required fields results in an error.

### Why are the changes needed?

This is need to fix a bug where we were incorrectly enforcing a schema check on optional fields rather than on required fields.

### Does this PR introduce _any_ user-facing change?

It fixes a bug, and gives more flexibility for user queries.

### How was this patch tested?
 - Updated unit tests

Closes #40080 from rangadi/fix-required-field-check.

Authored-by: Raghu Angadi <raghu.angadi@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit fb56477)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?

Protobuf serializer (used in `to_protobuf()`) should error if non-nullable fields (i.e. protobuf `required` fields) are present in the schema of the catalyst record being converted to a protobuf.

But `isNullable()` method used for this check returns opposite (see PR comment in the diff).  As a result, Serializer incorrectly requires the fields that are optional. This PR fixes this check (see PR comment in the diff).

This also requires corresponding fix for couple of unit tests. In order use a Protobuf message with a `required` field, Protobuf version 2 file `proto2_messages.proto` is added.
Two tests are updated to verify missing required fields results in an error.

### Why are the changes needed?

This is need to fix a bug where we were incorrectly enforcing a schema check on optional fields rather than on required fields.

### Does this PR introduce _any_ user-facing change?

It fixes a bug, and gives more flexibility for user queries.

### How was this patch tested?
 - Updated unit tests

Closes apache#40080 from rangadi/fix-required-field-check.

Authored-by: Raghu Angadi <raghu.angadi@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit fb56477)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants