New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SparkConnect] Initial Protobuf Definitions #37075

Closed

grundprinzip wants to merge 2 commits into apache:master from grundprinzip:spark_connect

Contributor

grundprinzip commented Jul 4, 2022

WIP

What changes were proposed in this pull request?

This is the first patch for Spark Connect that adds the base protocol buffers definitions.

Why are the changes needed?

These changes are part of the Spark Connect efforts.

Does this PR introduce any user-facing change?

No

How was this patch tested?

No tests add this point in time.

github-actions bot added BUILD SQL labels


          Adding the proto and the dependencies

7e89c17

grundprinzip force-pushed the spark_connect branch from d5d13e9 to 7e89c17 Compare

July 4, 2022 21:53

AmplabJenkins commented Jul 5, 2022

Can one of the admins verify this patch?

LuciferYang reviewed

View reviewed changes

sql/core/pom.xml Outdated Show resolved Hide resolved

LuciferYang reviewed

View reviewed changes

sql/core/pom.xml Show resolved Hide resolved


          Trying to fix aarch64

e8a068e

Contributor Author

grundprinzip commented Jul 5, 2022

Added the missing licence headers, fixed the M1 build.

HyukjinKwon marked this pull request as draft

July 6, 2022 00:27

Member

HyukjinKwon commented Jul 6, 2022

Let me turn this to a draft since it's WIP

LuciferYang reviewed

View reviewed changes

pom.xml

@@ @@ -116,7 +116,7 @@ @@
                   <log4j.version>2.17.2</log4j.version>
                   <!-- make sure to update IsolatedClientLoader whenever this version is changed -->
                   <hadoop.version>3.3.3</hadoop.version>
-                  <protobuf.version>2.5.0</protobuf.version>
+                  <protobuf.version>3.21.1</protobuf.version>

Contributor

LuciferYang Jul 6, 2022 •

edited

Loading

Maybe we should shade and relocation protobuf in spark to avoid potential conflicts with other third-party libraries, such as hadoop

Contributor Author

grundprinzip Jul 6, 2022 •

edited

Loading

Absolutely, the shading and relocation rules haven't been updated yet. I'm still a bit unclear what the best way to progress is to avoid conflicts with third-party packages or Spark consumers. I've discussed some approaches and one way would be to produce a shaded spark connect artifact that is then consumed in it's shaded version by spark itself or to shade and relocate after the build.

However, this PR is mostly for discussing the proto interface.

Contributor

LuciferYang Jul 6, 2022

Got it

Contributor

steveloughran Aug 11, 2022

hadoop hdfs has its own shaded copy now so doesn't care(*); a protobuf upgrade is incompatible with any code compiled against the later version, so must be shaded somehow.

(*) more specifically, has a new problem, how to safely upgrade that shaded hadoop-thirdparty jar with guava, protobuf etc

Contributor

rangadi Oct 21, 2022

a protobuf upgrade is incompatible with any code compiled against the later version, so must be shaded somehow.

@steveloughran what does this mean? could you point to more details?
If I shaded 'com.google.protobuf', in my jar, I can update it anytime, right?

Contributor

steveloughran Oct 21, 2022

shaded, yes. if unshaded, then if you update protobuf.jar, all .class files compiled with the older version of protobuf are unlikely to link. let alone work

c21 reviewed

View reviewed changes

sql/core/src/main/protobuf/spark/connect/base.proto

+                Metrics metrics = 3;
+                // Batch results of metrics.
+                message ArrowBatch {

Contributor

c21 Jul 7, 2022

curius is it data in the arrow format, so we call it ArrowBatch here? Normally we should have same schema across batches, right? Why we need to store schema field per each batch?

Contributor Author

grundprinzip Jul 7, 2022

Yes, the desired format is actually Arrow IPC Streams which include the schema directly. I will address this in this PR.

sql/core/src/main/protobuf/spark/connect/relations.proto

+                  JOIN_TYPE_OUTER = 2;
+                  JOIN_TYPE_LEFT_OUTER = 3;
+                  JOIN_TYPE_RIGHT_OUTER = 4;
+                  JOIN_TYPE_ANTI = 5;

Contributor

c21 Jul 7, 2022

nit: we can also have JOIN_TYPE_SEMI.

sql/core/src/main/protobuf/spark/connect/relations.proto

+                enum JoinType {
+                  JOIN_TYPE_UNSPECIFIED = 0;
+                  JOIN_TYPE_INNER = 1;
+                  JOIN_TYPE_OUTER = 2;

Contributor

c21 Jul 7, 2022

nit: JOIN_TYPE_FULL_OUTER sounds clearer.

sql/core/src/main/protobuf/spark/connect/relations.proto

+              /*
+               Relation of type [[Sort]].
+               */
+              message Sort {

Contributor

c21 Jul 7, 2022

we need to differentiate local vs global sort right?

Contributor Author

grundprinzip Jul 7, 2022

For now, I don't think we need to differentiate between local and global sort here as this will always treat the sort on the full relation. Pushing a local sort, FWIR, is a physical optimization.

sql/core/src/main/protobuf/spark/connect/relations.proto

+                message AggregateFunction {
+                  string name = 1;
+                  repeated Expression arguments = 2;

Contributor

c21 Jul 7, 2022

aggregate function (UDAF) would be very different from UDF/expression right?

Contributor Author

grundprinzip Jul 7, 2022

From my experiments UDAF registered by name should just work, as the AggregateFunction is "unresolved". I might make it clearer by calling it UnresolvedAggregateFunction

grundprinzip commented

View reviewed changes

sql/core/src/main/protobuf/spark/connect/relations.proto

+               When adding new relation types, they have to be registered here.
+               */
+              message Relation {

Contributor Author

grundprinzip Jul 7, 2022

add ID for incremental plan building.

sql/core/src/main/protobuf/spark/connect/relations.proto

+                message AggregateFunction {
+                  string name = 1;
+                  repeated Expression arguments = 2;

Contributor Author

grundprinzip Jul 7, 2022

From my experiments UDAF registered by name should just work, as the AggregateFunction is "unresolved". I might make it clearer by calling it UnresolvedAggregateFunction

sql/core/src/main/protobuf/spark/connect/relations.proto

+              /*
+               Relation of type [[Sort]].
+               */
+              message Sort {

Contributor Author

grundprinzip Jul 7, 2022

For now, I don't think we need to differentiate between local and global sort here as this will always treat the sort on the full relation. Pushing a local sort, FWIR, is a physical optimization.

sql/core/src/main/protobuf/spark/connect/base.proto

+              }
+              // A request to be executed by the service.
+              message Request {

Contributor Author

grundprinzip Jul 7, 2022

Add option to pass spark conf values as part of the request?

sql/core/src/main/protobuf/spark/connect/base.proto

+                Metrics metrics = 3;
+                // Batch results of metrics.
+                message ArrowBatch {

Contributor Author

grundprinzip Jul 7, 2022

Yes, the desired format is actually Arrow IPC Streams which include the schema directly. I will address this in this PR.

sql/core/src/main/protobuf/spark/connect/base.proto

+              service SparkConnectService {
+                // Executes a request that contains the query and returns a stream of [[Response]].
+                rpc ExecutePlan(Request) returns (stream Response) {}

Contributor Author

grundprinzip Jul 7, 2022

I suggest renaming this to something that avoids having Plan in the request as it can be confusing.

gengliangwang reviewed

View reviewed changes

sql/core/src/main/protobuf/google/protobuf/any.proto

		@@ -0,0 +1,155 @@
		// Protocol Buffers - Google's data interchange format

Member

gengliangwang Jul 15, 2022

Shall we change the header here?

SandishKumarHN mentioned this pull request

[SPARK-40654][SQL] Protobuf support for Spark - from_protobuf AND to_protobuf #37972

Closed

grundprinzip closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

steveloughran steveloughran left review comments

rangadi rangadi left review comments

gengliangwang gengliangwang left review comments

LuciferYang LuciferYang left review comments

c21 c21 left review comments

Labels