Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages #3146

Closed
wants to merge 3 commits into from

Conversation

aarondav
Copy link
Contributor

@aarondav aarondav commented Nov 7, 2014

This PR elimiantes the network package's usage of the Java serializer and replaces it with Encodable, which is a lightweight binary protocol. Each message is preceded by a type id, which will allow us to change messages (by only adding new ones), or to change the format entirely by switching to a special id (such as -1).

This protocol has the advantage over Java that we can guarantee that messages will remain compatible across compiled versions and JVMs, though it does not provide a clean way to do schema migration. In the future, it may be good to use a more heavy-weight serialization format like protobuf, thrift, or avro, but these all add several dependencies which are unnecessary at the present time.

Additionally this unifies the RPC messages of NettyBlockTransferService and ExternalShuffleClient.

…rvice messages

This PR elimiantes the network package's usage of the Java serializer and replaces it with Encodable, which is a lightweight binary protocol. Each message is preceded by a type id, which will allow us to change messages (by only adding new ones), or to change the format entirely by switching to a special id (such as -1).

This protocol has the advantage over Java that we can guarantee that messages will remain compatible across compiled versions and JVMs, though it does not provide a clean way to do schema migration. In the future, it may be good to use a more heavy-weight serialization format like protobuf, thrift, or avro, but these all add several dependencies which are unnecessary at the present time.
@@ -15,21 +15,24 @@
* limitations under the License.
*/

package org.apache.spark.network.shuffle;
package org.apache.spark.network.shuffle.protocol;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything from here down is systematic adding encodedLength(), encode(), and decode() methods and updating unit tests, which verify that the aforementioned methods are implemented correctly.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23036 has started for PR 3146 at commit 538f2a3.

  • This patch merges cleanly.

@rxin
Copy link
Contributor

rxin commented Nov 7, 2014

LGTM.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23036 has finished for PR 3146 at commit 538f2a3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23036/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23047 has started for PR 3146 at commit b8e2a49.

  • This patch merges cleanly.

@srowen
Copy link
Member

srowen commented Nov 7, 2014

This sort of seems like it's reinventing what Thrift or protobuf do. Also, why is it necessary to introduce another serialization-related interface just to customize the serialization? Not objecting so much as asking why you can't just override the serialization with a desired compact serialization, or use a library.

@aarondav
Copy link
Contributor Author

aarondav commented Nov 7, 2014

TL;DR: The goal is to keep the network package small, with minimal dependencies and minimal overhead to verify cross-version compatibility moving forward. It is my feeling that protobuf and thrift are expensive dependencies to have, and that Java serialization is harder to reason about.

The problem with using thrift or protobuf is inherently about dependencies. Protobuf dependencies are already a mess in Spark due to different, backwards-incompatible versions being used in Hadoop, Mesos, Akka, etc., and adding a real dependency in Spark just complicates the issue. Thrift is another relatively common dependency and has a few extra dependencies of its own, but I haven't explored that route as far. Since the code here is intended to work while running within other JVMs (e.g., YARN Node Manager), we want to keep dependencies down.

Other parts of the network package use the "Encodable" interface because they write directly to Netty and this API is thus natural (decoding ByteBufs from an IO buffer, for instance). The choice of using Encodable here rather than implementing Externalizable/Serializable objects is for two reasons: simplicity and flexibility. The Java serialization framework brings a lot of baggage and has some non-obvious pitfalls, and accidental misuse may go unnoticed until the serial version id mismatch errors arrive. Second, it is less obvious how to explicitly handle changes in classes between versions. Since we expect the shuffle service to be long-lived, we must be able to simply and straightforwardly verify that code will work in a cross-version manner, and I feel that that is harder to prove when relying on Java serialization.

Finally, the thing that makes this problem tractable, in my opinion, is that we should never be serializing complex object graphs at this level of the API. Everything should be ultimately simple, primitive values with minimal to no abstract types. We're not trying to solve serialization of general objects, just serialization of small, mostly static messages. Arrays of Strings should be the most complicated things we have to serialize.

@srowen
Copy link
Member

srowen commented Nov 7, 2014

Thanks! good to hear the reasoning. It is indeed light and the use case is not quite the same as the usual general serialization use cases.

@rxin
Copy link
Contributor

rxin commented Nov 7, 2014

@srowen I was initially actually for protobuf or avro, but looking at the dependency list, it'd be really hard to guarantee compatibility in the future. Given the number of messages we are actually serializing is very small, the work to do custom serialization protocol is very contained.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23047 has finished for PR 3146 at commit b8e2a49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23047/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23051 has started for PR 3146 at commit ed1102a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23051 has finished for PR 3146 at commit ed1102a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23051/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Nov 7, 2014

Merging in master & branch-1.2. Thanks.

asfgit pushed a commit that referenced this pull request Nov 7, 2014
…rvice messages

This PR elimiantes the network package's usage of the Java serializer and replaces it with Encodable, which is a lightweight binary protocol. Each message is preceded by a type id, which will allow us to change messages (by only adding new ones), or to change the format entirely by switching to a special id (such as -1).

This protocol has the advantage over Java that we can guarantee that messages will remain compatible across compiled versions and JVMs, though it does not provide a clean way to do schema migration. In the future, it may be good to use a more heavy-weight serialization format like protobuf, thrift, or avro, but these all add several dependencies which are unnecessary at the present time.

Additionally this unifies the RPC messages of NettyBlockTransferService and ExternalShuffleClient.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3146 from aarondav/free and squashes the following commits:

ed1102a [Aaron Davidson] Remove some unused imports
b8e2a49 [Aaron Davidson] Add appId to test
538f2a3 [Aaron Davidson] [SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages

(cherry picked from commit d4fa04e)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in d4fa04e Nov 7, 2014
dongjoon-hyun pushed a commit that referenced this pull request Apr 14, 2024
### What changes were proposed in this pull request?
The pr aims to upgrade `postgresql` from `42.7.2` to `42.7.3`.

### Why are the changes needed?
The version `42.7.3` full release notes:
https://jdbc.postgresql.org/changelogs/2024-03-14-42.7.3-release/
- fix: boolean types not handled in SimpleQuery mode [PR #3146](pgjdbc/pgjdbc#3146) *make sure we handle boolean types in simple query mode
support uuid as well
handle all well known types in text mode and change else if to switch
- fix: released new versions of 42.2.29, 42.3.10, 42.4.5, 42.5.6, 42.6.2 to deal with NoSuchMethodError on ByteBuffer#position when running on Java 8

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46038 from panbingkun/postgresql_upgrade.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants