-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages #3146
Conversation
…rvice messages This PR elimiantes the network package's usage of the Java serializer and replaces it with Encodable, which is a lightweight binary protocol. Each message is preceded by a type id, which will allow us to change messages (by only adding new ones), or to change the format entirely by switching to a special id (such as -1). This protocol has the advantage over Java that we can guarantee that messages will remain compatible across compiled versions and JVMs, though it does not provide a clean way to do schema migration. In the future, it may be good to use a more heavy-weight serialization format like protobuf, thrift, or avro, but these all add several dependencies which are unnecessary at the present time.
@@ -15,21 +15,24 @@ | |||
* limitations under the License. | |||
*/ | |||
|
|||
package org.apache.spark.network.shuffle; | |||
package org.apache.spark.network.shuffle.protocol; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything from here down is systematic adding encodedLength(), encode(), and decode() methods and updating unit tests, which verify that the aforementioned methods are implemented correctly.
Test build #23036 has started for PR 3146 at commit
|
LGTM. |
Test build #23036 has finished for PR 3146 at commit
|
Test FAILed. |
Test build #23047 has started for PR 3146 at commit
|
This sort of seems like it's reinventing what Thrift or protobuf do. Also, why is it necessary to introduce another serialization-related interface just to customize the serialization? Not objecting so much as asking why you can't just override the serialization with a desired compact serialization, or use a library. |
TL;DR: The goal is to keep the network package small, with minimal dependencies and minimal overhead to verify cross-version compatibility moving forward. It is my feeling that protobuf and thrift are expensive dependencies to have, and that Java serialization is harder to reason about. The problem with using thrift or protobuf is inherently about dependencies. Protobuf dependencies are already a mess in Spark due to different, backwards-incompatible versions being used in Hadoop, Mesos, Akka, etc., and adding a real dependency in Spark just complicates the issue. Thrift is another relatively common dependency and has a few extra dependencies of its own, but I haven't explored that route as far. Since the code here is intended to work while running within other JVMs (e.g., YARN Node Manager), we want to keep dependencies down. Other parts of the network package use the "Encodable" interface because they write directly to Netty and this API is thus natural (decoding ByteBufs from an IO buffer, for instance). The choice of using Encodable here rather than implementing Externalizable/Serializable objects is for two reasons: simplicity and flexibility. The Java serialization framework brings a lot of baggage and has some non-obvious pitfalls, and accidental misuse may go unnoticed until the serial version id mismatch errors arrive. Second, it is less obvious how to explicitly handle changes in classes between versions. Since we expect the shuffle service to be long-lived, we must be able to simply and straightforwardly verify that code will work in a cross-version manner, and I feel that that is harder to prove when relying on Java serialization. Finally, the thing that makes this problem tractable, in my opinion, is that we should never be serializing complex object graphs at this level of the API. Everything should be ultimately simple, primitive values with minimal to no abstract types. We're not trying to solve serialization of general objects, just serialization of small, mostly static messages. Arrays of Strings should be the most complicated things we have to serialize. |
Thanks! good to hear the reasoning. It is indeed light and the use case is not quite the same as the usual general serialization use cases. |
@srowen I was initially actually for protobuf or avro, but looking at the dependency list, it'd be really hard to guarantee compatibility in the future. Given the number of messages we are actually serializing is very small, the work to do custom serialization protocol is very contained. |
Test build #23047 has finished for PR 3146 at commit
|
Test PASSed. |
Test build #23051 has started for PR 3146 at commit
|
Test build #23051 has finished for PR 3146 at commit
|
Test PASSed. |
Merging in master & branch-1.2. Thanks. |
…rvice messages This PR elimiantes the network package's usage of the Java serializer and replaces it with Encodable, which is a lightweight binary protocol. Each message is preceded by a type id, which will allow us to change messages (by only adding new ones), or to change the format entirely by switching to a special id (such as -1). This protocol has the advantage over Java that we can guarantee that messages will remain compatible across compiled versions and JVMs, though it does not provide a clean way to do schema migration. In the future, it may be good to use a more heavy-weight serialization format like protobuf, thrift, or avro, but these all add several dependencies which are unnecessary at the present time. Additionally this unifies the RPC messages of NettyBlockTransferService and ExternalShuffleClient. Author: Aaron Davidson <aaron@databricks.com> Closes #3146 from aarondav/free and squashes the following commits: ed1102a [Aaron Davidson] Remove some unused imports b8e2a49 [Aaron Davidson] Add appId to test 538f2a3 [Aaron Davidson] [SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages (cherry picked from commit d4fa04e) Signed-off-by: Reynold Xin <rxin@databricks.com>
### What changes were proposed in this pull request? The pr aims to upgrade `postgresql` from `42.7.2` to `42.7.3`. ### Why are the changes needed? The version `42.7.3` full release notes: https://jdbc.postgresql.org/changelogs/2024-03-14-42.7.3-release/ - fix: boolean types not handled in SimpleQuery mode [PR #3146](pgjdbc/pgjdbc#3146) *make sure we handle boolean types in simple query mode support uuid as well handle all well known types in text mode and change else if to switch - fix: released new versions of 42.2.29, 42.3.10, 42.4.5, 42.5.6, 42.6.2 to deal with NoSuchMethodError on ByteBuffer#position when running on Java 8 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46038 from panbingkun/postgresql_upgrade. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This PR elimiantes the network package's usage of the Java serializer and replaces it with Encodable, which is a lightweight binary protocol. Each message is preceded by a type id, which will allow us to change messages (by only adding new ones), or to change the format entirely by switching to a special id (such as -1).
This protocol has the advantage over Java that we can guarantee that messages will remain compatible across compiled versions and JVMs, though it does not provide a clean way to do schema migration. In the future, it may be good to use a more heavy-weight serialization format like protobuf, thrift, or avro, but these all add several dependencies which are unnecessary at the present time.
Additionally this unifies the RPC messages of NettyBlockTransferService and ExternalShuffleClient.