Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49960][SQL] Provide extension point for custom AgnosticEncoder serde #48477

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

chris-twiner
Copy link

@chris-twiner chris-twiner commented Oct 15, 2024

What changes were proposed in this pull request?

4.0.0-preview2 introduced, as part of SPARK-49025 pr #47785, changes which drive ExpressionEncoder derivation purely from AgnosticEncoders. This PR adds a trait:

@DeveloperApi
trait AgnosticExpressionPathEncoder[T]
  extends AgnosticEncoder[T] {
  def toCatalyst(input: Expression): Expression
  def fromCatalyst(inputPath: Expression): Expression
}

and hooks in the De/SerializationBuildHelper matches to allow seamless extension of non-connect custom encoders (such as frameless or sparksql-scalapb).

SPARK-49960 provides the same information.

Why are the changes needed?

Without this change (or similar) there is no way for custom encoders to integrate with 4.0.0-preview2 derived encoders, something which has worked and devs have benefited from since pre 2.4 days. This stops code such as Dataset.joinWith from deriving a tuple encoder which works (as the provided ExpressionEncoder is now discarded under preview2). Supplying a custom AgnosticEncoder under preview2 also fails as only the preview2 AgnosticEncoders are supported in De/SerializationBuildHelper, triggering a MatchError.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test was added using a "custom" string encoder and joinWith based on an existing joinWith test. Removing the case statements in either BuildHelper will trigger the MatchError.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Oct 15, 2024
@chris-twiner
Copy link
Author

@hvanhovell fyi

* @tparam T over T
*/
@DeveloperApi
trait AgnosticExpressionPathEncoder[T]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-twiner can you give me an example of what you exactly are missing from the agnostic encoder framework. I'd rather solve this problem at that level than create an escape hatch to raw catalyst expressions. I am not saying that we should not do this, but I'd like to have a (small) discussion first.

My rationale for pushing for agnostic encoders is that I want to create a situation where the Classic and Connect Spark SQL interfaces are on par. Catalyst bespoke encoders - sort of - defeat that.

Copy link
Author

@chris-twiner chris-twiner Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell thanks for getting back to me, per the JIRA this is existing pre 4 functionality that is no longer fully working.
Frameless for example uses an extensible encoder derivation with injection/ADT support to provide type safe usage at compile time. Quality for example uses injections to store a result ADT efficiently, this SO has a similar often occurring example that can be solved. Lastly as the inbuilt encoders are not extensible you can bump into issues of it's derivation limitation (java.util.Calendar for example).

wrt to fully a unified interface impl, that's understood but this change is a minimal requirement to re-enable frameless style usage. I don't have any direct way to provide parity for connect yet (although your unification work provides a clear basis), I track it under frameless #701, although to go further down that route I'd also need custom expressions support in connect (but that's off topic and I know it's there to be used).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. However, I do want to call out that this is mostly internal API; we do not guarantee any compatibility between (minor) releases. For that reason, historically, most spark libraries have to create per spark version releases. The issue here IMO falls in that category.

I understand that this is a somewhat frustrating and impractical stance. I am open to having this interface for now, provided that in the future we can migrate towards AgnosticEncoders. The latter probably requires us to add additional encoders to the agnostic framework (e.g. an encoder for union types...).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrt internal - very much understood, it's the price paid for the functionality and performance gains, as I target Databricks as well there is yet more fun - hence shim's complicated version support

Copy link
Contributor

@hvanhovell hvanhovell Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(deleted my previous comment) I thought GH had lost it....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Databricks compatibility something you want, then Agnostic Encoders are your friend.

val realClassDataEnc: ProductEncoder[ClassData] =
Encoders.product[ClassData].asInstanceOf[ProductEncoder[ClassData]]

val custStringEnc: AgnosticExpressionPathEncoder[String] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming this is not the actual use case you are solving :)...

If it is, then a TransformingEncoder would do the job :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell yeah, it's purely illustrative to test the code works - on Classic at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants