Spark Operator Roadmap 2024 #2193

ChenYi015 · 2024-09-26T04:22:29Z

Roadmap

Creating this roadmap issue to track work items that we will do in the future. If you have any ideas, please leave a comment.

Features

Pod template support ([FEATURE] Use pod template feature in spark-submit for Spark 3.x applications #2101)
Deprecation of webhook by moving all functionality into the pod template
Improve controller performance
Spark Connect Support (Spark Connect support #1801)
Rest API for submitting jobs
Cert manger support (Use CertManager for webhook certificates #1178)

Chores

Doc improvement
Improve test coverage to improve the confidence in releases, particularly with e2e tests

jacobsalway · 2024-09-29T04:30:59Z

Some ideas:

A new CR to support Spark Connect
A HTTP API for job submission
A web UI for visibility into currently running applications
Deprecate the need for a mutating webhook by moving all functionality into the pod template
Controller performance improvements and recommendations for large scale clusters

Chores:

Improve test coverage to improve the confidence in releases, particularly with e2e tests
Doc improvements

cccsss01 · 2024-10-05T15:28:40Z

Upgrade default security posture
Remove reliance on userid 185
(seems it's connected to the krb5.conf file leveraging domains and realms of institutions that may not need it).

josecsotomorales · 2024-10-11T14:52:16Z

@jacobsalway @ChenYi015 I think that "Deprecate the need for a mutating webhook by moving all functionality into the pod template" should be a top priority, especially with the upcoming release of Spark v4

gangahiremath · 2024-10-11T19:48:07Z

@bnetzi , @vara-bonthu , regarding the point 'referring you to the discussion here, I think we just need to provide in general more options to configure the controller runtime, and that my PR is irrelevant',

Does it mean ' one queue per app and one go routine per app'(#1990) is not a solution for the performance issue faced?

Is #2186 solution for the same?

Do we see performance opportunity improvement with the approach that we have tried? - #1574 (comment)
Summary of changes :
- port spark-submit to golang
- this removes JVM invocation, hence, performance-wise faster
- no dependency on apache spark(as the frequency and quantity of changes to driver pod going to be minimal in future releases of apache spark).
We are happy to contribute our effort in this context to open source. Porting of spark-submit to golang is well-tested in our setup. Please let us know.

c-h-afzal · 2024-10-15T05:00:22Z

@gangahiremath - I think the two improvements aren't mutually exclusive - Given the testing done by @bnetzi and captured in this document - it seems that the one mutex per queue does have performance benefits. I also think that using Go instead of Java based submission can also help reduce job submission latency. However, as pointed out by @bnetz using Go would require corresponding changes to spark operator whenever there are changes to spark-submit and may also introduce functionality gaps. We can probably include both improvements in the roadmap if the performance hit from JVM is significant enough.

It would be great if other users can share/comment if JVM spin-up times indeed were a contributor to job submission latency? Also, if anyone tweaked/optimized JVM specifically to alleviate this pain point? Thanks.

gangahiremath · 2024-10-15T18:20:37Z

@gangahiremath - I think the two improvements aren't mutually exclusive - Given the testing done by @bnetzi and captured in this document - it seems that the one mutex per queue does have performance benefits. I also think that using Go instead of Java based submission can also help reduce job submission latency. However, as pointed out by @bnetz using Go would require corresponding changes to spark operator whenever there are changes to spark-submit and may also introduce functionality gaps. We can probably include both improvements in the roadmap if the performance hit from JVM is significant enough.

It would be great if other users can share/comment if JVM spin-up times indeed were a contributor to job submission latency? Also, if anyone tweaked/optimized JVM specifically to alleviate this pain point? Thanks.

@c-h-afzal , FYI point So the way I see it - work queue per app might not longer be the solution by bnetzi in thread #1990 (comment).

ChenYi015 pinned this issue Sep 26, 2024

ChenYi015 added help wanted Extra attention is needed good first issue Good for newcomers enhancement New feature or request labels Sep 29, 2024

vara-bonthu mentioned this issue Oct 11, 2024

Draft: Performance mega boost - queue per app #1990

Open

1 task

rimolive mentioned this issue Oct 17, 2024

Data WG roadmap for KF 1.10 kubeflow/community#769

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Operator Roadmap 2024 #2193

Spark Operator Roadmap 2024 #2193

ChenYi015 commented Sep 26, 2024 •

edited

Loading

jacobsalway commented Sep 29, 2024 •

edited

Loading

cccsss01 commented Oct 5, 2024

josecsotomorales commented Oct 11, 2024

gangahiremath commented Oct 11, 2024 •

edited

Loading

c-h-afzal commented Oct 15, 2024 •

edited

Loading

gangahiremath commented Oct 15, 2024

Spark Operator Roadmap 2024 #2193

Spark Operator Roadmap 2024 #2193

Comments

ChenYi015 commented Sep 26, 2024 • edited Loading

Roadmap

Features

Chores

jacobsalway commented Sep 29, 2024 • edited Loading

cccsss01 commented Oct 5, 2024

josecsotomorales commented Oct 11, 2024

gangahiremath commented Oct 11, 2024 • edited Loading

c-h-afzal commented Oct 15, 2024 • edited Loading

gangahiremath commented Oct 15, 2024

ChenYi015 commented Sep 26, 2024 •

edited

Loading

jacobsalway commented Sep 29, 2024 •

edited

Loading

gangahiremath commented Oct 11, 2024 •

edited

Loading

c-h-afzal commented Oct 15, 2024 •

edited

Loading