Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SparkApplication fails to resubmit after entering the PENDING_RERUN state #2283

Closed
josecsotomorales opened this issue Oct 24, 2024 · 4 comments

Comments

@josecsotomorales
Copy link
Contributor

Description

I’m encountering an issue with the Spark Operator where the SparkApplication fails to resubmit after entering the PENDING_RERUN state. The operator logs an error stating “failed to run spark-submit: driver pod already exist”, even though the driver pod was deleted. This issue prevents the application from restarting correctly.

•	✋ I have searched the open/closed issues and my issue is not listed.

Reproduction Code [Required]

Steps to reproduce the behavior:

1.	Submit a SparkApplication to the Kubernetes cluster.
2.	Allow the application to reach the RUNNING state.
3.	Trigger an event that causes the application to enter the INVALIDATING state (e.g., by updating the application or deleting a pod).
4.	Observe that the application transitions to the PENDING_RERUN state.
5.	The operator attempts to resubmit the application but fails with the error “driver pod already exist”.

Expected behavior

The Spark Operator should successfully resubmit the SparkApplication when it is in the PENDING_RERUN state, creating a new driver pod and continuing the execution of the application.

Actual behavior

The Spark Operator fails to resubmit the SparkApplication, logging an error:

Failed to run spark-submit: driver pod already exist

As a result, the application does not restart, and the driver pod remains in a failed state.

Terminal Output Screenshot(s)

2024-10-23T23:03:47.193Z ERROR sparkapplication/controller.go:260 Failed to submit SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "error": "failed to run spark-submit: driver pod already exist"}
...

2024-10-24T00:04:55.662Z ERROR sparkapplication/controller.go:409 Failed to run spark-submit {"name": "sample-app-sample-spark", "namespace": "default", "state": "PENDING_RERUN", "error": "failed to run spark-submit: driver pod already exist"}
...

Full Logs:

Click to expand

2024-10-23T23:03:47.159Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "sample-app-sample-spark", "namespace": "default", "oldState": "", "newState": "SUBMITTED"}
2024-10-23T23:03:47.175Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "sample-app-sample-spark", "namespace": "default", "oldState": "SUBMITTED", "newState": "SUBMITTED"}
2024-10-23T23:03:47.193Z ERROR sparkapplication/controller.go:260 Failed to submit SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "error": "failed to run spark-submit: driver pod already exist"}
github.com/kubeflow/spark-operator/internal/controller/sparkapplication.(*Reconciler).reconcileNewSparkApplication.func1
/workspace/internal/controller/sparkapplication/controller.go:260
k8s.io/client-go/util/retry.OnError.func1
/go/pkg/mod/k8s.io/client-go@v0.29.3/util/retry/util.go:51
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection
/go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/wait/wait.go:145
k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff
/go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/wait/backoff.go:461
k8s.io/client-go/util/retry.OnError
/go/pkg/mod/k8s.io/client-go@v0.29.3/util/retry/util.go:50
k8s.io/client-go/util/retry.RetryOnConflict
/go/pkg/mod/k8s.io/client-go@v0.29.3/util/retry/util.go:104
github.com/kubeflow/spark-operator/internal/controller/sparkapplication.(*Reconciler).reconcileNewSparkApplication
/workspace/internal/controller/sparkapplication/controller.go:247
github.com/kubeflow/spark-operator/internal/controller/sparkapplication.(*Reconciler).Reconcile
/workspace/internal/controller/sparkapplication/controller.go:179
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:227
2024-10-23T23:03:47.215Z INFO sparkapplication/controller.go:171 Reconciling SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "state": "SUBMITTED"}
2024-10-23T23:03:48.006Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "sample-app-sample-spark-driver", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}
2024-10-23T23:03:48.012Z INFO sparkapplication/controller.go:171 Reconciling SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "state": "SUBMITTED"}
2024-10-23T23:03:48.021Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "sample-app-sample-spark", "namespace": "default", "oldState": "SUBMITTED", "newState": "RUNNING"}
2024-10-23T23:03:48.042Z INFO sparkapplication/controller.go:171 Reconciling SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "state": "RUNNING"}
2024-10-23T23:03:51.900Z INFO sparkapplication/event_handler.go:60 Spark pod created {"name": "sample-21c07c92bb9f476b-exec-1", "namespace": "default", "phase": "Pending"}
...

Environment & Versions

•	Spark Operator App version: v2.0.2
•	Helm Chart Version: 2.0.2
•	Kubernetes Version: v1.30
•	Apache Spark version: 3.5.3

Additional context

•	The issue occurs consistently under the given reproduction steps.
•	It appears that the operator does not properly clean up or recognize the state of the driver pod during a rerun.
•	Manually deleting the driver pod does not resolve the issue; the operator continues to report that the driver pod already exists.
•	This issue impacts our ability to automatically restart Spark applications upon failure.

This issue might be fixed in #2241

@josecsotomorales
Copy link
Contributor Author

@ChenYi015 Happy to double-check if #2241 fixes this issue if we can get a new release in place 🚀

@josecsotomorales josecsotomorales changed the title [BUG] Brief description of the issue [BUG] SparkApplication fails to resubmit after entering the PENDING_RERUN state Oct 24, 2024
@Tom-Newton
Copy link
Contributor

This does sound familiar and I think my retry PR might have helped because I haven't noticed this since I rolled that out. Regardless of whether its still a problem I was contemplating some ways to make deleting extra resources (driver pod, service and ingress) more robust.

I don't really know what I'm talking about but I guess there is no harm in sharing my thoughts:

  1. Delete resources according to the name that the new sparkApplication wants to use not just what is in app.Status.DriverInfo.
  1. When deleting extra resources always check some ID, in addition to the name. I think this might protect from some race conditions if a spark application name is reused quickly after previous spark application completed.

@ChenYi015
Copy link
Contributor

@ChenYi015 Happy to double-check if #2241 fixes this issue if we can get a new release in place 🚀

Have released v2.1.0-rc.0.

Actually, I cannot reproduce this issue with version v2.0.2. I can see that spark resources (driver pod, service) are deleted as expected when the app is in invalidating state.

@josecsotomorales
Copy link
Contributor Author

Hey @Tom-Newton @ChenYi015, I did several tests on v2.1.0-rc.0, and I can confirm that this issue is resolved! Excellent work guys!! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants