-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] SparkApplication fails to resubmit after entering the PENDING_RERUN state #2283
Comments
@ChenYi015 Happy to double-check if #2241 fixes this issue if we can get a new release in place 🚀 |
This does sound familiar and I think my retry PR might have helped because I haven't noticed this since I rolled that out. Regardless of whether its still a problem I was contemplating some ways to make deleting extra resources (driver pod, service and ingress) more robust. I don't really know what I'm talking about but I guess there is no harm in sharing my thoughts:
|
Have released v2.1.0-rc.0. Actually, I cannot reproduce this issue with version |
Hey @Tom-Newton @ChenYi015, I did several tests on v2.1.0-rc.0, and I can confirm that this issue is resolved! Excellent work guys!! 🚀 |
Description
I’m encountering an issue with the Spark Operator where the SparkApplication fails to resubmit after entering the PENDING_RERUN state. The operator logs an error stating “failed to run spark-submit: driver pod already exist”, even though the driver pod was deleted. This issue prevents the application from restarting correctly.
Reproduction Code [Required]
Steps to reproduce the behavior:
Expected behavior
The Spark Operator should successfully resubmit the SparkApplication when it is in the PENDING_RERUN state, creating a new driver pod and continuing the execution of the application.
Actual behavior
The Spark Operator fails to resubmit the SparkApplication, logging an error:
Failed to run spark-submit: driver pod already exist
As a result, the application does not restart, and the driver pod remains in a failed state.
Terminal Output Screenshot(s)
2024-10-23T23:03:47.193Z ERROR sparkapplication/controller.go:260 Failed to submit SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "error": "failed to run spark-submit: driver pod already exist"}
...
2024-10-24T00:04:55.662Z ERROR sparkapplication/controller.go:409 Failed to run spark-submit {"name": "sample-app-sample-spark", "namespace": "default", "state": "PENDING_RERUN", "error": "failed to run spark-submit: driver pod already exist"}
...
Full Logs:
Click to expand
2024-10-23T23:03:47.159Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "sample-app-sample-spark", "namespace": "default", "oldState": "", "newState": "SUBMITTED"}
2024-10-23T23:03:47.175Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "sample-app-sample-spark", "namespace": "default", "oldState": "SUBMITTED", "newState": "SUBMITTED"}
2024-10-23T23:03:47.193Z ERROR sparkapplication/controller.go:260 Failed to submit SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "error": "failed to run spark-submit: driver pod already exist"}
github.com/kubeflow/spark-operator/internal/controller/sparkapplication.(*Reconciler).reconcileNewSparkApplication.func1
/workspace/internal/controller/sparkapplication/controller.go:260
k8s.io/client-go/util/retry.OnError.func1
/go/pkg/mod/k8s.io/client-go@v0.29.3/util/retry/util.go:51
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection
/go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/wait/wait.go:145
k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff
/go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/wait/backoff.go:461
k8s.io/client-go/util/retry.OnError
/go/pkg/mod/k8s.io/client-go@v0.29.3/util/retry/util.go:50
k8s.io/client-go/util/retry.RetryOnConflict
/go/pkg/mod/k8s.io/client-go@v0.29.3/util/retry/util.go:104
github.com/kubeflow/spark-operator/internal/controller/sparkapplication.(*Reconciler).reconcileNewSparkApplication
/workspace/internal/controller/sparkapplication/controller.go:247
github.com/kubeflow/spark-operator/internal/controller/sparkapplication.(*Reconciler).Reconcile
/workspace/internal/controller/sparkapplication/controller.go:179
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.5/pkg/internal/controller/controller.go:227
2024-10-23T23:03:47.215Z INFO sparkapplication/controller.go:171 Reconciling SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "state": "SUBMITTED"}
2024-10-23T23:03:48.006Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "sample-app-sample-spark-driver", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}
2024-10-23T23:03:48.012Z INFO sparkapplication/controller.go:171 Reconciling SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "state": "SUBMITTED"}
2024-10-23T23:03:48.021Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "sample-app-sample-spark", "namespace": "default", "oldState": "SUBMITTED", "newState": "RUNNING"}
2024-10-23T23:03:48.042Z INFO sparkapplication/controller.go:171 Reconciling SparkApplication {"name": "sample-app-sample-spark", "namespace": "default", "state": "RUNNING"}
2024-10-23T23:03:51.900Z INFO sparkapplication/event_handler.go:60 Spark pod created {"name": "sample-21c07c92bb9f476b-exec-1", "namespace": "default", "phase": "Pending"}
...
Environment & Versions
Additional context
This issue might be fixed in #2241
The text was updated successfully, but these errors were encountered: