From f1920ee2486f0c7d8c1428be6640a47121271a98 Mon Sep 17 00:00:00 2001 From: Daniel Vega-Myhre Date: Wed, 20 Sep 2023 19:30:47 +0000 Subject: [PATCH 1/5] add troubleshooting docs --- README.md | 3 +++ docs/faq/README.md | 44 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 47 insertions(+) create mode 100644 docs/faq/README.md diff --git a/README.md b/README.md index 95fca1abf..b2abac460 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,9 @@ Take a look at the [concepts](/docs/concepts/README.md) page for a brief descrip Read the [installation guide](/docs/setup/install.md) to learn more. +## Troubleshooting common issues +See the [FAQ](/docs/faq/README.md) for help with troubleshooting common issues. + ## Community, discussion, contribution, and support diff --git a/docs/faq/README.md b/docs/faq/README.md new file mode 100644 index 000000000..9e61028d3 --- /dev/null +++ b/docs/faq/README.md @@ -0,0 +1,44 @@ +# Troubleshooting Common Issues + +### "Webhook not available" error when attempting to create a JobSet + +Example error message: + +**Cause**: Usually this means the JobSet controller manager Deployment pods hey are unschedulable for some reason. + +**Solution**: Check if jobset-controller-manager deployment pods are running (`kubectl get pods -n jobset-system`). +If they are in a `Pending` state, describ the pod to see why (`kubectl describe pod -n jobset-system`), you +should see a message in the pod Events indicating why they are unschedulable. The solution will depend on why the pods +are unschedulable. For example, if they unschedulable to due insufficient CPU/memory, the solution is to scale up your CPU node pools or turn on autoscaling. + +### JobSet is created but child jobs and/or pods are not being created + +Check the jobset controller logs to see why the jobs are not being created: + +- `kubectl get pods -n jobset-system` +- `kubectl logs -n jobset-system` + +Inspect the logs to look for one of the following issues: + +1. Error message indicating an index does not exist (example: ` "error": "Index with name field:.metadata.controller does not exist"`) + +**Cause**: In older versions of JobSet (older than v0.2.1) if the indexes could not be built for some reason, the JobSet controller would log the error and launch anyway. This resulted in confusing behavior later when trying to create JobSets, where the controller would encounter this "index not found" error and not be able to create any jobs. This bug was fixed +in v0.2.1 so the JobSet controller now fails fast and exits with an error if indexes cannot be built. + +**Solution**: Upgrade to at least JobSet v0.2.1 (ideally, you should use the latest JobSet release). + +2. Validation error creating Jobs and/or Services, indicating the Job/Service name is invalid. + +**Cause**: Generated child job names or headless services names (which are derived from the JobSet name and ReplicatedJob names) are not valid. + +**Solution**: Validation has been added to fail the JobSet creation if the generated job/service names will be invalid, but the fix is not included in a release yet. For now, to resolve this simply delete/recreate the JobSet with a name such that: + +* The generated Job names (format: `---.`) will be valid DNS labels as defined in RFC 1035. +* The subdomain name (manually specified in `js.Spec.Network.Subdomain` or defaulted to the JobSet name if unspecified) is both [RFC 1123](https://datatracker.ietf.org/doc/html/rfc1123) compliant and [RFC 1035](https://datatracker.ietf.org/doc/html/rfc1035) compliant. + + +### Using JobSet + Kueue, preempted workloads never resume + +**Cause**: This could be due to a known bug in an older version of JobSet, or a known bug in an older version of Kueue. ug in older releases. + +**Solution**: Upgrade to at least JobSet v0.2.3 and Kueue v0.4.1. \ No newline at end of file From a11b86edb0eedb2de0d122a7855f894dceda0dfe Mon Sep 17 00:00:00 2001 From: Daniel Vega-Myhre <105610547+danielvegamyhre@users.noreply.github.com> Date: Wed, 20 Sep 2023 13:34:50 -0700 Subject: [PATCH 2/5] Update docs/faq/README.md Co-authored-by: Kevin Hannon --- docs/faq/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/faq/README.md b/docs/faq/README.md index 9e61028d3..7dccd88ce 100644 --- a/docs/faq/README.md +++ b/docs/faq/README.md @@ -7,7 +7,7 @@ Example error message: **Cause**: Usually this means the JobSet controller manager Deployment pods hey are unschedulable for some reason. **Solution**: Check if jobset-controller-manager deployment pods are running (`kubectl get pods -n jobset-system`). -If they are in a `Pending` state, describ the pod to see why (`kubectl describe pod -n jobset-system`), you +If they are in a `Pending` state, describe the pod to see why (`kubectl describe pod -n jobset-system`), you should see a message in the pod Events indicating why they are unschedulable. The solution will depend on why the pods are unschedulable. For example, if they unschedulable to due insufficient CPU/memory, the solution is to scale up your CPU node pools or turn on autoscaling. From cc96dedc6830345609fbe886cf830337e5361361 Mon Sep 17 00:00:00 2001 From: Daniel Vega-Myhre <105610547+danielvegamyhre@users.noreply.github.com> Date: Wed, 20 Sep 2023 13:56:18 -0700 Subject: [PATCH 3/5] Update docs/faq/README.md Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com> --- docs/faq/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/faq/README.md b/docs/faq/README.md index 7dccd88ce..978fdd089 100644 --- a/docs/faq/README.md +++ b/docs/faq/README.md @@ -4,7 +4,7 @@ Example error message: -**Cause**: Usually this means the JobSet controller manager Deployment pods hey are unschedulable for some reason. +**Cause**: Usually this means the JobSet controller manager Deployment pods are unschedulable for some reason. **Solution**: Check if jobset-controller-manager deployment pods are running (`kubectl get pods -n jobset-system`). If they are in a `Pending` state, describe the pod to see why (`kubectl describe pod -n jobset-system`), you From 33534c9a5d4b6a9bfd262178c35105cb28d47ef8 Mon Sep 17 00:00:00 2001 From: Daniel Vega-Myhre <105610547+danielvegamyhre@users.noreply.github.com> Date: Wed, 20 Sep 2023 13:56:38 -0700 Subject: [PATCH 4/5] Update docs/faq/README.md Co-authored-by: Kevin Hannon --- docs/faq/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/faq/README.md b/docs/faq/README.md index 978fdd089..3c6c04383 100644 --- a/docs/faq/README.md +++ b/docs/faq/README.md @@ -9,7 +9,7 @@ Example error message: **Solution**: Check if jobset-controller-manager deployment pods are running (`kubectl get pods -n jobset-system`). If they are in a `Pending` state, describe the pod to see why (`kubectl describe pod -n jobset-system`), you should see a message in the pod Events indicating why they are unschedulable. The solution will depend on why the pods -are unschedulable. For example, if they unschedulable to due insufficient CPU/memory, the solution is to scale up your CPU node pools or turn on autoscaling. +are unschedulable. For example, if they unschedulable due to insufficient CPU/memory, the solution is to scale up your CPU node pools or turn on autoscaling. ### JobSet is created but child jobs and/or pods are not being created From 6b42acd615ff5e7b0b54ee84d8ef76954815158d Mon Sep 17 00:00:00 2001 From: Daniel Vega-Myhre Date: Wed, 20 Sep 2023 20:58:12 +0000 Subject: [PATCH 5/5] add example error --- docs/faq/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/faq/README.md b/docs/faq/README.md index 3c6c04383..b48593a0c 100644 --- a/docs/faq/README.md +++ b/docs/faq/README.md @@ -2,7 +2,7 @@ ### "Webhook not available" error when attempting to create a JobSet -Example error message: +Example error: `failed calling webhook "mjobset.kb.io": failed to call webhook: Post "https://jobset-webhook-service.jobset-system.svc:443/mutate-jobset-x-k8s-io-v1alpha1-jobset?timeout=10s": no endpoints available for service "jobset-webhook-service"` **Cause**: Usually this means the JobSet controller manager Deployment pods are unschedulable for some reason.