Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add troubleshooting docs #303

Merged
merged 5 commits into from
Sep 20, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ Take a look at the [concepts](/docs/concepts/README.md) page for a brief descrip

Read the [installation guide](/docs/setup/install.md) to learn more.

## Troubleshooting common issues
See the [FAQ](/docs/faq/README.md) for help with troubleshooting common issues.


## Community, discussion, contribution, and support

Expand Down
44 changes: 44 additions & 0 deletions docs/faq/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Troubleshooting Common Issues

### "Webhook not available" error when attempting to create a JobSet

Example error message:
danielvegamyhre marked this conversation as resolved.
Show resolved Hide resolved

**Cause**: Usually this means the JobSet controller manager Deployment pods hey are unschedulable for some reason.
danielvegamyhre marked this conversation as resolved.
Show resolved Hide resolved

**Solution**: Check if jobset-controller-manager deployment pods are running (`kubectl get pods -n jobset-system`).
If they are in a `Pending` state, describe the pod to see why (`kubectl describe pod <pod> -n jobset-system`), you
should see a message in the pod Events indicating why they are unschedulable. The solution will depend on why the pods
are unschedulable. For example, if they unschedulable to due insufficient CPU/memory, the solution is to scale up your CPU node pools or turn on autoscaling.
danielvegamyhre marked this conversation as resolved.
Show resolved Hide resolved

### JobSet is created but child jobs and/or pods are not being created

Check the jobset controller logs to see why the jobs are not being created:

- `kubectl get pods -n jobset-system`
- `kubectl logs <pod> -n jobset-system`

Inspect the logs to look for one of the following issues:

1. Error message indicating an index does not exist (example: ` "error": "Index with name field:.metadata.controller does not exist"`)

**Cause**: In older versions of JobSet (older than v0.2.1) if the indexes could not be built for some reason, the JobSet controller would log the error and launch anyway. This resulted in confusing behavior later when trying to create JobSets, where the controller would encounter this "index not found" error and not be able to create any jobs. This bug was fixed
in v0.2.1 so the JobSet controller now fails fast and exits with an error if indexes cannot be built.

**Solution**: Upgrade to at least JobSet v0.2.1 (ideally, you should use the latest JobSet release).

2. Validation error creating Jobs and/or Services, indicating the Job/Service name is invalid.

**Cause**: Generated child job names or headless services names (which are derived from the JobSet name and ReplicatedJob names) are not valid.

**Solution**: Validation has been added to fail the JobSet creation if the generated job/service names will be invalid, but the fix is not included in a release yet. For now, to resolve this simply delete/recreate the JobSet with a name such that:

* The generated Job names (format: `<jobset-name>-<replicatedJobName>-<jobIndex>-<podIndex>.<subdomain>`) will be valid DNS labels as defined in RFC 1035.
* The subdomain name (manually specified in `js.Spec.Network.Subdomain` or defaulted to the JobSet name if unspecified) is both [RFC 1123](https://datatracker.ietf.org/doc/html/rfc1123) compliant and [RFC 1035](https://datatracker.ietf.org/doc/html/rfc1035) compliant.


### Using JobSet + Kueue, preempted workloads never resume

**Cause**: This could be due to a known bug in an older version of JobSet, or a known bug in an older version of Kueue. ug in older releases.

**Solution**: Upgrade to at least JobSet v0.2.3 and Kueue v0.4.1.