Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't manualy fail deployment from buggy versions of nomad #4286

Closed
tantra35 opened this issue May 11, 2018 · 3 comments
Closed

Can't manualy fail deployment from buggy versions of nomad #4286

tantra35 opened this issue May 11, 2018 · 3 comments
Assignees

Comments

@tantra35
Copy link
Contributor

tantra35 commented May 11, 2018

Nomad version

Nomad v0.8.3 (c85483d)

Issue

We have some deployments which remained from old times of nomad v0.6.0 development and it bugs. So now we decide to fail this deployments because we periodically see in out server logs follow:

2018/05/11 13:12:20.665488 [ERR] nomad.deployments_watcher: failed to track deployment "354218d0-1f40-aa7d-6f9a-841a01e4d453": deployment "354218d0-1f40-aa7d-6f9a-841a01e4d453" references unknown job "S3apiCache"
2018/05/11 13:12:20.665512 [ERR] nomad.deployments_watcher: failed to track deployment "358f9dda-9feb-0f66-05e6-647f9e157747": deployment "358f9dda-9feb-0f66-05e6-647f9e157747" references unknown job "tdagent-local"
2018/05/11 13:12:20.665536 [ERR] nomad.deployments_watcher: failed to track deployment "503ffcb2-ca8e-5978-4316-6ef8d36c38a3": deployment "503ffcb2-ca8e-5978-4316-6ef8d36c38a3" references unknown job "ceph-zabbix"
2018/05/11 13:12:20.665559 [ERR] nomad.deployments_watcher: failed to track deployment "64c03451-f546-18b3-429d-f236b66478cc": deployment "64c03451-f546-18b3-429d-f236b66478cc" references unknown job "tdagent-local"
2018/05/11 13:12:20.665578 [ERR] nomad.deployments_watcher: failed to track deployment "73a0e737-47a2-df97-9899-6754a4697456": deployment "73a0e737-47a2-df97-9899-6754a4697456" references unknown job "webphp"
2018/05/11 13:12:20.665599 [ERR] nomad.deployments_watcher: failed to track deployment "785947d4-045b-0827-8180-eec01f0e0de2": deployment "785947d4-045b-0827-8180-eec01f0e0de2" references unknown job "S3apiCache"

All this deployments shows as they running for example for deployment 354218d0-1f40-aa7d-6f9a-841a01e4d453 short notation 354218d0

$ nomad deployment list | grep '354218d0'
354218d0  S3apiCache                         53           running     Deployment is running

Since S3apiCache job doesn't actually exist we try to manually fail this deployment, and got the same error that we see in nomad server logs

$ nomad deployment fail 354218d0
Error failing deployment: Unexpected response code: 500 (rpc error: deployment "354218d0-1f40-aa7d-6f9a-841a01e4d453" references unknown job "S3apiCache")

Because this deployments stays after buggy versions of nomad I does;t think that this is a bug, but looks strange that nomad doesn't cleanup from not existent jobs, and doen't allow do manual cleanup

@tantra35 tantra35 changed the title Can't manuali fali deployment from buggy versions of nomad Can't manualy fali deployment from buggy versions of nomad May 11, 2018
@tantra35 tantra35 changed the title Can't manualy fali deployment from buggy versions of nomad Can't manualy fail deployment from buggy versions of nomad May 11, 2018
@tantra35
Copy link
Contributor Author

After some investigations we found a solution for this. We create fake jobs with same names as in buggy deployments, then we can fail them and clear with GC

@qkate qkate assigned dadgar and unassigned qkate May 21, 2018
dadgar added a commit that referenced this issue May 23, 2018
This PR cancels deployments that are active but do not have a job
associated with them. This is a broken invariant that causes issues in
the deployment watcher since it will not track them. Thus they are
objects that can't be operated on or cleaned up.

Fixes #4286
@dadgar
Copy link
Contributor

dadgar commented May 23, 2018

@tantra35 PR I just put up should clean them when upgrading to newer versions of Nomad. Don't want to add an endpoint since this isn't a case that should ever happen since it arouse from a bug that has since been fixed.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants