Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation Logs Immediately Removed on Allocation Replacement #18034

Closed
thetoothpick opened this issue Jul 21, 2023 · 6 comments · Fixed by #18108
Closed

Allocation Logs Immediately Removed on Allocation Replacement #18034

thetoothpick opened this issue Jul 21, 2023 · 6 comments · Fixed by #18108
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client theme/data migration theme/logging type/bug
Milestone

Comments

@thetoothpick
Copy link

Nomad version

# nomad -version
Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af70885c02ab921dedbdf6bc406a1e886866f80

Operating system and Environment details

# uname -a
Linux ip-10-3-6-38.ec2.internal 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Issue

When our Nomad service allocations finish (either successfully or unsuccessfully) and are replaced by another allocation from the same task group on the same client, all allocation logs on the filesystem are immediately removed. This makes it difficult to capture things that are only logged to standard out or error, especially if the allocation crashed or failed. It seems like the entire allocation directory is garbage collected immediately when the allocation is replaced, but we haven't been able to find any logs indicating why (or that this is actually what's happening).

These are our GC-related settings:

server {
  job_gc_threshold              = "24h"
  eval_gc_threshold             = "24h"
  deployment_gc_threshold       = "24h"
  csi_volume_claim_gc_threshold = "3m"
}

We do have sticky ephemeral disks enabled, and the alloc/data directory is transferred to the new allocation successfully.

On some hosts that run fewer allocations, we're sometimes able to see logs for an extended period after the previous allocations stop and are replaced, so the problem isn't universal. We've also checked the disk usage on the hosts when this happens and they're all very low (usually 2-5% used).

We have developed a "poststop" task to save the allocation logs after the allocation finishes, to try to combat this and help debug errors that occur in our jobs. However, sometimes even this poststop task fails to start with this error:

logmon: rpc error: code = Unknown desc = failed to create stdout logfile for "save-nomad-logs.stdout": open /.../alloc/logs: no such file or directory.

This seems to indicate that the allocation (log) directory is being deleted before the allocation even completes.

We're not sure what might be causing this.

Reproduction steps

  1. Configure Nomad with the following garbage collection (GC) related settings:
    server {
      job_gc_threshold              = "24h"
      eval_gc_threshold             = "24h"
      deployment_gc_threshold       = "24h"
      csi_volume_claim_gc_threshold = "3m"
    }
    
  2. Enable sticky ephemeral disks.
  3. Start Nomad service and deploy multiple allocations for a job on the same client.
  4. Observe that upon replacement of an allocation, all allocation logs on the filesystem are immediately removed.

Expected Result

When an allocation finishes and is replaced, the allocation logs should not be immediately removed, allowing for the capture of logs that may be useful for debugging purposes.

Actual Result

All allocation logs on the filesystem are immediately removed when the allocation is replaced, hindering the ability to capture critical logs for debugging.

Nomad Server logs (if appropriate)

Not gathered at debug level.

Nomad Client logs (if appropriate)

Not gathered at debug level.


Enabling debug logging requires a full redeploy of some of our infrastructure, but we can do it if required.

Please let us know the best way to continue investigating, or if anything obvious jumps out that we missed.

Thanks!

Kevin

@thetoothpick
Copy link
Author

Just want to point out that I think this is distinct from #13632 (if I'm reading through it correctly), because the allocation history is still present in the UI (along with the allocation / task timelines, etc.), but we can't open the logs (because they're missing from the disk).

@mr-karan
Copy link
Contributor

+1.

Chiming in with my usecase: We have a few cron jobs that run at midnight. It'd be nice to have a configurable GC interval for these allocs so that if there's any need to debug the crons using the logs, it's possible.

@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 24, 2023
@tgross
Copy link
Member

tgross commented Jul 25, 2023

@thetoothpick As I noted in #13632 (comment) the server GC and the client GC are separate operations with separate configuration. So you're right that #13632 is a different problem.

The client will emit info-level logs with text like "garbage collecting allocation", which would be useful to see here. Have you had a look at the client.gc_interval or the other gc_* configuration values on the client?

@tgross tgross self-assigned this Jul 28, 2023
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jul 28, 2023
@thetoothpick
Copy link
Author

hi @tgross - thanks for the pointers. we are hitting allocation garbage collection for certain allocations, usually due to full disks (which is a separate thing we're addressing).

We see a number of logs like this for these situations:

    2023-07-28T21:43:26.939Z [WARN]  client.gc: garbage collection skipped because no terminal allocations: reason="disk usage of 91 is over gc threshold of 80"
...
    2023-07-28T19:29:25.226Z [INFO]  client.gc: garbage collecting allocation: alloc_id=283a28be-b237-af3e-b22c-f20aa9b18100 reason="disk usage of 100 is over gc threshold of 80"

However, I don't think that's what's causing the log deletion that we're seeing. The disks of hosts where this happen is usually pretty empty, and while there is a marking allocation for GC line, there is no corresponding garbage collecting allocation line.

I've identified and gathered logs for an allocation where this behavior happened. I've included the tail end of it below, but can email the full log (contains potentially sensitive info).

Here, the allocation stopped and a new allocation replaced it on the same host. The allocation history remained, but the alloc dir was gone save for <id>/save-nomad-logs/secrets.

2023-07-28T17:24:19.441Z [INFO] client.gc: marking allocation for GC: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7
2023-07-28T17:24:19.441Z [INFO] client.alloc_runner: waiting for task to exit: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=c-media-driver
2023-07-28T17:24:19.443Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs type="Task Setup" msg="Building Task Directory" failed=false
2023-07-28T17:24:19.483Z [INFO] agent: (runner) rendered "(dynamic)" => "/nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7/save-nomad-logs/local/config/application.conf"
2023-07-28T17:24:19.486Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:tar Args:[-czvf /tmp/alloc-logs-667de301.tar.gz -C /nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7/alloc logs]}"
2023-07-28T17:24:19.537Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs type=Started msg="Task started by client" failed=false
2023-07-28T17:24:19.586Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs type=Terminated msg="Exit Code: 1" failed=false
2023-07-28T17:24:19.593Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs reason="Policy allows no restarts"
2023-07-28T17:24:19.593Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs type="Not Restarting" msg="Policy allows no restarts" failed=true
2023-07-28T17:24:19.593Z [INFO] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 driver=raw_exec task_name=save-nomad-logs path=/nomad/nomad_zip/nomad pid=3667330
2023-07-28T17:24:19.602Z [INFO] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs path=/nomad/nomad_zip/nomad pid=3667316
| \t* failed to remove alloc dir "/nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7": unlinkat /nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7/save-nomad-logs/secrets: device or resource busy
2023-07-28T17:24:19.780Z [ERROR] client.alloc_migrator: error destroying alloc dir: alloc_id=7875fe18-3797-968c-2489-b54d011511ef previous_alloc=667de301-622d-51ba-a54d-23bbc8d407b7
previous_alloc_dir=/nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7

This shows the "poststop" job (and failure) I described in the original issue.

The client.alloc_migrator: error destroying alloc dir line seems pretty relevant - is it standard procedure to delete the entire allocation directory when ephemeral_disk.sticky = true?

@tgross
Copy link
Member

tgross commented Jul 31, 2023

Thanks @thetoothpick, that context and those logs are very helpful.

2023-07-28T17:24:19.780Z [ERROR] client.alloc_migrator: error destroying alloc dir: alloc_id=7875fe18-3797-968c-2489-b54d011511ef previous_alloc=667de301-622d-51ba-a54d-23bbc8d407b7 previous_alloc_dir=/nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7

This message "error destroying alloc dir" is coming from alloc_watcher.go#L281-L285. This file is where Nomad does the migration of the ephemeral data from one allocation to another when you have ephemeral_disk.migrate = true set.

We've got two different workflows here, one for when the previous allocation is "remote" (on a different client) and "local" for when allocation is on the same node. In the remote case, we create a new "previous alloc" directory to unpack the data we've downloaded in migrateAllocDir. And then we delete that when we're done because there's no need for it to exist.

But for the "local" case which is what you're seeing, we move the old alloc/data and tasks directories via Move method and then when that returns successfully we destroy the entire allocation directory, which includes the logs!

As far as I can tell there's no reason to destroy the previous allocation directory at all there, as the usual allocation garbage collection should clean this up later. But it might have made sense back in 2017 when that code was written. 😀

The fix should be small but I'll need to do some hands-on testing to make sure we're not breaking something I didn't think of. I've got a draft PR up here #18108 but it's getting towards EOD for me here so I'll try to get that tested out tomorrow.

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/waiting-reply labels Jul 31, 2023
@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Jul 31, 2023
tgross added a commit that referenced this issue Jul 31, 2023
When `ephemeral_disk.migrate=true`, we make a best-effort attempt to migrate the
allocation data between the old and new allocations. For previous allocations on other
clients (the "remote" workflow), we create a local allocdir and download the
data from the previous client into it. That data is then moved into the new
allocdir and we delete the allocdir of the previous alloc.

For "local" previous allocations we don't need to create an extra directory for
the previous allocation and instead move the files directly from one to the
other. But we still delete the old allocdir _entirely_, which includes all the
logs!

There doesn't seem to be any reason to destroy the local previous allocdir, as
the usual client garbage collection should destroy it later on when needed. By
not deleting it, the previous allocation's logs are still available for the user
to read.

Fixes: #18034
tgross added a commit that referenced this issue Aug 1, 2023
When ephemeral disks are migrated from an allocation on the same node,
allocation logs for the previous allocation are lost.

There are two workflows for the best-effort attempt to migrate the allocation
data between the old and new allocations. For previous allocations on other
clients (the "remote" workflow), we create a local allocdir and download the
data from the previous client into it. That data is then moved into the new
allocdir and we delete the allocdir of the previous alloc.

For "local" previous allocations we don't need to create an extra directory for
the previous allocation and instead move the files directly from one to the
other. But we still delete the old allocdir _entirely_, which includes all the
logs!

There doesn't seem to be any reason to destroy the local previous allocdir, as
the usual client garbage collection should destroy it later on when needed. By
not deleting it, the previous allocation's logs are still available for the user
to read.

Fixes: #18034
@tgross tgross added this to the 1.6.x milestone Aug 1, 2023
Nomad - Community Issues Triage automation moved this from In Progress to Done Aug 2, 2023
tgross added a commit that referenced this issue Aug 2, 2023
When ephemeral disks are migrated from an allocation on the same node,
allocation logs for the previous allocation are lost.

There are two workflows for the best-effort attempt to migrate the allocation
data between the old and new allocations. For previous allocations on other
clients (the "remote" workflow), we create a local allocdir and download the
data from the previous client into it. That data is then moved into the new
allocdir and we delete the allocdir of the previous alloc.

For "local" previous allocations we don't need to create an extra directory for
the previous allocation and instead move the files directly from one to the
other. But we still delete the old allocdir _entirely_, which includes all the
logs!

There doesn't seem to be any reason to destroy the local previous allocdir, as
the usual client garbage collection should destroy it later on when needed. By
not deleting it, the previous allocation's logs are still available for the user
to read.

Fixes: #18034
@tgross
Copy link
Member

tgross commented Aug 2, 2023

#18108 has been merged and will ship in the next regular release of Nomad 1.6.x, with backports to 1.5.x and 1.4.x

tgross added a commit that referenced this issue Aug 2, 2023
When ephemeral disks are migrated from an allocation on the same node,
allocation logs for the previous allocation are lost.

There are two workflows for the best-effort attempt to migrate the allocation
data between the old and new allocations. For previous allocations on other
clients (the "remote" workflow), we create a local allocdir and download the
data from the previous client into it. That data is then moved into the new
allocdir and we delete the allocdir of the previous alloc.

For "local" previous allocations we don't need to create an extra directory for
the previous allocation and instead move the files directly from one to the
other. But we still delete the old allocdir _entirely_, which includes all the
logs!

There doesn't seem to be any reason to destroy the local previous allocdir, as
the usual client garbage collection should destroy it later on when needed. By
not deleting it, the previous allocation's logs are still available for the user
to read.

Fixes: #18034
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client theme/data migration theme/logging type/bug
Projects
Development

Successfully merging a pull request may close this issue.

3 participants