Allocation Logs Immediately Removed on Allocation Replacement #18034

thetoothpick · 2023-07-21T18:26:41Z

Nomad version

# nomad -version
Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af70885c02ab921dedbdf6bc406a1e886866f80

Operating system and Environment details

# uname -a
Linux ip-10-3-6-38.ec2.internal 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Issue

When our Nomad service allocations finish (either successfully or unsuccessfully) and are replaced by another allocation from the same task group on the same client, all allocation logs on the filesystem are immediately removed. This makes it difficult to capture things that are only logged to standard out or error, especially if the allocation crashed or failed. It seems like the entire allocation directory is garbage collected immediately when the allocation is replaced, but we haven't been able to find any logs indicating why (or that this is actually what's happening).

These are our GC-related settings:

server {
  job_gc_threshold              = "24h"
  eval_gc_threshold             = "24h"
  deployment_gc_threshold       = "24h"
  csi_volume_claim_gc_threshold = "3m"
}

We do have sticky ephemeral disks enabled, and the alloc/data directory is transferred to the new allocation successfully.

On some hosts that run fewer allocations, we're sometimes able to see logs for an extended period after the previous allocations stop and are replaced, so the problem isn't universal. We've also checked the disk usage on the hosts when this happens and they're all very low (usually 2-5% used).

We have developed a "poststop" task to save the allocation logs after the allocation finishes, to try to combat this and help debug errors that occur in our jobs. However, sometimes even this poststop task fails to start with this error:

logmon: rpc error: code = Unknown desc = failed to create stdout logfile for "save-nomad-logs.stdout": open /.../alloc/logs: no such file or directory.

This seems to indicate that the allocation (log) directory is being deleted before the allocation even completes.

We're not sure what might be causing this.

Reproduction steps

Configure Nomad with the following garbage collection (GC) related settings:

server {
  job_gc_threshold              = "24h"
  eval_gc_threshold             = "24h"
  deployment_gc_threshold       = "24h"
  csi_volume_claim_gc_threshold = "3m"
}

Enable sticky ephemeral disks.
Start Nomad service and deploy multiple allocations for a job on the same client.
Observe that upon replacement of an allocation, all allocation logs on the filesystem are immediately removed.

Expected Result

When an allocation finishes and is replaced, the allocation logs should not be immediately removed, allowing for the capture of logs that may be useful for debugging purposes.

Actual Result

All allocation logs on the filesystem are immediately removed when the allocation is replaced, hindering the ability to capture critical logs for debugging.

Nomad Server logs (if appropriate)

Not gathered at debug level.

Nomad Client logs (if appropriate)

Not gathered at debug level.

Enabling debug logging requires a full redeploy of some of our infrastructure, but we can do it if required.

Please let us know the best way to continue investigating, or if anything obvious jumps out that we missed.

Thanks!

Kevin

The text was updated successfully, but these errors were encountered:

thetoothpick · 2023-07-21T19:34:50Z

Just want to point out that I think this is distinct from #13632 (if I'm reading through it correctly), because the allocation history is still present in the UI (along with the allocation / task timelines, etc.), but we can't open the logs (because they're missing from the disk).

mr-karan · 2023-07-23T15:42:03Z

+1.

Chiming in with my usecase: We have a few cron jobs that run at midnight. It'd be nice to have a configurable GC interval for these allocs so that if there's any need to debug the crons using the logs, it's possible.

tgross · 2023-07-25T14:11:36Z

@thetoothpick As I noted in #13632 (comment) the server GC and the client GC are separate operations with separate configuration. So you're right that #13632 is a different problem.

The client will emit info-level logs with text like "garbage collecting allocation", which would be useful to see here. Have you had a look at the client.gc_interval or the other gc_* configuration values on the client?

thetoothpick · 2023-07-28T21:58:19Z

hi @tgross - thanks for the pointers. we are hitting allocation garbage collection for certain allocations, usually due to full disks (which is a separate thing we're addressing).

We see a number of logs like this for these situations:

    2023-07-28T21:43:26.939Z [WARN]  client.gc: garbage collection skipped because no terminal allocations: reason="disk usage of 91 is over gc threshold of 80"
...
    2023-07-28T19:29:25.226Z [INFO]  client.gc: garbage collecting allocation: alloc_id=283a28be-b237-af3e-b22c-f20aa9b18100 reason="disk usage of 100 is over gc threshold of 80"

However, I don't think that's what's causing the log deletion that we're seeing. The disks of hosts where this happen is usually pretty empty, and while there is a marking allocation for GC line, there is no corresponding garbage collecting allocation line.

I've identified and gathered logs for an allocation where this behavior happened. I've included the tail end of it below, but can email the full log (contains potentially sensitive info).

Here, the allocation stopped and a new allocation replaced it on the same host. The allocation history remained, but the alloc dir was gone save for <id>/save-nomad-logs/secrets.

2023-07-28T17:24:19.441Z [INFO] client.gc: marking allocation for GC: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7
2023-07-28T17:24:19.441Z [INFO] client.alloc_runner: waiting for task to exit: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=c-media-driver
2023-07-28T17:24:19.443Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs type="Task Setup" msg="Building Task Directory" failed=false
2023-07-28T17:24:19.483Z [INFO] agent: (runner) rendered "(dynamic)" => "/nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7/save-nomad-logs/local/config/application.conf"
2023-07-28T17:24:19.486Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:tar Args:[-czvf /tmp/alloc-logs-667de301.tar.gz -C /nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7/alloc logs]}"
2023-07-28T17:24:19.537Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs type=Started msg="Task started by client" failed=false
2023-07-28T17:24:19.586Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs type=Terminated msg="Exit Code: 1" failed=false
2023-07-28T17:24:19.593Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs reason="Policy allows no restarts"
2023-07-28T17:24:19.593Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs type="Not Restarting" msg="Policy allows no restarts" failed=true
2023-07-28T17:24:19.593Z [INFO] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 driver=raw_exec task_name=save-nomad-logs path=/nomad/nomad_zip/nomad pid=3667330
2023-07-28T17:24:19.602Z [INFO] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs path=/nomad/nomad_zip/nomad pid=3667316
| \t* failed to remove alloc dir "/nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7": unlinkat /nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7/save-nomad-logs/secrets: device or resource busy
2023-07-28T17:24:19.780Z [ERROR] client.alloc_migrator: error destroying alloc dir: alloc_id=7875fe18-3797-968c-2489-b54d011511ef previous_alloc=667de301-622d-51ba-a54d-23bbc8d407b7
previous_alloc_dir=/nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7

This shows the "poststop" job (and failure) I described in the original issue.

The client.alloc_migrator: error destroying alloc dir line seems pretty relevant - is it standard procedure to delete the entire allocation directory when ephemeral_disk.sticky = true?

tgross · 2023-07-31T20:38:14Z

Thanks @thetoothpick, that context and those logs are very helpful.

2023-07-28T17:24:19.780Z [ERROR] client.alloc_migrator: error destroying alloc dir: alloc_id=7875fe18-3797-968c-2489-b54d011511ef previous_alloc=667de301-622d-51ba-a54d-23bbc8d407b7 previous_alloc_dir=/nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7

This message "error destroying alloc dir" is coming from alloc_watcher.go#L281-L285. This file is where Nomad does the migration of the ephemeral data from one allocation to another when you have ephemeral_disk.migrate = true set.

We've got two different workflows here, one for when the previous allocation is "remote" (on a different client) and "local" for when allocation is on the same node. In the remote case, we create a new "previous alloc" directory to unpack the data we've downloaded in migrateAllocDir. And then we delete that when we're done because there's no need for it to exist.

But for the "local" case which is what you're seeing, we move the old alloc/data and tasks directories via Move method and then when that returns successfully we destroy the entire allocation directory, which includes the logs!

As far as I can tell there's no reason to destroy the previous allocation directory at all there, as the usual allocation garbage collection should clean this up later. But it might have made sense back in 2017 when that code was written. 😀

The fix should be small but I'll need to do some hands-on testing to make sure we're not breaking something I didn't think of. I've got a draft PR up here #18108 but it's getting towards EOD for me here so I'll try to get that tested out tomorrow.

When `ephemeral_disk.migrate=true`, we make a best-effort attempt to migrate the allocation data between the old and new allocations. For previous allocations on other clients (the "remote" workflow), we create a local allocdir and download the data from the previous client into it. That data is then moved into the new allocdir and we delete the allocdir of the previous alloc. For "local" previous allocations we don't need to create an extra directory for the previous allocation and instead move the files directly from one to the other. But we still delete the old allocdir _entirely_, which includes all the logs! There doesn't seem to be any reason to destroy the local previous allocdir, as the usual client garbage collection should destroy it later on when needed. By not deleting it, the previous allocation's logs are still available for the user to read. Fixes: #18034

When ephemeral disks are migrated from an allocation on the same node, allocation logs for the previous allocation are lost. There are two workflows for the best-effort attempt to migrate the allocation data between the old and new allocations. For previous allocations on other clients (the "remote" workflow), we create a local allocdir and download the data from the previous client into it. That data is then moved into the new allocdir and we delete the allocdir of the previous alloc. For "local" previous allocations we don't need to create an extra directory for the previous allocation and instead move the files directly from one to the other. But we still delete the old allocdir _entirely_, which includes all the logs! There doesn't seem to be any reason to destroy the local previous allocdir, as the usual client garbage collection should destroy it later on when needed. By not deleting it, the previous allocation's logs are still available for the user to read. Fixes: #18034

tgross · 2023-08-02T13:47:51Z

#18108 has been merged and will ship in the next regular release of Nomad 1.6.x, with backports to 1.5.x and 1.4.x

When ephemeral disks are migrated from an allocation on the same node, allocation logs for the previous allocation are lost. There are two workflows for the best-effort attempt to migrate the allocation data between the old and new allocations. For previous allocations on other clients (the "remote" workflow), we create a local allocdir and download the data from the previous client into it. That data is then moved into the new allocdir and we delete the allocdir of the previous alloc. For "local" previous allocations we don't need to create an extra directory for the previous allocation and instead move the files directly from one to the other. But we still delete the old allocdir _entirely_, which includes all the logs! There doesn't seem to be any reason to destroy the local previous allocdir, as the usual client garbage collection should destroy it later on when needed. By not deleting it, the previous allocation's logs are still available for the user to read. Fixes: #18034

thetoothpick added the type/bug label Jul 21, 2023

jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 24, 2023

tgross self-assigned this Jul 28, 2023

tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jul 28, 2023

tgross added stage/waiting-reply theme/client labels Jul 28, 2023

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/waiting-reply labels Jul 31, 2023

tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Jul 31, 2023

tgross mentioned this issue Jul 31, 2023

allocwatcher: don't destroy local allocdir after migration #18108

Merged

tgross added this to the 1.6.x milestone Aug 1, 2023

tgross added theme/data migration theme/logging labels Aug 1, 2023

tgross closed this as completed in #18108 Aug 2, 2023

Nomad - Community Issues Triage automation moved this from In Progress to Done Aug 2, 2023

This was referenced Aug 2, 2023

Backport of allocwatcher: don't destroy local allocdir after migration into release/1.5.x #18131

Merged

Backport of allocwatcher: don't destroy local allocdir after migration into release/1.6.x #18132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocation Logs Immediately Removed on Allocation Replacement #18034

Allocation Logs Immediately Removed on Allocation Replacement #18034

thetoothpick commented Jul 21, 2023

thetoothpick commented Jul 21, 2023

mr-karan commented Jul 23, 2023

tgross commented Jul 25, 2023

thetoothpick commented Jul 28, 2023

tgross commented Jul 31, 2023 •

edited

Loading

tgross commented Aug 2, 2023

Allocation Logs Immediately Removed on Allocation Replacement #18034

Allocation Logs Immediately Removed on Allocation Replacement #18034

Comments

thetoothpick commented Jul 21, 2023

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

thetoothpick commented Jul 21, 2023

mr-karan commented Jul 23, 2023

tgross commented Jul 25, 2023

thetoothpick commented Jul 28, 2023

tgross commented Jul 31, 2023 • edited Loading

tgross commented Aug 2, 2023

tgross commented Jul 31, 2023 •

edited

Loading