Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alloc.GetAlloc fails on rebooted client #10200

Closed
sgtang opened this issue Mar 19, 2021 · 12 comments · Fixed by #12779
Closed

Alloc.GetAlloc fails on rebooted client #10200

sgtang opened this issue Mar 19, 2021 · 12 comments · Fixed by #12779

Comments

@sgtang
Copy link

sgtang commented Mar 19, 2021

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian

Issue

Nomad client is not successfully garbage collecting old allocations after a node reboot, client logs reference a UUID index error.

Reproduction steps

This occurred while rebooting a Nomad client without draining jobs first.
Other Nomad clients were rebooted as well, however this is the only node that had issues, and the first time we've seen these logs.

Expected Result

Nomad client reboots, old allocations are gc'd, new allocations are rescheduled successfully.

Actual Result

Old allocations are not completely gc'd, new allocations on this node are stuck on pending.

Nomad Client logs (if appropriate)

2021-03-18T17:57:47.776Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: alloc lookup failed: index error: UUID must be 36 characters" rpc=Alloc.GetAlloc server=:4647
2021-03-18T17:57:47.776Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: alloc lookup failed: index error: UUID must be 36 characters"
rpc=Alloc.GetAlloc server=:4647
2021-03-18T17:57:47.776Z [ERROR] client.alloc_watcher: error querying previous alloc: alloc_id=b704b676-44b1-d456-783c-20e9e2285fd5 previous_alloc=48d93821-60db-a17d-9561-be3571a9a776 error="rpc error: alloc lookup failed: index error: UUID must be 36 characters"

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 22, 2021
@tgross tgross self-assigned this Mar 24, 2021
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Mar 24, 2021
@tgross
Copy link
Member

tgross commented Mar 24, 2021

This log entry is interesting:

client.alloc_watcher: error querying previous alloc

That comes from alloc_watcher.go#L394, which is called when the client is waiting for an allocation to be terminated. Was the client out of touch long enough for all it's allocs to be marked lost?

new allocations on this node are stuck on pending.

Are these allocations replacements for system jobs on this client, or are they newly-scheduled jobs that are being placed but not starting successfully? Are there logs on the client that give any indication what's happening on the client, or task events for those tasks?

@tgross tgross changed the title "UUID must be 36 characters" from rebooted client when performing RPC to server Alloc.GetAlloc fails on rebooted client Mar 24, 2021
@sgtang
Copy link
Author

sgtang commented Mar 24, 2021

@tgross The logs on the client essentially showed allocs being GC'd successfully, then ended up producing the above log every few minutes. I can provide more comprehensive logs via email if needed.

We essentially have one large job running on this server with 20 or so tasks. After the reboot, it seemed like the tasks resumed without being lost, however the "UUID must be 36 characters" line was triggering our error detection, so we decided to try restarting the job to fix that state. It was at that point that the allocations were stuck in a "pending" state with no task events showing in the UI. We also tried purging and starting the job to no avail, the same issue was occurring.

To resolve this issue, we ended up removing the allocations manually from /var/nomad/alloc, and starting the job afterwards.

@tgross
Copy link
Member

tgross commented May 4, 2021

To resolve this issue, we ended up removing the allocations manually from /var/nomad/alloc, and starting the job afterwards.

That's interesting... the client doesn't use the presence of those directories to determine if it needs to update the server with the status. The client has its own local state store that it syncs to disk. I was thinking that maybe that state store was somehow corrupt but that wouldn't explain why removing the allocation directories would help at all.

It would definitely be useful to get the logs from both the client and the servers around the time of the reboot. That might give more clues as to what happened here.

@sgtang
Copy link
Author

sgtang commented May 6, 2021

Thanks Tim, will try to gather the logs from the relevant time period and post them here.

One thing we've been experimenting with that might be slightly different than most deployments is that we have Nomad clients with IPs that change frequently, and we're running local Consul servers on the same nodes as some of the Nomad clients due to limited hardware availability.

This means that when the nodes rebooted, some of the Consul servers would've temporarily lost quorum + the IP addresses of the Nomad clients (and Consul service registration IDs) could've changed after reboot. Wondering if these factors could potentially contribute to corrupted local state?

Will upload the logs here once I get a chance to gather them.

@jlambertjc
Copy link

We have had this issue a few times, most recently today. We're on nomad 1.0.4 and had a job with a single group with a single task cause this. It has a template pulling values from vault (1.7.1) as well. The job file submitted was identical to the previously running job file other than a meta tag we update for when it was submitted. The job was allocated on a different host than the previously running the job as well FWIW.

{"@Level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2021-05-13T20:52:20.946081Z","error":"rpc error: ACL token not found","rpc":"Alloc.GetAlloc","server":{"IP":"10.20.2.132","Port":4647,"Zone":""}}
{"@Level":"error","@message":"error performing RPC to server which is not safe to automatically retry","@module":"client.rpc","@timestamp":"2021-05-13T20:52:20.946162Z","error":"rpc error: ACL token not found","rpc":"Alloc.GetAlloc","server":{"IP":"10.20.2.132","Port":4647,"Zone":""}}
{"@Level":"error","@message":"error querying previous alloc","@module":"client.alloc_migrator","@timestamp":"2021-05-13T20:52:20.946186Z","alloc_id":"e37c4285-aebb-ab4b-9e1f-e0ef7276474d","error":"rpc error: ACL token not found","previous_alloc":"42943055-c074-5deb-ce8d-b85d694f0672"}

@tgross
Copy link
Member

tgross commented Jun 3, 2021

@jlambertjc wrote:

We have had this issue a few times, most recently today. We're on nomad 1.0.4 and had a job with a single group with a single task cause this. It has a template pulling values from vault (1.7.1) as well. The job file submitted was identical to the previously running job file other than a meta tag we update for when it was submitted. The job was allocated on a different host than the previously running the job as well FWIW.

Hi @jlambertjc! This doesn't look like the same issue at all, other than it's happening on the same RPC endpoint. Looks to me here like you've got an ACLs problem. Can you please open a new issue?

@tgross
Copy link
Member

tgross commented Jun 3, 2021

@sgtang wrote:

One thing we've been experimenting with that might be slightly different than most deployments is that we have Nomad clients with IPs that change frequently,

Ah... the server keeps a mapping of Node IDs to IP addresses. If you're changing the IP addresses of the clients, this is likely to cause some issues with client-to-server communication. We might be able to provide some guidance here if we had a better understanding of why the IPs of the clients need to change.

@sgtang
Copy link
Author

sgtang commented Jun 15, 2021

@tgross Thanks for that info re: the Node ID to IP address mapping. The reason why our Nomad clients change IP addresses occasionally is due to our VPN redundancy set up, essentially our nodes can connect to any one of 3 VPN gateways and the client IP address might change depending on the subnet of the gateway it's connected to. We essentially have a hook to restart Nomad every time the address on the tunnel interface changes.

Wondering if you guys have experience with similar set ups over VPN?

@tgross
Copy link
Member

tgross commented Jun 16, 2021

The clients are remote from the servers in this case? We've seen folks do that, but I can't say that I've seen quite the setup you've got with the changing gateway subnets. Typically when I've seen that, the VPN gateways share a pool of IPs.

What does the internal topology look like for reaching the servers? Or in other words, when clients connect through the VPN, do they hit a specific server depending on which gateway subnet they're on? Is there a load balancer with a floating server IP? An A-Record that clients are using to reach the address? I think with enough information we can build a little test environment that exercises the behavior to make sure we have a good understanding of what's happening.

In the meantime, have you tuned the min_heartbeat_ttl or heartbeat_grace to account for the restarts? I'm not sure that's totally prevent this kind of problem but it could reduce the risk that the client is marked out-of-contact and has its allocations evicted.

@sgtang
Copy link
Author

sgtang commented Jun 21, 2021

The clients are remote from the servers, and we have multiple clients in various physical locations. We use Consul as a backend, so no A-record/load balancer for the servers, just discovery through service registration, and the clients can talk to all of our cloud Nomad servers in our single DC, regardless of the VPN gateway they connect to.

The VPN gateways are in the same cloud subnet as the servers, each managing a different set of IPs that they assign to clients. The issue is that the Nomad clients bind to the IPs assigned by the VPNs, and the subnet for the IPs differs per gateway to avoid routing conflicts. We are using the HA/failover config described in this article from OpenVPN: https://openvpn.net/community-resources/implementing-a-load-balancing-failover-configuration/ which essentially recommends separate servers with different virtual IP pools.

If it's a strict requirement for the Nomad client IPs to stay the same, we can try looking at ways to keep static client IPs across the gateways.

We have tuned heartbeat_grace to account for the restarts. Thanks for the suggestion + the help

@tgross tgross moved this from In Progress to Needs Roadmapping in Nomad - Community Issues Triage Jul 7, 2021
@tgross tgross removed their assignment Jul 7, 2021
schmichael added a commit that referenced this issue Apr 26, 2022
Fixes #10200

**The bug**

A user reported receiving the following error when an alloc was placed
that needed to preempt existing allocs:

```
[ERROR] client.alloc_watcher: error querying previous alloc:
alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup
failed: index error: UUID must be 36 characters"
```

The previous alloc (8e) was already complete on the client. This is
possible if an alloc stops *after* the scheduling decision was made to
preempt it, but *before* the node running both allocations was able to
pull and start the preemptor. While that is hopefully a narrow window of
time, you can expect it to occur in high throughput batch scheduling
heavy systems.

However the RPC error made no sense! `previous_alloc` in the logs was a
valid 36 character UUID!

**The fix**

The fix is:

```
-		prevAllocID:  c.Alloc.PreviousAllocation,
+		prevAllocID:  watchedAllocID,
```

The alloc watcher new func used for preemption improperly referenced
Alloc.PreviousAllocation instead of the passed in watchedAllocID. When
multiple allocs are preempted, a watcher is created for each with
watchedAllocID set properly by the caller. In this case
Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters`
error was coming from! Sadly we were properly referencing
watchedAllocID in the log, so it made the error make no sense!

**The repro**

I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl)
and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl)
for ease of repro.

First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad),
then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad)
that evicts 2 low priority jobs. Everything worked as expected.

However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147),
it reproduces the bug because the watcher references PreviousAllocation
instead of watchedAllocID.
Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Apr 26, 2022
schmichael added a commit that referenced this issue Apr 26, 2022
Fixes #10200

**The bug**

A user reported receiving the following error when an alloc was placed
that needed to preempt existing allocs:

```
[ERROR] client.alloc_watcher: error querying previous alloc:
alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup
failed: index error: UUID must be 36 characters"
```

The previous alloc (8e) was already complete on the client. This is
possible if an alloc stops *after* the scheduling decision was made to
preempt it, but *before* the node running both allocations was able to
pull and start the preemptor. While that is hopefully a narrow window of
time, you can expect it to occur in high throughput batch scheduling
heavy systems.

However the RPC error made no sense! `previous_alloc` in the logs was a
valid 36 character UUID!

**The fix**

The fix is:

```
-		prevAllocID:  c.Alloc.PreviousAllocation,
+		prevAllocID:  watchedAllocID,
```

The alloc watcher new func used for preemption improperly referenced
Alloc.PreviousAllocation instead of the passed in watchedAllocID. When
multiple allocs are preempted, a watcher is created for each with
watchedAllocID set properly by the caller. In this case
Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters`
error was coming from! Sadly we were properly referencing
watchedAllocID in the log, so it made the error make no sense!

**The repro**

I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl)
and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl)
for ease of repro.

First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad),
then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad)
that evicts 2 low priority jobs. Everything worked as expected.

However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147),
it reproduces the bug because the watcher references PreviousAllocation
instead of watchedAllocID.
@schmichael
Copy link
Member

I think #12779 will fix this issue. Even if it doesn't fit it entirely, it will at least fix the strange UUID must be 36 characters error message and give us new information to work with!

Please don't hesitate to reopen this issue (or open a new one) if this issue persists after 1.3.0-rc1, 1.2.7, and 1.1.13 are released.

@github-actions
Copy link

github-actions bot commented Oct 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

4 participants