Alloc Exec spontaneously disconnects after some time #10579

the-maldridge · 2021-05-13T08:40:11Z

Nomad version

$ nomad version
Nomad v1.0.4 (a480eed0815c54612856d9115a34bb1d1a773e8c)

Operating system and Environment details

Observed with Ubuntu and Void Linux clients, observed with macOS clients, observed with Alpine, Void, and ResinStack servers. All x64, tested with diversity in network path, underlying provider, and Nomad versions at both ends.

Issue

When using the alloc exec command the shell connects as expected and interactive commands work great. After some time the session spontaneously disconnects. Sometimes it disconnects with an error such as failed to exec into task: disconnected without receiving the exit code. Sometimes it disconnects with no error printed at all.

Reproduction steps

Launch a container of your choosing that contains an interactive shell. Busybox with the entrypoint set to ["/bin/sleep", "3600"] works well for this.
Exec into the alloc with -i -t options.
Wait.

After some time you will be disconnected.

Expected Result

I expect that unless an external and intentional factor is at play, the session should remain connected. Examples of external factors include but are not limited to: network path disruption, token expiry, user explicitly disconnecting the session.

Actual Result

The interactive exec is disconnected without warning and no discernible cause.

So far I have not found anything in the logs and the job being exec'd into seems to have no effect on how long the session survives. I think this may be some context internal to Nomad not being renewed, but I have not traced the execution yet.

The text was updated successfully, but these errors were encountered:

ld100 · 2021-05-13T08:45:58Z

I'm having the same issue on Nomad 1.0.2. Each time the session is between 2 and 15 minutes before the disconnect. Every time getting "failed to exec into task: disconnected without receiving the exit code" error on disconnect.

johnalotoski · 2021-05-14T01:32:36Z

Also experiencing this intermittently for the past several months on various versions of nomad, currently 1.0.4.
Today I'm trying to time a long running command and this is interfering with that. Thanks for filing!

apollo13 · 2021-05-16T18:31:14Z

Ha, I also noted this and I thought my haproxy configuration was at fault. But seems like that nomad alone also has a problem -- that would have saved me a few hours debugging :D

tgross · 2021-05-19T18:05:54Z

Hi folks! Looks like that error is bubbling up from (*Allocations).Exec in the api package. The channel we're polling on is getting closed, so we'll need to hunt down why. Probably either there's an undocumented timeout or there's a keep-alive missing for the websocket (I know we did had to do something like that for nomad monitor but not sure how generalized that was.)

I'm going to dig into this with my colleagues @isabeldepapel and @jsosulska tomorrow.

the-maldridge · 2021-05-19T18:07:29Z

Fantastic news, its great to know that this isn't some heisenbug as it originally appeared to be.

tgross · 2021-05-20T19:11:48Z

Ok, so we spent a little time pairing on this and there's a few things we've come up with:

That disconnected without receiving the exit code error message path can only be hit if we get a "is closed" error from the websocket library api/allocations.go#L257-L258. That can be any of a large number of websocket messages, but importantly they're all protocol messages. Unfortunately we throw away the error rather than letting the CLI user know what happened, which is lousy behavior and we should fix that. So that's bug 1 to fix.
The CLI side does heartbeat every 10 sec. But unfortunately the sender function returns an error that we always throw away, rather than reporting to the user. So it's entirely possible for the heartbeats to fail, and that would lead to the server eventually disconnecting you. From a design standpoint, it's a little messy -- nomad alloc exec is supposed to be streaming the output from the remote process, so if we're interleaving that with control errors that's going to mess up the usefulness of the command (and would break backwards compat). It'd probably be better to log missed heartbeats on the server side instead.
The server-side decoding of messages in alloc_endpoint.go#L525-L531 is what could be sending the error. The isClosedError function it calls is a little interesting in that it tries to parse the error on the message pack response and detects the string literal "closed" and "EOF". Next we want to dig into where those message pack errors are being set so we can make sure it's not possible for the process output to accidentally write control messages.
I also just want to set up a quickie tcpdump and leave it running so we can see if we can capture one of these events on the network. That'd help narrow down the source.

tgross · 2021-05-21T20:38:50Z

I've opened #10638 to make the problem a bit more debuggable. I ran tcpdump for an hour or so and I see the {}\n heartbeat message every 10s as expected. Getting to the end of my week here but I'll see about trying to set up a longer-running test of it (and across a more realistic network) to see if that gets us any more info.

ld100 · 2021-06-03T12:59:45Z

Any updates on the issue? Can't understand it from #10638

tgross · 2021-06-03T13:03:07Z

I believe this was fixed by @notnoop in #10657 and will ship in Nomad 1.1.1. @notnoop can we get the changelog entry added to 1.1.1 and the 1.0.7 backport?

notnoop · 2021-06-03T14:55:20Z

I'm afraid, we still need to dig into this issue more. #10657 ensures that the CLI reports the last few events (e.g. final output, exit code); it addresses the case where a completed commands result into "unexpected EOF" error.

We aren't certain what the cause of unexpected disconnects here. #10657 may highlight the cause more; though the issue might be a bug in RPC forwarding logic, or that the 10s heartbeat period is too long in some environments. I haven't fully reproduced it, but I'll test it more this week.

To set update context, I have a WIP branch that log all exec messages and events in https://github.com/hashicorp/nomad/compare/b-debug-exec-2 . Hopefully that will highlight the issue - but I haven't caught it yet.

notnoop · 2021-06-07T00:28:24Z

So I have a fix for this finally in #10710! Thank you @the-maldridge for reporting this issue. It's very unfortunate that we did not notice this issue internally. Without your GitHub reports, this issue will remain haunting more users unnecessarily!

Track usage of incoming streams on a connection. Connections without reference counts get marked as unused and reaped in a periodic job. This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated unexpectedly. Previously, when a client heartbeats switches between servers, the pool connection reaper eventually identifies the connection as unused and closes it even if it has an active exec/fs sessions. Fixes #10579

the-maldridge · 2021-06-07T19:51:36Z

Clever fix, thanks for taking a look!

Track usage of incoming streams on a connection. Connections without reference counts get marked as unused and reaped in a periodic job. This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated unexpectedly. Previously, when a client heartbeats switches between servers, the pool connection reaper eventually identifies the connection as unused and closes it even if it has an active exec/fs sessions. Fixes #10579

github-actions · 2022-10-18T02:45:30Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

the-maldridge added the type/bug label May 13, 2021

drewbailey added this to Needs Triage in Nomad - Community Issues Triage via automation May 13, 2021

drewbailey added theme/client stage/needs-investigation labels May 13, 2021

galeep changed the title ~~Alloc Exec spontanesouly disconnects after some time~~ Alloc Exec spontaneously disconnects after some time May 19, 2021

tgross self-assigned this May 19, 2021

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage May 19, 2021

tgross assigned jsosulska and isabeldepapel May 19, 2021

tgross mentioned this issue May 21, 2021

api: surface websocket errors to user #10638

Merged

tgross assigned notnoop and unassigned tgross May 24, 2021

tgross closed this as completed Jun 3, 2021

Nomad - Community Issues Triage automation moved this from In Progress to Done Jun 3, 2021

tgross added this to the 1.1.1 milestone Jun 3, 2021

notnoop reopened this Jun 3, 2021

Nomad - Community Issues Triage automation moved this from Done to Needs Triage Jun 3, 2021

notnoop mentioned this issue Jun 7, 2021

pool: track usage of incoming streams #10710

Merged

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jun 7, 2021

notnoop closed this as completed in #10710 Jun 7, 2021

Nomad - Community Issues Triage automation moved this from In Progress to Done Jun 7, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alloc Exec spontaneously disconnects after some time #10579

Alloc Exec spontaneously disconnects after some time #10579

the-maldridge commented May 13, 2021

ld100 commented May 13, 2021

johnalotoski commented May 14, 2021

apollo13 commented May 16, 2021

tgross commented May 19, 2021

the-maldridge commented May 19, 2021

tgross commented May 20, 2021 •

edited

Loading

tgross commented May 21, 2021

ld100 commented Jun 3, 2021

tgross commented Jun 3, 2021

notnoop commented Jun 3, 2021

notnoop commented Jun 7, 2021 •

edited

Loading

the-maldridge commented Jun 7, 2021

github-actions bot commented Oct 18, 2022

Alloc Exec spontaneously disconnects after some time #10579

Alloc Exec spontaneously disconnects after some time #10579

Comments

the-maldridge commented May 13, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

ld100 commented May 13, 2021

johnalotoski commented May 14, 2021

apollo13 commented May 16, 2021

tgross commented May 19, 2021

the-maldridge commented May 19, 2021

tgross commented May 20, 2021 • edited Loading

tgross commented May 21, 2021

ld100 commented Jun 3, 2021

tgross commented Jun 3, 2021

notnoop commented Jun 3, 2021

notnoop commented Jun 7, 2021 • edited Loading

the-maldridge commented Jun 7, 2021

github-actions bot commented Oct 18, 2022

tgross commented May 20, 2021 •

edited

Loading

notnoop commented Jun 7, 2021 •

edited

Loading