Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alloc Exec spontaneously disconnects after some time #10579

Closed
the-maldridge opened this issue May 13, 2021 · 13 comments · Fixed by #10710
Closed

Alloc Exec spontaneously disconnects after some time #10579

the-maldridge opened this issue May 13, 2021 · 13 comments · Fixed by #10710

Comments

@the-maldridge
Copy link

Nomad version

$ nomad version
Nomad v1.0.4 (a480eed0815c54612856d9115a34bb1d1a773e8c)

Operating system and Environment details

Observed with Ubuntu and Void Linux clients, observed with macOS clients, observed with Alpine, Void, and ResinStack servers. All x64, tested with diversity in network path, underlying provider, and Nomad versions at both ends.

Issue

When using the alloc exec command the shell connects as expected and interactive commands work great. After some time the session spontaneously disconnects. Sometimes it disconnects with an error such as failed to exec into task: disconnected without receiving the exit code. Sometimes it disconnects with no error printed at all.

Reproduction steps

  1. Launch a container of your choosing that contains an interactive shell. Busybox with the entrypoint set to ["/bin/sleep", "3600"] works well for this.
  2. Exec into the alloc with -i -t options.
  3. Wait.

After some time you will be disconnected.

Expected Result

I expect that unless an external and intentional factor is at play, the session should remain connected. Examples of external factors include but are not limited to: network path disruption, token expiry, user explicitly disconnecting the session.

Actual Result

The interactive exec is disconnected without warning and no discernible cause.

So far I have not found anything in the logs and the job being exec'd into seems to have no effect on how long the session survives. I think this may be some context internal to Nomad not being renewed, but I have not traced the execution yet.

@ld100
Copy link

ld100 commented May 13, 2021

I'm having the same issue on Nomad 1.0.2. Each time the session is between 2 and 15 minutes before the disconnect. Every time getting "failed to exec into task: disconnected without receiving the exit code" error on disconnect.

@johnalotoski
Copy link

Also experiencing this intermittently for the past several months on various versions of nomad, currently 1.0.4.
Today I'm trying to time a long running command and this is interfering with that. Thanks for filing!

@apollo13
Copy link
Contributor

Ha, I also noted this and I thought my haproxy configuration was at fault. But seems like that nomad alone also has a problem -- that would have saved me a few hours debugging :D

@galeep galeep changed the title Alloc Exec spontanesouly disconnects after some time Alloc Exec spontaneously disconnects after some time May 19, 2021
@tgross
Copy link
Member

tgross commented May 19, 2021

Hi folks! Looks like that error is bubbling up from (*Allocations).Exec in the api package. The channel we're polling on is getting closed, so we'll need to hunt down why. Probably either there's an undocumented timeout or there's a keep-alive missing for the websocket (I know we did had to do something like that for nomad monitor but not sure how generalized that was.)

I'm going to dig into this with my colleagues @isabeldepapel and @jsosulska tomorrow.

@tgross tgross self-assigned this May 19, 2021
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage May 19, 2021
@the-maldridge
Copy link
Author

Fantastic news, its great to know that this isn't some heisenbug as it originally appeared to be.

@tgross
Copy link
Member

tgross commented May 20, 2021

Ok, so we spent a little time pairing on this and there's a few things we've come up with:

  • That disconnected without receiving the exit code error message path can only be hit if we get a "is closed" error from the websocket library api/allocations.go#L257-L258. That can be any of a large number of websocket messages, but importantly they're all protocol messages. Unfortunately we throw away the error rather than letting the CLI user know what happened, which is lousy behavior and we should fix that. So that's bug 1 to fix.
  • The CLI side does heartbeat every 10 sec. But unfortunately the sender function returns an error that we always throw away, rather than reporting to the user. So it's entirely possible for the heartbeats to fail, and that would lead to the server eventually disconnecting you. From a design standpoint, it's a little messy -- nomad alloc exec is supposed to be streaming the output from the remote process, so if we're interleaving that with control errors that's going to mess up the usefulness of the command (and would break backwards compat). It'd probably be better to log missed heartbeats on the server side instead.
  • The server-side decoding of messages in alloc_endpoint.go#L525-L531 is what could be sending the error. The isClosedError function it calls is a little interesting in that it tries to parse the error on the message pack response and detects the string literal "closed" and "EOF". Next we want to dig into where those message pack errors are being set so we can make sure it's not possible for the process output to accidentally write control messages.
  • I also just want to set up a quickie tcpdump and leave it running so we can see if we can capture one of these events on the network. That'd help narrow down the source.

@tgross
Copy link
Member

tgross commented May 21, 2021

I've opened #10638 to make the problem a bit more debuggable. I ran tcpdump for an hour or so and I see the {}\n heartbeat message every 10s as expected. Getting to the end of my week here but I'll see about trying to set up a longer-running test of it (and across a more realistic network) to see if that gets us any more info.

@tgross tgross assigned notnoop and unassigned tgross May 24, 2021
@ld100
Copy link

ld100 commented Jun 3, 2021

Any updates on the issue? Can't understand it from #10638

@tgross
Copy link
Member

tgross commented Jun 3, 2021

I believe this was fixed by @notnoop in #10657 and will ship in Nomad 1.1.1. @notnoop can we get the changelog entry added to 1.1.1 and the 1.0.7 backport?

@tgross tgross closed this as completed Jun 3, 2021
Nomad - Community Issues Triage automation moved this from In Progress to Done Jun 3, 2021
@tgross tgross added this to the 1.1.1 milestone Jun 3, 2021
@notnoop notnoop reopened this Jun 3, 2021
Nomad - Community Issues Triage automation moved this from Done to Needs Triage Jun 3, 2021
@notnoop
Copy link
Contributor

notnoop commented Jun 3, 2021

I'm afraid, we still need to dig into this issue more. #10657 ensures that the CLI reports the last few events (e.g. final output, exit code); it addresses the case where a completed commands result into "unexpected EOF" error.

We aren't certain what the cause of unexpected disconnects here. #10657 may highlight the cause more; though the issue might be a bug in RPC forwarding logic, or that the 10s heartbeat period is too long in some environments. I haven't fully reproduced it, but I'll test it more this week.

To set update context, I have a WIP branch that log all exec messages and events in https://github.com/hashicorp/nomad/compare/b-debug-exec-2 . Hopefully that will highlight the issue - but I haven't caught it yet.

@notnoop
Copy link
Contributor

notnoop commented Jun 7, 2021

So I have a fix for this finally in #10710! Thank you @the-maldridge for reporting this issue. It's very unfortunate that we did not notice this issue internally. Without your GitHub reports, this issue will remain haunting more users unnecessarily!

@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jun 7, 2021
Nomad - Community Issues Triage automation moved this from In Progress to Done Jun 7, 2021
notnoop pushed a commit that referenced this issue Jun 7, 2021
Track usage of incoming streams on a connection. Connections without
reference counts get marked as unused and reaped in a periodic job.

This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated
unexpectedly. Previously, when a client heartbeats switches between
servers, the pool connection reaper eventually identifies the connection
as unused and closes it even if it has an active exec/fs sessions.

Fixes #10579
@the-maldridge
Copy link
Author

Clever fix, thanks for taking a look!

notnoop pushed a commit that referenced this issue Jun 9, 2021
Track usage of incoming streams on a connection. Connections without
reference counts get marked as unused and reaped in a periodic job.

This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated
unexpectedly. Previously, when a client heartbeats switches between
servers, the pool connection reaper eventually identifies the connection
as unused and closes it even if it has an active exec/fs sessions.

Fixes #10579
notnoop pushed a commit that referenced this issue Jun 9, 2021
Track usage of incoming streams on a connection. Connections without
reference counts get marked as unused and reaped in a periodic job.

This fixes a bug where `alloc exec` and `alloc fs` sessions get terminated
unexpectedly. Previously, when a client heartbeats switches between
servers, the pool connection reaper eventually identifies the connection
as unused and closes it even if it has an active exec/fs sessions.

Fixes #10579
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants