Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad API "Unexpected EOF" Errors #12273

Open
evandam opened this issue Mar 14, 2022 · 16 comments
Open

Nomad API "Unexpected EOF" Errors #12273

evandam opened this issue Mar 14, 2022 · 16 comments

Comments

@evandam
Copy link

evandam commented Mar 14, 2022

Nomad version

Nomad v1.2.6 (a6c6b475db5073e33885377b4a5c733e1161020c)

I've noticed this issue going back a few versions, unsure of when it began but possibly < 1.0.0.

Operating system and Environment details

Ubuntu 18.04

Issue

We sporadically see API calls on Nomad clients start to fail with "Unexpected EOF" errors.

Reproduction steps

Unclear, seemingly random on Nomad clients.

Expected Result

Nomad API handles requests correctly

Actual Result

$ nomad server members
Error querying servers: Get "http://127.0.0.1:4646/v1/agent/members": EOF
$ curl localhost:4646/v1/status/leader
curl: (56) Recv failure: Connection reset by peer

Job file (if appropriate)

N/A

Nomad Server logs (if appropriate)

There's nothing that jumps out in server logs, but here was the last hour before an instance of this error. Note that the "permission denied" errors can be safely ignored, it's just checks we have to verify ACL permissions in place.

https://gist.github.com/evandam/1d1f2dd2a5032426736107f414959e2d

Nomad Client logs (if appropriate)

N/A

I know it's not much to go on, but let me know if there's any additional detail I can provide. I'll typically see this occurring on a host until I restart the Nomad service.

@DerekStrickland DerekStrickland self-assigned this Mar 14, 2022
@DerekStrickland DerekStrickland added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 14, 2022
@DerekStrickland DerekStrickland moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Mar 14, 2022
@DerekStrickland
Copy link
Contributor

Hi @evandam

I'm sorry you are having issues. I'll take a look a the log and see if anything stands out. I'm curious about something. When this happens have you ever looked at the servers/clients to see how they are doing at the OS level? Specifically, I am wondering if they are experiencing any memory, CPU, or network pressure.

@DerekStrickland
Copy link
Contributor

Also, do you have any client logs showing the error, or are you only experiencing it from the command line?

@evandam
Copy link
Author

evandam commented Mar 14, 2022

Hey @DerekStrickland, adding a couple screenshots of our monitoring - doesn't look like anything unusual in terms of system activity. There was a small spike from ~20% to ~45% CPU, but I believe that was a minute or so after the first EOF error so probably not related. Memory and network pressure seems fine too.

There aren't any client logs, we're really just seeing it on the command line/curl. For what it's worth, I've seen this happening on our Nomad client nodes but never on a server (could be coincidence, could be related to different system loads, etc.) but maybe worth pointing out.

Screen Shot 2022-03-14 at 10 23 30 AM
Screen Shot 2022-03-14 at 10 24 04 AM
Screen Shot 2022-03-14 at 10 24 25 AM

@evandam
Copy link
Author

evandam commented Mar 14, 2022

Not sure if it helps, but here's logs after seeing this issue and runnning kill -s SIGABRT <nomad pid>

https://gist.github.com/evandam/69e7dbfa467932831c5cc3460fbe5a79

It helped in a vaguely-similar issue in hashicorp/nomad-autoscaler#514 (comment).

@DerekStrickland
Copy link
Contributor

Hi @evandam

This feels suspiciously like the same problem as AutoScaler #519. Thanks to @lgfa29 for seeing this and bringing it to my attention. We'll keep debugging on our end.

@evandam
Copy link
Author

evandam commented Mar 15, 2022

Sounds good! Just let me know if there's any more info I can provide that would be useful.

@DerekStrickland
Copy link
Contributor

Thanks for the offer. I think at this point I just need to find time to spin up a reproduction test. The working theory is that it can be cause by either a rapid succession of blocking query requests or large number of blocking queries at the same time. If you have a fast way to repro that situation that you can share that would be awesome 😄 Otherwise, it's just on me to get that in the work queue.

🤞 that the theory holds.

@DerekStrickland
Copy link
Contributor

No solution yet, but I did want to share that I have a reproducer. I can't promise it is 100% the same as your issue, but the symptom at least seems the same.

@ahmedwonolo
Copy link

We experienced same issue again on another node and this time curl to the agent API consistently unresponsive.

$ curl 127.0.0.1:4646/v1/agent/self
curl: (56) Recv failure: Connection reset by peer

Here's SIGABRT output https://gist.github.com/ahmedwonolo/63ffe1654d2be5024a0accbf25aa08a9
(I work with @evandam)

@spegoraro
Copy link

spegoraro commented Apr 15, 2022

Not sure if this adds any value but I'm seeing the same thing with requests from the nomad autoscaler in #519 also. It's a mix of EOF and connection reset by peer:

(I've removed the ip addresses and node ids)

failed to drain node: Put "https://[nomad-server]:4646/v1/node/[node-id]/drain?namespace=default®ion=global": EOF

or

received error while draining node: Error monitoring node: Get "https://[nomad-server]:4646/v1/node/[node-id]?index=2569292&namespace=default®ion=global&stale=": read tcp 172.17.0.2:58740->[host-ip]:4646: read: connection reset by peer

Interestingly I get the little ® character in all of my logs as well although I suspect it's meant to be defaultRegion.

Seems related to #8718

@DerekStrickland
Copy link
Contributor

Thanks for the added information @spegoraro!

@gbolo
Copy link

gbolo commented Apr 22, 2022

I can confirm to seeing this behaviour as well quite often. @DerekStrickland anything come up from any investigation thus far?

@DerekStrickland
Copy link
Contributor

@gbolo I'm just wrapping up the work on the nomad Disconnected Clients feature for 1.3. I'm hoping by late next week I'll be freed up to start looking at backlog issues and I've got this at the top of my list of things I'd like to investigate further. I can recreate it easily, so that's a good start. I apologize for the delay, but will definitely update the issue as I progress or if I have any questions.

@lgfa29 lgfa29 added theme/api HTTP API and SDK issues stage/needs-investigation labels Apr 27, 2022
@gbolo
Copy link

gbolo commented May 17, 2022

@gbolo I'm just wrapping up the work on the nomad Disconnected Clients feature for 1.3. I'm hoping by late next week I'll be freed up to start looking at backlog issues and I've got this at the top of my list of things I'd like to investigate further. I can recreate it easily, so that's a good start. I apologize for the delay, but will definitely update the issue as I progress or if I have any questions.

Hi @DerekStrickland can you share with us how you are reproducing this exactly? Thanks for looking into it

@DerekStrickland
Copy link
Contributor

Hi @gbolo . For sure. Here's what I did

Using the k6 tool I can create an EOF error with the following. Note that I have a hard coded IP address. Your server IP is likely different.

  • Create eof.js with the following content
import { check } from "k6";
import http from "k6/http";

export default function() {
  let res = http.get("http://192.168.56.11:4646/v1/jobs?index=23&wait=5m");
  check(res, {
    "is status 200": (r) => r.status === 200
  });
};
  • Then from the same directory where the file exists, I run this command.
k6 run -u 200 -d 20s eof.js
  • This very quickly results in the following error repeatedly
WARN[0005] Request Failed                                error="Get \"[http://192.168.56.11:4646/v1/jobs?index=23&wait=5m\](http://192.168.56.11:4646/v1/jobs?index=23&wait=5m%5C)": EOF"

@vincenthuynh
Copy link

Hello,
We've also encountered this issue as described in the community forum.

I was also able to reproduce the error running Nomad locally and using k6 tool:

time="2022-05-31T13:32:15Z" level=warning msg="Request Failed" error="Get \"http://127.0.0.1:4646/ui/clients\": EOF"
time="2022-05-31T13:32:15Z" level=warning msg="Request Failed" error="Get \"http://127.0.0.1:4646/ui/clients\": read tcp 127.0.0.1:43642->127.0.0.1:4646: read: connection reset by peer"
time="2022-05-31T13:32:15Z" level=warning msg="Request Failed" error="Get \"http://127.0.0.1:4646/ui/clients\": read tcp 127.0.0.1:43640->127.0.0.1:4646: read: connection reset by peer"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

7 participants