Attempt to reconnect long-running CLI commands in case of network timeout #17320

josh-m-sharpe · 2023-05-25T19:35:59Z

This Feature Request makes "Error fetching deployment" seem like a minor nuisance.

After updating our cluster from 1.3.x to 1.5.6 I see this error every single time I run nomad run job ... - seems like a regression at this point.

The text was updated successfully, but these errors were encountered:

josh-m-sharpe · 2023-05-25T19:49:32Z

not sure if this is related, but found them in the logs:

May 25 14:22:26 ip-10-20-21-164.us-west-2.compute.internal nomad[2608]: 2023-05-25T14:22:26.701Z [ERROR] worker: failed to dequeue evaluation: worker_id=1fe7b569-f86f-8524-f4cd-ee111b9e87b6 error="rpc error: No cluster leader"
May 25 14:22:26 ip-10-20-21-164.us-west-2.compute.internal nomad[2608]: 2023-05-25T14:22:26.701Z [ERROR] worker: failed to dequeue evaluation: worker_id=66802c36-9b61-59dc-19ec-cfd3e1c49a01 error="rpc error: No cluster leader"

which feels untrue, because every time i look at the 'servers' web UI or nomad server members I see a leader has been selected

lgfa29 · 2023-05-29T18:44:59Z

Hi @josh-m-sharpe 👋

From what I can tell there hasn't been no significant change in this part of the code between 1.3.x and 1.5.6, so if you're receiving an increase in this class of error I suspect that something else may be happening.

Unfortunately the CLI was omitting the actual error received, so I opened #17348 to output more information.

The No cluster leader you reported may be related to flappy leadership, and if you have a deployment being monitored while leadership changes the Error fetching deployment is expected to happen. This page details some metrics you may want to look into to determine any leadership problems.

A retry mechanism for deployment monitoring would definitely be handy, and that is covered in #12062, so I'm going to close this as a duplicate. I recommend 👍 that issue to help us with roadmaping and following it for further updates.

Feel free to open a new issue if you detect any further problem regarding unstable leadership.

Thank you for the report!

josh-m-sharpe · 2023-05-29T19:23:51Z

Hey @lgfa29 thanks for the response. Have a bit more to add here, but it's a bit anecdotal.

I opened this issue when I encountered issues with nomad run job but I was also fiddling with restart -reschedule for other use cases and occasionally I was seeing those executions fail with a 504 Gateway Timeout error - I don't have a screenshot or output. I'd run restart -reschedule it would run for a bit , then die and output like 5-6 lines of error messaging showing that response code.

( To be clear this is NOT the same thing I reported in #17329 - even if I opened all these things near about the same time. I've been doin a lot of nomad hacking 😄 )

At no point did I see any evidence of that 504 error in my nomad server logs - which makes sense as it was a gateway timeout error. This pointed me to the AWS Application Load Balancer I had deployed in front of my nomad servers. The ALB had a (default) timeout of 60 seconds.

I replaced that with an AWS Network load balancer which has a default timeout of 350 seconds. After I made this change this issue appears to have gone away.

This does mean my issue is largely resolved. However, it does signal to me that something between 1.3.x and 1.5.6 started taking longer than 60 seconds to respond - which is a heck of lot of time.

Anyways, sorry I don't have any more hard evidence, just wanted to convey what I figured out. Cheers!

josh-m-sharpe · 2023-05-29T19:26:46Z

Now that I think about it more, I wish I knew if the restart -reschedule died right around 60 seconds. I want to say it didn't take that long but maybe it did. Is it possible the CLI is/was opening a connection and holding it open while it polls?

lgfa29 · 2023-05-29T21:10:16Z

This does mean my issue is largely resolved. However, it does signal to me that something between 1.3.x and 1.5.6 started taking longer than 60 seconds to respond - which is a heck of lot of time.

Hum...that's interesting, I can't think of any change in this regard, and the Nomad API should be using a keep-alive timeout of 30s to keep the connection open.

daa9824 switched the api client (which the Nomad CLI uses) to use pooled connections, but I think this was also the case in 1.3.x.

Is it possible the CLI is/was opening a connection and holding it open while it polls?

Yes, the CLI reuses the same connection, there's a bit more info here:

nomad/api/api.go

Lines 489 to 496 in 087ac3a

    
           // Close closes the client's idle keep-alived connections. The default 
        
           // client configuration uses keep-alive to maintain connections and 
        
           // you should instantiate a single Client and reuse it for all 
        
           // requests from the same host. Connections will be closed 
        
           // automatically once the client is garbage collected. If you are 
        
           // creating multiple clients on the same host (for example, for 
        
           // testing), it may be useful to call Close() to avoid hitting 
        
           // connection limits.

Maybe we could try to create a new connection in case of a network timeout?

I think I will reword the title for this issue and keep it open for us to further investigate this possibility, thanks for the extra info!

lgfa29 · 2023-05-29T21:18:30Z

I've been doin a lot of nomad hacking

I forgot to mention in the previous message, but I'm also curious about this 😄

kaspergrubbe · 2024-01-03T13:17:17Z

We've recently upgraded from an old 1.0.18 deployment to a much newer (and upgraded) 1.7.2 cluster, and we're also seeing 504 issues too now.

We're behind 2 loadbalancers: an AWS ELB and a Haproxy running within the Nomad cluster reaching the Nomad APIs.

This is what the CLI spews out in our CI/CD pipeline:

2024-01-03T13:14:08+01:00
ID          = ******
Job ID      = ourapp-production-web
Job Version = ****
Status      = running
Description = Deployment is running pending automatic promotion
Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
puma        false     10       4         4       0        0          2024-01-03T12:29:08Z
==> 2024-01-03T13:14:58+01:00: Error fetching deployment: Unexpected response code: 504 (<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>)

I think ELBs have a timeout of 60 seconds, while our Haproxy have a default of 50 seconds, maybe we should use the HTTP API directly instead of Nomad CLI in these cases?

Blefish · 2024-01-03T15:45:37Z

I'm also running into this as I was exploring using Nomad CLI to perform some application deployments via CI

Previously I was monitoring deployment status using Ansible and accessing HTTP API every x seconds, but Nomad CLI is much more useful in terms of deployment status and visibility.

josh-m-sharpe added the type/bug label May 25, 2023

lgfa29 closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2023

lgfa29 added theme/cli stage/duplicate labels May 29, 2023

lgfa29 mentioned this issue May 29, 2023

[feature]: cli job run retry on "Error fetching deployment" #12062

Open

lgfa29 reopened this May 29, 2023

lgfa29 added type/enhancement and removed type/bug labels May 29, 2023

lgfa29 changed the title ~~"Error fetching deployment" no longer a minor issue~~ Attempt to reconnect long-running CLI commands in case of network timeout May 29, 2023

lgfa29 added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/duplicate labels May 30, 2023

lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation May 30, 2023

lgfa29 moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to reconnect long-running CLI commands in case of network timeout #17320

Attempt to reconnect long-running CLI commands in case of network timeout #17320

josh-m-sharpe commented May 25, 2023

josh-m-sharpe commented May 25, 2023 •

edited

Loading

lgfa29 commented May 29, 2023

josh-m-sharpe commented May 29, 2023

josh-m-sharpe commented May 29, 2023

lgfa29 commented May 29, 2023

lgfa29 commented May 29, 2023 •

edited

Loading

kaspergrubbe commented Jan 3, 2024

Blefish commented Jan 3, 2024

Attempt to reconnect long-running CLI commands in case of network timeout #17320

Attempt to reconnect long-running CLI commands in case of network timeout #17320

Comments

josh-m-sharpe commented May 25, 2023

josh-m-sharpe commented May 25, 2023 • edited Loading

lgfa29 commented May 29, 2023

josh-m-sharpe commented May 29, 2023

josh-m-sharpe commented May 29, 2023

lgfa29 commented May 29, 2023

lgfa29 commented May 29, 2023 • edited Loading

kaspergrubbe commented Jan 3, 2024

Blefish commented Jan 3, 2024

josh-m-sharpe commented May 25, 2023 •

edited

Loading

lgfa29 commented May 29, 2023 •

edited

Loading