Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy upstream connection failure due to 503 upstream_reset_before_response_started{local reset} LR #15968

Closed
madhu-shankar opened this issue Apr 14, 2021 · 7 comments
Labels
area/http question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently

Comments

@madhu-shankar
Copy link

Title: Envoy is failing to connect to upstream cluster

Description:

We are using Original Destination routing while routing request to upstream cluster. We were intermitently seeing 503s returned by Envoy. Upon checking the Response flag and Response details in Access logs, we found that 503s are being sent due to failure in connecting with upstream host. This is a sample access log entry 503 upstream_reset_before_response_started{local reset} LR

To give some context on the upstream cluster, it periodically closes connections and sends connection:close header.

  1. What is the underlying cause for the 503 response from Envoy, is it failing to handle the connection:close and has stale connections to upstream.

  2. We tried fixing it by configuring retry on connection reset using the below config,

retry_policy:
   retry_on: reset

Even after configuring the 1-retry, there has been a reduction in the number of 503 responses. But still some upstream connections are failing with URX flag. Do you suggest any other configuration or tuning this configuration itself to completly eliminate 503 responses?

Also, we saw different Response details for the same scenario is Envoy-1.13.1 and Envoy-1.15.2.

  • Envoy-1.13.1: 503 upstream_reset_before_response_started{connection failure} UF
  • Envoy-1.15.2: 503 upstream_reset_before_response_started{local reset} LR

Can you confirm if this behavior has changed from 1.15.0? We saw this issue mentioning about this change.

Relevant Links:

#14394.
Response flag in Access loging

@madhu-shankar madhu-shankar added the triage Issue requires triage label Apr 14, 2021
@madhu-shankar madhu-shankar changed the title Envoy upstream connection failure Envoy upstream connection failure due to 503 upstream_reset_before_response_started{local reset} LR Apr 14, 2021
@yanavlasov
Copy link
Contributor

Is it possible that upstream server sends "connection: close" on connections that Envoy used to start a request?

@madhu-shankar
Copy link
Author

the upstream sends connection:close in the response after serving fixed number of requests on each connection. If your question was if upstream sends connection: close on a new request without giving normal response, then no.

@dozer47528
Copy link

dozer47528 commented Apr 16, 2021

Same problem.
I have an unreachable ip in a cluster, and Envoy 1.15 won't retry.

Here is the envoy log:

2021-04-16T08:42:13.409647Z     debug   envoy pool      queueing request due to no available connections
2021-04-16T08:42:13.409654Z     debug   envoy pool      creating a new connection
2021-04-16T08:42:13.409696Z     debug   envoy client    [C4622] connecting
2021-04-16T08:42:13.409706Z     debug   envoy connection        [C4622] connecting to 172.17.0.10:8080
2021-04-16T08:42:13.409822Z     debug   envoy connection        [C4622] connection in progress
2021-04-16T08:42:13.417058Z     debug   envoy pool      [C4622] connect timeout
2021-04-16T08:42:13.417077Z     debug   envoy connection        [C4622] closing data_to_write=0 type=1
2021-04-16T08:42:13.417081Z     debug   envoy connection        [C4622] closing socket: 1
2021-04-16T08:42:13.417102Z     debug   envoy client    [C4622] disconnect. resetting 0 pending requests
2021-04-16T08:42:13.417110Z     debug   envoy pool      [C4622] client disconnected, failure reason:
2021-04-16T08:42:13.417122Z     debug   envoy router    [C4621][S6415775258818097972] upstream reset: reset reason local reset
2021-04-16T08:42:13.417169Z     debug   envoy http      [C4621][S6415775258818097972] Sending local reply with details upstream_reset_before_response_started{local reset}
2021-04-16T08:42:13.417224Z     debug   envoy http      [C4621][S6415775258818097972] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '84'
'content-type', 'text/plain'
'date', 'Fri, 16 Apr 2021 08:42:13 GMT'
'server', 'envoy'

ip 172.17.0.10 used to belong to a pod which have already terminated.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label May 16, 2021
@yanavlasov
Copy link
Contributor

Sorry for not responding earlier. I got confused by you mentioning the server sending connection: close. I do not think it is related. It looks like your back end takes time to get ready and start serving connections. This leads to Envoy not being able to connect to back end.
You can increase the number of retries. It is possible that adding health checks could address this issue as well if you have multiple backends.
Adding @snowp for any additional suggestions.

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label May 21, 2021
@alyssawilk alyssawilk added area/http question Questions that are neither investigations, bugs, nor enhancements and removed triage Issue requires triage labels May 24, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 23, 2021
@github-actions
Copy link

github-actions bot commented Jul 1, 2021

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as completed Jul 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/http question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

4 participants