Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: check if retries are required on some errors #3471

Merged
merged 2 commits into from
Jul 17, 2023

Conversation

shreddedbacon
Copy link
Member

@shreddedbacon shreddedbacon commented Jun 28, 2023

Checklist

  • Affected Issues have been mentioned in the Closing issues section
  • Documentation has been written/updated
  • PR title is ready for inclusion in changelog

If the lagoon-api is unresponsive, sometimes the actions-handler is unable to update properly. This adds error checking that if a connection reset is hit (api down) then it will flag an error and the message will remain in the queue to be retried again.

An example of a connection reset is below. This message was dropped, which means the status of this deployment may not have been updated correctly because the message is dropped when this error is hit.

(messageid:pqt3csxj) solr-service-qa/lagoon-build-do90w: ERROR: unable to get deployment - Post "http://lagoon-core-api:80/graphql": read tcp 10.204.12.70:49118->10.208.4.225:80: read: connection reset by peer

@shreddedbacon shreddedbacon marked this pull request as ready for review June 28, 2023 04:33
@shreddedbacon shreddedbacon added this to the 2.16.x milestone Jun 28, 2023
@shreddedbacon shreddedbacon modified the milestones: 2.16.x, 2.15.3 Jul 10, 2023
@shreddedbacon shreddedbacon requested a review from bomoko July 10, 2023 22:28
Copy link
Contributor

@bomoko bomoko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense and looks good.

Just one thing, I wonder if there could be a case where this becomes a choke point?

@shreddedbacon
Copy link
Member Author

Just one thing, I wonder if there could be a case where this becomes a choke point?

It is possible, but only if the errors that are returned are consistently reset by peer errors. Which from analysing logs only happens during scaling events where the API pod becomes unavailable mid request, that is basically the protection this is covering for.

@bomoko
Copy link
Contributor

bomoko commented Jul 11, 2023

sweet, happy happy - LGTM and all that jazz.

@tobybellwood tobybellwood merged commit c556318 into main Jul 17, 2023
@tobybellwood tobybellwood deleted the actions-updates branch July 17, 2023 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants