Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent 404 Not Found on /pulls github calls #1019

Closed
tedder opened this issue Apr 30, 2020 · 13 comments · Fixed by #1131
Closed

intermittent 404 Not Found on /pulls github calls #1019

tedder opened this issue Apr 30, 2020 · 13 comments · Fixed by #1131
Labels
bug Something isn't working

Comments

@tedder
Copy link

tedder commented Apr 30, 2020

Hi, sometimes when Atlantis is triggered on a PR in github, Atlantis posts the following error onto the PR:

Plan Error

GET https://api.github.com/repos/myorg/myrepo/pulls/1163/files?per_page=300: 404 Not Found []

Looking at github's API docs, that per_page=300 seems okay:

Note: Responses include a maximum of 3000 files. The paginated response returns 30 files per page by default.

We can replan and it works- e.g., it appears to be intermittent. Looking in the Atlantis logs, I see the following (I've removed the timestamps and redacted IPs/private info):

[INFO] server: POST /events – from xxx:51544
[INFO] server: Identified event as type "other"
[INFO] server: POST /events – respond HTTP 200
[EROR] myorg/myrepo#1163: GET https://api.github.com/repos/myorg/myrepo/pulls/1163/files?per_page=300: 404 Not Found []
[EROR] myorg/myrepo#1163: Unable to hide old comments: GET https://api.github.com/repos/myorg/myrepo/issues/1163/comments?direction=asc&sort=created: 404 Not Found []

We have a couple of theories but haven't been able to reproduce. First, it's only happened since we updated to v12.0, the current release. (we also added --hide-prev-plan-comments --disable-markdown-folding at this time).

Second is that it may happen with a largeish number of directories, though generally our changed dirs is under 50 and changed files is under 100.

The third theory is that it might happen when two unrelated repos are processing at the same time. That can be seen here; I've left the timestamps so you can see the overlap, "myrepo" is the same as above, and "REPO2" is the other repo that is planning.

2020-04-30T10:29:45.155000 [INFO] myorg/myrepo#1163: Creating dir "/home/atlantis/.atlantis/repos/myorg/myrepo/1163/default"
2020-04-30T10:29:45.252000 [INFO] myorg/REPO2#276: Creating dir "/home/atlantis/.atlantis/repos/myorg/REPO2/276/default"
2020-04-30T10:29:45.961000 [INFO] myorg/REPO2#276: Successfully parsed atlantis.yaml file
2020-04-30T10:29:45.963000 [INFO] myorg/REPO2#276: 1 projects are to be planned based on their when_modified config
2020-04-30T10:29:46.152000 [INFO] myorg/myrepo#1163: Successfully parsed atlantis.yaml file
2020-04-30T10:29:46.156000 [INFO] myorg/myrepo#1163: 13 projects are to be planned based on their when_modified config
2020-04-30T10:29:46.157000 [INFO] myorg/myrepo#1163: Acquired lock with id "myorg/myrepo/xxx"
2020-04-30T10:29:46.457000 [INFO] myorg/REPO2#276: Acquired lock with id "myorg/REPO2/./yyy"
2020-04-30T10:29:46.457000 [INFO] myorg/REPO2#276: Creating dir "/home/atlantis/.atlantis/repos/myorg/REPO2/276/yyyy"
@kenske
Copy link

kenske commented Jul 22, 2020

This is happening to us too. It always happens when opening a PR, but then we comment atlantis plan and it works. 🤷‍♂️

@waltervargas
Copy link

we are having this issue as well:

  • no multiple repos, just a single repo
  • small number of files.

@waltervargas
Copy link

Are you guys implementing a retry/backoff mechanism to handle eventual consistency?

--- PR CREATION
Jul 24 11:31:25 ip-42-42-42-42 bash[5186]: 2020/07/24 11:31:25+0000 [INFO] server: Identified event as type "opened"
Jul 24 11:31:25 ip-42-42-42-42 bash[5186]: 2020/07/24 11:31:25+0000 [INFO] server: Executing autoplan
Jul 24 11:31:25 ip-42-42-42-42 bash[5186]: 2020/07/24 11:31:25+0000 [INFO] server: POST /events – respond HTTP 200
--- ERROR
Jul 24 11:31:25 ip-42-42-42-42 bash[5186]: 2020/07/24 11:31:25+0000 [EROR] owner/atlantis-repo-name#1: GET https://api.github.com/repos/owner/atlantis-repo-name/pulls/1/files?per_page=300: 404 Not Found []
--- END
Jul 24 11:31:27 ip-42-42-42-42 bash[5186]: 2020/07/24 11:31:27+0000 [INFO] server: POST /events – from 127.42.42.42:4242
Jul 24 11:31:27 ip-42-42-42-42 bash[5186]: 2020/07/24 11:31:27+0000 [INFO] server: POST /events – respond HTTP 200

@lkysow lkysow added the bug Something isn't working label Jul 24, 2020
@lkysow
Copy link
Member

lkysow commented Jul 24, 2020

Looks like we need to add a retry.

@tedder
Copy link
Author

tedder commented Jul 24, 2020

FWIW we hadn't seen this for a while and started seeing it again yesterday (or the day before?). I'm sure it's a github problem, not an Atlantis problem, but Atlantis probably needs to work around it.

@rtb-recursion
Copy link

We have been using Atlantis for ~1 week and we also just saw this problem just now for the first time in a project with ~30 terraform files and on a PR with only 1 changed file. We are using Atlantis v0.14.0.

@lkysow
Copy link
Member

lkysow commented Jul 24, 2020

FWIW we hadn't seen this for a while and started seeing it again yesterday (or the day before?). I'm sure it's a github problem, not an Atlantis problem, but Atlantis probably needs to work around it.

Yeah totally, and it shouldn't be too hard to throw some retries in there.

@sparky005
Copy link
Contributor

We just started seeing messages like this on PRs. I wonder if there need to be more retries or implement the exponential backoff? Mostly commenting just to see if others who happen to stop by here are having the same issue

@jayceebernardino
Copy link

We just started seeing messages like this on PRs. I wonder if there need to be more retries or implement the exponential backoff? Mostly commenting just to see if others who happen to stop by here are having the same issue

We have been getting is more often as well

@kenske
Copy link

kenske commented Dec 8, 2020

After trying the version with the fix, we stopped seeing this

@benwh
Copy link

benwh commented Apr 6, 2021

We're running with the fix implemented in #1131 and have still seen this issue occur, relatively often within the past week (presumably due to GitHub performance), so it seems like it might be worth implementing a different retry strategy such as exponential backoff, as suggested in that PR.

jamengual pushed a commit that referenced this issue Oct 4, 2021
* Improve github pull request call retries

Retry with fixed 1 second backoff up to 3 retries was added by #1131 to
address #1019, but the issue continued to show up (#1453).

Increase max attempts to 5 and use exponential backoff for a maximum
total retry time of (2^n - n - 1) seconds, which is roughly 30 seconds
for current max attempts n = 5.

Also move the sleep to the top of the loop so that we never sleep
without sending the request again on the last iteration.

* Fix style with gofmt -s
@virgofx
Copy link

virgofx commented Nov 22, 2021

We are still observing this issue fairly consistently in new Pull requests with autoplan enabled in 0.17.5

elementalvoid added a commit to elementalvoid/atlantis that referenced this issue Jan 21, 2022
This is a follow on to resolve similar issues to runatlantis#1019.

In runatlantis#1131 retries were added to GetPullRequest. And in runatlantis#1810 a backoff
was included.

However, those only resolve one potential request at the very
beginning of a PR creation. The other request that happens early on
during auto-plan is one to ListFiles to detect the modified files. This
too can sometimes result in a 404 due to async updates on the GitHub
side.
krrrr38 pushed a commit to krrrr38/atlantis that referenced this issue Dec 16, 2022
* Improve github pull request call retries

Retry with fixed 1 second backoff up to 3 retries was added by runatlantis#1131 to
address runatlantis#1019, but the issue continued to show up (runatlantis#1453).

Increase max attempts to 5 and use exponential backoff for a maximum
total retry time of (2^n - n - 1) seconds, which is roughly 30 seconds
for current max attempts n = 5.

Also move the sleep to the top of the loop so that we never sleep
without sending the request again on the last iteration.

* Fix style with gofmt -s
@pgold30
Copy link

pgold30 commented Apr 6, 2023

Same on version 0.23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants