-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very infrequent "failed to load image: exit status 1" errors #921
Comments
Retries is a good idea to explore, though ideally this doesn't fail :/ Logging is substantially more powerful in HEAD, Have we experimented yet with combining the 10 images with |
I will try the save every thing to one tar approach this week and see if
things improve. Thanks!
…On Sat, Oct 5, 2019, 3:03 PM Benjamin Elder ***@***.***> wrote:
But if that is not feasible it may be nice to have some better
logging/error messages, maybe retries possibly? I'm not too sure as I don't
yet understand the root cause.
Retries is a good idea to explore, though ideally this doesn't fail :/
Logging is substantially more powerful in HEAD, -v 1 or greater will
result in a stack trace being logged on failure, along with the command
output if the failure's cause is executing a command.
Have we experimented yet with combining the 10 images with docker save ...
and then using kind load image-archive instead of kind load docker-image?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#921?email_source=notifications&email_token=AAEYGXIBVTQLLDSMNKIIYNTQNEFKXA5CNFSM4I5TK6B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAN4YQQ#issuecomment-538692674>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEYGXISGBVXYW7NNPVKRILQNEFKXANCNFSM4I5TK6BQ>
.
|
The containerd file contains several errors about failing to load the images Maybe parallelizing 10 images to load in containerd is too much and we should cap the maximum? |
Seems plausible, I'll drop the parallelism a bit and see what happens. |
it's also possible we may have picked up a containerd fix with the new containerd infra, we should have much more recent containerd builds going forward (EG in HEAD of kind we're on the latest stable release + backports) |
I switched to loading images 1 at a time instead of in parallel, and at the same time our testing load roughly doubled due to an incoming release. Load failures seem about the same if not a little bit worse. Sounds like the next step would be to try the newer versions of kind. I was planning to wait for v0.6.0, would you suggest we just switch to master now? |
if you pin a particular commit master as of this moment is probably a good choice, I'm intending to get v0.6.0 soon-ish but possibly not fast enough to resolve this. how horrendous would it be to add a retry? it would flake to being slower but possibly succeed instead of totally failing? |
considering a similar trade-off for #949 |
Retry is a good option, seems a worthwhile tradeoff. I'll try that out and if I see it again update to some commit on master. Thanks! |
Quick update - 6 days ago we added retries. I think the logic on my change is wrong and it doesn't always retry them, but anecdotally things seemed to improve. We still have not updated past So I think for now this is mostly mitigated |
Actually just realized in the one test that I have seen an error, the retry was broken. So I think I have never seen it fail with 3 retries |
ACK, thanks! |
Forgot to ping with this coming out in the kuebcon chaos :/ Any sign of this with 0.6 images? |
I have not seen any issues loading images in at least a month, but we do have retries now that may be masking errors |
going to close this for now, but keeping an eye out for signs of this.. |
What happened:
We occasionally are seeing issues load kind images. From a rough grep I think this is impacting roughly 3% of our PRs. Note that each PR is running ~20 tests and load ~10 images, and maybe be rerun many times due to test failures or new commits, etc. So this number actually means
kind load
is likely only failing .003% of the time I guess?What you expected to happen:
Ideally,
kind load
is more robust and doesn't experience errors. But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I'm not too sure as I don't yet understand the root cause.How to reproduce it (as minimally and precisely as possible):
Its very intermittent failures, so I am not sure we can reproduce it easily. I can however, point you to a bunch of logs:
We do everything with loglevel=debug and dump kind logs in Artifacts so hopefully we have everything there. I didn't really look through the logs much as I don't know what to look for, but happy to look deeper if I am pointed in the right direction.
Anything else we need to know?:
As mentioned, this is pretty rare. A 99.999% pass rate is pretty solid, so I wouldn't be too disappointing if nothing could be done here.
Environment:
kind version
): 0.5.1kubectl version
): Running Kind on GKE 1.13. I think all of these are spinning up 1.15 clustersdocker info
): 18.06.1/etc/os-release
): cOSThe text was updated successfully, but these errors were encountered: