Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very infrequent "failed to load image: exit status 1" errors #921

Closed
howardjohn opened this issue Oct 4, 2019 · 15 comments
Closed

Very infrequent "failed to load image: exit status 1" errors #921

howardjohn opened this issue Oct 4, 2019 · 15 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@howardjohn
Copy link
Contributor

What happened:
We occasionally are seeing issues load kind images. From a rough grep I think this is impacting roughly 3% of our PRs. Note that each PR is running ~20 tests and load ~10 images, and maybe be rerun many times due to test failures or new commits, etc. So this number actually means kind load is likely only failing .003% of the time I guess?

What you expected to happen:

Ideally, kind load is more robust and doesn't experience errors. But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I'm not too sure as I don't yet understand the root cause.

How to reproduce it (as minimally and precisely as possible):

Its very intermittent failures, so I am not sure we can reproduce it easily. I can however, point you to a bunch of logs:

We do everything with loglevel=debug and dump kind logs in Artifacts so hopefully we have everything there. I didn't really look through the logs much as I don't know what to look for, but happy to look deeper if I am pointed in the right direction.

Anything else we need to know?:

As mentioned, this is pretty rare. A 99.999% pass rate is pretty solid, so I wouldn't be too disappointing if nothing could be done here.

Environment:

  • kind version: (use kind version): 0.5.1
  • Kubernetes version: (use kubectl version): Running Kind on GKE 1.13. I think all of these are spinning up 1.15 clusters
  • Docker version: (use docker info): 18.06.1
  • OS (e.g. from /etc/os-release): cOS
@howardjohn howardjohn added the kind/bug Categorizes issue or PR as related to a bug. label Oct 4, 2019
@BenTheElder BenTheElder self-assigned this Oct 4, 2019
@howardjohn howardjohn changed the title Occasional "failed to load image: exit status 1" errors Very infrequent "failed to load image: exit status 1" errors Oct 4, 2019
@BenTheElder
Copy link
Member

But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I'm not too sure as I don't yet understand the root cause.

Retries is a good idea to explore, though ideally this doesn't fail :/

Logging is substantially more powerful in HEAD, -v 1 or greater will result in a stack trace being logged on failure, along with the command output if the failure's cause is executing a command.

Have we experimented yet with combining the 10 images with docker save ... and then using kind load image-archive instead of kind load docker-image?

@howardjohn
Copy link
Contributor Author

howardjohn commented Oct 5, 2019 via email

@aojea
Copy link
Contributor

aojea commented Oct 12, 2019

The containerd file contains several errors about failing to load the images

https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/17569/e2e-simpleTests_istio/1326/artifacts/kind/istio-testing-control-plane/containerd.log

Maybe parallelizing 10 images to load in containerd is too much and we should cap the maximum?

@howardjohn
Copy link
Contributor Author

Seems plausible, I'll drop the parallelism a bit and see what happens.

@BenTheElder
Copy link
Member

it's also possible we may have picked up a containerd fix with the new containerd infra, we should have much more recent containerd builds going forward (EG in HEAD of kind we're on the latest stable release + backports)

@howardjohn
Copy link
Contributor Author

I switched to loading images 1 at a time instead of in parallel, and at the same time our testing load roughly doubled due to an incoming release. Load failures seem about the same if not a little bit worse.

Sounds like the next step would be to try the newer versions of kind. I was planning to wait for v0.6.0, would you suggest we just switch to master now?

@BenTheElder
Copy link
Member

if you pin a particular commit master as of this moment is probably a good choice, I'm intending to get v0.6.0 soon-ish but possibly not fast enough to resolve this.

how horrendous would it be to add a retry? it would flake to being slower but possibly succeed instead of totally failing?

@BenTheElder
Copy link
Member

considering a similar trade-off for #949

@howardjohn
Copy link
Contributor Author

Retry is a good option, seems a worthwhile tradeoff. I'll try that out and if I see it again update to some commit on master. Thanks!

@howardjohn
Copy link
Contributor Author

Quick update - 6 days ago we added retries. I think the logic on my change is wrong and it doesn't always retry them, but anecdotally things seemed to improve. We still have not updated past v0.5.1.

So I think for now this is mostly mitigated

@howardjohn
Copy link
Contributor Author

Actually just realized in the one test that I have seen an error, the retry was broken. So I think I have never seen it fail with 3 retries

@BenTheElder
Copy link
Member

ACK, thanks!
I'll send you a ping when 0.6 is ready & out, continuing to focus on identifying any stability weakpoints and eliminating them. I think the last known one now is log export issues which I'll tackle shortly. We have a lot more CI signal and better ability to ship the lastest containerd improvements now which should help.

@BenTheElder
Copy link
Member

Forgot to ping with this coming out in the kuebcon chaos :/

Any sign of this with 0.6 images?
We're keeping up to date with containerd's latest 1.3.X versions now as we develop kind.

@howardjohn
Copy link
Contributor Author

I have not seen any issues loading images in at least a month, but we do have retries now that may be masking errors

@BenTheElder
Copy link
Member

going to close this for now, but keeping an eye out for signs of this..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants