Single node installation (SNO) sometimes fails because of etcd pods unable to start #8225
Labels
lifecycle/stale
Denotes an issue or PR has remained open with no activity and has become stale.
Version
Platform:
Azure
AWS
IPI
What happened?
Single node OpenShift deployment sometimes fails ending with timeout waiting on API. After investigation on master node, API pod is not running because cannot contact etcd. Etcd pod is restarting itself in neverending loop because it does wait on bootstrap response (which but never gets, because bootstrap is already removed by installer).
etcd log does contain these repeating messages:
IP address 10.242.20.6 in message above is (or better said was) bootstrap , which does not exist anymore (installer removed it). So it looks like some race condition that sometimes bootstrap is removed too soon.
I'll attach full etcd log here. I do have also sosreport from master, but it is too big to upload, still I can provide it, if needed.
etcd.zip
What you expected to happen?
Build will finish always successfully.
How to reproduce it (as minimally and precisely as possible)?
Run several SNO IPI builds to reproduce issue, usually about five builds.
Minimalistic install config is enough, I was able to reproduce it with SNO private and public cluster in Azure and private cluster in AWS (doing builds mostly in Azure). Example of my install config is here:
Anything else we need to know?
If some more logs or test should be done, just let me know what should I collect. I opened also ticket with RedHat support, but as this occurs just sometimes , they are asking for more proofs...
References
These two can be related, but hard to say as there are not logs at all
#8049
#7982
The text was updated successfully, but these errors were encountered: