Single node installation (SNO) sometimes fails because of etcd pods unable to start #8225

mlacko64 · 2024-04-02T09:29:28Z

Version

$ openshift-install version
4.15.0
4.15.2
stable-4.14

Platform:

Azure
AWS

IPI

What happened?

Single node OpenShift deployment sometimes fails ending with timeout waiting on API. After investigation on master node, API pod is not running because cannot contact etcd. Etcd pod is restarting itself in neverending loop because it does wait on bootstrap response (which but never gets, because bootstrap is already removed by installer).

etcd log does contain these repeating messages:

{"level":"info","ts":"2024-03-08T14:14:10.737756Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"831e2e3a53e15a35 is starting a new election at term 3"}
{"level":"info","ts":"2024-03-08T14:14:10.737794Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"831e2e3a53e15a35 became pre-candidate at term 3"}
{"level":"info","ts":"2024-03-08T14:14:10.737804Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"831e2e3a53e15a35 received MsgPreVoteResp from 831e2e3a53e15a35 at term 3"}
{"level":"info","ts":"2024-03-08T14:14:10.737817Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"831e2e3a53e15a35 [logterm: 3, index: 19588] sent MsgPreVote request to 7e5e8569a1ffd12a at term 3"}
{"level":"warn","ts":"2024-03-08T14:14:11.201809Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:11.702107Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:11.846431Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"7e5e8569a1ffd12a","rtt":"519.251µs","error":"dial tcp 10.242.20.6:2380: connect: no route to host"}
{"level":"warn","ts":"2024-03-08T14:14:11.947647Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"7e5e8569a1ffd12a","rtt":"10.743675ms","error":"dial tcp 10.242.20.6:2380: connect: no route to host"}
{"level":"warn","ts":"2024-03-08T14:14:12.202645Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:12.278501Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-03-08T14:14:09.278445Z","time spent":"3.000051118s","remote":"[::1]:49170","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-03-08T14:14:12.703398Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:13.204015Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:13.704522Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-03-08T14:14:14.205129Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6500257895037094893,"retry-timeout":"500ms"}

IP address 10.242.20.6 in message above is (or better said was) bootstrap , which does not exist anymore (installer removed it). So it looks like some race condition that sometimes bootstrap is removed too soon.

I'll attach full etcd log here. I do have also sosreport from master, but it is too big to upload, still I can provide it, if needed.
etcd.zip

What you expected to happen?

Build will finish always successfully.

How to reproduce it (as minimally and precisely as possible)?

Run several SNO IPI builds to reproduce issue, usually about five builds.

Minimalistic install config is enough, I was able to reproduce it with SNO private and public cluster in Azure and private cluster in AWS (doing builds mostly in Azure). Example of my install config is here:

apiVersion: v1
baseDomain: azureipi.mcs
controlPlane:
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      osDisk:
        diskSizeGB: 128
        diskType: Premium_LRS
      type: Standard_D8ls_v5
  replicas: 1
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    azure:
      osDisk:
        diskSizeGB: 128
        diskType: Premium_LRS
      type: Standard_D4as_v5
      zones:
      - "1"
      - "2"
      - "3"
  replicas: 0
metadata:
  name: pr-merge-6636
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.242.20.0/22
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  azure:
    baseDomainResourceGroupName: pr-merge-6636-ocp-rg
    cloudName: AzurePublicCloud
    outboundType: UserDefinedRouting
    region: westus2
    networkResourceGroupName: JenkinsAutoGroup
    virtualNetwork: mcs-azure-nw
    controlPlaneSubnet: mcs-azure-subnet01
    computeSubnet: mcs-azure-subnet02
    resourceGroupName: pr-merge-6636-ocp-rg
publish: "Internal"
pullSecret: '{"auths":...<removed>}'
sshKey: ssh-rsa ...<removed>

Anything else we need to know?

If some more logs or test should be done, just let me know what should I collect. I opened also ticket with RedHat support, but as this occurs just sometimes , they are asking for more proofs...

References

These two can be related, but hard to say as there are not logs at all
#8049
#7982

The text was updated successfully, but these errors were encountered:

openshift-bot · 2024-07-02T01:00:45Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single node installation (SNO) sometimes fails because of etcd pods unable to start #8225

Single node installation (SNO) sometimes fails because of etcd pods unable to start #8225

mlacko64 commented Apr 2, 2024

openshift-bot commented Jul 2, 2024

Single node installation (SNO) sometimes fails because of etcd pods unable to start #8225

Single node installation (SNO) sometimes fails because of etcd pods unable to start #8225

Comments

mlacko64 commented Apr 2, 2024

Version

Platform:

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

References

openshift-bot commented Jul 2, 2024