libvirt: fails on "Some cluster operators are still updating: authentication, console: timed out waiting for the condition" #1648

mnovak1 · 2019-04-18T11:27:20Z

$ openshift-install version
bin/openshift-install unreleased-master-832-g7aea0d5d115ed2a31e756ae778ed416e744ce2d7
built from commit 7aea0d5d115ed2a31e756ae778ed416e744ce2d7
release image registry.svc.ci.openshift.org/origin/release:v4.1

# Platform (aws|libvirt|openstack):
libvirt - Fedora 30

# What happened?
Installation failed with:
env TF_VAR_libvirt_master_memory=16168 TF_VAR_libvirt_master_vcpu=8 ./bin/openshift-install create cluster --dir /root/ocp4/
...
time="2019-04-18T05:58:50-04:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-04-18-062748: 99% complete, waiting on authentication, console"
time="2019-04-18T06:02:35-04:00" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console"
time="2019-04-18T06:08:40-04:00" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console: timed out waiting for the condition"

# What you expected to happen?
OCP Cluster is created.

# How to reproduce it (as minimally and precisely as possible)?
Follow steps - https://github.com/openshift/installer/blob/release-4.1/docs/dev/libvirt/README.md on Fedora 30

mnovak1 · 2019-04-18T11:28:57Z

I can provide any more details just let me know what you need?

mnovak1 · 2019-04-18T13:51:11Z

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1701209

mnovak1 · 2019-04-18T13:59:04Z

Looks like that authentication operator failed:

[root@dev137 installer]# oc --config=/root/ocp4/auth/kubeconfig get clusterversion -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: 2019-04-18T09:33:51Z
    generation: 1
    name: version
    namespace: ""
    resourceVersion: "18010"
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: 133009aa-61bd-11e9-8a3a-664f163f5f0f
  spec:
    channel: stable-4.0
    clusterID: 06e1ed5f-25d5-431c-8f2f-4490e9d422cd
    upstream: https://api.openshift.com/api/upgrades_info/v1/graph
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: 2019-04-18T09:33:55Z
      status: "False"
      type: Available
    - lastTransitionTime: 2019-04-18T10:02:35Z
      message: 'Some cluster operators are still updating: authentication, console'
      reason: ClusterOperatorsNotAvailable
      status: "True"
      type: Failing

because of:

>  oc --config=/root/ocp4/auth/kubeconfig get clusteroperator authentication -o=jsonpath='{range .status.conditions[*]}{.type}{" "}{.status}{" "}{.message}{"\n"}{end}'
> Failing False Failing: error checking current version: unable to check route health: failed to GET route: dial tcp: lookup openshift-authentication-openshift-authentication.apps.test1.tt.testing on 172.30.0.10:53: no such host
> Progressing Unknown 
> Available Unknown 
> Upgradeable Unknown 
>

I'm not able to ssh to master node as it requires password - for some reason ssh does not work. Where is 172.30.0.10:53 coming from?

ghost · 2019-04-19T07:20:53Z

@mnovak1,

did you generate the ssh key at install time? ( during the installation process you need to inform your public-key that it will be injected into the coreos )
there is a name resolution problem with dnsmasq

Regards,
Fábio Sbano

mnovak1 · 2019-04-25T06:14:28Z

@ssbano hi, thanks for the tips. I don't think there is dnsmasq issue because I hit the issue with dnsmask before and fixed it. I've got further in installation process and hit this issue.

I was not asked for providing ssh keys during installation process:

[root@dev207 installer]# env TF_VAR_libvirt_master_memory=160168 TF_VAR_libvirt_master_vcpu=32 ./bin/openshift-install create cluster --dir /root/ocp4/
? Platform libvirt
? Libvirt Connection URI qemu+tcp://192.168.122.1/system
? Base Domain tt.testing
? Cluster Name test1
? Pull Secret [? for help] ******************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

INFO Fetching OS image: rhcos-410.8.20190412.1-qemu.qcow2 
INFO Creating infrastructure resources...         
INFO Waiting up to 30m0s for the Kubernetes API at https://api.test1.tt.testing:6443... 
INFO API v1.13.4+f716ef3 up                       
INFO Waiting up to 30m0s for bootstrapping to complete... 
INFO Destroying the bootstrap resources...        
INFO Waiting up to 30m0s for the cluster at https://api.test1.tt.testing:6443 to initialize... 
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console: timed out waiting for the condition

Root user under which openshift is installed has ssh keys generated:
[root@dev209 ~]# ls ~/.ssh/
authorized_keys id_rsa id_rsa.pub known_hosts

I would expect them to picked up automatically.

rbo · 2019-04-29T10:18:05Z

I have the same problem:

oc --config=/root/ocp4-libvirt/auth/kubeconfig get clusteroperator authentication -o=jsonpath='{range .status.conditions[*]}{.type}{" "}{.status}{" "}{.message}{"\n"}{end}'
Failing True Failing: error checking payload readiness: unable to check route health: failed to GET route: dial tcp: lookup openshift-authentication-openshift-authentication.apps.ocp4.bohne.io on 172.30.0.10:53: no such host
Progressing False
Available False
Upgradeable Unknown

It looks like there is no wildcard dns entry for *.apps.${CLUSTER}.${DOMAIN}.

Can not find any entries in: https://github.com/openshift/installer/blob/master/data/data/libvirt/main.tf

Since dnsmasq 2.77 supports wildcard dns entries but it looks like libvirt doesn't support it: https://bugzilla.redhat.com/show_bug.cgi?id=1532856

Not tested yet, might be helpful: https://gist.github.com/praveenkumar/4f0b929593563c087bc724b75c83ee40#gistcomment-2854047

aenertia · 2019-05-01T09:43:10Z

I updated the bugzilla https://bugzilla.redhat.com/show_bug.cgi?id=1701209 with some additional info - for me it appears the fetch of the openshift-api container image for 4.1 is failing.

nehaberry · 2019-05-10T05:09:45Z

I am also facing similar failure messages on trying to install OCP 4.1 on centos based libvirt setup.

Following are some of the details

# bin/openshift-install version
bin/openshift-install unreleased-master-972-g7d1959b1b2a01c36e01d24b646be1ba01a008e49
built from commit 7d1959b
release image registry.svc.ci.openshift.org/origin/release:v4.1

# oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0-201905062232+ef90487-dirty", GitCommit:"ef90487", GitTreeState:"dirty", BuildDate:"2019-05-07T03:33:12Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+23d4f6f", GitCommit:"23d4f6f", GitTreeState:"clean", BuildDate:"2019-05-09T11:10:35Z", GoVersion:"go1.11.8", Compiler:"gc", Platform:"linux/amd64"}

Logs



time="2019-05-09T19:05:15+05:30" level=info msg="Waiting up to 30m0s for the cluster at https://api.node2.tt.testing:6443 to initialize..."
time="2019-05-09T19:20:47+05:30" level=debug msg="Still waiting for the cluster to initialize: Multiple errors are preventing progress:\n* Cluster operator authentication has not yet reported success\n* Cluster operator image-registry has not yet reported success\n* Cluster operator ingress has not yet reported success\n* Cluster operator kube-apiserver is reporting a failure: StaticPodsDegraded: nodes/node2-dj2n8-master-1 pods/kube-apiserver-node2-dj2n8-master-1 container=\"kube-apiserver-2\" is not ready\nStaticPodsDegraded: pods \"kube-apiserver-node2-dj2n8-master-2\" not found\nStaticPodsDegraded: pods \"kube-apiserver-node2-dj2n8-master-0\" not found\n* Cluster operator kube-controller-manager has not yet reported success\n* Cluster operator kube-scheduler is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: \nStaticPodsDegraded: nodes/node2-dj2n8-master-1 pods/openshift-kube-scheduler-node2-dj2n8-master-1 container=\"scheduler\" is not ready\nStaticPodsDegraded: pods \"openshift-kube-scheduler-node2-dj2n8-master-2\" not found\nStaticPodsDegraded: pods \"openshift-kube-scheduler-node2-dj2n8-master-0\" not found\n* Cluster operator marketplace has not yet reported success\n* Cluster operator monitoring has not yet reported success\n* Cluster operator node-tuning has not yet reported success\n* Cluster operator openshift-apiserver is reporting a failure: ResourceSyncControllerDegraded: namespaces \"openshift-apiserver\" not found\n* Cluster operator service-catalog-apiserver has not yet reported success\n* Cluster operator service-catalog-controller-manager has not yet reported success\n* Cluster operator storage has not yet reported success\n* Could not update oauthclient \"console\" (221 of 351): the server does not recognize this resource, check extension API servers\n* Could not update rolebinding \"openshift/cluster-samples-operator-openshift-edit\" (183 of 351): resource may have been deleted\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (347 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (322 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (350 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-image-registry/image-registry\" (328 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (338 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (341 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (344 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (268 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (331 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (334 of 351): the server does not recognize this resource, check extension API servers"
time="2019-05-09T19:22:05+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310"
time="2019-05-09T19:22:05+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: downloading update"
time="2019-05-09T19:22:05+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310"
time="2019-05-09T19:22:05+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 1% complete"
time="2019-05-09T19:22:05+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 10% complete"
time="2019-05-09T19:22:05+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 22% complete"
time="2019-05-09T19:22:06+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 31% complete"
time="2019-05-09T19:22:20+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 82% complete"
time="2019-05-09T19:22:35+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 88% complete"
time="2019-05-09T19:22:50+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 89% complete"
time="2019-05-09T19:23:35+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 90% complete"
time="2019-05-09T19:23:50+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 91% complete"
time="2019-05-09T19:24:20+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 96% complete"
time="2019-05-09T19:25:35+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 97% complete, waiting on authentication, console, monitoring, openshift-samples"
time="2019-05-09T19:26:50+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 97% complete"
time="2019-05-09T19:27:50+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 99% complete"
time="2019-05-09T19:29:20+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 99% complete, waiting on authentication, console"
time="2019-05-09T19:33:35+05:30" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console"
time="2019-05-09T19:35:15+05:30" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console: timed out waiting for the condition"

install-config

 **cat /root/install-config.yaml** 
apiVersion: v1
baseDomain: tt.testing
compute:
- hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
metadata:
  creationTimestamp: null
  name: node2
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineCIDR: 192.168.126.0/24
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  libvirt:
    URI: qemu+tcp://192.168.122.1/system
    network:
      if: tt0

oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication Unknown Unknown True 15h
cloud-credential 4.1.0-0.okd-2019-05-09-131310 True False False 15h
cluster-autoscaler 4.1.0-0.okd-2019-05-09-131310 True False False 15h
console 4.1.0-0.okd-2019-05-09-131310 False True False 15h
dns 4.1.0-0.okd-2019-05-09-131310 True False False 15h
image-registry 4.1.0-0.okd-2019-05-09-131310 True False False 15h
ingress 4.1.0-0.okd-2019-05-09-131310 True False False 15h
kube-apiserver 4.1.0-0.okd-2019-05-09-131310 True False False 15h
kube-controller-manager 4.1.0-0.okd-2019-05-09-131310 True False False 15h
kube-scheduler 4.1.0-0.okd-2019-05-09-131310 True False False 15h
machine-api 4.1.0-0.okd-2019-05-09-131310 True False False 15h
machine-config 4.1.0-0.okd-2019-05-09-131310 True False False 15h
marketplace 4.1.0-0.okd-2019-05-09-131310 True False False 15h
monitoring 4.1.0-0.okd-2019-05-09-131310 True False False 44m
network 4.1.0-0.okd-2019-05-09-131310 True False False 15h
node-tuning 4.1.0-0.okd-2019-05-09-131310 True False False 15h
openshift-apiserver 4.1.0-0.okd-2019-05-09-131310 True False False 45m
openshift-controller-manager 4.1.0-0.okd-2019-05-09-131310 True False False 15h
openshift-samples 4.1.0-0.okd-2019-05-09-131310 True False False 15h
operator-lifecycle-manager 4.1.0-0.okd-2019-05-09-131310 True False False 15h
operator-lifecycle-manager-catalog 4.1.0-0.okd-2019-05-09-131310 True False False 15h
service-ca 4.1.0-0.okd-2019-05-09-131310 True False False 15h
service-catalog-apiserver 4.1.0-0.okd-2019-05-09-131310 True False False 15h
service-catalog-controller-manager 4.1.0-0.okd-2019-05-09-131310 True False False 15h
storage 4.1.0-0.okd-2019-05-09-131310 True False False 15h

oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.0-0.okd-2019-05-09-131310 False True 15h Unable to apply 4.1.0-0.okd-2019-05-09-131310: some cluster operators have not yet rolled out

Logs from install are attached here:
initial-ocs.zip

Thanks,
neha

zeenix · 2019-06-12T15:18:12Z

/label platform/libvirt

zeenix · 2019-06-17T16:04:57Z

AFAICT the remaining problem here is the one tracked by #1007. Closing this in favour of that.

/close

openshift-ci-robot · 2019-06-17T16:04:59Z

@zeenix: Closing this issue.

In response to this:

AFAICT the remaining problem here is the one tracked by #1007. Closing this in favour of that.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rthallisey · 2020-02-12T14:37:41Z

Saw this on 4.3.1 libvirt install.

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED    
authentication                                       Unknown     Unknown       True
console                                    4.3.1     False       True          False

The console and oauth operators were not reaching the worker because the pods were using the K8s DNS name server in order to locate oauth-openshift.apps.<cluster_name>.tt.testing. You can confirm this with the following error in the authentication operator logs: failed to GET route: dial tcp: lookup oauth-openshift.apps.<cluster_name>.tt.testing on 172.30.0.10:53: no such host. It makes sense that the K8s DNS name server would be searched, however K8s should not be responsible for name resolution on the control plane network -- that should be done by libvirt.

$ WORKER_IP=192.168.126.51
$ virsh net-update <cluster_libvirt_network> add dns-host "<host ip='$WORKER_IP'><hostname>oauth-openshift.apps.<cluster_name>.tt.testing</hostname></host>"

Problem solved.

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED
authentication                             4.3.1     True        False         False
console                                    4.3.1     True        False         False

fenggolang · 2022-07-06T01:18:36Z

Saw this on 4.3.1 libvirt install.
$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED    
authentication                                       Unknown     Unknown       True
console                                    4.3.1     False       True          False                                                                                                                                                                                              
The console and oauth operators were not reaching the worker because the pods were using the K8s DNS name server in order to locate oauth-openshift.apps.<cluster_name>.tt.testing. You can confirm this with the following error in the authentication operator logs: failed to GET route: dial tcp: lookup oauth-openshift.apps.<cluster_name>.tt.testing on 172.30.0.10:53: no such host. It makes sense that the K8s DNS name server would be searched, however K8s should not be responsible for name resolution on the control plane network -- that should be done by libvirt.
$ WORKER_IP=192.168.126.51
$ virsh net-update <cluster_libvirt_network> add dns-host "<host ip='$WORKER_IP'><hostname>oauth-openshift.apps.<cluster_name>.tt.testing</hostname></host>"
Problem solved.
$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED
authentication                             4.3.1     True        False         False
console                                    4.3.1     True        False         False

worker have two,so WORKER_IP Write worker-0 IP or worker-1 IP？

cfergeau · 2022-07-07T14:23:14Z

worker have two,so WORKER_IP Write worker-0 IP or worker-1 IP？

If you have multiple workers, you'll need to setup something more sophisticated, not sure how though :(

openshift-ci-robot added the platform/libvirt label Jun 12, 2019

openshift-ci-robot closed this as completed Jun 17, 2019

This was referenced Jul 3, 2020

Cluster failed to create when replica of master is set to be 3 in install-config.yaml #3126

Closed

libvirt: Unable to access web console #1007

Closed

100% complete, waiting on authentication, console #2287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libvirt: fails on "Some cluster operators are still updating: authentication, console: timed out waiting for the condition" #1648

libvirt: fails on "Some cluster operators are still updating: authentication, console: timed out waiting for the condition" #1648

mnovak1 commented Apr 18, 2019 •

edited

Loading

mnovak1 commented Apr 18, 2019

mnovak1 commented Apr 18, 2019

mnovak1 commented Apr 18, 2019

ghost commented Apr 19, 2019

mnovak1 commented Apr 25, 2019

rbo commented Apr 29, 2019 •

edited

Loading

aenertia commented May 1, 2019

nehaberry commented May 10, 2019 •

edited

Loading

zeenix commented Jun 12, 2019

zeenix commented Jun 17, 2019

openshift-ci-robot commented Jun 17, 2019

rthallisey commented Feb 12, 2020

fenggolang commented Jul 6, 2022

cfergeau commented Jul 7, 2022

libvirt: fails on "Some cluster operators are still updating: authentication, console: timed out waiting for the condition" #1648

libvirt: fails on "Some cluster operators are still updating: authentication, console: timed out waiting for the condition" #1648

Comments

mnovak1 commented Apr 18, 2019 • edited Loading

mnovak1 commented Apr 18, 2019

mnovak1 commented Apr 18, 2019

mnovak1 commented Apr 18, 2019

ghost commented Apr 19, 2019

mnovak1 commented Apr 25, 2019

rbo commented Apr 29, 2019 • edited Loading

aenertia commented May 1, 2019

nehaberry commented May 10, 2019 • edited Loading

Following are some of the details

zeenix commented Jun 12, 2019

zeenix commented Jun 17, 2019

openshift-ci-robot commented Jun 17, 2019

rthallisey commented Feb 12, 2020

fenggolang commented Jul 6, 2022

cfergeau commented Jul 7, 2022

mnovak1 commented Apr 18, 2019 •

edited

Loading

rbo commented Apr 29, 2019 •

edited

Loading

nehaberry commented May 10, 2019 •

edited

Loading