Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libvirt: fails on "Some cluster operators are still updating: authentication, console: timed out waiting for the condition" #1648

Closed
mnovak1 opened this issue Apr 18, 2019 · 14 comments

Comments

@mnovak1
Copy link

mnovak1 commented Apr 18, 2019

$ openshift-install version
bin/openshift-install unreleased-master-832-g7aea0d5d115ed2a31e756ae778ed416e744ce2d7
built from commit 7aea0d5d115ed2a31e756ae778ed416e744ce2d7
release image registry.svc.ci.openshift.org/origin/release:v4.1

# Platform (aws|libvirt|openstack):
libvirt - Fedora 30

# What happened?
Installation failed with:
env TF_VAR_libvirt_master_memory=16168 TF_VAR_libvirt_master_vcpu=8 ./bin/openshift-install create cluster --dir /root/ocp4/
...
time="2019-04-18T05:58:50-04:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-04-18-062748: 99% complete, waiting on authentication, console"
time="2019-04-18T06:02:35-04:00" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console"
time="2019-04-18T06:08:40-04:00" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console: timed out waiting for the condition"

# What you expected to happen?
OCP Cluster is created.

# How to reproduce it (as minimally and precisely as possible)?
Follow steps - https://github.com/openshift/installer/blob/release-4.1/docs/dev/libvirt/README.md on Fedora 30
@mnovak1
Copy link
Author

mnovak1 commented Apr 18, 2019

I can provide any more details just let me know what you need?

@mnovak1
Copy link
Author

mnovak1 commented Apr 18, 2019

@mnovak1
Copy link
Author

mnovak1 commented Apr 18, 2019

Looks like that authentication operator failed:

[root@dev137 installer]# oc --config=/root/ocp4/auth/kubeconfig get clusterversion -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: 2019-04-18T09:33:51Z
    generation: 1
    name: version
    namespace: ""
    resourceVersion: "18010"
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: 133009aa-61bd-11e9-8a3a-664f163f5f0f
  spec:
    channel: stable-4.0
    clusterID: 06e1ed5f-25d5-431c-8f2f-4490e9d422cd
    upstream: https://api.openshift.com/api/upgrades_info/v1/graph
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: 2019-04-18T09:33:55Z
      status: "False"
      type: Available
    - lastTransitionTime: 2019-04-18T10:02:35Z
      message: 'Some cluster operators are still updating: authentication, console'
      reason: ClusterOperatorsNotAvailable
      status: "True"
      type: Failing

because of:

>  oc --config=/root/ocp4/auth/kubeconfig get clusteroperator authentication -o=jsonpath='{range .status.conditions[*]}{.type}{" "}{.status}{" "}{.message}{"\n"}{end}'
> Failing False Failing: error checking current version: unable to check route health: failed to GET route: dial tcp: lookup openshift-authentication-openshift-authentication.apps.test1.tt.testing on 172.30.0.10:53: no such host
> Progressing Unknown 
> Available Unknown 
> Upgradeable Unknown 
> 

I'm not able to ssh to master node as it requires password - for some reason ssh does not work. Where is 172.30.0.10:53 coming from?

@ghost
Copy link

ghost commented Apr 19, 2019

@mnovak1,

  • did you generate the ssh key at install time? ( during the installation process you need to inform your public-key that it will be injected into the coreos )

  • there is a name resolution problem with dnsmasq

Regards,
Fábio Sbano

@mnovak1
Copy link
Author

mnovak1 commented Apr 25, 2019

@ssbano hi, thanks for the tips. I don't think there is dnsmasq issue because I hit the issue with dnsmask before and fixed it. I've got further in installation process and hit this issue.

I was not asked for providing ssh keys during installation process:

[root@dev207 installer]# env TF_VAR_libvirt_master_memory=160168 TF_VAR_libvirt_master_vcpu=32 ./bin/openshift-install create cluster --dir /root/ocp4/
? Platform libvirt
? Libvirt Connection URI qemu+tcp://192.168.122.1/system
? Base Domain tt.testing
? Cluster Name test1
? Pull Secret [? for help] ******************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

INFO Fetching OS image: rhcos-410.8.20190412.1-qemu.qcow2 
INFO Creating infrastructure resources...         
INFO Waiting up to 30m0s for the Kubernetes API at https://api.test1.tt.testing:6443... 
INFO API v1.13.4+f716ef3 up                       
INFO Waiting up to 30m0s for bootstrapping to complete... 
INFO Destroying the bootstrap resources...        
INFO Waiting up to 30m0s for the cluster at https://api.test1.tt.testing:6443 to initialize... 
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console: timed out waiting for the condition 

Root user under which openshift is installed has ssh keys generated:
[root@dev209 ~]# ls ~/.ssh/
authorized_keys id_rsa id_rsa.pub known_hosts

I would expect them to picked up automatically.

@rbo
Copy link
Member

rbo commented Apr 29, 2019

I have the same problem:

oc --config=/root/ocp4-libvirt/auth/kubeconfig get clusteroperator authentication -o=jsonpath='{range .status.conditions[*]}{.type}{" "}{.status}{" "}{.message}{"\n"}{end}'
Failing True Failing: error checking payload readiness: unable to check route health: failed to GET route: dial tcp: lookup openshift-authentication-openshift-authentication.apps.ocp4.bohne.io on 172.30.0.10:53: no such host
Progressing False
Available False
Upgradeable Unknown

It looks like there is no wildcard dns entry for *.apps.${CLUSTER}.${DOMAIN}.

Can not find any entries in: https://github.com/openshift/installer/blob/master/data/data/libvirt/main.tf

Since dnsmasq 2.77 supports wildcard dns entries but it looks like libvirt doesn't support it: https://bugzilla.redhat.com/show_bug.cgi?id=1532856

Not tested yet, might be helpful: https://gist.github.com/praveenkumar/4f0b929593563c087bc724b75c83ee40#gistcomment-2854047

@aenertia
Copy link

aenertia commented May 1, 2019

I updated the bugzilla https://bugzilla.redhat.com/show_bug.cgi?id=1701209 with some additional info - for me it appears the fetch of the openshift-api container image for 4.1 is failing.

@nehaberry
Copy link

nehaberry commented May 10, 2019

I am also facing similar failure messages on trying to install OCP 4.1 on centos based libvirt setup.

Following are some of the details

# bin/openshift-install version
bin/openshift-install unreleased-master-972-g7d1959b1b2a01c36e01d24b646be1ba01a008e49
built from commit 7d1959b
release image registry.svc.ci.openshift.org/origin/release:v4.1

# oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0-201905062232+ef90487-dirty", GitCommit:"ef90487", GitTreeState:"dirty", BuildDate:"2019-05-07T03:33:12Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+23d4f6f", GitCommit:"23d4f6f", GitTreeState:"clean", BuildDate:"2019-05-09T11:10:35Z", GoVersion:"go1.11.8", Compiler:"gc", Platform:"linux/amd64"}

Logs



time="2019-05-09T19:05:15+05:30" level=info msg="Waiting up to 30m0s for the cluster at https://api.node2.tt.testing:6443 to initialize..."
time="2019-05-09T19:20:47+05:30" level=debug msg="Still waiting for the cluster to initialize: Multiple errors are preventing progress:\n* Cluster operator authentication has not yet reported success\n* Cluster operator image-registry has not yet reported success\n* Cluster operator ingress has not yet reported success\n* Cluster operator kube-apiserver is reporting a failure: StaticPodsDegraded: nodes/node2-dj2n8-master-1 pods/kube-apiserver-node2-dj2n8-master-1 container=\"kube-apiserver-2\" is not ready\nStaticPodsDegraded: pods \"kube-apiserver-node2-dj2n8-master-2\" not found\nStaticPodsDegraded: pods \"kube-apiserver-node2-dj2n8-master-0\" not found\n* Cluster operator kube-controller-manager has not yet reported success\n* Cluster operator kube-scheduler is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: \nStaticPodsDegraded: nodes/node2-dj2n8-master-1 pods/openshift-kube-scheduler-node2-dj2n8-master-1 container=\"scheduler\" is not ready\nStaticPodsDegraded: pods \"openshift-kube-scheduler-node2-dj2n8-master-2\" not found\nStaticPodsDegraded: pods \"openshift-kube-scheduler-node2-dj2n8-master-0\" not found\n* Cluster operator marketplace has not yet reported success\n* Cluster operator monitoring has not yet reported success\n* Cluster operator node-tuning has not yet reported success\n* Cluster operator openshift-apiserver is reporting a failure: ResourceSyncControllerDegraded: namespaces \"openshift-apiserver\" not found\n* Cluster operator service-catalog-apiserver has not yet reported success\n* Cluster operator service-catalog-controller-manager has not yet reported success\n* Cluster operator storage has not yet reported success\n* Could not update oauthclient \"console\" (221 of 351): the server does not recognize this resource, check extension API servers\n* Could not update rolebinding \"openshift/cluster-samples-operator-openshift-edit\" (183 of 351): resource may have been deleted\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (347 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (322 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (350 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-image-registry/image-registry\" (328 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (338 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (341 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (344 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (268 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (331 of 351): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (334 of 351): the server does not recognize this resource, check extension API servers"
time="2019-05-09T19:22:05+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310"
time="2019-05-09T19:22:05+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: downloading update"
time="2019-05-09T19:22:05+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310"
time="2019-05-09T19:22:05+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 1% complete"
time="2019-05-09T19:22:05+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 10% complete"
time="2019-05-09T19:22:05+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 22% complete"
time="2019-05-09T19:22:06+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 31% complete"
time="2019-05-09T19:22:20+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 82% complete"
time="2019-05-09T19:22:35+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 88% complete"
time="2019-05-09T19:22:50+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 89% complete"
time="2019-05-09T19:23:35+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 90% complete"
time="2019-05-09T19:23:50+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 91% complete"
time="2019-05-09T19:24:20+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 96% complete"
time="2019-05-09T19:25:35+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 97% complete, waiting on authentication, console, monitoring, openshift-samples"
time="2019-05-09T19:26:50+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 97% complete"
time="2019-05-09T19:27:50+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 99% complete"
time="2019-05-09T19:29:20+05:30" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-0.okd-2019-05-09-131310: 99% complete, waiting on authentication, console"
time="2019-05-09T19:33:35+05:30" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console"
time="2019-05-09T19:35:15+05:30" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console: timed out waiting for the condition"

install-config

 **cat /root/install-config.yaml** 
apiVersion: v1
baseDomain: tt.testing
compute:
- hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
metadata:
  creationTimestamp: null
  name: node2
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineCIDR: 192.168.126.0/24
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  libvirt:
    URI: qemu+tcp://192.168.122.1/system
    network:
      if: tt0

oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication Unknown Unknown True 15h
cloud-credential 4.1.0-0.okd-2019-05-09-131310 True False False 15h
cluster-autoscaler 4.1.0-0.okd-2019-05-09-131310 True False False 15h
console 4.1.0-0.okd-2019-05-09-131310 False True False 15h
dns 4.1.0-0.okd-2019-05-09-131310 True False False 15h
image-registry 4.1.0-0.okd-2019-05-09-131310 True False False 15h
ingress 4.1.0-0.okd-2019-05-09-131310 True False False 15h
kube-apiserver 4.1.0-0.okd-2019-05-09-131310 True False False 15h
kube-controller-manager 4.1.0-0.okd-2019-05-09-131310 True False False 15h
kube-scheduler 4.1.0-0.okd-2019-05-09-131310 True False False 15h
machine-api 4.1.0-0.okd-2019-05-09-131310 True False False 15h
machine-config 4.1.0-0.okd-2019-05-09-131310 True False False 15h
marketplace 4.1.0-0.okd-2019-05-09-131310 True False False 15h
monitoring 4.1.0-0.okd-2019-05-09-131310 True False False 44m
network 4.1.0-0.okd-2019-05-09-131310 True False False 15h
node-tuning 4.1.0-0.okd-2019-05-09-131310 True False False 15h
openshift-apiserver 4.1.0-0.okd-2019-05-09-131310 True False False 45m
openshift-controller-manager 4.1.0-0.okd-2019-05-09-131310 True False False 15h
openshift-samples 4.1.0-0.okd-2019-05-09-131310 True False False 15h
operator-lifecycle-manager 4.1.0-0.okd-2019-05-09-131310 True False False 15h
operator-lifecycle-manager-catalog 4.1.0-0.okd-2019-05-09-131310 True False False 15h
service-ca 4.1.0-0.okd-2019-05-09-131310 True False False 15h
service-catalog-apiserver 4.1.0-0.okd-2019-05-09-131310 True False False 15h
service-catalog-controller-manager 4.1.0-0.okd-2019-05-09-131310 True False False 15h
storage 4.1.0-0.okd-2019-05-09-131310 True False False 15h

oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.0-0.okd-2019-05-09-131310 False True 15h Unable to apply 4.1.0-0.okd-2019-05-09-131310: some cluster operators have not yet rolled out

Logs from install are attached here:
initial-ocs.zip

Thanks,
neha

@zeenix
Copy link
Contributor

zeenix commented Jun 12, 2019

/label platform/libvirt

@zeenix
Copy link
Contributor

zeenix commented Jun 17, 2019

AFAICT the remaining problem here is the one tracked by #1007. Closing this in favour of that.

/close

@openshift-ci-robot
Copy link
Contributor

@zeenix: Closing this issue.

In response to this:

AFAICT the remaining problem here is the one tracked by #1007. Closing this in favour of that.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rthallisey
Copy link

Saw this on 4.3.1 libvirt install.

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED    
authentication                                       Unknown     Unknown       True
console                                    4.3.1     False       True          False                                                                                                                                                                                              

The console and oauth operators were not reaching the worker because the pods were using the K8s DNS name server in order to locate oauth-openshift.apps.<cluster_name>.tt.testing. You can confirm this with the following error in the authentication operator logs: failed to GET route: dial tcp: lookup oauth-openshift.apps.<cluster_name>.tt.testing on 172.30.0.10:53: no such host. It makes sense that the K8s DNS name server would be searched, however K8s should not be responsible for name resolution on the control plane network -- that should be done by libvirt.

$ WORKER_IP=192.168.126.51
$ virsh net-update <cluster_libvirt_network> add dns-host "<host ip='$WORKER_IP'><hostname>oauth-openshift.apps.<cluster_name>.tt.testing</hostname></host>"

Problem solved.

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED
authentication                             4.3.1     True        False         False
console                                    4.3.1     True        False         False

@fenggolang
Copy link

Saw this on 4.3.1 libvirt install.

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED    
authentication                                       Unknown     Unknown       True
console                                    4.3.1     False       True          False                                                                                                                                                                                              

The console and oauth operators were not reaching the worker because the pods were using the K8s DNS name server in order to locate oauth-openshift.apps.<cluster_name>.tt.testing. You can confirm this with the following error in the authentication operator logs: failed to GET route: dial tcp: lookup oauth-openshift.apps.<cluster_name>.tt.testing on 172.30.0.10:53: no such host. It makes sense that the K8s DNS name server would be searched, however K8s should not be responsible for name resolution on the control plane network -- that should be done by libvirt.

$ WORKER_IP=192.168.126.51
$ virsh net-update <cluster_libvirt_network> add dns-host "<host ip='$WORKER_IP'><hostname>oauth-openshift.apps.<cluster_name>.tt.testing</hostname></host>"

Problem solved.

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED
authentication                             4.3.1     True        False         False
console                                    4.3.1     True        False         False

worker have two,so WORKER_IP Write worker-0 IP or worker-1 IP?

@cfergeau
Copy link
Contributor

cfergeau commented Jul 7, 2022

worker have two,so WORKER_IP Write worker-0 IP or worker-1 IP?

If you have multiple workers, you'll need to setup something more sophisticated, not sure how though :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants