Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCM deletes cluster nodes when dynamic groups and policies are not set properly #434

Open
adriengentil opened this issue Jul 17, 2023 · 0 comments

Comments

@adriengentil
Copy link

adriengentil commented Jul 17, 2023

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

CCM Version: 1.25+

Environment:

  • Kubernetes version (use kubectl version): Openshift 4.14
# oc version
Client Version: 4.12.23
Kustomize Version: v4.5.7
Kubernetes Version: v1.27.3+4aaeaec
  • OS (e.g. from /etc/os-release):
$ cat /etc/redhat-release 
Red Hat Enterprise Linux CoreOS release 4.14
  • Kernel (e.g. uname -a):
$ uname -a
Linux localhost.localdomain 5.14.0-284.13.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 27 13:35:10 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Others:

What happened?

We deployed the CCM (with useInstancePrincipals: true) into our cluster without setting up dynamic groups and policies in our compartment, as a consequence the cluster nodes were deleted (kubectl get nodes returned no nodes).

This behavior of the CCM made the investigation and the access to the logs complicated as the CCM pods were evicted along with the nodes.

We guess this behavior is not limited to Openshift.

What you expected to happen?

Nodes are left uninitialized, the CCM logs a meaningful message, and retries until the user creates the required policies in OCI.

How to reproduce it (as minimally and precisely as possible)?

Provision a cluster and ensure:

  • policies are not set or badly set
  • the nodes are tainted with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule

then deploy the CCM with useInstancePrincipals: true config flag. At this time, the CCM should delete the nodes.

Anything else we need to know?

Here are the logs of the CCM pod before it deletes a node:

I0717 15:13:56.437876       1 node_controller.go:415] Initializing node test-infra-cluster-4107b8b3-master-2 with cloud provider
E0717 15:13:56.437954       1 node_controller.go:229] error syncing 'test-infra-cluster-4107b8b3-master-2': failed to get instance metadata for node test-infra-cluster-4107b8b3-master-2: error fetching node by provider ID: compartmentID annotation missing in the node. Would retry, and error by node name: error getting CompartmentID from Node Name: compartmentID annotation missing in the node. Would retry, requeuing
2023-07-17T15:13:56.969Z	ERROR	oci/node_info_controller.go:244	Failed to get instance from instance ID	{"component": "cloud-controller-manager", "node": "test-infra-cluster-4107b8b3-master-2", "error": "Error returned by Compute Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: ed0509ddc78d5d902a7b8257aadea741/F14BBE797B83333448017788F7DE2651/E08C35B4C4FA433DD1DE55198F6F99AD. Message: instance ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq not found\nOperation Name: GetInstance\nTimestamp: 2023-07-17 15:13:54 +0000 GMT\nClient Version: Oracle-GoSDK/65.2.0\nRequest Endpoint: GET https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances/ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq\nTroubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.\nAlso see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/GetInstance for details on this operation's requirements.\nTo get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.\nIf you are unable to resolve this Compute issue, please contact Oracle support and provide them this full error message.", "errorVerbose": "Error returned by Compute Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: ed0509ddc78d5d902a7b8257aadea741/F14BBE797B83333448017788F7DE2651/E08C35B4C4FA433DD1DE55198F6F99AD. Message: instance ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq not found\nOperation Name: GetInstance\nTimestamp: 2023-07-17 15:13:54 +0000 GMT\nClient Version: Oracle-GoSDK/65.2.0\nRequest Endpoint: GET https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances/ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq\nTroubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.\nAlso see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/GetInstance for details on this operation's requirements.\nTo get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.\nIf you are unable to resolve this Compute issue, please contact Oracle support and provide them this full error message.\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/oci/client.(*client).GetInstance\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/oci/client/compute.go:50\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.getInstanceByNode\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:242\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).processItem\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:168\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).processNextItem\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:139\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).runWorker\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:124\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).Run\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571"}
2023-07-17T15:13:56.969Z	ERROR	oci/node_info_controller.go:142	Error processing node test-infra-cluster-4107b8b3-master-2 (will retry): Error returned by Compute Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: ed0509ddc78d5d902a7b8257aadea741/F14BBE797B83333448017788F7DE2651/E08C35B4C4FA433DD1DE55198F6F99AD. Message: instance ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq not found
Operation Name: GetInstance
Timestamp: 2023-07-17 15:13:54 +0000 GMT
Client Version: Oracle-GoSDK/65.2.0
Request Endpoint: GET https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances/ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq
Troubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.
Also see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/GetInstance for details on this operation's requirements.
To get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.
If you are unable to resolve this Compute issue, please contact Oracle support and provide them this full error message.	{"component": "cloud-controller-manager"}
I0717 15:13:58.998504       1 node_controller.go:415] Initializing node test-infra-cluster-4107b8b3-master-2 with cloud provider
E0717 15:13:58.998590       1 node_controller.go:229] error syncing 'test-infra-cluster-4107b8b3-master-2': failed to get instance metadata for node test-infra-cluster-4107b8b3-master-2: error fetching node by provider ID: compartmentID annotation missing in the node. Would retry, and error by node name: error getting CompartmentID from Node Name: compartmentID annotation missing in the node. Would retry, requeuing
I0717 15:13:59.329292       1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: test-infra-cluster-4107b8b3-master-2
I0717 15:13:59.329476       1 event.go:294] "Event occurred" object="test-infra-cluster-4107b8b3-master-2" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node test-infra-cluster-4107b8b3-master-2 because it does not exist in the cloud provider"
2023-07-17T15:14:03.394Z	ERROR	oci/node_info_controller.go:142	Error processing node test-infra-cluster-4107b8b3-master-0 (will retry): node "test-infra-cluster-4107b8b3-master-0" not found	{"component": "cloud-controller-manager"}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant