Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Emit Kubernetes Events when Cluster Phase, ControlPlaneReady, or InfrastructureReady change #7786

Conversation

mtougeron
Copy link
Member

What this PR does / why we need it:

Emit Kubernetes Events when Cluster status.Phase, status.ControlPlaneReady, or status.InfrastructureReady change and when the Cluster is deleted.

Which issue(s) this PR fixes:
relates to #7424 (but does not resolve it)

Output of events:

$> kubectl get cluster
NAME      PHASE          AGE   VERSION
tge-aks   Provisioning   13s   

$> kubectl get events --sort-by=.metadata.creationTimestamp
LAST SEEN   TYPE     REASON         OBJECT            MESSAGE
19s         Normal   Provisioning   cluster/tge-aks   Cluster is Provisioning

$> kubectl get events --sort-by=.metadata.creationTimestamp
LAST SEEN   TYPE     REASON                OBJECT            MESSAGE
52s         Normal   Provisioning          cluster/tge-aks   Cluster is Provisioning
22s         Normal   InfrastructureReady   cluster/tge-aks   Cluster InfrastructureReady is now true

$> kubectl get events --sort-by=.metadata.creationTimestamp
LAST SEEN   TYPE     REASON                OBJECT            MESSAGE
4m16s       Normal   Provisioning          cluster/tge-aks   Cluster is Provisioning
3m46s       Normal   InfrastructureReady   cluster/tge-aks   Cluster InfrastructureReady is now true
6s          Normal   ControlPlaneReady     cluster/tge-aks   Cluster ControlPlaneReady is now true

$> kubectl get events --sort-by=.metadata.creationTimestamp
LAST SEEN   TYPE     REASON                  OBJECT               MESSAGE
4m33s       Normal   Provisioning            cluster/tge-aks      Cluster is Provisioning
4m3s        Normal   InfrastructureReady     cluster/tge-aks      Cluster InfrastructureReady is now true
23s         Normal   ControlPlaneReady       cluster/tge-aks      Cluster ControlPlaneReady is now true
13s         Normal   Provisioned             cluster/tge-aks      Cluster is Provisioned

$> kubectl get cluster tge-aks                             
NAME      PHASE         AGE     VERSION
tge-aks   Provisioned   4m49s   

$> kubectl delete cluster tge-aks
cluster.cluster.x-k8s.io "tge-aks" deleted

$> kubectl get events --sort-by=.metadata.creationTimestamp
LAST SEEN   TYPE     REASON                  OBJECT               MESSAGE
11m         Normal   Provisioning            cluster/tge-aks      Cluster is Provisioning
10m         Normal   InfrastructureReady     cluster/tge-aks      Cluster InfrastructureReady is now true
7m          Normal   ControlPlaneReady       cluster/tge-aks      Cluster ControlPlaneReady is now true
6m50s       Normal   Provisioned             cluster/tge-aks      Cluster is Provisioned
6m47s       Normal   SuccessfulSetNodeRefs   machinepool/system   [{Kind: Namespace: Name:aks-system-21404556-vmss000000 UID:1e467b71-25e5-4a17-a6a6-6ff4f0533683 APIVersion: ResourceVersion: FieldPath:}]
6m47s       Normal   SuccessfulSetNodeRefs   machinepool/worker   [{Kind: Namespace: Name:aks-worker-21404556-vmss000000 UID:469773bb-8a74-42b4-93ad-281fdd3589d3 APIVersion: ResourceVersion: FieldPath:}]
15s         Normal   Deleting                cluster/tge-aks      Cluster is Deleting

$> kubectl get events --sort-by=.metadata.creationTimestamp
LAST SEEN   TYPE     REASON                  OBJECT               MESSAGE
17m         Normal   Provisioning            cluster/tge-aks      Cluster is Provisioning
16m         Normal   InfrastructureReady     cluster/tge-aks      Cluster InfrastructureReady is now true
13m         Normal   ControlPlaneReady       cluster/tge-aks      Cluster ControlPlaneReady is now true
12m         Normal   Provisioned             cluster/tge-aks      Cluster is Provisioned
12m         Normal   SuccessfulSetNodeRefs   machinepool/system   [{Kind: Namespace: Name:aks-system-21404556-vmss000000 UID:1e467b71-25e5-4a17-a6a6-6ff4f0533683 APIVersion: ResourceVersion: FieldPath:}]
12m         Normal   SuccessfulSetNodeRefs   machinepool/worker   [{Kind: Namespace: Name:aks-worker-21404556-vmss000000 UID:469773bb-8a74-42b4-93ad-281fdd3589d3 APIVersion: ResourceVersion: FieldPath:}]
6m23s       Normal   Deleting                cluster/tge-aks      Cluster is Deleting
11s         Normal   Deleted                 cluster/tge-aks      Cluster has been deleted

Note: In this example, the SuccessfulSetNodeRefs events were already being emitted by Cluster API.

I will be doing a follow-up PR for the remaining aspects of #7424 as it will involve more work and potential coordination with the Cluster API Providers.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 20, 2022
@enxebre
Copy link
Member

enxebre commented Jan 2, 2023

This is valuable information and I'm +1 to events. Some questions:

Is there a reason to not infer this from conditions ControlPlaneInitializedCondition / ControlPlaneReadyCondition / BootstrapReadyCondition / InfrastructureReadyCondition?

Also note that events do not follow contractual policies, they might change at any time. I'm curious for the automation workflows described in the issue have you considered consuming the conditions directly or metrics/alerting instead?

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtougeron thanks for working on this PR!
For sake of keeping a clean code organization/a simple main reconcile loop, what do you think about moving this change into reconcilePhase?

internal/controllers/cluster/cluster_controller.go Outdated Show resolved Hide resolved
@mtougeron
Copy link
Member Author

Also note that events do not follow contractual policies, they might change at any time. I'm curious for the automation workflows described in the issue have you considered consuming the conditions directly or metrics/alerting instead?

@enxebre The informer pattern only works with events not with object status conditions. There isn't a way to follow changes to the status conditions of a resource that I'm aware. This change will allow for being informed of events and then inspecting the Cluster resource to determine what action(s) need to be taken based on its status at that time.

@mtougeron
Copy link
Member Author

Sorry for the delays on this. I'm hoping to pick it back up next week.

@mtougeron mtougeron force-pushed the emit-events-on-status-conditions branch 3 times, most recently from 9b7a1c9 to 07dd857 Compare January 30, 2023 00:23
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 30, 2023
@mtougeron
Copy link
Member Author

Updated details about the events emitted

kubectl get events --sort-by=.metadata.creationTimestamp
LAST SEEN   TYPE     REASON                  OBJECT               MESSAGE
5m25s       Normal   Provisioning            cluster/tge-aks      Cluster tge-aks is Provisioning
4m55s       Normal   InfrastructureReady     cluster/tge-aks      Cluster tge-aks InfrastructureReady is now true
21s         Normal   ControlPlaneReady       cluster/tge-aks      Cluster tge-aks ControlPlaneReady is now true
11s         Normal   Provisioned             cluster/tge-aks      Cluster tge-aks is Provisioned
9s          Normal   SuccessfulSetNodeRefs   machinepool/system   [{Kind: Namespace: Name:aks-system-14953984-vmss000000 UID:badae934-1ac8-4e77-b6c0-bfba720798ca APIVersion: ResourceVersion: FieldPath:}]
9s          Normal   SuccessfulSetNodeRefs   machinepool/worker   [{Kind: Namespace: Name:aks-worker-14953984-vmss000000 UID:985726db-3cb3-44bc-b114-4865ebb89767 APIVersion: ResourceVersion: FieldPath:}]
kubectl get events --sort-by=.metadata.creationTimestamp -o json
{
  "apiVersion": "v1",
  "items": [
    {
      "apiVersion": "v1",
      "count": 1,
      "eventTime": null,
      "firstTimestamp": "2023-01-30T00:14:42Z",
      "involvedObject": {
        "apiVersion": "cluster.x-k8s.io/v1beta1",
        "kind": "Cluster",
        "name": "tge-aks",
        "namespace": "azure",
        "resourceVersion": "52456428",
        "uid": "785d5528-70b5-41f9-960e-400e78778b87"
      },
      "kind": "Event",
      "lastTimestamp": "2023-01-30T00:14:42Z",
      "message": "Cluster tge-aks is Provisioning",
      "metadata": {
        "creationTimestamp": "2023-01-30T00:14:42Z",
        "name": "tge-aks.173eede5b71d5715",
        "namespace": "azure",
        "resourceVersion": "52456429",
        "uid": "674a07a6-7685-4878-b75c-9d870d004240"
      },
      "reason": "Provisioning",
      "reportingComponent": "",
      "reportingInstance": "",
      "source": {
        "component": "cluster-controller"
      },
      "type": "Normal"
    },
    {
      "apiVersion": "v1",
      "count": 1,
      "eventTime": null,
      "firstTimestamp": "2023-01-30T00:15:12Z",
      "involvedObject": {
        "apiVersion": "cluster.x-k8s.io/v1beta1",
        "kind": "Cluster",
        "name": "tge-aks",
        "namespace": "azure",
        "resourceVersion": "52456745",
        "uid": "785d5528-70b5-41f9-960e-400e78778b87"
      },
      "kind": "Event",
      "lastTimestamp": "2023-01-30T00:15:12Z",
      "message": "Cluster tge-aks InfrastructureReady is now true",
      "metadata": {
        "creationTimestamp": "2023-01-30T00:15:12Z",
        "name": "tge-aks.173eedecbe51234f",
        "namespace": "azure",
        "resourceVersion": "52456747",
        "uid": "60e501e3-1933-4860-bad3-84f7f7c3b5d1"
      },
      "reason": "InfrastructureReady",
      "reportingComponent": "",
      "reportingInstance": "",
      "source": {
        "component": "cluster-controller"
      },
      "type": "Normal"
    },
    {
      "apiVersion": "v1",
      "count": 1,
      "eventTime": null,
      "firstTimestamp": "2023-01-30T00:19:46Z",
      "involvedObject": {
        "apiVersion": "cluster.x-k8s.io/v1beta1",
        "kind": "Cluster",
        "name": "tge-aks",
        "namespace": "azure",
        "resourceVersion": "52459269",
        "uid": "785d5528-70b5-41f9-960e-400e78778b87"
      },
      "kind": "Event",
      "lastTimestamp": "2023-01-30T00:19:46Z",
      "message": "Cluster tge-aks ControlPlaneReady is now true",
      "metadata": {
        "creationTimestamp": "2023-01-30T00:19:46Z",
        "name": "tge-aks.173eee2c7d2563e8",
        "namespace": "azure",
        "resourceVersion": "52459273",
        "uid": "e973ad54-7588-4e85-8196-3dec9dbcb549"
      },
      "reason": "ControlPlaneReady",
      "reportingComponent": "",
      "reportingInstance": "",
      "source": {
        "component": "cluster-controller"
      },
      "type": "Normal"
    },
    {
      "apiVersion": "v1",
      "count": 2,
      "eventTime": null,
      "firstTimestamp": "2023-01-30T00:19:56Z",
      "involvedObject": {
        "apiVersion": "cluster.x-k8s.io/v1beta1",
        "kind": "Cluster",
        "name": "tge-aks",
        "namespace": "azure",
        "resourceVersion": "52459274",
        "uid": "785d5528-70b5-41f9-960e-400e78778b87"
      },
      "kind": "Event",
      "lastTimestamp": "2023-01-30T00:19:56Z",
      "message": "Cluster tge-aks is Provisioned",
      "metadata": {
        "creationTimestamp": "2023-01-30T00:19:56Z",
        "name": "tge-aks.173eee2ed0aee6b0",
        "namespace": "azure",
        "resourceVersion": "52459363",
        "uid": "58b42043-e2fc-4c83-910f-202ab6927e3c"
      },
      "reason": "Provisioned",
      "reportingComponent": "",
      "reportingInstance": "",
      "source": {
        "component": "cluster-controller"
      },
      "type": "Normal"
    },
    {
      "apiVersion": "v1",
      "count": 1,
      "eventTime": null,
      "firstTimestamp": "2023-01-30T00:19:58Z",
      "involvedObject": {
        "apiVersion": "cluster.x-k8s.io/v1beta1",
        "kind": "MachinePool",
        "name": "system",
        "namespace": "azure",
        "resourceVersion": "52459377",
        "uid": "8d33a077-67b8-4a2e-8d84-957a9291be5a"
      },
      "kind": "Event",
      "lastTimestamp": "2023-01-30T00:19:58Z",
      "message": "[{Kind: Namespace: Name:aks-system-14953984-vmss000000 UID:badae934-1ac8-4e77-b6c0-bfba720798ca APIVersion: ResourceVersion: FieldPath:}]",
      "metadata": {
        "creationTimestamp": "2023-01-30T00:19:58Z",
        "name": "system.173eee2f5d7c4d4a",
        "namespace": "azure",
        "resourceVersion": "52459395",
        "uid": "379442e4-4365-40f1-bafc-00925f16d5ee"
      },
      "reason": "SuccessfulSetNodeRefs",
      "reportingComponent": "",
      "reportingInstance": "",
      "source": {
        "component": "machinepool-controller"
      },
      "type": "Normal"
    },
    {
      "apiVersion": "v1",
      "count": 1,
      "eventTime": null,
      "firstTimestamp": "2023-01-30T00:19:58Z",
      "involvedObject": {
        "apiVersion": "cluster.x-k8s.io/v1beta1",
        "kind": "MachinePool",
        "name": "worker",
        "namespace": "azure",
        "resourceVersion": "52459386",
        "uid": "8e221181-e9cb-416d-be7a-b5a2a6e9c2bb"
      },
      "kind": "Event",
      "lastTimestamp": "2023-01-30T00:19:58Z",
      "message": "[{Kind: Namespace: Name:aks-worker-14953984-vmss000000 UID:985726db-3cb3-44bc-b114-4865ebb89767 APIVersion: ResourceVersion: FieldPath:}]",
      "metadata": {
        "creationTimestamp": "2023-01-30T00:19:58Z",
        "name": "worker.173eee2f5d7d6658",
        "namespace": "azure",
        "resourceVersion": "52459396",
        "uid": "bb64d6ec-e6e1-4158-a1c1-1e4ca743c85a"
      },
      "reason": "SuccessfulSetNodeRefs",
      "reportingComponent": "",
      "reportingInstance": "",
      "source": {
        "component": "machinepool-controller"
      },
      "type": "Normal"
    }
  ],
  "kind": "List",
  "metadata": {
    "resourceVersion": ""
  }
}

@braghettos
Copy link

@vincepri @CecileRobertMichon sorry for pinging but I would really like to have this feature is CAPI - could you review the PR? Thank you so much!

@@ -331,6 +332,7 @@ func (r *Reconciler) reconcileDelete(ctx context.Context, cluster *clusterv1.Clu
}

controllerutil.RemoveFinalizer(cluster, clusterv1.ClusterFinalizer)
r.recorder.Eventf(cluster, corev1.EventTypeNormal, "Deleted", "Cluster %s has been deleted", cluster.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a common pattern? Wondering if pods do this too, and/or if there isn't some kind of general Object deleted event that gets emitted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a pod, there's the Killing Event that's emitted when it is deleted. Here's the lifecycle events of creating a statefulset and then running a kubectl delete against the pod.

LAST SEEN   TYPE      REASON                            OBJECT                                            MESSAGE
60s         Normal    Scheduled                         pod/touge-debug-0                                 Successfully assigned default/touge-debug-0 to ip-10-158-59-46.us-west-2.compute.internal
4s          Normal    SuccessfulCreate                  statefulset/touge-debug                           create Pod touge-debug-0 in StatefulSet touge-debug successful
59s         Normal    Pulling                           pod/touge-debug-0                                 Pulling image "mtougeron/touge-debug"
46s         Normal    Pulled                            pod/touge-debug-0                                 Successfully pulled image "mtougeron/touge-debug" in 13.117780542s
46s         Normal    Created                           pod/touge-debug-0                                 Created container ubuntu
46s         Normal    Started                           pod/touge-debug-0                                 Started container ubuntu
3s          Normal    Killing                           pod/touge-debug-0                                 Stopping container ubuntu
4s          Normal    Pulled                            pod/touge-debug-0                                 Container image "mtougeron/touge-debug" already present on machine
4s          Normal    Scheduled                         pod/touge-debug-0                                 Successfully assigned default/touge-debug-0 to ip-10-158-59-46.us-west-2.compute.internal
3s          Normal    Created                           pod/touge-debug-0                                 Created container ubuntu
3s          Normal    Started                           pod/touge-debug-0                                 Started container ubuntu

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lgtm overall, just a question about what's out there for delete events, but I'll defer to @fabriziopandini since he was the one who suggested a different approach

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @fabriziopandini

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 550b5ac5296d8f9cfc560f319c9faa7de87af55e

@braghettos
Copy link

I'm writing again to get an update on the review of this PR @fabriziopandini @vincepri @mtougeron , do you have any updates or topics to discuss?

@vincepri
Copy link
Member

/retest

@mtougeron mtougeron force-pushed the emit-events-on-status-conditions branch from 07dd857 to 8aa4797 Compare February 23, 2023 22:32
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 23, 2023
@@ -331,6 +332,7 @@ func (r *Reconciler) reconcileDelete(ctx context.Context, cluster *clusterv1.Clu
}

controllerutil.RemoveFinalizer(cluster, clusterv1.ClusterFinalizer)
r.recorder.Eventf(cluster, corev1.EventTypeNormal, "Deleted", "Cluster %s has been deleted", cluster.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this event could be emitted multiple times as it depends on how quickly the garbage collector in kube-controller-manager actually removes the object.

But I assume we don't care about how often this event is emitted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think that's acceptable. From the CAPI perspective the action happened so it's appropriate for us to emit the event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is acceptable too

@sbueringer
Copy link
Member

sbueringer commented Mar 8, 2023

Also note that events do not follow contractual policies, they might change at any time. I'm curious for the automation workflows described in the issue have you considered consuming the conditions directly or metrics/alerting instead?

@enxebre The informer pattern only works with events not with object status conditions. There isn't a way to follow changes to the status conditions of a resource that I'm aware. This change will allow for being informed of events and then inspecting the Cluster resource to determine what action(s) need to be taken based on its status at that time.

From a client-go perspective it should be possible. An informer gives you an old and new object and you can compare them to see if conditions have changed. The same is possible by adding predicates to a Reconciler in controller-runtime.

We do something like this e.g. here in predicates.ClusterUpdateInfraReady:

func ClusterUpdateInfraReady(logger logr.Logger) predicate.Funcs {

I think what Alberto wanted to make you aware of is that when you build on top of events this can be very brittle as we don't consider events part of our API (similar to logs). Which means that any automation build on events might break at any time (~ CAPI minor releases).

@mtougeron
Copy link
Member Author

I think what Alberto wanted to make you aware of is that when you build on top of events this can be very brittle as we don't consider events part of our API

I don't consider the events part of the API contract as well. IMO, I don't think these new events are any different from the ones CAPI already emits for the kubeadm provider. The same limitations and conditions apply.

@richardcase
Copy link
Member

I think what Alberto wanted to make you aware of is that when you build on top of events this can be very brittle as we don't consider events part of our API (similar to logs). Which means that any automation build on events might break at any time (~ CAPI minor releases).

Agree with this statement.

What if we did start to define exported event names so that any changes to them are picked up by apidiff (like the Killing example mentioned above). This would then at least give some stability for people wanting to rely on them for some reason and make it less brittle.

@mtougeron
Copy link
Member Author

What if we did start to define exported event names so that any changes to them are picked up by apidiff

That's a good idea. What about the ones that are based on the ClusterPhase type? I'd hate to redeclare those. Forgive my golang inexperience, is there a way to re-use those in this case?

@sbueringer
Copy link
Member

My comment was only regarding what folks are building on top of this PR, not a blocker in any way to get this PR merged.

It's probably better to move the discussion as a follow-up to a separate issue to not block this PR.

I think the discussion is fundamentally "do we want to guarantee API stability for events" (and what does API stability for events exactly mean) and not just "how do we get the apidiff to fail" or "how do we define the events in our API package so they are technically a Go API".

@braghettos
Copy link

Since I have some use cases, I agree to open a discussion on this topic but I think it's not only related to CAPI controllers but more from the generic implementation of a controller.

@richardcase
Copy link
Member

I think the discussion is fundamentally "do we want to guarantee API stability for events" (and what does API stability for events exactly mean) and not just "how do we get the apidiff to fail" or "how do we define the events in our API package so they are technically a Go API".

I agree on these statements, I mentioned it originally to start a discussion like this and not just "how do we get the apidiff to fail".

Created discussion #8254 to carry on this.

@sbueringer
Copy link
Member

Sounds good.

Coming back to this PR
/lgtm
from my side

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 9beb937cd41da71178da75ce49b799b40c326f8b

@richardcase
Copy link
Member

I also think:

/lgtm

@fabriziopandini
Copy link
Member

/lgtm
/approve

also, +1 to open a discussion thread on stability on events, it will be nice to be explicit about this

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 9, 2023
@mtougeron
Copy link
Member Author

looks like a flake read tcp 10.60.164.185:47198->185.199.108.133:443: read: connection reset by peer going to retest

/test pull-cluster-api-e2e-main

@k8s-ci-robot k8s-ci-robot merged commit 3529fda into kubernetes-sigs:main Mar 9, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.4 milestone Mar 9, 2023
@mtougeron mtougeron deleted the emit-events-on-status-conditions branch March 9, 2023 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants