Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apps: record cause of rollout and deployer pods timestamps back to rc #16347

Merged

Conversation

mfojtik
Copy link
Contributor

@mfojtik mfojtik commented Sep 14, 2017

This helps provide better metrics for deployment configs where we can report how much time the rollout took so in theory we can record times for last failed rollouts.

@Kargakis @tnozicka relatively simple change.

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 14, 2017
@openshift-merge-robot openshift-merge-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-api-review labels Sep 14, 2017
@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 14, 2017

not sure if time.Now() is the best format or we want just report UNIX timestamps (actually I might be more happy with UNIX timestamps as you can do math with them without conversion)

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 14, 2017 via email

DeployerPodCreatedAtAnnotation = "openshift.io/deployer-pod.createdAt"
// DeployerPodFinishedAnnotation is an annotation on deployment that records
// the time of when the deploymer pod finished.
DeployerPodCompletedAtAnnotation = "openshift.io/deployer-pod.completedAt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completed-at maybe. We're kind of inconsistent but no reason not to use dash

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 14, 2017

@smarterclayton if we encode RFC3339 to string, we need to decode, why we hate ourself?

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 14, 2017

@smarterclayton fixed. ptal

// records the time of when the deployer pod for this particular
// deployment was created.
DeployerPodCreatedAtAnnotation = "openshift.io/deployer-pod.created-at"
// DeployerPodFinishedAnnotation is an annotation on deployment that records
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/DeployerPodFinishedAnnotation/DeployerPodCompletedAtAnnotation

@@ -37,6 +37,13 @@ const (
// annotation value is the name of the deployer Pod which will act upon the ReplicationController
// to implement the deployment behavior.
DeploymentPodAnnotation = "openshift.io/deployer-pod.name"
// DeployerPodCreatedAnnotation is an annotation on a deployment that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/DeployerPodCreatedAnnotation/DeployerPodCreatedAtAnnotation

@@ -37,6 +37,13 @@ const (
// annotation value is the name of the deployer Pod which will act upon the ReplicationController
// to implement the deployment behavior.
DeploymentPodAnnotation = "openshift.io/deployer-pod.name"
// DeployerPodCreatedAnnotation is an annotation on a deployment that
// records the time of when the deployer pod for this particular
Copy link
Contributor

@tnozicka tnozicka Sep 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mention the time format here or is RFC3339 implied in our API in general?

@tnozicka tnozicka self-requested a review September 15, 2017 08:45
@@ -149,6 +150,7 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
return actionableError(fmt.Sprintf("couldn't create deployer pod for %q: %v", deployutil.LabelForDeploymentV1(deployment), err))
}
updatedAnnotations[deployapi.DeploymentPodAnnotation] = deploymentPod.Name
updatedAnnotations[deployapi.DeployerPodCreatedAtAnnotation] = time.Now().Format(time.RFC3339)
Copy link
Contributor

@tnozicka tnozicka Sep 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will get thrown away in case the update at the end fails on 409

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not lost, on next reconcile we set this back

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not if the deployer pod is missing (e.g. deleted manually) I think - you set it back when the pod is present only on L182

@@ -294,10 +297,12 @@ func (c *DeploymentController) nextStatus(pod *v1.Pod, deployment *v1.Replicatio
// Sync the internal replica annotation with the target so that we can
// distinguish deployer updates from other scaling events.
updatedAnnotations[deployapi.DeploymentReplicasAnnotation] = updatedAnnotations[deployapi.DesiredReplicasAnnotation]
updatedAnnotations[deployapi.DeployerPodCompletedAtAnnotation] = time.Now().Format(time.RFC3339)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thrown away at 409, and since the pod will be deleted it won't get reconciled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be reconciled until we succeed updating the RC status, so eventually this will be set... there might be time skew if the reconciliation takes a long time, but the over deployment status won't be updated as well, so from the user PoC the deployment will seems like still progressing...

Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all edge driven and won't get reconciled. there will be states when one or both of those annotations are missing causing metrics weirdness and bugs to be filled

@tnozicka
Copy link
Contributor

so the created state will be partially reconciled but overwritten at the same time at L181. (But I don't think this will reconcile a state when someone deletes the pod manually.)

The deletion won't be reconciled I think.

But the bigger issue arising seems to be the time that due to reconciliation might be off because you are using time.Now(). In the reconciliation cycle that might be even added after rate-limiting that might be significantly off.

One thing is that we might choose to not to write that time rather than be off and providing misleading data. The second is that we could use the real times in the informer we use to feed the deployer controller with pod events and use real times from pod filled in by kubelet. Yes, on a relist the informer might loose probably the delete event and we don't get the deleting timestamp but we could reconcile it in deployer controller and set value for it as it should be fairly rare and this way it won't create false positives in metrics.

@tnozicka
Copy link
Contributor

I guess we should have 3 values to fully reason about it:

  • creationTimestamp (API object)
  • startedAt (pod)
  • finishedAt (pod)

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 15, 2017

/retest

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 15, 2017

@tnozicka having three timestamps seems unreasonable to me as we basically just want to measure how long the rollout takes, which is easy with 2 timestamps ;-) I don't think the time between the pod was created and container started matters that much.

@tnozicka
Copy link
Contributor

@mfojtik in that case you need to use startedAt, not creation timestamp

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 15, 2017

i switched this to use creationtimestamp of the pod as created-at annotation and finishedTimestamp of the container that runs the deployer as completed-at.

So there is no time.Now() anymore. I really hate to have special routine in deployer controller that will be updating RC in parallel with the main routine that handle the rollout. That is basically asking for troubles.

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 15, 2017

@tnozicka the deployment process will fail when the container cannot be created right? in that case that RC will basically end up with no timestamps.

@openshift-ci-robot openshift-ci-robot removed the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 15, 2017
@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 15, 2017 via email

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 15, 2017

@smarterclayton yeah, i'm using it already ;-)

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 15, 2017

/retest

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 17, 2017

@tnozicka we are green, but I think i'm going to add some tests for this tomorrow

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 20, 2017

@smarterclayton @tnozicka is out this week, can you take a second look?

// deployment was started.
DeployerPodStartedAtAnnotation = "openshift.io/deployer-pod.started-at"
// DeployerPodCompletedAtAnnotation is an annotation on deployment that records
// the time in RFC3339 format of when the deploymer pod finished.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/deploymer/deployer/

@@ -149,6 +149,10 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
return actionableError(fmt.Sprintf("couldn't create deployer pod for %q: %v", deployutil.LabelForDeploymentV1(deployment), err))
}
updatedAnnotations[deployapi.DeploymentPodAnnotation] = deploymentPod.Name
updatedAnnotations[deployapi.DeployerPodCreatedAtAnnotation] = deploymentPod.CreationTimestamp.String()
if deploymentPod.Status.StartTime != nil {
updatedAnnotations[deployapi.DeployerPodCreatedAtAnnotation] = deploymentPod.Status.StartTime.String()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the annotation should be DeployerPodStartedAtAnnotation not DeployerPodCreatedAtAnnotation

@@ -176,6 +180,10 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
} else {
// Update to pending or to the appropriate status relative to the existing validated deployer pod.
updatedAnnotations[deployapi.DeploymentPodAnnotation] = deployer.Name
updatedAnnotations[deployapi.DeployerPodCreatedAtAnnotation] = deployer.CreationTimestamp.String()
if deployer.Status.StartTime != nil {
updatedAnnotations[deployapi.DeployerPodCreatedAtAnnotation] = deployer.Status.StartTime.String()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, lgtm after they are addressed

@tnozicka
Copy link
Contributor

Also let's state that those annotations are filled at best (or at least reasonable) effort and might not be present (or be incomplete) at rare cases which should not be considered bugs.

@openshift-merge-robot openshift-merge-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 28, 2017
@mfojtik
Copy link
Contributor Author

mfojtik commented Oct 2, 2017

@tnozicka comments addressed, thanks!

@@ -149,6 +149,10 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
return actionableError(fmt.Sprintf("couldn't create deployer pod for %q: %v", deployutil.LabelForDeploymentV1(deployment), err))
}
updatedAnnotations[deployapi.DeploymentPodAnnotation] = deploymentPod.Name
updatedAnnotations[deployapi.DeployerPodStartedAtAnnotation] = deploymentPod.CreationTimestamp.String()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one should be DeployerPodCreatedAtAnnotation, it was ok before

@@ -176,6 +180,10 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
} else {
// Update to pending or to the appropriate status relative to the existing validated deployer pod.
updatedAnnotations[deployapi.DeploymentPodAnnotation] = deployer.Name
updatedAnnotations[deployapi.DeployerPodStartedAtAnnotation] = deployer.CreationTimestamp.String()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one should be DeployerPodCreatedAtAnnotation, it was ok before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brain fart, sorry. fixed.

@tnozicka
Copy link
Contributor

tnozicka commented Oct 2, 2017

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 2, 2017
@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mfojtik, tnozicka

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@tnozicka
Copy link
Contributor

tnozicka commented Oct 2, 2017

/retest

@mfojtik
Copy link
Contributor Author

mfojtik commented Oct 3, 2017

/kind bug

not having this in 3.7 means we can't build up better metrics in 3.8

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 3, 2017
@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 3, 2017

@mfojtik: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/extended_conformance_install_update fb1679e link /test extended_conformance_install_update

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 16347, 16652).

@openshift-merge-robot openshift-merge-robot merged commit 813f39e into openshift:master Oct 3, 2017
@mfojtik mfojtik deleted the record-dc-cause-to-rc branch September 5, 2018 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. needs-api-review size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants