Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apps: add prometheus metrics for rollout #14796

Merged

Conversation

mfojtik
Copy link
Contributor

@mfojtik mfojtik commented Jun 21, 2017

Add basic prometheus metrics for deployment config rollouts. The implementation should be consistent with build metrics: #15440

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 21, 2017

@Kargakis @tnozicka @smarterclayton open for suggestions.

@mfojtik mfojtik force-pushed the deployment-condition-metrics branch from 04fa119 to 6ce74a3 Compare June 21, 2017 11:31
@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 21, 2017

@smarterclayton the fatal error count on deployer pods indicates deployer pods were gone missing or something terrible happened to them.

prometheus.CounterOpts{
Subsystem: DeployerControllerSubsystem,
Name: "failure_count",
Help: "Counter that counts total number of deployer pod errors per error type",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make more sense to have a ratio as opposed to an absolute number? Or keep the number of errors that occured in the last hour for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis i don't think prometheus has that, you can query that via prometheus console

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally you would count errors separate from total.

prometheus.GaugeOpts{
Subsystem: DeploymentConfigControllerSubsystem,
Name: "set_condition",
Help: "Gauge measuring changes to deployment conditions per condition, reason and the status",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is pretty vague. I would be interested in the ratio a deployment is flipping between True/False for Available, how many deployments are unavailable, how many deployments have failed over a period of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis you can querty unavailable deployment via prometheus console (just query the condition counter and specify time).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these metrics should be as raw as possible.

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 21, 2017

btw. I think we should always set reason for the condition, the empty reasons looks bad in metrics.

const DeploymentConfigControllerSubsystem = "deploymentconfig_controller"

var (
deploymentConditionCounter = prometheus.NewGaugeVec(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton not sure if this should be gauge or counter (i decrease the number when we remove the condition)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gauge. Counters must be monotonic increasing.

deploymentConditionCounter = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Subsystem: DeploymentConfigControllerSubsystem,
Name: "set_condition",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this consistent with other naming of prometheus metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton it looks ok with the subsystem set properly, probably should be plural (set_conditions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(upstream use plurals for counters)

status.Conditions = filterOutCondition(status.Conditions, condType)
// Decrease counters for conditions we are going to remove
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fiddling with counters like this is dangerous - it's too easy to get them wrong. I would recommend instead simply counting the number of conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or doing this at a higher level. Since this is observational from the cache, you might just want to add a ResourceEventHandler on the cache and count those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or have remove_condition counters as well. i think that will be more clean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just found why ResourceEventHandler is much better solution ;-) thx.

@mfojtik mfojtik force-pushed the deployment-condition-metrics branch from 27f7b40 to f96564c Compare June 22, 2017 10:31
@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 22, 2017

@mfojtik mfojtik force-pushed the deployment-condition-metrics branch from f96564c to d458bf8 Compare June 22, 2017 10:33
@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 22, 2017

[test]

@mfojtik mfojtik force-pushed the deployment-condition-metrics branch 2 times, most recently from 28804b6 to 4d9a804 Compare June 22, 2017 10:43
@mfojtik mfojtik changed the title WIP: deploy: add prometheus metrics for deployment conditions deploy: add prometheus metrics for deployment conditions Jun 22, 2017
for _, n := range new {
found := false
for _, o := range old {
if found = n == o; found {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hurts... Can you simply do

if n == o {
   found = true
   break
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, fixed.

@0xmichalis
Copy link
Contributor

just a nit, LGTM in general

@smarterclayton do we treat metrics as API?

@mfojtik mfojtik force-pushed the deployment-condition-metrics branch from 4d9a804 to fdbd630 Compare June 22, 2017 11:30
@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 22, 2017

@Kargakis @smarterclayton added deployer_controller_failed_rollouts metric that is counting all failed rollouts ;-) so if something go sudden during upgrade, we will know.

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 22, 2017

any objections merging this now so we can have these for next upgrade? we can follow up with kube-state-metrics in 3.7

}

func updateFailedRolloutsMetrics(oldRC, newRC *v1.ReplicationController) {
if oldRC != nil && deployutil.IsFailedDeployment(oldRC) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis is this ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is a counter is it will just go up, which is fine)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's fine.

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 22, 2017

flake: #14689

[test]

for _, n := range new {
found := false
for _, o := range old {
if n == o {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail in the case of a Progressing condition where the only difference between the new and the old condition is the lastTransitionTime, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis guess that is fine? it is not new condition, just timestamp update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a timestamp update, found ends up being false so you increment the metric here and decrement it below. Also we don't really use == for comparing structs. I would prefer this to be more explicit about the fields we care.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, i will update this, thanks

@mfojtik mfojtik force-pushed the deployment-condition-metrics branch from b3cdbc6 to 1db906c Compare June 23, 2017 08:44
@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 23, 2017

@Kargakis comparison fixed, guess this is ready to go

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 23, 2017

[severity:blocker]

@mfojtik mfojtik force-pushed the deployment-condition-metrics branch 2 times, most recently from 3084879 to 4e9f491 Compare June 27, 2017 13:22
@smarterclayton
Copy link
Contributor

We shouldn't merge this for 3.6 because we aren't gathering it yet.

@mfojtik mfojtik changed the title deploy: add prometheus metrics for deployment conditions apps: add prometheus metrics for rollout Sep 8, 2017
@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 8, 2017

@smarterclayton updated, it follow the way the build metrics are done (lister, simple counters)...

}

generation := d.Status.ObservedGeneration
ch <- prometheus.MustNewConstMetric(activeRolloutCountDesc, prometheus.GaugeValue, float64(generation), []string{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton i'm not sure if generation is the right value here, the builds are using start timestamp (unix)

var available, failed, cancelled float64

for _, d := range result {
if util.IsTerminatedDeployment(d) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result is DCs but these helpers are for RCs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis yeah, brainfart... it is all fixed :)

@mfojtik mfojtik force-pushed the deployment-condition-metrics branch 4 times, most recently from e8c06b5 to 1f771da Compare September 8, 2017 15:15

// Collect implements the prometheus.Collector interface.
func (c *appsCollector) Collect(ch chan<- prometheus.Metric) {
result, err := c.lister.List(labels.Everything())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the extra list here? We already do one inside the main sync loop of the controller. Can't you use those results?

Copy link
Contributor Author

@mfojtik mfojtik Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess prometheus collector schedule the list, if I pass the main controller loop rcList there might be a race? or I will have to read lock it? not sure if it is worth, since this list is done via cache imho.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed IRL, this is fine.

@mfojtik mfojtik force-pushed the deployment-condition-metrics branch 4 times, most recently from 96cad7b to 383fe38 Compare September 8, 2017 16:15
}

// TODO: possible time screw?
durationSeconds := time.Now().Unix() - d.CreationTimestamp.Unix()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the time the deployer pod started running more precise than the creation of the RC for when the new rollout started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis i think there is some time skew here but the deployer pod might be already gone and I will need to get that pod which will slow down the collection?

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 12, 2017

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 12, 2017

@smarterclayton tests are passing now, PTAL

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 14, 2017

I might use the #16347 for started/completed after that lands, would not block this.

@mfojtik
Copy link
Contributor Author

mfojtik commented Sep 17, 2017

@smarterclayton bump

}

// Record duration in seconds for active rollouts
// TODO: possible time screw?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time "skew". Time "screw" would probably be very different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classic @mfojtik comment

@smarterclayton
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2017
@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mfojtik, smarterclayton

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 14784, 16418, 16406, 16431, 14796)

@openshift-merge-robot openshift-merge-robot merged commit 1c8041b into openshift:master Sep 19, 2017
@mfojtik mfojtik deleted the deployment-condition-metrics branch September 5, 2018 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants