Expose controller-runtime metrics #786

lilic · 2018-11-29T16:30:45Z

Description of the change:

Bring in the controller-runtime metrics by exposing and creating a Service object. By default it is on 8080 the same as in controller-runtime we just pass the default value in case the user changes it this is then reflected in the Service creation as well.

Motivation for the change:

As decided offline: as currently the controller-runtime uses a global registry we should use that instead of creating a new one and serving the metrics ourselves.

Closes #222

lilic · 2018-11-29T16:32:34Z

I would personally not merge this until controller-runtime creates a new release as this would mean all the new operators would pin against master instead of a new release. But as we agreed we should open this to review and pin against the master for now.

lilic · 2018-11-29T16:37:06Z

Tested this locally and this is the example metrics as served on 8080:

# HELP controller_runtime_reconcile_queue_length Length of reconcile queue per controller
# TYPE controller_runtime_reconcile_queue_length gauge
controller_runtime_reconcile_queue_length{controller="appservice-controller"} 0
# HELP controller_runtime_reconcile_time_seconds Length of time per reconcile per controller
# TYPE controller_runtime_reconcile_time_seconds histogram
controller_runtime_reconcile_time_seconds_bucket{controller="appservice-controller",le="0.005"} 5
controller_runtime_reconcile_time_seconds_bucket{controller="appservice-controller",le="0.01"} 5
controller_runtime_reconcile_time_seconds_bucket{controller="appservice-controller",le="0.025"} 5
controller_runtime_reconcile_time_seconds_bucket{controller="appservice-controller",le="0.05"} 5
controller_runtime_reconcile_time_seconds_bucket{controller="appservice-controller",le="0.1"} 5
controller_runtime_reconcile_time_seconds_bucket{controller="appservice-controller",le="0.25"} 5
controller_runtime_reconcile_time_seconds_bucket{controller="appservice-controller",le="0.5"} 5
controller_runtime_reconcile_time_seconds_bucket{controller="appservice-controller",le="1"} 5
controller_runtime_reconcile_time_seconds_bucket{controller="appservice-controller",le="2.5"} 5
controller_runtime_reconcile_time_seconds_bucket{controller="appservice-controller",le="5"} 5
controller_runtime_reconcile_time_seconds_bucket{controller="appservice-controller",le="10"} 5
controller_runtime_reconcile_time_seconds_bucket{controller="appservice-controller",le="+Inf"} 5
controller_runtime_reconcile_time_seconds_sum{controller="appservice-controller"} 6.119999999999999e-07
controller_runtime_reconcile_time_seconds_count{controller="appservice-controller"} 5
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 6.0225e-05
go_gc_duration_seconds{quantile="0.25"} 6.1858e-05
go_gc_duration_seconds{quantile="0.5"} 0.000228719
go_gc_duration_seconds{quantile="0.75"} 0.000725184
go_gc_duration_seconds{quantile="1"} 0.003963635
go_gc_duration_seconds_sum 0.005039621
go_gc_duration_seconds_count 5
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 40
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.11.2"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 5.679208e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 1.3982888e+07
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.445848e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 89562
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0.05740754063652995
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 2.379776e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 5.679208e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 5.9236352e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 7.118848e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 46716
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 0
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 6.63552e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.5435088556178067e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 136278
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 3456
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 94088
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 98304
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 8.23344e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 710944
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 753664
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 753664
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 7.176012e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 10
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.13
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 8
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 2.4887296e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.54350885539e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.33582848e+08
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes -1

pkg/scaffold/operator.go

estroz

LGTM other than 2 nits.

pkg/scaffold/cmd.go

pkg/metrics/metrics.go

shawn-hurley

overall LGTM just personal preference around constant files

pkg/metrics/constants.go

pkg/k8sutil/k8sutil.go

pkg/scaffold/cmd.go

lilic · 2018-12-05T09:18:56Z

As agreed we will put this on hold and wait until controller-runtime issues a new release.

lilic · 2019-01-21T15:46:23Z

@joelanford @hasbro17 Can you please have another look, thanks! Adjusted per your suggestions.

joelanford · 2019-01-21T17:46:55Z

pkg/metrics/metrics.go

-	kubeconfig, err := config.GetConfig()
+	err = createService(mgr, s)
+
+	return s, nil


If the service already exists, we return the one returned by initOperatorService, not the existing one. Just making sure that's what we want to happen.

Actually, there is a missing error there in general 🤦‍♀️

Just making sure that's what we want to happen.

Yes not sure there, do you think requesting a Service would be best and return that, or just always error out and not return a service and by that removing the exist check?

This is not a straightforward answer due to the fact that an operator might be running in a deployment with leader election enabled. I think I've opened up a can of worms.

So there are two related concerns here:

Which pods are selected by the Service's selector? Right now it looks like all pods in the deployment? I'm not totally familiar with how the metrics work, but I'd imagine that we want Prometheus to scrape only the leader's metrics, right?

What's the lifecycle of the Service? Does it live for the duration of the leader pod or of the operator deployment?

I think we need to answer those questions to make sure we get the logic correct in ExposeMetricsPort.

Which pods are selected by the Service's selector? Right now it looks like all pods in the deployment?

Yes since the service uses the same selector as the Deployment name=<operator-name> it will expose the ports for all Deployment pods.
And it's alright if we scrape the metrics for all non-leader pods as well. If a leader pod steps down for some reason it can have different metrics from the new leader (e.g number of reconciles) which is worth exporting.

What's the lifecycle of the Service? Does it live for the duration of the leader pod or of the operator deployment?

I think it should live for the duration of the deployment. Recreating the service every time there's a new leader doesn't make much sense if it's going to be selecting all replicas. So if a service does not exist, the leader pod should create it with the ownerref set to the Deployement that owns it.

do you think requesting a Service would be best and return that, or just always error out and not return a service and by that removing the exist check?

I think ExposeMetricsPort() should return the actual service that exists. Either it get's created by the pod if it doesn't exist, or it gets and returns the service created by a previous pod.

Although it's worth considering any drawbacks to tying the service lifecycle to the operator Deployment if any. I can't think of any right now though.

/cc @shawn-hurley

Yes since the service uses the same selector as the Deployment name= it will expose the ports for all Deployment pods.
And it's alright if we scrape the metrics for all non-leader pods as well. If a leader pod steps down for some reason it can have different metrics from the new leader (e.g number of reconciles) which is worth exporting.

Will prometheus scrape each service endpoint individually or will it scrape the service and get round-robin-ed among each endpoint? If the former, then that sounds good. If the latter, won't that cause problems with prometheus (e.g. counters will jump up and down depending on which pod services the request)?

I agree with option 1.

Did some testing and 1. doesn't work because metrics get served only when Start is called. I would like to suggest upstream to serve metrics before that as it should be independantly of Start. I am assuming that we should hold the lock before that, right?

They do handle leader election and serve metrics if there is a lock, but we use our own leader logic so we can't rely on that IIUC.

I guess for now so we have some metrics, I would suggest we go with option 2. and contribute upstream if they agree on serving metrics independantly. SGTY @shawn-hurley @hasbro17 @joelanford ?

Sum up from discussion with @shawn-hurley:

Will look into disabling serving of metrics in controller-runtime, so we can serve the metrics they expose in operator-sdk.

Open issue to discuss starting serving metrics independently of Start() being called.

Agreed. Option 2 sounds fine until we can work upstream or find a workaround to expose the metrics before getting the lock and calling start.

Agreed. Option 2 sounds fine until we can work upstream or find a workaround to expose the metrics before getting the lock and calling start.

Okay, will open an issue to make it work for all pods, not just leader when this is merged 👍 Until then we will at least have some metrics.

pkg/metrics/metrics.go

hasbro17 · 2019-01-24T07:10:01Z

@lilic Looks good but just a few more nits.

Can you also please update the CHANGELOG to mention that the SDK scaffold for main.go now by default exposes the controller-runtime metrics on port 8383.
Since we never exposed metrics after the 0.1.0 refactor I think this will just go in the Added section.
We'll update that further with a link to more docs later on.

Co-Authored-By: LiliC <cosiclili@gmail.com>

locally.

shawn-hurley

LGTM

Due to the cache not being started at the time when we attempt to query for the Service, we instead create a new client in a similiar way as we do with leader elections.

lilic · 2019-01-24T15:57:38Z

We couldn't get the Service using the clients manager as the cache was not started, so I did the same as we do when handling leader and created a new client.

@hasbro17 @joelanford Tested locally and added to the CHANGELOG. PTALA.

joelanford · 2019-01-24T16:42:05Z

pkg/metrics/metrics.go


-	service, err := k8sutil.InitOperatorService()
+// ExposeMetricsPort creates a Kubernetes Service to expose the passed metrics port.
+func ExposeMetricsPort(ctx context.Context, port int32) (*v1.Service, error) {


👍 On passing in the context.

Will we want to change this function signature at all if kubernetes-sigs/controller-runtime#273 gets merged? Would we go back to passing in the manager (or maybe the client directly)?

If so, I'm wondering if it would be worth anticipating that now to avoid an API change, or if we should just wait since we don't know exactly how it'll look.

Thoughts?

If so, I'm wondering if it would be worth anticipating that now to avoid an API change, or if we should just wait since we don't know exactly how it'll look.

Yes was thinking about that as well, but as currently we have no idea if that will get merged and how it will look like in the end, so not sure we can fully predict it and think about not breaking the API. And we will have to change Leader functions signature in the case we use from above PR, not sure it makes a difference here. So yes, most likely if that gets merged we will break the API, or just decide to leave it as is, we always have that choice.

Good point about needing to change the leader election API as well.

In that case, I agree with waiting and breaking the API for both if necessary.

joelanford

LGTM. Just one more question.

hasbro17

LGTM

stepin · 2019-01-25T19:26:31Z

Maybe it's wrong place to ask but how to enable metrics? When I'm just start in container (built from latest sources):

/usr/local/bin/ansible-operator run ansible --watches-file=/opt/ansible/watches.yaml

there is nothing on 8383 port:

bash-4.2$ curl http://127.0.0.1:8383/metrics
curl: (7) Failed connect to 127.0.0.1:8383; Connection refused

shawn-hurley · 2019-01-25T19:28:47Z

@stepin I believe that we as the ansible operator need to turn on metrics in our main file.

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 29, 2018

lilic mentioned this pull request Nov 29, 2018

WIP: Rework metrics #715

Closed

lilic mentioned this pull request Nov 29, 2018

Metrics service is not garbage collected #459

Closed

lilic commented Nov 29, 2018

View reviewed changes

pkg/scaffold/operator.go Show resolved Hide resolved

lilic requested review from estroz, AlexNPavel, shawn-hurley, joelanford and hasbro17 November 29, 2018 16:41

estroz reviewed Nov 29, 2018

View reviewed changes

pkg/scaffold/cmd.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

lilic force-pushed the lili/metrics-helpers branch 2 times, most recently from 1e27f3f to 7c0f0ce Compare November 30, 2018 13:23

shawn-hurley approved these changes Nov 30, 2018

View reviewed changes

pkg/metrics/constants.go Outdated Show resolved Hide resolved

lilic force-pushed the lili/metrics-helpers branch 3 times, most recently from 59682b1 to 9691354 Compare December 3, 2018 10:29

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 3, 2018

hasbro17 reviewed Dec 5, 2018

View reviewed changes

pkg/k8sutil/k8sutil.go Outdated Show resolved Hide resolved

hasbro17 reviewed Dec 5, 2018

View reviewed changes

pkg/scaffold/cmd.go Outdated Show resolved Hide resolved

lilic force-pushed the lili/metrics-helpers branch from 3eb575a to ef4e08c Compare December 10, 2018 10:54

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 10, 2018

lilic force-pushed the lili/metrics-helpers branch 2 times, most recently from 8bc16b3 to 93be61a Compare December 10, 2018 13:35

lilic mentioned this pull request Dec 13, 2018

k8sutil: ExposeMetricsPort works for ClusterScoped #836

Merged

lilic force-pushed the lili/metrics-helpers branch 3 times, most recently from 51cb32f to f255a39 Compare December 19, 2018 13:31

pkg*: Accept port32 in func

c4e1922

joelanford reviewed Jan 21, 2019

View reviewed changes

pkg/metrics: Add error handling

e6fe5cc

openshift-ci-robot requested a review from shawn-hurley January 21, 2019 21:36

pkg/metrics: Return Service if it exists

f2f2e91

hasbro17 reviewed Jan 24, 2019

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

hasbro17 reviewed Jan 24, 2019

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

hasbro17 reviewed Jan 24, 2019

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

hasbro17 reviewed Jan 24, 2019

View reviewed changes

pkg/metrics/metrics.go Show resolved Hide resolved

hasbro17 and others added 4 commits January 24, 2019 09:44

Update pkg/metrics/metrics.go

91f8a59

Co-Authored-By: LiliC <cosiclili@gmail.com>

Update pkg/metrics/metrics.go

410be12

Co-Authored-By: LiliC <cosiclili@gmail.com>

Update pkg/metrics/metrics.go

ddda0f4

Co-Authored-By: LiliC <cosiclili@gmail.com>

pkg/metrics/metrics.go: Skip errors when running

68452c3

locally.

shawn-hurley approved these changes Jan 24, 2019

View reviewed changes

pkg/*: Create new client for Service get/create

84d2432

Due to the cache not being started at the time when we attempt to query for the Service, we instead create a new client in a similiar way as we do with leader elections.

lilic force-pushed the lili/metrics-helpers branch from b57b6c2 to 84d2432 Compare January 24, 2019 15:33

CHANGELOG: Document metrics addition

c6bf9ea

joelanford reviewed Jan 24, 2019

View reviewed changes

joelanford approved these changes Jan 24, 2019

View reviewed changes

hasbro17 approved these changes Jan 24, 2019

View reviewed changes

lilic merged commit 6b070cd into operator-framework:master Jan 25, 2019

lilic deleted the lili/metrics-helpers branch January 25, 2019 09:15

This was referenced Jan 25, 2019

Default metrics port is in ip_local_port_range #735

Closed

Add ability to expose metrics in non leader pods #999

Closed

primeroz mentioned this pull request Apr 5, 2019

Add documentation for metrics #644

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose controller-runtime metrics #786

Expose controller-runtime metrics #786

lilic commented Nov 29, 2018 •

edited

Loading

lilic commented Nov 29, 2018

lilic commented Nov 29, 2018

estroz left a comment

shawn-hurley left a comment

lilic commented Dec 5, 2018

lilic commented Jan 21, 2019

joelanford Jan 21, 2019

lilic Jan 21, 2019

joelanford Jan 21, 2019

hasbro17 Jan 21, 2019

joelanford Jan 21, 2019

shawn-hurley Jan 22, 2019

lilic Jan 23, 2019

lilic Jan 23, 2019

hasbro17 Jan 24, 2019

lilic Jan 24, 2019

hasbro17 commented Jan 24, 2019

shawn-hurley left a comment

lilic commented Jan 24, 2019

joelanford Jan 24, 2019

lilic Jan 24, 2019

joelanford Jan 24, 2019

joelanford left a comment

hasbro17 left a comment

stepin commented Jan 25, 2019

shawn-hurley commented Jan 25, 2019

Expose controller-runtime metrics #786

Expose controller-runtime metrics #786

Conversation

lilic commented Nov 29, 2018 • edited Loading

lilic commented Nov 29, 2018

lilic commented Nov 29, 2018

estroz left a comment

Choose a reason for hiding this comment

shawn-hurley left a comment

Choose a reason for hiding this comment

lilic commented Dec 5, 2018

lilic commented Jan 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hasbro17 commented Jan 24, 2019

shawn-hurley left a comment

Choose a reason for hiding this comment

lilic commented Jan 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joelanford left a comment

Choose a reason for hiding this comment

hasbro17 left a comment

Choose a reason for hiding this comment

stepin commented Jan 25, 2019

shawn-hurley commented Jan 25, 2019

lilic commented Nov 29, 2018 •

edited

Loading