Add prometheus capture context #71

yaacov · 2017-07-23T14:24:51Z

Description

Add support for reading Prometheus endpoint

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1461969

yaacov · 2017-07-23T14:27:19Z

@miq-bot add_label compute/containers

@simon3z @cben @moolitayer @zeari please review

miq-bot · 2017-07-23T14:28:20Z

@yaacov Cannot apply the following label because they are not recognized: compute/containers

cben

To add some context on Prometheus capture:

PR looks pretty good. Up to you whether my comments on resids belong in this PR or future fixes.

cben · 2017-07-24T09:20:51Z

...anageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_capture_context.rb

+    end
+
+    def collect_container_metrics
+      # FIXME: container_name => @target.name is a uniqe id ?


I'm pretty sure it isn't unique. @target is a Container model instance, right? Can you add pod_name ?
I think even with pod_name it's not unique, you also need the project (namespace)...

cben · 2017-07-24T09:24:55Z

...anageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_capture_context.rb

+
+    def collect_group_metrics
+      cpu_counters = @target.containers.collect do |c|
+        cpu_resid = "sum(container_cpu_usage_seconds_total{container_name=\"#{c.name}\",job=\"kubernetes-nodes\"}) * 1e9"


I think you also want pod_name (also in mem_resid). And all of these need some way to constrain by project (namespace).

Take a look at[1] you should have the aggregation already done for you if you use container_name="POD", pod_name="#{@target.name}" 👍 to add the namespace label

[1]

cben · 2017-07-24T09:30:03Z

...anageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_capture_context.rb

+        mem_resid = "sum(container_memory_usage_bytes{container_name=\"#{c.name}\",job=\"kubernetes-nodes\"})"
+        fetch_counters_data(mem_resid)
+      end
+      process_mem_gauges_data(compute_summation(mem_gauges))


Don't the above 2 loops duplicate collect_container_metrics ?
Could we fetch the Containers data only once and then compute sums for ContainerGroups from that? (I'm probably being naive about how manageiq collects metrics ;-))

Could we fetch the Containers data only once and then compute sums for ContainerGroups from that?

no :-( we call the fetch_counters_data for each container in the pod because we do not have one endpoint with all the pods data.

👍 We are looking for ways to do this collection better, we (e.g. @simon3z ) are planing a big re-factor of this code to use new and better endpoints for both the Hawkular and Prometheus collectors.

cben · 2017-07-24T09:31:19Z

...anageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_capture_context.rb

+
+    def fetch_counters_data(resource)
+      start_sec = (@starts / 1_000) - @interval
+      end_sec = @ends ? (@ends / 1_000).to_i : Time.now.utc.to_i


❤️ names with units.

cben · 2017-07-24T09:31:40Z

...anageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_capture_context.rb

+          "query_range",
+          :query => resource,
+          :start => start_sec.to_i,
+          :end   => end_sec,


is start_sec.to_i but no to_i here deliberate?

end_sec already has to_i in line 57

ah, right. consider start_sec = (@starts / 1_000).to_i - @interval for symmetry.

👍 leaving it at line 57 :-)

cben · 2017-07-24T09:52:02Z

...anageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_capture_context.rb

+      end_sec = @ends ? (@ends / 1_000).to_i : Time.now.utc.to_i
+
+      sort_and_normalize(
+        prometheus_client.get(


consider caching the client in the future.

👍 I do not understand completely why we do that :-(

This code mirror the way we do this in the Hawkular collector, checking this behaviour, and fixing it need more research and is out of scope for this PR.

cben · 2017-07-24T10:03:52Z

...s/manageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_client_mixin.rb

+  end
+
+  def prometheus_credentials
+    {:token => @ext_management_system.authentication_token(prometheus_endpoint)}


I think authentication_token takes an authtype string, not an Endpoint.

cben · 2017-07-24T10:04:02Z

...s/manageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_client_mixin.rb

+  end
+
+  def prometheus_endpoint
+    @ext_management_system.connection_configurations.prometheus.try(:endpoint)


Can you drop the .try?

moolitayer · 2017-07-31T14:06:22Z

...anageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_capture_context.rb

+    def collect_node_metrics
+      # prometheus field is in sec, multiply by 1e9, sec to ns
+      cpu_resid = "sum(container_cpu_usage_seconds_total{container_name=\"\",id=\"/\",instance=\"#{@target.name}\",job=\"kubernetes-nodes\"}) * 1e9"
+      process_cpu_counters_rate(fetch_counters_rate(cpu_resid))


@yaacov you plan to switch this to the rate functions right?

We plane to re-write the fetch_[ENTITY]_rate functions ... . But, we have nothing concrete at this moment, so this code has to be correct, in case we will finally find that we can not re-factor this code.

@yaacov can you add a TODO? Thanks.

#TODO not forget to do the refactor PR about using the new endpoints in hawkular and prometheus.

moolitayer · 2017-07-31T14:09:57Z

...anageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_capture_context.rb

+      process_cpu_counters_rate(fetch_counters_rate(cpu_resid))
+
+      # prometheus field is in bytes
+      mem_resid = "sum(container_memory_usage_bytes{container_name=\"\",id=\"/\",instance=\"#{@target.name}\",job=\"kubernetes-nodes\"})"


container_memory_usage_bytes with these labels is already be a scalar and not a verctor. Are you always using sum a guard against vectors?

moolitayer · 2017-07-31T14:16:55Z

...anageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_capture_context.rb

+
+    def collect_container_metrics
+      # FIXME: container_name => @target.name is a uniqe id ?
+      cpu_resid = "sum(container_cpu_usage_seconds_total{container_name=\"#{@target.name}\",job=\"kubernetes-nodes\"}) * 1e9"


If I understand your comment, it is not distinct, you get one value for each cpu[1] - so I think the result of sum is exactly what you expect here. Please remove the comment if that satisfies you.
[1]

moolitayer · 2017-07-31T14:20:24Z

...anageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_capture_context.rb

+      cpu_resid = "sum(container_cpu_usage_seconds_total{container_name=\"#{@target.name}\",job=\"kubernetes-nodes\"}) * 1e9"
+      process_cpu_counters_rate(fetch_counters_rate(cpu_resid))
+
+      mem_resid = "sum(container_memory_usage_bytes{container_name=\"#{@target.name}\",job=\"kubernetes-nodes\"})"


in this case there should be one value per container

moolitayer · 2017-07-31T14:41:42Z

...s/manageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_client_mixin.rb

+    prometheus_client_new(@prometheus_uri, @prometheus_credentials, @prometheus_options)
+  end
+
+  def prometheus_client_new(uri, credentials, options)


Please separate out what is relevant for the client (prometheus_client_new, prometheus_options, ...) and generally things that are relevant to Faraday from things that are related to ManageIQ (endpoints, prometheus_client ...)

The client class can be named api_client. (I put mine in providers/kubernetes/prometheus/alert_buffer_client.rb It makes sense to me since that stuff isn't strictly container_manager related)

zgalor · 2017-08-01T08:50:11Z

...s/manageiq/providers/kubernetes/container_manager/metrics_capture/prometheus_client_mixin.rb

+  end
+
+  def prometheus_try_connect
+    prometheus_client.get("query").kind_of?(Hash)


For connection validation, I prefer checking the http status return code (== 200) rather than relying on content being a Hash

We are using oauth2 proxy, the default behaviour when authentication fail, is to send a login page with return code 200. We can not use just the return code for authentication.

yaacov · 2017-08-03T14:18:48Z

@simon3z @moolitayer added the FIXME and TODO comments as agreed in today's meeting

miq-bot · 2017-08-06T10:38:53Z

Checked commits yaacov/manageiq-providers-kubernetes@d95eb1d~...a98199b with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
5 files checked, 0 offenses detected
Everything looks fine. 🍪

yaacov · 2017-08-06T10:56:23Z

@simon3z @moolitayer hi, updated the specs to work with the new code from the openshift provider[1] all is green. With @cben 's help :-)

[1] #76

simon3z · 2017-08-24T07:56:13Z

Given that Prometheus on OpenShift 3.7 is unstable and this PR didn't receive any update for more than 2 weeks (because of that instability), I think that ATM this represent the current state of the art.
We'll improve this (especially in correctness, handling the TODOs) as soon as we'll have something more stable to work with.

Merging for now.

mdshuai · 2017-09-01T02:39:58Z

@simon3z As the card said, CFME 4.6 will be integrated into ocp3.7, Will CFME-4.6 contain this feature?
Is there any doc I can follow to configure and try this feature? thanks

yaacov · 2017-09-01T05:59:07Z

Is there any doc I can follow to configure and try this feature?

@mdshuai hi,

a. I do not know about the version that will support this feature, you will have to wait for @simon3z response on that :-)

b. This additional features are merged into the master branch, they allow you to add a container provider that use Prometheus metrics, and view the metrics as external metrics data source:

Container provider add/edit support Prometheus:
ManageIQ/manageiq-ui-classic#1501

Ad hoc metrics UI interface support Prometheus:
ManageIQ/manageiq-ui-classic#1677

c. Caveats, for C&U use:

This PR support Prometheus's kubernetes-nodes metrtics endpoint that will be removed in OCP 3.7, we plan to start using the cadvisor endpoint as soon as it will be available ( @smarterclayton ? )

The standard for deploying OCP 3.6 metrics is Hawkular, I do not know of a supported way to install Prometheus with OCP 3.6.

mdshuai · 2017-09-01T06:45:10Z

@yaacov Really thanks for your info

cben reviewed Jul 24, 2017

View reviewed changes

yaacov force-pushed the add-prometheus-capture-context branch 3 times, most recently from 034c7f2 to d447a05 Compare July 25, 2017 07:53

simon3z requested a review from moolitayer July 31, 2017 12:43

simon3z self-assigned this Jul 31, 2017

simon3z added the enhancement label Jul 31, 2017

simon3z requested a review from zgalor July 31, 2017 12:44

moolitayer reviewed Jul 31, 2017

View reviewed changes

zgalor reviewed Aug 1, 2017

View reviewed changes

yaacov force-pushed the add-prometheus-capture-context branch from d447a05 to 00f3e6d Compare August 3, 2017 14:16

yaacov added 3 commits August 6, 2017 11:33

Add prometheus capture context

d95eb1d

Add FIXME and TODO on all the issues we will address in next PR's

a1e557b

update specs to use the new openshift inventory code

a98199b

yaacov force-pushed the add-prometheus-capture-context branch from 00f3e6d to a98199b Compare August 6, 2017 10:34

cben approved these changes Aug 7, 2017

View reviewed changes

simon3z merged commit f816dbb into ManageIQ:master Aug 24, 2017

yaacov mentioned this pull request Aug 24, 2017

Add prometheus view to ad hoc metrics ManageIQ/manageiq-ui-classic#1677

Merged

moolitayer added this to the Sprint 68 Ending Sep 4, 2017 milestone Aug 24, 2017

Add prometheus capture context #71

Add prometheus capture context #71

Conversation

yaacov commented Jul 23, 2017 • edited by simaishi Loading

yaacov commented Jul 23, 2017

miq-bot commented Jul 23, 2017

cben left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moolitayer Jul 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaacov commented Aug 3, 2017

miq-bot commented Aug 6, 2017

yaacov commented Aug 6, 2017

simon3z commented Aug 24, 2017

mdshuai commented Sep 1, 2017 • edited Loading

yaacov commented Sep 1, 2017 • edited Loading

mdshuai commented Sep 1, 2017

yaacov commented Jul 23, 2017 •

edited by simaishi

Loading

cben left a comment •

edited

Loading

moolitayer Jul 31, 2017 •

edited

Loading

mdshuai commented Sep 1, 2017 •

edited

Loading

yaacov commented Sep 1, 2017 •

edited

Loading