-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prometheus capture context #71
Add prometheus capture context #71
Conversation
@miq-bot add_label compute/containers @simon3z @cben @moolitayer @zeari please review |
@yaacov Cannot apply the following label because they are not recognized: compute/containers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end | ||
|
||
def collect_container_metrics | ||
# FIXME: container_name => @target.name is a uniqe id ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure it isn't unique. @target
is a Container model instance, right? Can you add pod_name
?
I think even with pod_name
it's not unique, you also need the project (namespace)...
|
||
def collect_group_metrics | ||
cpu_counters = @target.containers.collect do |c| | ||
cpu_resid = "sum(container_cpu_usage_seconds_total{container_name=\"#{c.name}\",job=\"kubernetes-nodes\"}) * 1e9" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you also want pod_name (also in mem_resid
). And all of these need some way to constrain by project (namespace).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mem_resid = "sum(container_memory_usage_bytes{container_name=\"#{c.name}\",job=\"kubernetes-nodes\"})" | ||
fetch_counters_data(mem_resid) | ||
end | ||
process_mem_gauges_data(compute_summation(mem_gauges)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't the above 2 loops duplicate collect_container_metrics
?
Could we fetch the Containers data only once and then compute sums for ContainerGroups from that? (I'm probably being naive about how manageiq collects metrics ;-))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we fetch the Containers data only once and then compute sums for ContainerGroups from that?
no :-( we call the fetch_counters_data
for each container in the pod because we do not have one endpoint with all the pods data.
👍 We are looking for ways to do this collection better, we (e.g. @simon3z ) are planing a big re-factor of this code to use new and better endpoints for both the Hawkular and Prometheus collectors.
|
||
def fetch_counters_data(resource) | ||
start_sec = (@starts / 1_000) - @interval | ||
end_sec = @ends ? (@ends / 1_000).to_i : Time.now.utc.to_i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ names with units.
"query_range", | ||
:query => resource, | ||
:start => start_sec.to_i, | ||
:end => end_sec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is start_sec.to_i
but no to_i
here deliberate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end_sec already has to_i
in line 57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, right. consider start_sec = (@starts / 1_000).to_i - @interval
for symmetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 leaving it at line 57 :-)
end_sec = @ends ? (@ends / 1_000).to_i : Time.now.utc.to_i | ||
|
||
sort_and_normalize( | ||
prometheus_client.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider caching the client in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I do not understand completely why we do that :-(
This code mirror the way we do this in the Hawkular collector, checking this behaviour, and fixing it need more research and is out of scope for this PR.
end | ||
|
||
def prometheus_credentials | ||
{:token => @ext_management_system.authentication_token(prometheus_endpoint)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think authentication_token
takes an authtype
string, not an Endpoint.
end | ||
|
||
def prometheus_endpoint | ||
@ext_management_system.connection_configurations.prometheus.try(:endpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you drop the .try
?
034c7f2
to
d447a05
Compare
def collect_node_metrics | ||
# prometheus field is in sec, multiply by 1e9, sec to ns | ||
cpu_resid = "sum(container_cpu_usage_seconds_total{container_name=\"\",id=\"/\",instance=\"#{@target.name}\",job=\"kubernetes-nodes\"}) * 1e9" | ||
process_cpu_counters_rate(fetch_counters_rate(cpu_resid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaacov you plan to switch this to the rate functions right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We plane to re-write the fetch_[ENTITY]_rate
functions ... . But, we have nothing concrete at this moment, so this code has to be correct, in case we will finally find that we can not re-factor this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaacov can you add a TODO? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#TODO not forget to do the refactor PR about using the new endpoints in hawkular and prometheus.
process_cpu_counters_rate(fetch_counters_rate(cpu_resid)) | ||
|
||
# prometheus field is in bytes | ||
mem_resid = "sum(container_memory_usage_bytes{container_name=\"\",id=\"/\",instance=\"#{@target.name}\",job=\"kubernetes-nodes\"})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
container_memory_usage_bytes with these labels is already be a scalar and not a verctor. Are you always using sum a guard against vectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
|
||
def collect_container_metrics | ||
# FIXME: container_name => @target.name is a uniqe id ? | ||
cpu_resid = "sum(container_cpu_usage_seconds_total{container_name=\"#{@target.name}\",job=\"kubernetes-nodes\"}) * 1e9" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpu_resid = "sum(container_cpu_usage_seconds_total{container_name=\"#{@target.name}\",job=\"kubernetes-nodes\"}) * 1e9" | ||
process_cpu_counters_rate(fetch_counters_rate(cpu_resid)) | ||
|
||
mem_resid = "sum(container_memory_usage_bytes{container_name=\"#{@target.name}\",job=\"kubernetes-nodes\"})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case there should be one value per container
prometheus_client_new(@prometheus_uri, @prometheus_credentials, @prometheus_options) | ||
end | ||
|
||
def prometheus_client_new(uri, credentials, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please separate out what is relevant for the client (prometheus_client_new
, prometheus_options
, ...) and generally things that are relevant to Faraday
from things that are related to ManageIQ (endpoints
, prometheus_client
...)
The client class can be named api_client. (I put mine in providers/kubernetes/prometheus/alert_buffer_client.rb
It makes sense to me since that stuff isn't strictly container_manager related)
end | ||
|
||
def prometheus_try_connect | ||
prometheus_client.get("query").kind_of?(Hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For connection validation, I prefer checking the http status return code (== 200) rather than relying on content being a Hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using oauth2 proxy, the default behaviour when authentication fail, is to send a login page with return code 200. We can not use just the return code for authentication.
d447a05
to
00f3e6d
Compare
@simon3z @moolitayer added the FIXME and TODO comments as agreed in today's meeting |
00f3e6d
to
a98199b
Compare
Checked commits yaacov/manageiq-providers-kubernetes@d95eb1d~...a98199b with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0 |
@simon3z @moolitayer hi, updated the specs to work with the new code from the openshift provider[1] all is green. With @cben 's help :-) [1] #76 |
Given that Prometheus on OpenShift 3.7 is unstable and this PR didn't receive any update for more than 2 weeks (because of that instability), I think that ATM this represent the current state of the art. Merging for now. |
@simon3z As the card said, CFME 4.6 will be integrated into ocp3.7, Will CFME-4.6 contain this feature? |
@mdshuai hi, a. I do not know about the version that will support this feature, you will have to wait for @simon3z response on that :-) b. This additional features are merged into the master branch, they allow you to add a container provider that use Prometheus metrics, and view the metrics as external metrics data source: Container provider add/edit support Prometheus: Ad hoc metrics UI interface support Prometheus: c. Caveats, for C&U use: This PR support Prometheus's The standard for deploying OCP 3.6 metrics is Hawkular, I do not know of a supported way to install Prometheus with OCP 3.6. |
@yaacov Really thanks for your info |
Description
Add support for reading Prometheus endpoint
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1461969