Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus capture context #71

Merged
merged 3 commits into from
Aug 24, 2017

Conversation

yaacov
Copy link
Member

@yaacov yaacov commented Jul 23, 2017

Description

Add support for reading Prometheus endpoint

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1461969

@yaacov
Copy link
Member Author

yaacov commented Jul 23, 2017

@miq-bot add_label compute/containers

@simon3z @cben @moolitayer @zeari please review

@miq-bot
Copy link
Member

miq-bot commented Jul 23, 2017

@yaacov Cannot apply the following label because they are not recognized: compute/containers

Copy link
Contributor

@cben cben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add some context on Prometheus capture:
chained "Prometheus" statue, at Minho's University

PR looks pretty good. Up to you whether my comments on resids belong in this PR or future fixes.

end

def collect_container_metrics
# FIXME: container_name => @target.name is a uniqe id ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it isn't unique. @target is a Container model instance, right? Can you add pod_name ?
I think even with pod_name it's not unique, you also need the project (namespace)...


def collect_group_metrics
cpu_counters = @target.containers.collect do |c|
cpu_resid = "sum(container_cpu_usage_seconds_total{container_name=\"#{c.name}\",job=\"kubernetes-nodes\"}) * 1e9"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you also want pod_name (also in mem_resid). And all of these need some way to constrain by project (namespace).

Copy link

@moolitayer moolitayer Jul 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at[1] you should have the aggregation already done for you if you use container_name="POD", pod_name="#{@target.name}" 👍 to add the namespace label

[1]
prometheus time series collection and processing server 2

mem_resid = "sum(container_memory_usage_bytes{container_name=\"#{c.name}\",job=\"kubernetes-nodes\"})"
fetch_counters_data(mem_resid)
end
process_mem_gauges_data(compute_summation(mem_gauges))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't the above 2 loops duplicate collect_container_metrics ?
Could we fetch the Containers data only once and then compute sums for ContainerGroups from that? (I'm probably being naive about how manageiq collects metrics ;-))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we fetch the Containers data only once and then compute sums for ContainerGroups from that?

no :-( we call the fetch_counters_data for each container in the pod because we do not have one endpoint with all the pods data.

👍 We are looking for ways to do this collection better, we (e.g. @simon3z ) are planing a big re-factor of this code to use new and better endpoints for both the Hawkular and Prometheus collectors.


def fetch_counters_data(resource)
start_sec = (@starts / 1_000) - @interval
end_sec = @ends ? (@ends / 1_000).to_i : Time.now.utc.to_i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ names with units.

"query_range",
:query => resource,
:start => start_sec.to_i,
:end => end_sec,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is start_sec.to_i but no to_i here deliberate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end_sec already has to_i in line 57

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right. consider start_sec = (@starts / 1_000).to_i - @interval for symmetry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 leaving it at line 57 :-)

end_sec = @ends ? (@ends / 1_000).to_i : Time.now.utc.to_i

sort_and_normalize(
prometheus_client.get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider caching the client in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I do not understand completely why we do that :-(

This code mirror the way we do this in the Hawkular collector, checking this behaviour, and fixing it need more research and is out of scope for this PR.

end

def prometheus_credentials
{:token => @ext_management_system.authentication_token(prometheus_endpoint)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think authentication_token takes an authtype string, not an Endpoint.

end

def prometheus_endpoint
@ext_management_system.connection_configurations.prometheus.try(:endpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you drop the .try?

@yaacov yaacov force-pushed the add-prometheus-capture-context branch 3 times, most recently from 034c7f2 to d447a05 Compare July 25, 2017 07:53
@simon3z simon3z requested a review from moolitayer July 31, 2017 12:43
@simon3z simon3z self-assigned this Jul 31, 2017
@simon3z simon3z requested a review from zgalor July 31, 2017 12:44
def collect_node_metrics
# prometheus field is in sec, multiply by 1e9, sec to ns
cpu_resid = "sum(container_cpu_usage_seconds_total{container_name=\"\",id=\"/\",instance=\"#{@target.name}\",job=\"kubernetes-nodes\"}) * 1e9"
process_cpu_counters_rate(fetch_counters_rate(cpu_resid))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaacov you plan to switch this to the rate functions right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We plane to re-write the fetch_[ENTITY]_rate functions ... . But, we have nothing concrete at this moment, so this code has to be correct, in case we will finally find that we can not re-factor this code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaacov can you add a TODO? Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#TODO not forget to do the refactor PR about using the new endpoints in hawkular and prometheus.

process_cpu_counters_rate(fetch_counters_rate(cpu_resid))

# prometheus field is in bytes
mem_resid = "sum(container_memory_usage_bytes{container_name=\"\",id=\"/\",instance=\"#{@target.name}\",job=\"kubernetes-nodes\"})"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container_memory_usage_bytes with these labels is already be a scalar and not a verctor. Are you always using sum a guard against vectors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes


def collect_container_metrics
# FIXME: container_name => @target.name is a uniqe id ?
cpu_resid = "sum(container_cpu_usage_seconds_total{container_name=\"#{@target.name}\",job=\"kubernetes-nodes\"}) * 1e9"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand your comment, it is not distinct, you get one value for each cpu[1] - so I think the result of sum is exactly what you expect here. Please remove the comment if that satisfies you.
[1]
prometheus time series collection and processing server 1

cpu_resid = "sum(container_cpu_usage_seconds_total{container_name=\"#{@target.name}\",job=\"kubernetes-nodes\"}) * 1e9"
process_cpu_counters_rate(fetch_counters_rate(cpu_resid))

mem_resid = "sum(container_memory_usage_bytes{container_name=\"#{@target.name}\",job=\"kubernetes-nodes\"})"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case there should be one value per container

prometheus_client_new(@prometheus_uri, @prometheus_credentials, @prometheus_options)
end

def prometheus_client_new(uri, credentials, options)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please separate out what is relevant for the client (prometheus_client_new, prometheus_options, ...) and generally things that are relevant to Faraday from things that are related to ManageIQ (endpoints, prometheus_client ...)

The client class can be named api_client. (I put mine in providers/kubernetes/prometheus/alert_buffer_client.rb It makes sense to me since that stuff isn't strictly container_manager related)

end

def prometheus_try_connect
prometheus_client.get("query").kind_of?(Hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For connection validation, I prefer checking the http status return code (== 200) rather than relying on content being a Hash

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using oauth2 proxy, the default behaviour when authentication fail, is to send a login page with return code 200. We can not use just the return code for authentication.

@yaacov yaacov force-pushed the add-prometheus-capture-context branch from d447a05 to 00f3e6d Compare August 3, 2017 14:16
@yaacov
Copy link
Member Author

yaacov commented Aug 3, 2017

@simon3z @moolitayer added the FIXME and TODO comments as agreed in today's meeting

@yaacov yaacov force-pushed the add-prometheus-capture-context branch from 00f3e6d to a98199b Compare August 6, 2017 10:34
@miq-bot
Copy link
Member

miq-bot commented Aug 6, 2017

Checked commits yaacov/manageiq-providers-kubernetes@d95eb1d~...a98199b with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
5 files checked, 0 offenses detected
Everything looks fine. 🍪

@yaacov
Copy link
Member Author

yaacov commented Aug 6, 2017

@simon3z @moolitayer hi, updated the specs to work with the new code from the openshift provider[1] all is green. With @cben 's help :-)

[1] #76

@simon3z
Copy link
Contributor

simon3z commented Aug 24, 2017

Given that Prometheus on OpenShift 3.7 is unstable and this PR didn't receive any update for more than 2 weeks (because of that instability), I think that ATM this represent the current state of the art.
We'll improve this (especially in correctness, handling the TODOs) as soon as we'll have something more stable to work with.

Merging for now.

@simon3z simon3z merged commit f816dbb into ManageIQ:master Aug 24, 2017
@moolitayer moolitayer added this to the Sprint 68 Ending Sep 4, 2017 milestone Aug 24, 2017
@mdshuai
Copy link

mdshuai commented Sep 1, 2017

@simon3z As the card said, CFME 4.6 will be integrated into ocp3.7, Will CFME-4.6 contain this feature?
Is there any doc I can follow to configure and try this feature? thanks

@yaacov
Copy link
Member Author

yaacov commented Sep 1, 2017

Is there any doc I can follow to configure and try this feature?

@mdshuai hi,

a. I do not know about the version that will support this feature, you will have to wait for @simon3z response on that :-)

b. This additional features are merged into the master branch, they allow you to add a container provider that use Prometheus metrics, and view the metrics as external metrics data source:

Container provider add/edit support Prometheus:
ManageIQ/manageiq-ui-classic#1501

Ad hoc metrics UI interface support Prometheus:
ManageIQ/manageiq-ui-classic#1677

c. Caveats, for C&U use:

This PR support Prometheus's kubernetes-nodes metrtics endpoint that will be removed in OCP 3.7, we plan to start using the cadvisor endpoint as soon as it will be available ( @smarterclayton ? )

The standard for deploying OCP 3.6 metrics is Hawkular, I do not know of a supported way to install Prometheus with OCP 3.6.

@mdshuai
Copy link

mdshuai commented Sep 1, 2017

@yaacov Really thanks for your info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants