Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check metrics authentication validity before perf_capture_queue #493

Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,17 @@ def log_severity
}
}

def capture_ems_targets(_options = {})
agrare marked this conversation as resolved.
Show resolved Hide resolved
begin
verify_metrics_connection!(ems)
rescue TargetValidationError, TargetValidationWarning => e
_log.send(e.log_severity, e.message)
return []
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all providers implement similar logic? Can we extract this so other providers will also bypass metrics capture with invalid auth?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "future" PR question... I'm fine with fixing kubernetes in this way now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little tricky because not all providers have different endpoints and/or authentications so what you need to check varies.

What I've considered doing is consolidating this in the supports :metrics {} block in ext_management_system so that the logic could be limited to the provider but that is going to be more of a cross-provider change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating supports to handle all the actual logic totally got me in a few PRs. they are still outstanding :( ✨


super
end

def prometheus_capture_context(target, start_time, end_time)
PrometheusCaptureContext.new(target, start_time, end_time, INTERVAL)
end
Expand All @@ -57,11 +68,28 @@ def metrics_connection(ems)
ems.connection_configurations.prometheus
end

def capture_context(_ems, target, start_time, end_time)
def metrics_connection_valid?(ems)
metrics_connection(ems)&.authentication&.status == "Valid"
end

def verify_metrics_connection!(ems)
t = target || ems
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbrock do we have a general way of handling this? When called from perf_capture_timer as ems.perf_capture_object.perf_capture_all_queue target is nil and only ems is set

Copy link
Member

@kbrock kbrock Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like how @target is set behind the scenes.
perf_capture_gap, perf_capture_all_queue, perf_capture_realtime_queue ( /via ui)

if you are running capture for a whole ems, target is null.
If you are running capture for a single object, target is single object.
If you are running capture for a batch/vmware, target/targets is an array

I think I'd just use ems

Note: you need to handle multiple targets case. (I would like everyone to have targets)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to ensure that prometheus exists in here? (aka capture_context)

That way verify does all the raises here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd just use ems

Well when we use this below it is on a single object so this has to work with target or ems

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to ensure that prometheus exists in here? (aka capture_context)

Yeah I was thinking that but we use the return of capture_context and returning something from a verify_* method seems weird.

Also building a prometheus_capture_context object seems like overkill when it isn't used in the perf_capture_timer case here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably could call this from capture_context and raise if that is nil in that method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I pushed a second commit which moves some things around, LMK if you like it better or worse and it is easy to revert out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, it doesn't look like it needs to know about target at all beside building a couple of printf strings. Looks good for now.

TANGENT:

Yea, I guess the code is currently coded to reference a single target.

I have on my short list to convert this over to targets.
For some reason, I can't find the initialize method on PrometheusCaptureContext.
But from what I can tell, that code assumes a single target when building the labels.

Not sure how hard it would be to handle a list of targets over there.

target_name = "#{t.class.name.demodulize}(#{t.id})"
raise TargetValidationError, "no provider for #{target_name}" if ems.nil?

raise TargetValidationWarning, "no metrics endpoint found for #{target_name}" if metrics_connection(ems).nil?
raise TargetValidationWarning, "metrics authentication isn't valid for #{target_name}" unless metrics_connection_valid?(ems)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check if the ems is paused?

Copy link
Member Author

@agrare agrare Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should do that up in core perf_capture_timer since that is going to be the same for all providers, good call though. I'm a little wary of that just because it feels like every single queue method would have to implement this check but I don't know that there's a more central place to handle it and this is a real-world problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end

def build_capture_context!(ems, target, start_time, end_time)
verify_metrics_connection!(ems)
# make start_time align to minutes
start_time = start_time.beginning_of_minute

prometheus_capture_context(target, start_time, end_time)
context = prometheus_capture_context(target, start_time, end_time)
raise TargetValidationWarning, "no metrics endpoint found for #{target_name}" if context.nil?

context
end

def perf_collect_metrics(interval_name, start_time = nil, end_time = nil)
Expand All @@ -73,15 +101,7 @@ def perf_collect_metrics(interval_name, start_time = nil, end_time = nil)
"[#{start_time}] [#{end_time}]")

begin
raise TargetValidationError, "no provider for #{target_name}" if ems.nil?

connection = metrics_connection(ems)
raise TargetValidationWarning, "no metrics endpoint found for #{target_name}" if connection.nil?
raise TargetValidationWarning, "metrics authentication isn't valid for #{target_name}" unless connection.authentication&.status == "Valid"

context = capture_context(ems, target, start_time, end_time)

raise TargetValidationWarning, "no metrics endpoint found for #{target_name}" if context.nil?
context = build_capture_context!(ems, target, start_time, end_time)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea. this is nice

rescue TargetValidationError, TargetValidationWarning => e
_log.send(e.log_severity, "[#{target_name}] #{e.message}")
ems.try(:update,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,10 @@
end
end

context "#capture_context" do
context "#build_capture_context!" do
it "detect prometheus metrics provider" do
metric_capture = described_class.new(@node)
context = metric_capture.capture_context(
context = metric_capture.build_capture_context!(
@ems_kubernetes,
@node,
5.minutes.ago,
Expand All @@ -59,6 +59,33 @@
end
end

context "#perf_capture_all_queue" do
it "returns the objects" do
expect(@ems_kubernetes.perf_capture_object.perf_capture_all_queue).to include("Container" => [@container], "ContainerNode" => [@node])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO why only @container and @node and not @group

end

context "with a missing metrics endpoint" do
before do
@ems_kubernetes.endpoints.find_by(:role => "prometheus").destroy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE I don't love how this spec sets up these ivars since it makes it difficult to override in sub-contexts. I worked around it but I think I should refactor to use let

end

it "returns no objects" do
expect(@ems_kubernetes.perf_capture_object.perf_capture_all_queue).to be_empty
end
end

context "with invalid authentication on the metrics endpoint" do
before do
@ems_kubernetes.authentications.find_by(:authtype => "prometheus").update!(:status => "Error")
@ems_kubernetes.reload
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE a good example of why the before {} setting up ivars isn't great

end

it "returns no objects" do
expect(@ems_kubernetes.perf_capture_object.perf_capture_all_queue).to be_empty
end
end
end

context "#perf_collect_metrics" do
it "fails when no ems is defined" do
@node.ext_management_system = nil
Expand Down