Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check metrics authentication validity before perf_capture_queue #493

Conversation

agrare
Copy link
Member

@agrare agrare commented Jun 12, 2023

Check that the EMS's metrics authentication is valid before queueing perf_capture calls, potentially flooding the queue with calls that are going to fail later.

requires:

@agrare agrare added the bug label Jun 12, 2023
@agrare agrare requested review from kbrock and jrafanie June 12, 2023 18:34
@agrare agrare requested review from cben and Fryguy as code owners June 12, 2023 18:34
@jrafanie
Copy link
Member

Nice

@agrare agrare force-pushed the check_metrics_connection_validity_before_queueing_perf_captures branch 4 times, most recently from 5d3c22d to d70847b Compare June 12, 2023 19:45
Check that the EMS's metrics authentication is valid before queueing
perf_capture
@agrare agrare force-pushed the check_metrics_connection_validity_before_queueing_perf_captures branch from d70847b to be53bdc Compare June 12, 2023 19:49
end

def verify_metrics_connection!(ems)
t = target || ems
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbrock do we have a general way of handling this? When called from perf_capture_timer as ems.perf_capture_object.perf_capture_all_queue target is nil and only ems is set

Copy link
Member

@kbrock kbrock Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like how @target is set behind the scenes.
perf_capture_gap, perf_capture_all_queue, perf_capture_realtime_queue ( /via ui)

if you are running capture for a whole ems, target is null.
If you are running capture for a single object, target is single object.
If you are running capture for a batch/vmware, target/targets is an array

I think I'd just use ems

Note: you need to handle multiple targets case. (I would like everyone to have targets)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to ensure that prometheus exists in here? (aka capture_context)

That way verify does all the raises here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd just use ems

Well when we use this below it is on a single object so this has to work with target or ems

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to ensure that prometheus exists in here? (aka capture_context)

Yeah I was thinking that but we use the return of capture_context and returning something from a verify_* method seems weird.

Also building a prometheus_capture_context object seems like overkill when it isn't used in the perf_capture_timer case here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably could call this from capture_context and raise if that is nil in that method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I pushed a second commit which moves some things around, LMK if you like it better or worse and it is easy to revert out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, it doesn't look like it needs to know about target at all beside building a couple of printf strings. Looks good for now.

TANGENT:

Yea, I guess the code is currently coded to reference a single target.

I have on my short list to convert this over to targets.
For some reason, I can't find the initialize method on PrometheusCaptureContext.
But from what I can tell, that code assumes a single target when building the labels.

Not sure how hard it would be to handle a list of targets over there.

@@ -59,6 +59,33 @@
end
end

context "#perf_capture_all_queue" do
it "returns the objects" do
expect(@ems_kubernetes.perf_capture_object.perf_capture_all_queue).to include("Container" => [@container], "ContainerNode" => [@node])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO why only @container and @node and not @group


context "with a missing metrics endpoint" do
before do
@ems_kubernetes.endpoints.find_by(:role => "prometheus").destroy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE I don't love how this spec sets up these ivars since it makes it difficult to override in sub-contexts. I worked around it but I think I should refactor to use let

context "with invalid authentication on the metrics endpoint" do
before do
@ems_kubernetes.authentications.find_by(:authtype => "prometheus").update!(:status => "Error")
@ems_kubernetes.reload
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE a good example of why the before {} setting up ivars isn't great

context = capture_context(ems, target, start_time, end_time)

raise TargetValidationWarning, "no metrics endpoint found for #{target_name}" if context.nil?
context = build_capture_context!(ems, target, start_time, end_time)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea. this is nice

Copy link
Member

@kbrock kbrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@agrare agrare force-pushed the check_metrics_connection_validity_before_queueing_perf_captures branch from 32e316c to b8a12a9 Compare June 13, 2023 14:10
@agrare agrare force-pushed the check_metrics_connection_validity_before_queueing_perf_captures branch from 0aac8bf to 0c258a3 Compare June 13, 2023 14:34
@miq-bot
Copy link
Member

miq-bot commented Jun 13, 2023

Checked commits agrare/manageiq-providers-kubernetes@be53bdc~...0c258a3 with ruby 2.6.10, rubocop 1.28.2, haml-lint 0.35.0, and yamllint
4 files checked, 1 offense detected

spec/factories/ext_management_system.rb

@agrare
Copy link
Member Author

agrare commented Jun 13, 2023

@kbrock can you re-review? I refactored the specs to use let() and pulled out target_name into a method to keep the logic in one place.

rescue TargetValidationError, TargetValidationWarning => e
_log.send(e.log_severity, e.message)
return []
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all providers implement similar logic? Can we extract this so other providers will also bypass metrics capture with invalid auth?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "future" PR question... I'm fine with fixing kubernetes in this way now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little tricky because not all providers have different endpoints and/or authentications so what you need to check varies.

What I've considered doing is consolidating this in the supports :metrics {} block in ext_management_system so that the logic could be limited to the provider but that is going to be more of a cross-provider change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating supports to handle all the actual logic totally got me in a few PRs. they are still outstanding :( ✨

raise TargetValidationError, "no provider for #{target_name}" if ems.nil?

raise TargetValidationWarning, "no metrics endpoint found for #{target_name}" if metrics_connection(ems).nil?
raise TargetValidationWarning, "metrics authentication isn't valid for #{target_name}" unless metrics_connection_valid?(ems)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check if the ems is paused?

Copy link
Member Author

@agrare agrare Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should do that up in core perf_capture_timer since that is going to be the same for all providers, good call though. I'm a little wary of that just because it feels like every single queue method would have to implement this check but I don't know that there's a more central place to handle it and this is a real-world problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +7 to +9
FactoryBot.create(:kubernetes_node, :name => 'node', :ext_management_system => ems, :ems_ref => 'target').tap do |node|
node.computer_system.hardware = FactoryBot.create(:hardware, :cpu_total_cores => 2, :memory_mb => 2_048)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets keep. But in the future if we merge ManageIQ/manageiq#22565 this could be:

Suggested change
FactoryBot.create(:kubernetes_node, :name => 'node', :ext_management_system => ems, :ems_ref => 'target').tap do |node|
node.computer_system.hardware = FactoryBot.create(:hardware, :cpu_total_cores => 2, :memory_mb => 2_048)
end
FactoryBot.create(:kubernetes_node, :name => 'node', :ext_management_system => ems, :ems_ref => 'target', :hardware => FactoryBot.create(:hardware, :cpu_total_cores => 2, :memory_mb => 2_048))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yeah I wanted to keep this dependent-less for simpler backport but that's a much better way to do it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. keep. reduce dependencies

Copy link
Member

@kbrock kbrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these updates look good.

@kbrock kbrock added the metrics label Jun 13, 2023
@kbrock kbrock self-assigned this Jun 13, 2023
@kbrock kbrock merged commit ba04739 into ManageIQ:master Jun 13, 2023
@agrare agrare deleted the check_metrics_connection_validity_before_queueing_perf_captures branch June 13, 2023 22:21
kbrock added a commit to kbrock/manageiq that referenced this pull request Jun 14, 2023
introduced by:
ManageIQ/manageiq-providers-kubernetes#493

This makes sure the ems is all good and tests pass
@Fryguy
Copy link
Member

Fryguy commented Jun 28, 2023

Backported to petrosian in commit 968fcea.

commit 968fcea19c17ceab63e9272a3df91a85461d58aa
Author: Keenan Brock <keenan@thebrocks.net>
Date:   Tue Jun 13 18:21:24 2023 -0400

    Merge pull request #493 from agrare/check_metrics_connection_validity_before_queueing_perf_captures
    
    Check metrics authentication validity before perf_capture_queue
    
    (cherry picked from commit ba04739fc4ab3eee4625630ec4622dd71d576d67)

Fryguy pushed a commit that referenced this pull request Jun 28, 2023
…_before_queueing_perf_captures

Check metrics authentication validity before perf_capture_queue

(cherry picked from commit ba04739)
GilbertCherrie pushed a commit to GilbertCherrie/manageiq that referenced this pull request Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants