Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP code always 200 for specific check route if check is optional #173

Open
G-Rath opened this issue Sep 6, 2021 · 4 comments
Open

HTTP code always 200 for specific check route if check is optional #173

G-Rath opened this issue Sep 6, 2021 · 4 comments

Comments

@G-Rath
Copy link

G-Rath commented Sep 6, 2021

I'm guessing this is by design, and can imagine without having looked at the code itself that this'd probably make for a more straightforward implementation, but wanted to ask anyway as I couldn't find any other issues mentioning this:

Are checks that are optional meant to return a success HTTP code when accessing them directly?

The situation I'm in is that we have an application using Elasticsearch to power its search - if this goes down (or is not accessible for some reason), the search stops working but the rest of the site is still up. So we've got a few monitors setup based on use-case:

  • Our load balancer is configured to use the standard /healthchecks route, since for autoscaling we only care if the app is "up" or not
  • We have a downtime monitor checking /healthchecks/all to tell us if something "critical" is down (which really for this app is just "is the app up + the database accessible", since that's the only other extra service besides Elasticsearch)
  • We have another downtime monitor checking /healthchecks/elasticsearch to tell us if the service is down, without classing the whole site as being down.
    • However, currently it seems that this returning 200 even if the check fails

My thinking was that marking the elasticsearch healthcheck as optional would mean it wouldn't fail /healthchecks/all, but I still expected it to fail it's specific route, since tbh otherwise I don't see there being much point in that route (ignoring that I could arguably do text matching, as that isn't always supported + can be more brittle).

On the other hand, maybe I'm just using this wrong?

@pbyrne
Copy link
Collaborator

pbyrne commented Sep 6, 2021

That's definitely a reasonable way to interpret the feature, but as you noticed it's not how we've got it implemented. An optional check "passes" even when the health check fails; basically reporting the failure message in its output but still allowing an HTTP 200. And that remains true for the individual check itself; in retrospect letting the individual check fail when checked directly instead of as a group makes a ton of sense.

In practice, the way I've used OK Computer for something like a load balancer health check is setting up a CheckCollection with the minimal number of checks that would make sense (like the default uptime check, maybe a database or other service connection check) and use that; and also perhaps some special-case collections for targeted health checks that can alert on-call staff of a critical problem like you outline. And /all being a true catchall that won't necessarily wake someone up at 3am if one of them fails, but could send a Slack message or an email out for later review.

@G-Rath
Copy link
Author

G-Rath commented Sep 6, 2021

Thanks for the tip about CheckCollection, I didn't know that was a thing! From looking over the code, am I right in thinking the way to use it is something like:

OkComputer::Registry.register "optional_systems", OkComputer::CheckCollectionOkComputer.new({
  elasticsearch: OkComputer::ElasticsearchCheck.new(ENV["ELASTICSEARCH_URL"])
})

in retrospect letting the individual check fail when checked directly instead of as a group makes a ton of sense.

Glad you agree - personally I really don't see a whole lot of extra value in optional checks if they don't fail their individual check from an automation/alerting standpoint as I outlined above.

Would you be open to me making a PR changing this? Ideally I think it should be the default as I can't think of any real downside in changing the behaviour given it's only for individual checks, but that is also arguably a breaking change so can implement it behind a setting if you'd like.

@anfleene
Copy link
Contributor

That's almost right. That won't set the registrant_name on the ElasticsearchCheck

I would do it like this:

optional_checks = OkComputer::CheckCollectionOkComputer.new
optional_checks.register "elasticsearch", OkComputer::ElasticsearchCheck.new(ENV["ELASTICSEARCH_URL"])
OkComputer::Registry.register "optional_systems", optional_checks

I agree with your concerns, changing that API makes me a little nervous but if we mark it as an api breaking change I think it's fine.

@G-Rath
Copy link
Author

G-Rath commented Sep 24, 2021

So I've started looking into implementing this and it looks like it might be harder than I was hoping it'd be in terms of having a clean API, due to the way checks are composed with collections.

Currently the cleanest way I can come up with right now I think is to have OkComputerController#show check if the check is an OkComputer::OptionalCheck and if it is, attempt to call run on the underlying object (which I think is possible to get, but I've not used delegate before) rather than the delegated object, so that the overridden success? method isn't called.

The chief downside is that it would mean the controller would be more dependent on the implementation of optional checks, but that could probably be mitigated at least somewhat by implementing a concrete underlying_check method 🤔

This would also mean that optional checks in collections would continue to always return successful, but while it could be nice to support this on that level, I think it's not as important for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants