Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM]: Replace error occurrence watchers with Kibana Alerting #46547

Closed
wants to merge 1 commit into from

Conversation

dgieselaar
Copy link
Member

@dgieselaar dgieselaar commented Sep 25, 2019

This is a proof of concept that replaces the APM error occurrence Watcher (an Elasticsearch feature) with the new Kibana alerting and actions plugin. The Slack action succesfully fires, but I haven't bothered with the Email action because the approach is pretty similar, and we decided to timebox this to two days.

Here's the gist:

  • Define an Alert Type for error occurrences. An Alert Type is essentially a function (called an executor) that, given a set of parameters, decides whether (and which) actions need to be triggered. Actions are triggered via an Alert Instance, which captures state of previous executions (for the purpose of this POC, I don't think we need that state).
  • When the user configures a new alert, we create an Alert object on the server. An Alert is a configuration object for an Alert Type, that tells it at what interval it and with what parameters it should execute.
  • An Alert also configures the action groups that can be triggered from the Alert Type executor. For this POC, we create Slack and Email actions based on the user's input when configuring an alert, and add them to the default group.
  • In the executor of the Alert Type, that now runs at the configured interval for the Alert that was created, we run a query for the number of error occurrences and determine whether the threshold was exceeded. If so, we fire the default actions for the Alert, which can be either the Slack or Email action, or both.
  • To make formatting easier, we pre-process some of the output from the query and pass this as context to the actions that are triggered.

To test this, make sure to explicitly enable both required plugins in your kibana config file:

xpack.actions.enabled: true
xpack.alerting.enabled: true

Notes:

  • Currently an Alert only supports interval as a scheduling option. That means that we cannot run the executor at a given time each day, which is something that our current implementation does support. I've been told expanding the scheduling options is on the roadmap.
  • To determine the timeframe for the query we run in the executor, we need the interval parameter. However, it doesn't seem like that's automatically available in the executor, so we pass it as a parameter instead. Maybe there's a nicer way to solve this.
  • We have to create actions in addition to each Alert that we create (AFAICT), because we allow our users to configure the Slack webhook URL and email addresses in the form. This could present some maintenance issues. It might make more sense to expect the user to create specific Slack/Email actions first, and then provide them with a UI to select those predefined actions. This would limit the amount of maintenance/cleanup of orphaned actions that would be needed in our current implementation.
  • The Alerting plugin still uses template rendering, but it's a lot easier now that we can pre-process the output from the query with JavaScript, and then pass it to a Mustache renderer. One thing that could me off guard was that Mustache escapes by default, which broke the formatting of the Slack message initially.

Some questions/suggestions for the Alerting/Actions team:

  • Some of the configuration options for the email and slack actions are now seemingly moved to the secrets param, but this is not reflected in the documentation (https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/actions/README.md)
  • As explained above, would be nice to have the interval available in the executor as well (if it's not already there and I missed it). Seems like it's a common use case.
  • I could not figure out what actionGroups is for in the Alert Type. It's seemingly not documented, but it is required.
  • The typings of registerType could be improved by having typed params (instead of Record<string, any>). Similar to what we did for APM in [APM] migrate to io-ts #42961 (happy to open a PR if welcome).
  • The kbn-action CLI tool has been super useful, thanks for taking the time to build that.

@dgieselaar dgieselaar requested a review from a team as a code owner September 25, 2019 06:50
@dgieselaar
Copy link
Member Author

Woops, this was supposed to be a draft PR. I'll update the description in a bit.

@elasticmachine
Copy link
Contributor

💔 Build Failed

*/

export const ERROR_OCCURRENCE_ALERT_TYPE_ID = 'apm.error_occurrence';
export const NOTIFICATION_EMAIL_ACTION_ID = 'apm.notificationEmail';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this right now. I was figuring out if we could create the email action we need when registering the alert type, but because I never got around to implementing email it's not being used.

'elasticsearch',
'xpack_main',
'apm_oss',
'alerting',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these should be required, or if they are optional. If it's the latter, not sure how we can get access to these plugins on startup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good question for platform how to handle optional dependencies. I think it has come up before but can't remember the answer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New platform allows for optional dependencies, but legacy platform does not. If you are depending on something in a legacy plugin that could be disabled, you need to make sure that you implement isEnabled and check that the dependency is enabled, otherwise Kibana will crash.

.map(email => email.trim())
.filter(email => !!email)
: [];
const email = this.state.actions.email ? this.state.emails : '';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified this because we can just pass a csv to the action, because it uses nodemailer under the hood. Never tested it though.

})
);

return alertsClient.create({
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we're allowed to do this or if we have to use the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok to do, all the business logic currently lives within the client and you're getting the client from the request (which is good). You will just have to handle validation until we have some within the client (if we decided to do so).

import { createRoute } from './create_route';
import { createAlert } from '../lib/alerting/error_occurrence/create_alert';

export const errorOccurrenceAlertRoute = createRoute(core => ({
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a route for this, because we have to execute several concurrent and sequential requests, and the server is a more robust environment for that kind of dependencies.

}
};

const { callWithInternalUser } = elasticsearch.getCluster('admin');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use services.callCluster(...). It will be in the context of the user who created the alert (security wise). The approach you have is fine as well if you don't want that.

@elasticmachine
Copy link
Contributor

💔 Build Failed

@smith smith added the WIP Work in progress label Oct 22, 2019
@elasticmachine
Copy link
Contributor

💔 Build Failed

this.props.onClose();
const id = 'id' in savedObject ? savedObject.id : NOT_AVAILABLE_LABEL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm outmoded, but isn't the in operator generally discouraged since it's at risk for prototype hijacking?

'<br/>' +
'{errorLogMessage}<br/>' +
'{errorCulprit}<br/>' +
'{docCount} occurrences<br/>',
Copy link
Contributor

@ogupte ogupte Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do want to preserve new lines \n characters in email template like in the slack template?

@dgieselaar
Copy link
Member Author

Closing in favor of #59566.

@dgieselaar dgieselaar closed this Mar 19, 2020
@dgieselaar dgieselaar deleted the alerting-poc branch March 19, 2020 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants