Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] adds basic observability into Task Manager's runtime operations #77868

Merged
merged 87 commits into from
Oct 27, 2020
Merged
Show file tree
Hide file tree
Changes from 81 commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
5f850db
moved TM integration tests to TS
gmmorris Sep 18, 2020
a34dd73
Merge branch 'master' into task-manager/health
gmmorris Sep 18, 2020
c0a7038
introduce health endpoint in TM
gmmorris Sep 22, 2020
7e27d7b
added shceduling to health endpoint
gmmorris Sep 23, 2020
0a22c43
fixed tests
gmmorris Sep 23, 2020
7c226e9
fixed typing
gmmorris Sep 23, 2020
e1ee967
fixed typing again
gmmorris Sep 23, 2020
abec231
added task runtime stats to health endpoint
gmmorris Sep 28, 2020
92cf5ad
Merge branch 'master' into task-manager/health
gmmorris Sep 28, 2020
bf0c3b4
removed unused import
gmmorris Sep 28, 2020
acae863
made task run stats reactive
gmmorris Sep 28, 2020
6b81d56
Merge branch 'master' into task-manager/health
gmmorris Sep 28, 2020
4d34dac
fixed unit tests in health route
gmmorris Sep 28, 2020
e3ba8ad
removed unused import
gmmorris Sep 29, 2020
e7c5da2
added task run duration to health endpoint
gmmorris Sep 29, 2020
192782a
removed mode from TM stats
gmmorris Sep 30, 2020
09bf68e
report run result frequency in tm stats
gmmorris Sep 30, 2020
4a5d652
added success and failure ratio
gmmorris Oct 1, 2020
de5a7ac
added schedule density to Task Manager health
gmmorris Oct 2, 2020
62ceba3
added schedule density to test
gmmorris Oct 5, 2020
1cc8260
added upper bound to schedule density
gmmorris Oct 5, 2020
7b14dff
Merge branch 'master' into task-manager/health
gmmorris Oct 5, 2020
2dca67f
use APM agg types
gmmorris Oct 5, 2020
ae15dc6
fixed tests
gmmorris Oct 5, 2020
734cb12
fixed mock import
gmmorris Oct 5, 2020
db86986
added status on health api
gmmorris Oct 5, 2020
4ca65c3
test status in aceptancve tests
gmmorris Oct 5, 2020
44cb578
corrected types
gmmorris Oct 5, 2020
d29c866
change RGY to OK Error and Warn
gmmorris Oct 5, 2020
273d58d
added readme
gmmorris Oct 6, 2020
2efb599
updated json in readme
gmmorris Oct 6, 2020
911c827
spaces -> tabs
gmmorris Oct 6, 2020
2ab5e73
Added health status in each section
gmmorris Oct 6, 2020
37a4041
removed unused import
gmmorris Oct 6, 2020
2f12de5
Merge branch 'master' into task-manager/health
gmmorris Oct 8, 2020
4863391
plug health into service status
gmmorris Oct 8, 2020
7cd7ca6
Merge branch 'master' into task-manager/health
gmmorris Oct 8, 2020
8223188
Merge branch 'master' into task-manager/health
gmmorris Oct 8, 2020
9fc7da6
fixed src import
gmmorris Oct 8, 2020
7403903
estimate recurring tasks in schedule
gmmorris Oct 9, 2020
e47b619
make hot stats refresh rate configurable
gmmorris Oct 12, 2020
6dddf5c
Merge branch 'master' into task-manager/health
gmmorris Oct 12, 2020
7798875
ensure we dont aggregate workload before tm is readyt
gmmorris Oct 12, 2020
a47c7aa
fixed config mocks
gmmorris Oct 12, 2020
689d014
updated docs
gmmorris Oct 12, 2020
e9db32e
fixed bvucketing
gmmorris Oct 12, 2020
5ff008d
renamed schedule density
gmmorris Oct 13, 2020
4f37202
Merge remote-tracking branch 'upstream/master' into task-manager/health
gmmorris Oct 13, 2020
9d92055
changed to p metrics
gmmorris Oct 13, 2020
ac27d4f
removed unused import
gmmorris Oct 13, 2020
311103e
extract task dictionary from TM
gmmorris Oct 13, 2020
8396812
fixed typing
gmmorris Oct 15, 2020
177b6b9
extract middleware and store from taskManager
gmmorris Oct 15, 2020
3d2cde8
extracted store APIs from TM
gmmorris Oct 15, 2020
1c4c266
Merge branch 'master' into task-manager/health
gmmorris Oct 15, 2020
afcdfd6
removed redundant lifecycle checks
gmmorris Oct 15, 2020
bbd445a
extract scheduling from task_manager
gmmorris Oct 15, 2020
02afe23
renamed task_manager to polling_lifecycle
gmmorris Oct 15, 2020
320ea57
last bit og logger
gmmorris Oct 15, 2020
475d49d
Merge branch 'master' into task-manager/refactor-oversized-class
gmmorris Oct 16, 2020
b0275e0
removed usage of deprecated legacy es client
gmmorris Oct 16, 2020
aa669e6
renamed variables to match new names of components
gmmorris Oct 16, 2020
42e041f
made type optional
gmmorris Oct 16, 2020
77d4450
removed type from tasks
gmmorris Oct 16, 2020
4e08766
simplified middleware
gmmorris Oct 16, 2020
1c21c6b
removed types from the tasks in actions and alerts
gmmorris Oct 16, 2020
faefc37
Merge branch 'master' into task-manager/health
gmmorris Oct 16, 2020
c30d71b
Merge branch 'master' into task-manager/refactor-oversized-class
gmmorris Oct 19, 2020
12e52e6
Merge branch 'master' into task-manager/health
gmmorris Oct 19, 2020
6637ab4
Merge branch 'task-manager/health' into task-manager/health-refactored
gmmorris Oct 19, 2020
40fd393
merged refactor into obvs
gmmorris Oct 19, 2020
b310cb7
fixed typing
gmmorris Oct 19, 2020
d3157fd
use camel case in health endpoint as per styleguide
gmmorris Oct 19, 2020
20d7e66
fixed readme
gmmorris Oct 19, 2020
1906572
made execution thresholds configurable
gmmorris Oct 20, 2020
508a726
merge config with dynamic configs in stats
gmmorris Oct 20, 2020
14faf60
expanded test on thresholds
gmmorris Oct 20, 2020
600863f
Merge branch 'master' into task-manager/health-refactored
gmmorris Oct 20, 2020
c8b4879
added config examples to readme
gmmorris Oct 20, 2020
1a5a907
refactored createRunningAveragedStat
gmmorris Oct 22, 2020
12e7866
Merge branch 'master' into task-manager/health
gmmorris Oct 22, 2020
13981c0
Merge branch 'master' into task-manager/health
kibanamachine Oct 26, 2020
5ea830c
use infered type in aggregation
gmmorris Oct 26, 2020
9b48539
added missing configs to docs
gmmorris Oct 27, 2020
314f3bc
Merge branch 'master' into task-manager/health
gmmorris Oct 27, 2020
e59ebc0
removed text in JSOPN parse
gmmorris Oct 27, 2020
5d8f750
Merge branch 'master' into task-manager/health
gmmorris Oct 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions x-pack/plugins/apm/typings/elasticsearch/aggregations.ts
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,15 @@ export interface AggregationOptionsByType {
>;
keyed?: boolean;
} & AggregationSourceOptions;
range: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these changes needed? I'm wondering if we will need to pre-req the apm plugin, if not now, in some future where the compilation depends on plugin deps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are investigating moving these types into the JS client (see #77720.) No timeframe on that, but worth considering.

I'm always in favor of explicitly declaring dependencies, though in this case we would create a circular one since APM has Task Manager as an optional dependency.

As for the type changes here I'd like to defer to @dgieselaar on that one before approving.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wasn't crazy about this, but checked with @dgieselaar and he gave a thumbsup.
He said he'd review on the PR 👍

field: string;
ranges: Array<
| { key?: string; from: string | number }
| { key?: string; to: string | number }
| { key?: string; from: string | number; to: string | number }
>;
keyed?: boolean;
};
auto_date_histogram: {
buckets: number;
} & AggregationSourceOptions;
Expand Down Expand Up @@ -324,6 +333,18 @@ interface AggregationResponsePart<
? Record<string, DateRangeBucket>
: { buckets: DateRangeBucket[] };
};
range: {
buckets: TAggregationOptionsMap extends { range: { keyed: true } }
? Record<
string,
DateRangeBucket &
SubAggregationResponseOf<TAggregationOptionsMap['aggs'], TDocument>
>
: Array<
DateRangeBucket &
SubAggregationResponseOf<TAggregationOptionsMap['aggs'], TDocument>
>;
};
auto_date_histogram: {
buckets: Array<
DateHistogramBucket &
Expand Down
265 changes: 265 additions & 0 deletions x-pack/plugins/task_manager/server/MONITORING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
# Task Manager Monitoring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally getting around to looking at this, sorry - I'm going to do an initial "comment" review of this doc, before looking at the code ...


Task Manager has an internal monitoring mechanism in which keeps track of a variety of metrics which are exposed via a `health` api endpoint and Kibana Server Log debug messaging.

## Exposed Metrics
There are three different sections to the stats returned by the `health` api.
- `configuration`: Summarizes Task Manager's current configuration.
- `workload`: Summarizes the workload in the current deployment.
- `runtime`: Tracks Task Manager's performance.

### Configuring the Stats
There are four new configurations:

- `xpack.task_manager.monitored_stats_required_freshness` - The _required freshness_ of critical "Hot" stats, which means that if key stats (last polling cycle time, for example) haven't been refreshed within the specified duration, the `_health` endpoint and service will report an `Error` status. By default this is inferred from the configured `poll_interval` and is set to `poll_interval` plus a `1s` buffer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A one second buffer seems too small to me - I'd be a little afraid of a Kibana under load still working "fine", but exceeding that, moving to an Error. I wonder if we should have two values, perhaps computed from one config value - a warning and an error. Where warning was using the config value, error used 2x the config value. Presumably the value can't be less than poll_interval, perhaps should not be lower than whatever "default we use (poll_interval + 1s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair - this is just an initial stab.
As mentioned on the issue itself - I'd like some input on where we draw the line between Error and Warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it does get interesting. Seems it could follow what the query timeout would be. Some queries against Elasticsearch can take a while to run and should be perfectly normal (part of make it slow project).

- `xpack.task_manager.monitored_aggregated_stats_refresh_rate` - Dictates how often we refresh the "Cold" metrics. These metrics require an aggregation against Elasticsearch and add load to the system, hence we want to limit how often we execute these. We also inffer the _required freshness_ of these "Cold" metrics from this configuration, which means that if these stats have not been updated within the required duration then the `_health` endpoint and service will report an `Error` status. This covers the entire `workload` section of the stats. By default this is configured to `60s`, and as a result the _required freshness_ defaults to `61s` (refresh plus a `1s` buffer).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the rate at which we do the queries, and there is also a required freshness for this? Isn't that freshness then just the latency to make the request and get the response? Also seems like we'd have some kind of timeout on the call, and retry, and log a warning when we had to retry.

Copy link
Contributor Author

@gmmorris gmmorris Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's designed to catch the case where the workload hasn't been updated because something in the mechanism broke.
When building a monitoring endpoint you also need to take into account that you want a warning if the monitoring itself has broken... this ensures that if, for some reason, the code that fetches the workload stalls, you don't just have a green light on your monitoring because it isn't refreshing and the last value was OK.

Copy link
Contributor Author

@gmmorris gmmorris Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think of this as two separate components - one that keeps the monitored stats up to date and one that looks at these monitored stats and assesses the health based on them.

- `xpack.task_manager.monitored_stats_running_average_window`- Dictates the size of the window used to calculate the running average of various "Hot" stats, such as the time it takes to run a task, the _drift_ that tasks experience etc. These stats are collected throughout the lifecycle of tasks and this window will dictate how large the queue we keep in memory would be, and how many values we need to calculate the average against. We do not calculate the average on *every* new value, but rather only when the time comes to summarize the stats before logging them or returning them to the API endpoint.
- `xpack.task_manager.monitored_task_execution_thresholds`- Configures the threshold of failed task executions at which point the `warn` or `error` health status will be set either at a default level or a custom level for specific task types. This will allow you to mark the health as `error` when any task type failes 90% of the time, but set it to `error` at 50% of the time for task types that you consider critical. This value can be set to any number between 0 to 100, and a threshold is hit when the value *exceeds* this number. This means that you can avoid setting the status to `error` by setting the threshold at 100, or hit `error` the moment any task failes by setting the threshold to 0 (as it will exceed 0 once a single failer occurs).

For example, in your `Kibana.yml`:
```
xpack.task_manager.monitored_stats_required_freshness: 5000
xpack.task_manager.monitored_aggregated_stats_refresh_rate: 60000
xpack.task_manager.monitored_stats_running_average_window: 50
xpack.task_manager.monitored_task_execution_thresholds:
default:
error_threshold: 70
warn_threshold: 50
custom:
"alerting:always-firing":
error_threshold: 50
warn_threshold: 0
```

## Consuming Health Stats
Task Manager exposes a `/api/task_manager/_health` api which returns the _latest_ stats.
Calling this API is designed to be fast and doesn't actually perform any checks- rather it returns the result of the latest stats in the system, and is design in such a way that you could call it from an external service on a regular basis without worrying that you'll be adding substantial load to the system.

Additionally, the metrics are logged out into Task Manager's `DEBUG` logger at a regular cadence (dictated by the Polling Interval).
If you wish to enable DEBUG logging in your Kibana instance, you will need to add the following to your `Kibana.yml`:
```
logging:
loggers:
- context: plugins.taskManager
appenders: [console]
level: debug
```

Please bear in mind that these stats are logged as often as your `poll_interval` configuration, which means it could add substantial noise to your logs.
We would recommend only enabling this level of logging temporarily.

### Understanding the Exposed Stats

As mentioned above, the `health` api exposes three sections: `configuration`, `workload` and `runtime`.
Each section has a `timestamp` and a `status` which indicates when the last update to this setion took place and whether the health of this section was evaluated as `OK`, `Warning` or `Error`.

The root has its own `status` which indicate the state of the system overall as infered from the `status` of the section.
An `Error` status in any section will cause the whole system to display as `Error`.
A `Warning` status in any section will cause the whole system to display as `Warning`.
An `OK` status will only be displayed when all sections are marked as `OK`.

The root `timestamp` is the time in which the summary was exposed (either to the DEBUG logger or the http api) and the `last_update` is the last time any one of the sections was updated.

#### The Configuration Section
The `configuration` section summarizes Task Manager's current configuration, including dynamic configurations which change over time, such as `poll_interval` and `max_workers` which adjust in reaction to changing load on the system.

These are "Hot" stats which are updated whenever a change happens in the configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps overkill, but would it be useful to list the registered task types here, and their config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's an interesting thought.
Not just the types that are in use but also the ones Kibana knows about?
Could be worth while.... though I've been thinking of this mostly for monitoring things that change over time.

I've been thinking we could have a verbose mode (in the http endpoint) that includes more data and that way we avoid putting too much info into the server log, and we'd only log the stuff that changes. In verbose mode we'd return things like this as well.


#### The Workload Section
The `workload` which summarizes the work load in the current deployment, listing the tasks in the system, their types and what their current status is.

It includes three sub sections:
- The number of tasks scheduled in the system, broken down by type and status.
- The number of idle `overdue` tasks, whose `runAt` has expired.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the # of failed tasks would be good here as well :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already there :) that's the and status....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should something be done about retryAt? I wonder if this or something else should consider the expiry of it as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, you mean overdue as their retryAt has been reached? 🤔
Yeah, I didn't think of that, I'll look into it

- Execution density in the next minute or so (configurable), which shows how many tasks are scheduled to execute in the scope of each polling interval. This can give us an idea of how much load there is on the current Kibana deployment.

These are "Cold" stat which are updated at a regular cadence, configured by the `monitored_aggregated_stats_refresh_rate` config.

#### The Runtime Section
The `runtime` tracks Task Manager's performance as it runs, making note of task execution time, _drift_ etc.
These include:
- The time it takes a task to run (mean and median, using a configurable running average window, `50` by default)
- The average _drift_ that tasks experience (mean and median, using the same configurable running average window as above). Drift tells us how long after a task's scheduled a task typically executes.
gmmorris marked this conversation as resolved.
Show resolved Hide resolved
- The polling rate (the timestamp of the last time a polling cycle completed) and the result [`No tasks | Filled task pool | Unexpectedly ran out of workers`] frequency the past 50 polling cycles (using the same window size as the one used for running averages)
- The `Success | Retry | Failure ratio` by task type. This is different than the workload stats which tell you what's in the queue, but ca't keep track of retries and of non recurring tasks as they're wiped off the index when completed.

These are "Hot" stats which are updated reactively as Tasks are executed and interacted with.

### Example Stats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm curious (again, without looking at the code yet!) how much of the runtime section is purely from a Kibana's internal bookkeeping. And then how well this will work out in a multi-Kibana story. It seems like trying to aggregate over all the Kibana instances is going to be hard for this section, so I can understand if the first iteration of this just returns the current Kibana instance's stats. I think we should include the kibana server uuid value in here, in that case. At least a human could just keep doing curl's till you know you hit every server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should include the kibana server uuid

Yeah, I actually intended on doing that and forgot :)
I had thought of this for that exact reason.

Copy link
Contributor Author

@gmmorris gmmorris Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much of the runtime section is purely from a Kibana's internal bookkeeping

This isn't about 1s vs. 2s, it's about having visibility into what's going on in the server that might explain why a customer is seeing a certain outcome we didn't expect.
If, for example, average duration is 30s on certain task type, but it's average drift is 1m on a 5m schedule, then we know the issue is in TM itself, not in the task. We then know to look at things like, estimatedScheduleDensity, max_workers etc.
This is all about getting indicators that help us advise customers.

Long term, we also want to shift the Health Status based on these things, but for now, in order to reach GA, this is all about getting some observability into that black box.


For example, if you _curl_ the `/api/task_manager/_health` endpoint, you might get these stats:
gmmorris marked this conversation as resolved.
Show resolved Hide resolved
```
{
/* the time these stats were returned by the api */
"timestamp": "2020-10-05T18:26:11.346Z",
/* the overall status of the system */
"status": "OK",
/* last time any stat was updated in this output */
"last_update": "2020-10-05T17:57:55.411Z",
"stats": {
"configuration": { /* current configuration of TM */
"timestamp": "2020-10-05T17:56:06.507Z",
"status": "OK",
"value": {
"max_workers": 10,
"poll_interval": 3000,
"request_capacity": 1000,
"max_poll_inactivity_cycles": 10,
"monitored_aggregated_stats_refresh_rate": 60000,
"monitored_stats_running_average_window": 50
}
},
"workload": { /* The workload of this deployment */
"timestamp": "2020-10-05T17:57:06.534Z",
"status": "OK",
"value": {
"count": 6, /* count of tasks in the system */
"task_types": { /* what tasks are there and what status are they in */
"actions_telemetry": {
"count": 1,
"status": {
"idle": 1
}
},
"alerting_telemetry": {
"count": 1,
"status": {
"idle": 1
}
},
"apm-telemetry-task": {
"count": 1,
"status": {
"idle": 1
}
},
"endpoint:user-artifact-packager": {
"count": 1,
"status": {
"idle": 1
}
},
"lens_telemetry": {
"count": 1,
"status": {
"idle": 1
}
},
"session_cleanup": {
"count": 1,
"status": {
"idle": 1
}
}
},

/* Frequency of recurring tasks schedules */
"schedule": [
["60s", 1], /* 1 task, every 60s */
["3600s", 3], /* 3 tasks every hour */
["720m", 1]
],
/* There are no overdue tasks in this system at the moment */
"overdue": 0,
/* This is the schedule density, it shows a histogram of all the polling intervals in the next minute (or, if
pollInterval is configured unusually high it will show a min of 2 refresh intervals into the future, and a max of 50 buckets).
Here we see that on the 3rd polling interval from *now* (which is ~9 seconds from now, as pollInterval is `3s`) there is one task due to run.
We also see that there are 5 due two intervals later, which is fine as we have a max workers of `10`
*/
"estimated_schedule_density": [0, 0, 1, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}
},
"runtime": {
"timestamp": "2020-10-05T17:57:55.411Z",
"status": "OK",
"value": {
"polling": {
/* When was the last polling cycle? */
"last_successful_poll": "2020-10-05T17:57:55.411Z",
/* What is the frequency of polling cycle result?
Here we see 94% of "NoTasksClaimed" and 6% "PoolFilled" */
"result_frequency_percent_as_number": {
"NoTasksClaimed": 94,
"RanOutOfCapacity": 0, /* This is a legacy result, we might want to rename - it tells us when a polling cycle resulted in claiming more tasks than we had workers for, butt he name doesn't make much sense outside of the context of the code */
"PoolFilled": 6
}
},
/* on average, the tasks in this deployment run 1.7s after their scheduled time */
"drift": {
"mean": 1720,
"median": 2276
},
"execution": {
"duration": {
/* on average, the `endpoint:user-artifact-packager` tasks take 15ms to run */
"endpoint:user-artifact-packager": {
"mean": 15,
"median": 14.5
},
"session_cleanup": {
"mean": 28,
"median": 28
},
"lens_telemetry": {
"mean": 100,
"median": 100
},
"actions_telemetry": {
"mean": 135,
"median": 135
},
"alerting_telemetry": {
"mean": 197,
"median": 197
},
"apm-telemetry-task": {
"mean": 1347,
"median": 1347
}
},
"result_frequency_percent_as_number": {
/* and 100% of `endpoint:user-artifact-packager` have completed in success (within the running average window, so the past 50 runs (by default, configrable by `monitored_stats_running_average_window`) */
"endpoint:user-artifact-packager": {
"status": "OK",
"Success": 100,
"RetryScheduled": 0,
"Failed": 0
},
"session_cleanup": {
/* `error` status as 90% of results are `Failed` */
"status": "error",
"Success": 5,
"RetryScheduled": 5,
"Failed": 90
},
"lens_telemetry": {
"status": "OK",
"Success": 100,
"RetryScheduled": 0,
"Failed": 0
},
"actions_telemetry": {
"status": "OK",
"Success": 100,
"RetryScheduled": 0,
"Failed": 0
},
"alerting_telemetry": {
"status": "OK",
"Success": 100,
"RetryScheduled": 0,
"Failed": 0
},
"apm-telemetry-task": {
"status": "OK",
"Success": 100,
"RetryScheduled": 0,
"Failed": 0
}
}
}
}
}
}
}
```
8 changes: 8 additions & 0 deletions x-pack/plugins/task_manager/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@ The task_manager can be configured via `taskManager` config options (e.g. `taskM
- `override_num_workers`: An object of `taskType: number` that overrides the `num_workers` for tasks
- For example: `task_manager.override_num_workers.reporting: 2` would override the number of workers occupied by tasks of type `reporting`
- This allows sysadmins to tweak the operational performance of Kibana, allowing more or fewer tasks of a specific type to run simultaneously
- `monitored_aggregated_stats_refresh_rate` - Dictates how often we refresh the "Cold" metrics. Learn More: [./MONITORING](./MONITORING.MD)
- `monitored_stats_running_average_window`- Dictates the size of the window used to calculate the running average of various "Hot" stats. Learn More: [./MONITORING](./MONITORING.MD)
gmmorris marked this conversation as resolved.
Show resolved Hide resolved

## Task definitions

Expand Down Expand Up @@ -460,3 +462,9 @@ The task manager's public API is create / delete / list. Updates aren't directly
node scripts/functional_tests_server.js --config x-pack/test/plugin_api_integration/config.ts
node scripts/functional_test_runner --config x-pack/test/plugin_api_integration/config.ts
```

## Monitoring

Task Manager exposes runtime statistics which enable basic observability into its inner workings and makes it possible to monitor the system from external services.

Learn More: [./MONITORING](./MONITORING.MD)
Loading