A kubernetes operator to manage prometheus-anomaly-detector instances. Based on Operator-SDK.
Prometheus-Anomaly-Detector (PAD) is a machine learning framework that enables us to use various models such as Fourier, LSTM to perform time-series forecasting on metric data collected from a given prometheus source. PAD collects the metrics of interest from the specified Prometheus data source, trains a model to forecast the future values of these metrics. These predicted future values (upper and lower bounds when considering a margin of error) and the actual values of the metrics are compared.
If the actual value of a particular metric at a point of time is very different from what was predicted (i.e it has either larger than the upper bound or smaller than the lower bound) then it is conidered to be an anomaly.
The operator can be introduced into a given kubernetes cluster in two ways.
- Set the appropriate kubernetes context (in case you have multiple clusters).
- Clone this repository.
- Run
make deploy
.
This should create a pad-operator
Deployment in a new pad-operator-system
namespace.
make undeploy
deletes the operator from the cluster. (It would be good to delete all the pad resources created via the operator first).
This repo also contains bundle files that enable the management of the operator via OLM. Please follow the steps mentioned in enabling olm if your cluster does not have olm enabled.
The bundle image for this operator can be found in docker hub
Run operator-sdk run bundle docker.io/arjunshenoymec/pad-operator-bundle:v0.0.1
Run operator-sdk cleanup --delete-all pad-operator
to delete the operator and all related resources.
After getting the pad-operator up and running in your cluster, you can manage PAD instances by using the pad CustomResource.
apiVersion: indicator.padoperator/v1alpha1
kind: Pad
metadata:
name: pad-sample
spec:
replicas: 1
source: "http://demo.robustperception.io:9090/"
metrics: "up"
retraining_interval: "10"
The above snippet if applied will create a Deployment which will access demo.robustperception.io, collect the up
metric(s) and perform the forecasting, anomaly detection process. The following table specifies the currently available PadSpec parameters, what they mean and their default values.
Parameter | Definition | Default |
---|---|---|
replicas | The number of Replicas in the deployment | 1 |
source | The URL corresponding to the prometheus datasource. The port is also to be included. Corresponds to FLT_PROM_URL in the PAD source code. |
"http://demo.robustperception.io:9090/" |
metrics | The list of metrics to be worked on. This is a string where each metric is to be separated by a ; . Corresponds to FLT_METRICS_LIST in the pad repo. |
"up" |
retraining_interval | Specifies how often the model will be retrained. Corresponds to FLT_RETRAINING_INTERVAL_MINUTES in the PAD repo. |
"15" |
training_window_size | Limits the size of the data considered for training. Also deletes older than the training_window during each trainign iteration. Corresponds to FLT_ROLLING_TRAINING_WINDOW_SIZE in the PAD Repo. |
"24h" |
image | The PAD container image to be used. In case you want to modify and use your own private container image | "quay.io/aicoe/prometheus-anomaly-detector:latest" |
We plan to introduce other parameters such as ones corresponding to FLT_PARALLELISM
, modifying the number of CPUs available to the container and enabling the choice of model being trained (currently an old version of Prophet is used as default in the default image specified above).