Harness the power of open source to efficiently monitor your Cisco ACI environment with the ACI-Monitoring-Stack. This lightweight, yet robust, monitoring solution combines top-tier open source tools, each contributing unique capabilities to ensure comprehensive visibility into your ACI infrastructure.
The ACI-Monitoring-Stack integrates the following key components:
-
Grafana: The leading open-source analytics and visualization platform. Grafana allows you to create dynamic dashboards that provide real-time insights into your network's performance, health, and metrics. With its user-friendly interface, you can easily visualize and correlate data across your ACI fabric, enabling quicker diagnostics and informed decision-making.
-
Prometheus: A powerful open-source monitoring and alerting toolkit. Prometheus excels in collecting and storing metrics in a time-series database, allowing for flexible queries and real-time alerting. Its seamless integration with Grafana ensures that your monitoring stack provides a detailed and up-to-date view of your ACI environment.
-
Loki: Designed for efficiently aggregating and querying logs from your entire ACI ecosystem. Loki complements Prometheus by focusing on log aggregation, providing a unified stack for metrics and logs. Its integration with Grafana enables you to correlate log data with metrics and create a holistic monitoring experience.
-
Promtail: the agent responsible for gathering and shipping the log files to the Loki server.
-
Syslog-ng: is an open-source implementation of the Syslog protocol, its role in this stack is to translate syslog messages from RFC 3164 to 5424. This is needed because Promtail only support Syslog RFC 5424 over TCP and this capability is only available in ACI 6.1 and above.
-
aci-exporter: A Prometheus exporter that serves as the bridge between your Cisco ACI environment and the Prometheus monitoring ecosystem. The aci-exporter translates ACI-specific metrics into a format that Prometheus can ingest, ensuring that all crucial data points are captured and monitored effectively.
-
Pre-configured ACI data collections queries, alerts, and dashboards (Work In Progress): The ACI-Monitoring-Stack provides a solid foundation for monitoring an ACI fabric with its pre-defined queries, dashboards, and alerts. While these tools are crafted based on best practices to offer immediate insights into network performance, they are not exhaustive. The strength of the ACI-Monitoring-Stack lies in its community-driven approach. Users are invited to contribute their expertise by providing feedback, sharing custom solutions, and helping enhance the stack. Your input helps to refine and expand the stack's capabilities, ensuring it remains a relevant and powerful tool for network monitoring.
Wanna take a look at the current Stack? Head to:
user: guest password: guest
Here you can see a high level diagram of the components used and how they interact together
flowchart-elk
subgraph ACI Monitoring Stack
G["Grafana"]
P[("Prometheus")]
L["Loki"]
PT["Promtail"]
SL["Syslog-ng"]
AM["Alertmanager"]
A["aci-exporter"]
G--"PromQL"-->P
G--"LogQL"-->L
P-->AM
PT-->L
SL-->PT
P--"Service Discovery"-->A
end
subgraph ACI
S["Switches"]
APIC["APIC"]
end
U["User"]
N["Notifications (Mail/Webex etc...)"]
V{Ver >= 6.1}
A--"API Queries"-->S
A--"API Queries"-->APIC
U-->G
AM-->N
S--"Syslog"-->V
APIC--"Syslog"-->V
V -->|Yes| PT
V -->|No| SL
If you want to contribute to this project star from Here
-
Familiarity with Kubernetes: This installation guide is intended to assist with the setup of the ACI Monitoring stack and assumes prior familiarity with Kubernetes; it is not designed to provide instruction on Kubernetes itself.
-
A Kubernetes Cluster: Currently the stack has been tested on
Upstream Kubernetes 1.30.x
andMinikube
.- Persistent Volumes: 10G should be plenty for a small/demo environment. Many storage provisioner support Volume expansion so should be easy to increase this post installation.
- Ability to expose services for:
- Access to the Grafana/Prometheus and Alert Manager dashboards: This will be ideally achieved via an
Ingress Controller
- (Optional) Wildcard DNS Entries for the ingress controller domain.
- Syslog ingestion from ACI: Since the syslog can be sent via
UDP
orTCP
it is more flexible to use expose these service directly via either aNodePort
or aLoadBalancer
service Type
- Access to the Grafana/Prometheus and Alert Manager dashboards: This will be ideally achieved via an
- Cluster Compute Resources: This stack has been tested against a 500 node ACI fabric and was consuming roughly 8GB of RAM, CPU resources didn't seem to play a major role and any modern CPU should suffice.
- 1 Dedicated Namespace per instance: One Instance can monitor at least 500 switches.
- This is not strictly required but is suggested to keep the HELM configuration simple so the default K8s service names can be re-used see the Config Preparation section for more details.
-
Helm: This stack is distributed as a helm chart and relies on 3rd party helm charts as well
-
Connectivity from your Kubernetes Cluster to ACI either over Out Of Band or In Band
If you are installing on Minikube please follow the Minikube Preparation Steps and then come back here.
The ACI Monitoring Stack is a combination of several Charts, if you are familiar with Helm you are aware of the struggle to propagate dynamic values to sub-charts. For example, it is not possible to pass to a sub-chart the name of a service in a dynamic way.
In order to simplify the user experience the chart
comes with a few pre-configured parameters that are populated in the configurations of the various sub-charts.
For example the aci-exporter Service Name is pre-configured as aci-exporter-svc
and this value is then passed to Prometheus as service Discovery URL.
All these values can be customized and if you need to you can refer to the Values file.
Note: This is the first HELM char camrossi
created, and he is sure it can be improved. If you have suggestions they are extremely welcome! :)
The aci-exporter is the bridge between your Cisco ACI environment and the Prometheus monitoring ecosystem, for it to works it needs to know:
fabrics
: A list of fabrics and how to connect to the APICs.- Requires a ReadOnly Admin User
service_discovery
: Configure if devices are reachable via Out Of Band (oobMgmtAddr
) or InBand (inbMgmtAddr
).
Note: The switches are auto-discovered.
This is done by setting the following Values in Helm:
aci_exporter:
# Profiles for different fabrics
fabrics:
fab1:
username: <username>
password: <password>
apic:
- https://IP1
- https://IP2
- https://IP3
# service_discovery oobMgmtAddr|inbMgmtAddr
service_discovery: oobMgmtAddr
fab2:
username: <username>
password: <password>
apic:
- https://IP1
- https://IP2
- https://IP3
# service_discovery oobMgmtAddr|inbMgmtAddr
service_discovery: inbMgmtAddr
Prometheus is installed via its own Chart the options you need to set are:
- The
ingress
config and the baseURL: These most likely are the same URL which can accessprometheus
andalertmanager
- Persistent Volume Capacity
- (Optional)
retentionSize
: this is only needed if you want to limit the retention by size. Keep in mind that if you run out of disk space Prometheus WILL stop working. - (Optional) alertmanager
route
: these are used to send notifications via Mail/Webex etc... the complete syntax is available Here Below an example:
prometheus:
server:
ingress:
enabled: true
ingressClassName: "traefik"
hosts:
- aci-exporter-prom.apps.c1.cam.ciscolabs.com
baseURL: "http://aci-exporter-prom.apps.c1.cam.ciscolabs.com"
service:
retentionSize: 5GB
persistentVolume:
accessModes: ["ReadWriteOnce"]
size: 5Gi
alertmanager:
baseURL: "http://aci-exporter-alertmanager.apps.c1.cam.ciscolabs.com"
ingress:
enabled: true
ingressClassName: "traefik"
hosts:
- host: aci-exporter-alertmanager.apps.c1.cam.ciscolabs.com
paths:
- path: /
pathType: ImplementationSpecific
config:
route:
group_by: ['alertname']
group_interval: 30s
repeat_interval: 30s
group_wait: 30s
receiver: 'webex'
receivers:
- name: webex
webex_configs:
- send_resolved: false
api_url: "https://webexapis.com/v1/messages"
room_id: "<room_id>"
http_config:
authorization:
credentials: "<credentials>"
If you use Webex here some config steps for you!
Grafana is installed via its own Chart the main options you need to set are:
- The
ingress
config: External URL which can access Grafana. - Persistent Volume Capacity
- (Optional)
adminPassword
: If not set will be auto generated and can be found in thegrafana
secret - (Optional)
viewers_can_edit
: This allows users with aview only
role to modify the dashboards and accessExplorer
to execute queries againstPormetheus
andLoki
. However, the user will not be able to save any changes. - (Optional)
deploymentStrategy
: if GrafanaPersistent Volume
is of typeReadWriteOnce
rolling updates will get stuck as the new pod cannot start before the old one releases the PVC. SettingdeploymentStrategy.type
toRecreate
destroy the original pod before starting the new one.
Below an example:
grafana:
grafana.ini:
users:
viewers_can_edit: "True"
adminPassword: <adminPassword>
deploymentStrategy:
type: Recreate
ingress:
ingressClassName: "traefik"
enabled: true
hosts:
- aci-exporter-grafana.apps.c1.cam.ciscolabs.com
persistence:
enabled: true
size: 2Gi
The syslog config is the most complicated part as it relies on 3 components (promtail
, loki
and syslog-ng
) with their own individual configs. Furthermore, there are two issues we need to overcome:
- The Syslog messages don't contain the ACI Fabric name: to be able to distinguish the messaged from one fabric to another the only solution is to use dedicated
external services
with uniqueIP:Port
pair per Fabric. - Until ACI 6.1 we need
syslog-ng
betweenACI
andPromtail
to convert from RFC 3164 to 5424 Note: Promtail 3.1.0 adds support for RFC 3164 however this DOES NOT work for Cisco Switches and still requires syslog-ng. syslog-ngsyslog-parser
has extensive logic to handle all the complexities (and inconsistencies) of RFC 3164 messages.
Loki is deployed with the Simple Scalable Profile and is composed of a backend
, read
and write
deployment with a replica of 3.
The backend
and write
deployments requires persistent volumes. This chart is pre-configured to allocate 2Gi Volumes for each deployment (a total of 6 PVC will be created):
3 x data-loki-backend-X
3 x data-loki-write-X
The PVC Size can be easily changed if required.
Loki also requires an Object Store
. This chart is pre-configured to deploy minio. Note: Currently Loki Chart is deploying a very old version of Minio
and there is a PR open to address this already.
Loki also support chunks-cache
via memcached
. The default config allocates 8G of memory. I have decreased this to 1G by default.
If you want to change any of these parameters check the loki
section in the Values file.
Assuming the default parameters are acceptable the only required config for loki is to set the rulerConfig.external_url
to point to the Grafana ingress
URL
loki:
loki:
rulerConfig:
external_url: http://aci-exporter-grafana.apps.c1.cam.ciscolabs.com
These two components are tightly coupled together.
- Syslog-ng translates logs from RFC 3164 to RFC 5424 and forwards them to Promtail.
- Promtail is ingesting logs in RFC 5424 format and forwards them to Loki.
Promtail is pre-configured with:
- Deployment Mode with 1 replica
- Loki Push Gateway url:
loki-gateway
This is the Loki Gateway K8s service name. - Auto generated
scrapeConfigs
that will map a Fabric to aIP:Port
Pair.
These setting can be easily changed if required check the Promtail
section in the Values file for more details.
Syslog-ng is pre-configured with:
- Deployment Mode with 1 replica
If you are happy with my defaults the only configs required are setting the extraPorts
for Loki and services
for Syslog-ng. You will need one entry per fabric and the portsd needs to "match", see the diagram below for a visual representation.
Syslog-ng
is only needed for ACI < 6.1
Below a diagram of what is our goal for an ACI 6.1 fabric and an ACI 5.2 one.
flowchart-elk
subgraph K8s Cluster
subgraph Promtail
PT1513["TCP:1513 label:fab1"]
PT1514["TCP:1514 label:fab2"]
end
subgraph Syslog-ng
SL["UDP:1514"]
end
F1SVC["LoadBalancerIP TCP:1513"]
F2SVC["LoadBalancerIP UDP:1514"]
F1SVC --> PT1513
F2SVC --> SL
end
ACI61["ACI Fab1 Ver. 6.1"] --> F1SVC
ACI52["ACI Fab2 Ver. 5.2"] --> F2SVC
SL --> PT1514
The above architecture can be achieved with the following config:
name
: This will set thefabric
labels for the logs received by LokicontainerPort
: The port the container listen to. This is mapping a logs stream to a fabricservice.type
: I would suggest to set this to eitherNodePort
orLoadBalancer
. Regardless this IP allocated MUST be reachable by all the Fabric Nodes.service.port
: The port theLoadBalancer
service is listening to, this will be the port you set into the ACI Syslog config.service.nodePort
: The port theNodePort
service is listening to, this will be the port you set into the ACI Syslog config.
promtail:
extraPorts:
fab1:
name: fab1
containerPort: 1513
service:
type: LoadBalancer
port: 1513
fab2:
name: fab2
containerPort: 1516
service:
type: ClusterIP
syslog:
services:
fab2:
name: fab2
containerPort: 1516
protocol: UDP
service:
type: LoadBalancer
port: 1516
If you need a reminder on how to configure ACI Syslog take a look Here
Here an Example Config for 4 Fabrics
- Create a file containing all your configs i.e.:
aci-mon-stack-config.yaml
helm repo add aci-monitoring-stack https://datacenter.github.io/aci-monitoring-stack
helm repo update
helm -n aci-mon-stack upgrade --install --create-namespace aci-mon-stack aci-monitoring-stack/aci-monitoring-stack -f aci-mon-stack-config.yaml