Sync meeting on EESSI CVMFS monitoring dashboard (2024 06 14)

EESSI CVMFS monitoring

Date: Fri Jun 14th, 14:00 CEST
Participants
- Bob
- Terje
- Kenneth
- Lara
- Thomas
- Richard

Public-facing dashboard / status page

https://status.eessi.io
Provide quick overview of current status
- Servers are available
- Servers are in sync (serving the same snapshot)
Sync server still needs to be added to status page
- aws-eu-west-s1-sync.eessi.science
- see http://www.eessi.io/docs/filesystem_layer/stratum1
List issues/maintenance?
- Could point to a page in the docs
No alerting
- Could be implemented here, or in the internal monitoring
- user-level alerting vs internal alerting?
- API for status page
  - can make a JSON blob availale which people can query to get overview of status, which people can integrate with their own monitoring
Source code:
- Status page: https://github.com/EESSI/status-page
- Scraper: https://github.com/EESSI/cvmfs-server-scraper
- Running on AWS VM in Australia
  - Access: Bob, Terje, Alan, Kenneth, Thomas. Use the CVMFS jumphost to get access?
would using a standard framework like https://status.status.io make sense?
- or https://cachethq.io

Internal dashboard

Only available to some EESSI members
- e.g. via an SSH tunnel
Show more metrics for all CVMFS servers
- Standard ones, e.g. cpu/disk/memory usage
- Outgoing network traffic
  - If possible, aggregated by source IP
  - Important to find sites that are hammering a Stratum 1
- network from Stratum 1 to Stratum 0
- CVMFS-specific stuff: revision of different CVMFS repos, etc.
Additional bandwidth monitoring to Stratum 1 servers
- See Alan's script: https://github.com/EESSI/eessi-demo/pull/24
- How do we set this up and integrate it in the dashboard?****
- We can set up one or more clients that periodically run Alan's script and ingest the data into Prometheus
Available tools
- Grafana + Prometheus?
  - Example: https://stats.galaxyproject.eu/d/AbGoj5Iik/squid?orgId=1
- Prometheus node exporter: https://github.com/prometheus/node_exporter
- CVMFS Prometheus exporter (https://gitlab.cern.ch/cloud/cvmfs-prometheus-exporter) + Ansible role (https://github.com/terjekv/ansible-cvmfs-prometheus-exporter)
  - Use the scraper prometheus exporter: https://github.com/EESSI/cvmfs-server-scraper
  - see https://github.com/EESSI/cvmfs-server-scraper/blob/main/scripts/cvmfs_server_scraper_exporter.py
- Use Grafana or Prometheus Alertmanager for sending alerts
  - Does that allow us to detect the same discrepancies as the status page may find?
  - Can use the API of the status page to send alerts
to monitor the monitoring
- heartbeat signal: send a brief report every day, so we know Prometheus/Grafana is still working fine

How to present internal data

Considerations for a public facing Grafana board

Historically, publicly facing Grafana has been a bad idea. There have been a number of CVEs and it will require followup.
Depending on user count, there might be need for caching in front of Grafana to ensure that not every hit on the page generates traffic internally towards the backends. Large dashboards can move a lot of data and cause issues both with bandwidth and with server load. If there are public graphs, one should use a cache. See https://grafana.com/docs/grafana/latest/dashboards/dashboard-public/. Note that built-in caching and rate limiting in grafana is only available for Enterprise users.

Notes

how does this relate to dashboard effort by SURF
- that's focused on software testing, performance monitoring
Grafana has support for hierarchy of services
- Stratum 1 servers not being able to sync may be due to problem with bandwidth to Stratum 0
we should set up some internal docs
- details on particular services (like status page)
- who has access, who's main contact, etc.
step-by-step plan for internal monitoring
- set up Prometheus server
  - using Ansible playbook
  - in EESSI/cvmfs-servers repo, for now
- make CVMFS servers report data through Prometheus exporters
  - standard server metrics
  - CVMFS-specific metrics
  - can update Ansible playbooks in EESSI/cvmfs-servers repo
- set up Grafana for nice plots
- set up alerting (Slack, email)
- set up clients that measure latency to mirror server
  - collect that info on monitoring server
status page update
- to have alerting -> slack and email
should we also "monitor" AWS regions in which we have critical stuff running?
next meeting: Fri 30 Aug'24, 14:00 CEST

MultiXscale

Task 1.1 - Providing a stable, optimized, shared scientific software stack with support for established system architectures (20 PM)

Leader: Ugent (5 PM); Partners: Surf (7 PM), RIJKSUNIGRON (3 PM), Ub (3 PM), Uib (2 PM)

This task will focus on the technical development needed to provide a stable, optimized, shared scientific software stack that supports a broad range of established system architectures. Task 5.1 will define the initial level of support for the shared software stack developed by the EESSI initiative, which will serve as a starting point for this task. Additional architectures to support will be chosen based on the needs of WPs 2, 3 and 4, relevance to the domain European scientific community, the estimated technical complexity, and the overall expected impact - including impact beyond this project. While the shared software stack aims to provide the convenience of portability between systems, there will be a strong focus on performance and scaling – two aspects of paramount importance for exascale systems. Thus, we will benchmark software from the shared software stack and compare the performance against on-premise software stacks to identify potential performance limitations, solicit input from application experts and developers of the scientific software applications, and explore technical solutions to overcome performance issues that can be attributed to how the software was installed. Finally, we will increase stability of the shared software stack both proactively, by developing monitoring tools, and reactively, based on the end-user feedback from WP5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly