-
Notifications
You must be signed in to change notification settings - Fork 0
Sync meeting on EESSI CVMFS monitoring dashboard (2024 06 14)
- Date: Fri Jun 14th, 14:00 CEST
- Participants
- Bob
- Terje
- Kenneth
- Lara
- Thomas
- Richard
- https://status.eessi.io
- Provide quick overview of current status
- Servers are available
- Servers are in sync (serving the same snapshot)
- Sync server still needs to be added to status page
aws-eu-west-s1-sync.eessi.science
- see http://www.eessi.io/docs/filesystem_layer/stratum1
- List issues/maintenance?
- Could point to a page in the docs
- No alerting
- Could be implemented here, or in the internal monitoring
- user-level alerting vs internal alerting?
- API for status page
- can make a JSON blob availale which people can query to get overview of status, which people can integrate with their own monitoring
- Source code:
- Status page: https://github.com/EESSI/status-page
- Scraper: https://github.com/EESSI/cvmfs-server-scraper
- Running on AWS VM in Australia
- Access: Bob, Terje, Alan, Kenneth, Thomas. Use the CVMFS jumphost to get access?
- would using a standard framework like https://status.status.io make sense?
- Only available to some EESSI members
- e.g. via an SSH tunnel
- Show more metrics for all CVMFS servers
- Standard ones, e.g. cpu/disk/memory usage
- Outgoing network traffic
- If possible, aggregated by source IP
- Important to find sites that are hammering a Stratum 1
- network from Stratum 1 to Stratum 0
- CVMFS-specific stuff: revision of different CVMFS repos, etc.
- Additional bandwidth monitoring to Stratum 1 servers
- See Alan's script: https://github.com/EESSI/eessi-demo/pull/24
- How do we set this up and integrate it in the dashboard?****
- We can set up one or more clients that periodically run Alan's script and ingest the data into Prometheus
- Available tools
- Grafana + Prometheus?
- Prometheus node exporter: https://github.com/prometheus/node_exporter
- CVMFS Prometheus exporter (https://gitlab.cern.ch/cloud/cvmfs-prometheus-exporter) + Ansible role (https://github.com/terjekv/ansible-cvmfs-prometheus-exporter)
- Use the scraper prometheus exporter: https://github.com/EESSI/cvmfs-server-scraper
- see https://github.com/EESSI/cvmfs-server-scraper/blob/main/scripts/cvmfs_server_scraper_exporter.py
- Use Grafana or Prometheus Alertmanager for sending alerts
- Does that allow us to detect the same discrepancies as the status page may find?
- Can use the API of the status page to send alerts
- to monitor the monitoring
- heartbeat signal: send a brief report every day, so we know Prometheus/Grafana is still working fine
- Historically, publicly facing Grafana has been a bad idea. There have been a number of CVEs and it will require followup.
- Depending on user count, there might be need for caching in front of Grafana to ensure that not every hit on the page generates traffic internally towards the backends. Large dashboards can move a lot of data and cause issues both with bandwidth and with server load. If there are public graphs, one should use a cache. See https://grafana.com/docs/grafana/latest/dashboards/dashboard-public/. Note that built-in caching and rate limiting in grafana is only available for Enterprise users.
- how does this relate to dashboard effort by SURF
- that's focused on software testing, performance monitoring
- Grafana has support for hierarchy of services
- Stratum 1 servers not being able to sync may be due to problem with bandwidth to Stratum 0
- we should set up some internal docs
- details on particular services (like status page)
- who has access, who's main contact, etc.
- step-by-step plan for internal monitoring
- set up Prometheus server
- using Ansible playbook
- in
EESSI/cvmfs-servers
repo, for now
- make CVMFS servers report data through Prometheus exporters
- standard server metrics
- CVMFS-specific metrics
- can update Ansible playbooks in
EESSI/cvmfs-servers
repo
- set up Grafana for nice plots
- set up alerting (Slack, email)
- set up clients that measure latency to mirror server
- collect that info on monitoring server
- set up Prometheus server
- status page update
- to have alerting -> slack and email
- should we also "monitor" AWS regions in which we have critical stuff running?
- next meeting: Fri 30 Aug'24, 14:00 CEST
Task 1.1 - Providing a stable, optimized, shared scientific software stack with support for established system architectures (20 PM)
Leader: Ugent (5 PM); Partners: Surf (7 PM), RIJKSUNIGRON (3 PM), Ub (3 PM), Uib (2 PM)
This task will focus on the technical development needed to provide a stable, optimized, shared scientific software stack that supports a broad range of established system architectures. Task 5.1 will define the initial level of support for the shared software stack developed by the EESSI initiative, which will serve as a starting point for this task. Additional architectures to support will be chosen based on the needs of WPs 2, 3 and 4, relevance to the domain European scientific community, the estimated technical complexity, and the overall expected impact - including impact beyond this project. While the shared software stack aims to provide the convenience of portability between systems, there will be a strong focus on performance and scaling – two aspects of paramount importance for exascale systems. Thus, we will benchmark software from the shared software stack and compare the performance against on-premise software stacks to identify potential performance limitations, solicit input from application experts and developers of the scientific software applications, and explore technical solutions to overcome performance issues that can be attributed to how the software was installed. Finally, we will increase stability of the shared software stack both proactively, by developing monitoring tools, and reactively, based on the end-user feedback from WP5.