Skip to content

Sync meeting on EESSI CVMFS monitoring dashboard (2024 06 14)

Bob Dröge edited this page Jun 14, 2024 · 1 revision

EESSI CVMFS monitoring

  • Date: Fri Jun 14th, 14:00 CEST
  • Participants
    • Bob
    • Terje
    • Kenneth
    • Lara
    • Thomas
    • Richard

Public-facing dashboard / status page

Internal dashboard

How to present internal data

Considerations for a public facing Grafana board

  • Historically, publicly facing Grafana has been a bad idea. There have been a number of CVEs and it will require followup.
  • Depending on user count, there might be need for caching in front of Grafana to ensure that not every hit on the page generates traffic internally towards the backends. Large dashboards can move a lot of data and cause issues both with bandwidth and with server load. If there are public graphs, one should use a cache. See https://grafana.com/docs/grafana/latest/dashboards/dashboard-public/. Note that built-in caching and rate limiting in grafana is only available for Enterprise users.

Notes

  • how does this relate to dashboard effort by SURF
    • that's focused on software testing, performance monitoring
  • Grafana has support for hierarchy of services
    • Stratum 1 servers not being able to sync may be due to problem with bandwidth to Stratum 0
  • we should set up some internal docs
    • details on particular services (like status page)
    • who has access, who's main contact, etc.
  • step-by-step plan for internal monitoring
    • set up Prometheus server
      • using Ansible playbook
      • in EESSI/cvmfs-servers repo, for now
    • make CVMFS servers report data through Prometheus exporters
      • standard server metrics
      • CVMFS-specific metrics
      • can update Ansible playbooks in EESSI/cvmfs-servers repo
    • set up Grafana for nice plots
    • set up alerting (Slack, email)
    • set up clients that measure latency to mirror server
      • collect that info on monitoring server
  • status page update
    • to have alerting -> slack and email
  • should we also "monitor" AWS regions in which we have critical stuff running?
  • next meeting: Fri 30 Aug'24, 14:00 CEST

MultiXscale

Task 1.1 - Providing a stable, optimized, shared scientific software stack with support for established system architectures (20 PM)

Leader: Ugent (5 PM); Partners: Surf (7 PM), RIJKSUNIGRON (3 PM), Ub (3 PM), Uib (2 PM)

This task will focus on the technical development needed to provide a stable, optimized, shared scientific software stack that supports a broad range of established system architectures. Task 5.1 will define the initial level of support for the shared software stack developed by the EESSI initiative, which will serve as a starting point for this task. Additional architectures to support will be chosen based on the needs of WPs 2, 3 and 4, relevance to the domain European scientific community, the estimated technical complexity, and the overall expected impact - including impact beyond this project. While the shared software stack aims to provide the convenience of portability between systems, there will be a strong focus on performance and scaling – two aspects of paramount importance for exascale systems. Thus, we will benchmark software from the shared software stack and compare the performance against on-premise software stacks to identify potential performance limitations, solicit input from application experts and developers of the scientific software applications, and explore technical solutions to overcome performance issues that can be attributed to how the software was installed. Finally, we will increase stability of the shared software stack both proactively, by developing monitoring tools, and reactively, based on the end-user feedback from WP5.

Clone this wiki locally