Skip to content

Sync meeting on EESSI CVMFS monitoring dashboard (2024 08 30)

Bob Dröge edited this page Aug 30, 2024 · 1 revision

Sync meeting (2024-08-30)

  • Date: Fri Aug 30th, 13:30 CEST
  • Participants
    • Bob
    • Pedro
    • Lara
    • Thomas
    • Richard
    • Kenneth
    • Terje

Dashboard

Bob has been working on Ansible playbooks which are on the repo cvmfs-servers.

  • 1 playbook installs server with Grafana and Prometheus
  • other playbooks install the Prometheus exporters in the other machines so we can get data from there
  • see also https://github.com/EESSI/cvmfs-servers/pull/12
    • bit outdated, Bob needs to sync recent changes
    • node-exporter.json 24k file is export of project dashboard, so it can be easily restored somewhere if needed
  • Demo runs in AWS on test machines for the moment
  • Dashboard with node exporter, will allow selecting between different cvmfs servers (s0, s1s, etc).
  • only monitoring server can access endpoint for Prometheus exporter on CVMFS servers
  • Ansible playbooks are run from Bob's laptop or from Stratum 0 as jumphost
  • Add information on access on as a wikipage on the support portal documentation.

Alerting is not yet working, but some can be stareted using Prometheus. Plan to include CVMFS exporters to get CVMFS metrics from the relevant machines and have alerts and dashboards from those exporters. Would we want to send all the alerts from one single location/source? Yes. Have another stream of alerts separate from CVMFS etc (for example, on select tests). Status page also creates a json file that we can fetch with Prometheus and define some alerts.

  • can we set up client systems that can report back to monitoring server?
    • could run a daily Slurm job from the HPC clusters we have access to
  • can also monitor S3 bucket used by CVMFS sync server via AWS CloudFront
  • we need to create overview of EESSI infrastructure
    • who can access what
    • who's responsible (+ backup) for each component
    • should cover:
      • CVMFS servers (incl. jumphost)
      • EESSI status page
      • build clusters in AWS/Azure
      • GitHub org + repos

Status page

https://status.eessi.io/test/

  • should monitoring server be responsible for all alterting to Slack/email? => YES
    • status page provides a JSON file that can be pulled in by Prometheus, see https:// status.eessi.io/test/status.json

https://status.eessi.io/test/config.json defines rules for each status. We can change these rules without having to change the code, only the config. We should add documentation on this setup/access to the wiki. Dashboard (throughPrometheus) can get this info from the json files to visualise and send alerts. Bob will look into this. Terje will move the current test version into production https://status.eessi.io

Other topics

Where do we keep secrets like passwords and keys?

  • shared secrets via Keepass database file with a master password known by people who need to
  • Can do something similar with Ansible vaults(?).
  • Looking at production setup for this will p start that around November so we can have the deliverable ready on time.
    • Basically same deliverable as before, but now explaining how we went from the test environment to production. Write something about disaster recovery, not necessarily something that needs to be tested right now but something that explains the implications of this.
  • Next meeting Friday 18th October 13:30 CEST
Clone this wiki locally