Improve and increase coverage of alerting about issues in our system #1261

arealmaas · 2024-10-10T07:51:58Z

Videreføring av issue: #75

Following up on the mentioned issue, we need to improve alerts and insight into errors/issues in our system.

Considerations:

Errors and outages in (Maskinporten, Autorisasjon, Events)
- Health check that verifies that the well-known endpoint is reachable.

Her er en liste over typiske feilsituasjoner som vi trenger å varsle om:

Degradation/bortfall av komponenter (liveness/readiness)
Feil fra/bortfall eksterne dependencies (Maskinporten, Autorisasjon, Events)

Andre situasjoner som indikerer hendelser som kan negativt påvirke tjenestenivå:

Spikes i antall 4xx-feil
Spikes i trafikk fra enkelt-parter

Infrastructure:

Postgresql (slow queries, failed queries, high cpu/mem, database size)
Redis (slowness, high cpu/mem)
Servicebus (dead-letter queue, slow consumption time, cpu/mem)
Container apps (restart count, crashlooping containers, high cpu/mem)

Create Slack-channels for each environment? To make clear the severity of alerts

Tasks

Give feedback

Create Slack-channels for each environment? To make clear the severity of alerts
Options

arealmaas · 2024-10-10T13:28:02Z

Discussions:

https://digdir.slack.com/archives/C079D6PAGDS/p1728552564466459
https://digdir.slack.com/archives/C079XRW5G5A/p1728559035522959

## Description  We are getting spammed by health-check alerts in Slack. For now these are just probes that Kubernetes should handle the error of internally. ## Related Issue(s) - #1261 ## Verification - [ ] **Your** code builds clean without any errors or warnings - [ ] Manual testing done (required) - [ ] Relevant automated test added (if you find this hard, leave it and we'll help out) ## Documentation - [ ] Documentation is updated (either in `docs`-directory, Altinnpedia or a separate linked PR in [altinn-studio-docs.](https://github.com/Altinn/altinn-studio-docs), if applicable)  ## Summary by CodeRabbit - **New Features** - Introduced a new parameter for improved configuration of the Slack Notifier function app. - Enhanced security with a system-assigned identity for the function app. - Added a new module for managing application settings seamlessly. - Implemented an exception alert rule to monitor and notify the development team of issues. - **Improvements** - Enhanced resource management and monitoring capabilities for better operational efficiency.

arealmaas · 2024-11-18T11:29:25Z

Alerting about issues related to metrics would be possible to do in the grafana-dashboard if we decide to go for that. I suggest waiting with this task until we look into #1456

arealmaas · 2025-01-14T15:39:03Z

Grafana instance now available, so let's create some alerts!

arealmaas mentioned this issue Oct 10, 2024

Etablere logging og monitorering #74

Open

github-project-automation bot added this to ⚠️ Dialogporten / Arbeidsflate - GAMMEL - se https://github.com/orgs/Altinn/projects/146 ⚠️ Oct 10, 2024

github-project-automation bot moved this to New issues in ⚠️ Dialogporten / Arbeidsflate - GAMMEL - se https://github.com/orgs/Altinn/projects/146 ⚠️ Oct 10, 2024

arealmaas assigned arealmaas and knuhau Oct 10, 2024

arealmaas mentioned this issue Oct 10, 2024

chore(slacknotifier): remove health checks from slack alerts #1269

Merged

4 tasks

arealmaas mentioned this issue Oct 22, 2024

Health checks mot PostgreSQL, Azure Service Bus, og Altinn #292

Closed

elsand moved this from New issues to Backlog in ⚠️ Dialogporten / Arbeidsflate - GAMMEL - se https://github.com/orgs/Altinn/projects/146 ⚠️ Nov 5, 2024

elsand moved this from Backlog to Ready in ⚠️ Dialogporten / Arbeidsflate - GAMMEL - se https://github.com/orgs/Altinn/projects/146 ⚠️ Nov 5, 2024

arealmaas added the monitoring Issue related to logging and monitoring label Nov 19, 2024

elsand added this to Dialogporten / Arbeidsflate - NY Jan 9, 2025

elsand moved this to Ready in Dialogporten / Arbeidsflate - NY Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve and increase coverage of alerting about issues in our system #1261

Improve and increase coverage of alerting about issues in our system #1261

arealmaas commented Oct 10, 2024 •

edited

Loading

Tasks

arealmaas commented Oct 10, 2024

arealmaas commented Nov 18, 2024

arealmaas commented Jan 14, 2025

Improve and increase coverage of alerting about issues in our system #1261

Improve and increase coverage of alerting about issues in our system #1261

Comments

arealmaas commented Oct 10, 2024 • edited Loading

Tasks

arealmaas commented Oct 10, 2024

arealmaas commented Nov 18, 2024

arealmaas commented Jan 14, 2025

arealmaas commented Oct 10, 2024 •

edited

Loading