Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve and increase coverage of alerting about issues in our system #1261

Open
1 task
Tracked by #74
arealmaas opened this issue Oct 10, 2024 · 3 comments
Open
1 task
Tracked by #74
Assignees
Labels
monitoring Issue related to logging and monitoring

Comments

@arealmaas
Copy link
Collaborator

arealmaas commented Oct 10, 2024

Videreføring av issue: #75

Following up on the mentioned issue, we need to improve alerts and insight into errors/issues in our system.

Considerations:

  • Errors and outages in (Maskinporten, Autorisasjon, Events)
    • Health check that verifies that the well-known endpoint is reachable.

Her er en liste over typiske feilsituasjoner som vi trenger å varsle om:

  • Degradation/bortfall av komponenter (liveness/readiness)
  • Feil fra/bortfall eksterne dependencies (Maskinporten, Autorisasjon, Events)

Andre situasjoner som indikerer hendelser som kan negativt påvirke tjenestenivå:

  • Spikes i antall 4xx-feil
  • Spikes i trafikk fra enkelt-parter

Infrastructure:

  • Postgresql (slow queries, failed queries, high cpu/mem, database size)
  • Redis (slowness, high cpu/mem)
  • Servicebus (dead-letter queue, slow consumption time, cpu/mem)
  • Container apps (restart count, crashlooping containers, high cpu/mem)

Create Slack-channels for each environment? To make clear the severity of alerts

Tasks

Preview Give feedback
@arealmaas
Copy link
Collaborator Author

arealmaas added a commit that referenced this issue Oct 11, 2024
<!--- Provide a general summary of your changes in the Title above -->

## Description

<!--- Describe your changes in detail -->

We are getting spammed by health-check alerts in Slack. For now these
are just probes that Kubernetes should handle the error of internally.

## Related Issue(s)

- #1261

## Verification

- [ ] **Your** code builds clean without any errors or warnings
- [ ] Manual testing done (required)
- [ ] Relevant automated test added (if you find this hard, leave it and
we'll help out)

## Documentation

- [ ] Documentation is updated (either in `docs`-directory, Altinnpedia
or a separate linked PR in
[altinn-studio-docs.](https://github.com/Altinn/altinn-studio-docs), if
applicable)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced a new parameter for improved configuration of the Slack
Notifier function app.
- Enhanced security with a system-assigned identity for the function
app.
	- Added a new module for managing application settings seamlessly.
- Implemented an exception alert rule to monitor and notify the
development team of issues.

- **Improvements**
- Enhanced resource management and monitoring capabilities for better
operational efficiency.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@arealmaas
Copy link
Collaborator Author

Alerting about issues related to metrics would be possible to do in the grafana-dashboard if we decide to go for that. I suggest waiting with this task until we look into #1456

@arealmaas arealmaas added the monitoring Issue related to logging and monitoring label Nov 19, 2024
@arealmaas
Copy link
Collaborator Author

Grafana instance now available, so let's create some alerts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
monitoring Issue related to logging and monitoring
Projects
Development

No branches or pull requests

2 participants