Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health checks mot PostgreSQL, Azure Service Bus, og Altinn #292

Closed
4 tasks done
Tracked by #74
oskogstad opened this issue Dec 14, 2023 · 4 comments
Closed
4 tasks done
Tracked by #74

Health checks mot PostgreSQL, Azure Service Bus, og Altinn #292

oskogstad opened this issue Dec 14, 2023 · 4 comments
Assignees
Labels
monitoring Issue related to logging and monitoring

Comments

@oskogstad
Copy link
Collaborator

oskogstad commented Dec 14, 2023

Introduksjon

ASP.NET-prosjektene våre har endepunkter for health probes i ContainerApps (kubernetes),
disse bruker nå default-implementasjonen til .NET.
De returnerer 200 OK om ting er i live.

Implementasjon

Implementer vår egne checks som sjekker connections og connection time mot

  • PostgreSQL ✅
  • ASB ✅
  • Altinn authentication: ✅
  • Maskinporten wellknown: ✅
  • id-porten wellknown: ✅

Noen checks vi kanskje bør se på etterhvert:

  • Altinn events
  • Altinn authorization (might not need because we have authentication?) /authorization
  • Altinn access management (might not need because we have authentication?) /accessmanagement
  • Altinn resourceregistry: (is it enough with health checking auth here? Would give us more insight if we know that rr has outage..) "resourceregistry/api/v1/resource/";
  • Altinn organization registry: "orgs/altinn-orgs.json" Altinn CDN? https://altinncdn.no/orgs/altinn-orgs.json
  • Altinn name registry: "register/api/v1/parties/nameslookup"
  • Altinn CDN: Use organization registry instead here?

Ingen kobling mot disse gir unhealthy, høy responstid skal gir degraded (sjekk opp eksakte terms/HTTP-responskoder)

Oppgaver

Preview Give feedback
@oskogstad
Copy link
Collaborator Author

Kan hende varsling kan tas ut som en egen task, for å holde oppgavene små.

@elsand elsand added this to the Pilotproduksjon milestone Dec 14, 2023
@elsand elsand added the monitoring Issue related to logging and monitoring label Dec 14, 2023
@elsand elsand changed the title Health checks mot PostgreSQL, RabbitMQ/Messagebroker, og Altinn Health checks mot PostgreSQL, Azure Service Bus, og Altinn Jan 3, 2024
@arealmaas
Copy link
Collaborator

Tror vi skal være forsiktig med å legge til requests mot eksterne tjenester som en del av container apps-health checken. Om vi sliter å få kontakt med postgresql så vil vi ikke nødvendigvis degrade tjenesten til "unhealthy" i Kubernetes ettersom den da vil kontinuerlig restarte pga. failing health checks.

Skulle vi heller ha eksponert et eget health-endpoint som vi kunne pinget fra f.eks https://learn.microsoft.com/en-us/azure/azure-monitor/app/availability-overview, https://www.runscope.com/ eller https://www.atlassian.com/software/statuspage? Der kan vi f.eks også degrade tjenesten om latency på en tredjepartsservice er over X f.eks også.

Så kan vi heller returnere 200 OK på liveness og returnere noe som gir mening på readiness (når vi sier at vi ikke vil at tjenesten/replicaen skal motta mer trafikk før den er healthy).

@arealmaas arealmaas self-assigned this Sep 18, 2024
arealmaas added a commit that referenced this issue Oct 8, 2024
- Adds health check for Redis, PosgreSQL and the wellknown-endpoints. 
- Ensures that we have different endpoints for
readiness/liveness/startup/health

Related to #292 

<img width="542" alt="image"
src="https://github.com/user-attachments/assets/5b71bfbc-1e83-427c-8042-e363ffbf8faa">


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **New Features**
- Added health check capabilities for Redis, PostgreSQL, and well-known
endpoints.
- Introduced multiple health check endpoints: `/startup`, `/liveness`,
`/readiness`, and `/health`.
- Integrated health checks into the service collection for better
monitoring.
  - Added a new project for utility functions related to health checks.

- **Enhancements**
- Improved health monitoring with a new HTTP client and health check
configurations, including a self-check feature.
- Added support for dynamic configuration of health check probes in
deployment templates.
- Updated API specifications to reflect new health check schemas and
structures.

- **Bug Fixes**
- Enhanced error handling for health checks to provide clearer feedback
on endpoint status.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Are Almaas <arealmaas@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Dialogporten Automation Bot <164321870+dialogporten-bot@users.noreply.github.com>
Co-authored-by: Magnus Sandgren <5285192+MagnusSandgren@users.noreply.github.com>
arealmaas added a commit that referenced this issue Oct 8, 2024
<!--- Provide a general summary of your changes in the Title above -->

## Description

<!--- Describe your changes in detail -->
Changed the paths of health checks, so have to ensure we use the same
endpoints in the probes
## Related Issue(s)

- #292 

## Verification

- [ ] **Your** code builds clean without any errors or warnings
- [ ] Manual testing done (required)
- [ ] Relevant automated test added (if you find this hard, leave it and
we'll help out)

## Documentation

- [ ] Documentation is updated (either in `docs`-directory, Altinnpedia
or a separate linked PR in
[altinn-studio-docs.](https://github.com/Altinn/altinn-studio-docs), if
applicable)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Updated health probe paths for container apps to include a new
`/health` prefix, enhancing health check organization.
- **Bug Fixes**
- Improved accuracy of health status checks by modifying probe endpoints
to ensure proper monitoring.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
arealmaas added a commit that referenced this issue Oct 9, 2024
<!--- Provide a general summary of your changes in the Title above -->

## Description

<!--- Describe your changes in detail -->

<img width="1091" alt="image"
src="https://github.com/user-attachments/assets/6f3f9095-ccc7-4342-8f47-fdacb733f9be">

Seems like it's CPU that we are struggling with the most. Upgrading to
the next profile in the Burstable tier which has 2 cores. (B2s)

## Related Issue(s)

- #292 

## Verification

- [ ] **Your** code builds clean without any errors or warnings
- [ ] Manual testing done (required)
- [ ] Relevant automated test added (if you find this hard, leave it and
we'll help out)

## Documentation

- [ ] Documentation is updated (either in `docs`-directory, Altinnpedia
or a separate linked PR in
[altinn-studio-docs.](https://github.com/Altinn/altinn-studio-docs), if
applicable)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced flexibility in PostgreSQL SKU selection with additional
options available.
- Updated default SKU from 'Standard_B1ms' to 'Standard_B2s' for
improved resource allocation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
arealmaas added a commit that referenced this issue Oct 11, 2024
<!--- Provide a general summary of your changes in the Title above -->

## Description

<!--- Describe your changes in detail -->

## Related Issue(s)

- #292 

## Verification

- [ ] **Your** code builds clean without any errors or warnings
- [ ] Manual testing done (required)
- [ ] Relevant automated test added (if you find this hard, leave it and
we'll help out)

## Documentation

- [ ] Documentation is updated (either in `docs`-directory, Altinnpedia
or a separate linked PR in
[altinn-studio-docs.](https://github.com/Altinn/altinn-studio-docs), if
applicable)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced health check configurations for container applications,
enhancing monitoring capabilities.
- Added a new launch configuration for debugging the GraphQL application
alongside the WebApi.

- **Bug Fixes**
- Updated health check mappings to ensure proper functionality and
configuration.

- **Documentation**
- Improved project references and service configurations for clarity and
maintainability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@arealmaas
Copy link
Collaborator

Flytter denne til #1261 :

"Implementer varsling hvis en container er unhealthy eller degraded over en viss periode"

Ikke så relevant for health-checks

@arealmaas
Copy link
Collaborator

Eneste som gjenstår er health-check mot servicebus

arealmaas added a commit that referenced this issue Oct 23, 2024
<!--- Provide a general summary of your changes in the Title above -->

## Description

<!--- Describe your changes in detail -->

## Related Issue(s)

- #292 

## Verification

- [ ] **Your** code builds clean without any errors or warnings
- [ ] Manual testing done (required)
- [ ] Relevant automated test added (if you find this hard, leave it and
we'll help out)

## Documentation

- [ ] Documentation is updated (either in `docs`-directory, Altinnpedia
or a separate linked PR in
[altinn-studio-docs.](https://github.com/Altinn/altinn-studio-docs), if
applicable)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
	- Enhanced health check reporting to focus on dependency tracking.
- Optimized caching strategies for improved performance and reliability.
- Updated configuration for HTTP clients to improve error handling and
service integration.

- **Bug Fixes**
- Adjusted health check options and caching parameters to ensure
accurate functionality.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
arealmaas added a commit that referenced this issue Oct 24, 2024
<!--- Provide a general summary of your changes in the Title above -->

## Description

<!--- Describe your changes in detail -->

An availability test for the backend. For now will send a health-check
request to web-api-so so verify that the service is up and running with
the all dependencies. The health-endpoint in APIM will send requests to
web-api-so by default, the other services are not exposed yet.

- Adds an availability test towards our APIM. Will probe the deep
version of the health-checks which checks third party URLs together with
Redis and Postgres.
- Will now only target web-api-so as it is the default backend. Should
expose all services like this.

<img width="612" alt="image"
src="https://github.com/user-attachments/assets/a368ed4d-78c5-4966-b363-493c85bd4568">

The frontend availability test:


![image](https://github.com/user-attachments/assets/55cbe387-d246-4b45-bbd4-17722f4117ab)

## Related Issue(s)

- #292 

## Verification

- [ ] **Your** code builds clean without any errors or warnings
- [ ] Manual testing done (required)
- [ ] Relevant automated test added (if you find this hard, leave it and
we'll help out)

## Documentation

- [ ] Documentation is updated (either in `docs`-directory, Altinnpedia
or a separate linked PR in
[altinn-studio-docs.](https://github.com/Altinn/altinn-studio-docs), if
applicable)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced a new parameter `apimUrl` for capturing the APIM instance
URL across various environments (production, staging, test, yt01).
- Added a new module for creating an availability test for the APIM
instance, enhancing monitoring capabilities.

- **Enhancements**
- New output declaration for the Application Insights resource ID,
allowing easier access to the resource identifier post-deployment.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
monitoring Issue related to logging and monitoring
Projects
None yet
Development

No branches or pull requests

3 participants