Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tinkerers-ci test suites are failing due to observability-platform #3827

Closed
QuantumEnigmaa opened this issue Jan 13, 2025 · 6 comments
Closed
Assignees
Labels

Comments

@QuantumEnigmaa
Copy link

The Tinkerers-ci test suites are failing on CAPV and CAPVCD due to the observability-platform, most likely due to the object storage issues we are facing right now with the recent migration of all CAPV and CAPVCD buckets CRs from glippy togaggle.

After the migration, the object-storage-operator is struggling to work correctly on gaggle so solving this would most likely help, if not solve, the tinkerers-ci issue as well.

See the related incident for more details : https://gigantic.slack.com/archives/C087RGMRPDZ

@QuantumEnigmaa
Copy link
Author

After updating the observabiliy secrets on grouse so that they match the ones on gaggle which is now the MC managing the object-storage for CAPV & CAPVCD clusters, we got rid of the authentication errors.

Currently, there are 2 remaining issues on the MC :

  • A lack of resources on the cluster which lead to several pods (mimir, loki and alloy ones) to be stuck in Pending state. This should be solved tomorrow with rocket adding an additional worker node.
  • The prometheus-to-grafana-cloud pod is still having authentication issues so we need to go through the process for rolling the secret.

@QuantumEnigmaa
Copy link
Author

QuantumEnigmaa commented Jan 16, 2025

So @giantswarm/team-rocket added an additional worker node today, which lead to some pods to schedule but unfortunately, most of the alloy-logs pods are still stuck in Pending state as the new node is already at maximum capacity concerning memory.

By trying to release some of the memory by lowering memory requests on loki-write, I also accidentaly put one of the replica in a Pending state.

@QuantumEnigmaa
Copy link
Author

@QuentinBisson did some additional investigations and found out that alloy-metrics was requesting way more memory than what the vpa's recommender actually suggested, taking over almost entire nodes' memory.

As a temporary hot fix, Quentin created a PR to allow mimir-querier and mimir-distributor pods to schedule on control-plane nodes : https://github.com/giantswarm/shared-configs/pull/194

@QuentinBisson
Copy link

This workaround is sadly not accepted by the majority of team atlas

@QuantumEnigmaa
Copy link
Author

Herve said he was ok (even though releuctant) in today's standup :)

@QuentinBisson
Copy link

@AverageMarcus the observability platform is running on grouse, closing for now, let's reopen if we still have issues

@github-project-automation github-project-automation bot moved this from Inbox 📥 to Validation ☑️ in Roadmap Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

2 participants