Tinkerers-ci test suites are failing due to observability-platform #3827

QuantumEnigmaa · 2025-01-13T10:13:57Z

The Tinkerers-ci test suites are failing on CAPV and CAPVCD due to the observability-platform, most likely due to the object storage issues we are facing right now with the recent migration of all CAPV and CAPVCD buckets CRs from glippy togaggle.

After the migration, the object-storage-operator is struggling to work correctly on gaggle so solving this would most likely help, if not solve, the tinkerers-ci issue as well.

See the related incident for more details : https://gigantic.slack.com/archives/C087RGMRPDZ

The text was updated successfully, but these errors were encountered:

QuantumEnigmaa · 2025-01-15T16:45:26Z

After updating the observabiliy secrets on grouse so that they match the ones on gaggle which is now the MC managing the object-storage for CAPV & CAPVCD clusters, we got rid of the authentication errors.

Currently, there are 2 remaining issues on the MC :

A lack of resources on the cluster which lead to several pods (mimir, loki and alloy ones) to be stuck in Pending state. This should be solved tomorrow with rocket adding an additional worker node.
The prometheus-to-grafana-cloud pod is still having authentication issues so we need to go through the process for rolling the secret.

QuantumEnigmaa · 2025-01-16T13:53:56Z

So @giantswarm/team-rocket added an additional worker node today, which lead to some pods to schedule but unfortunately, most of the alloy-logs pods are still stuck in Pending state as the new node is already at maximum capacity concerning memory.

By trying to release some of the memory by lowering memory requests on loki-write, I also accidentaly put one of the replica in a Pending state.

QuantumEnigmaa · 2025-01-20T08:59:19Z

@QuentinBisson did some additional investigations and found out that alloy-metrics was requesting way more memory than what the vpa's recommender actually suggested, taking over almost entire nodes' memory.

As a temporary hot fix, Quentin created a PR to allow mimir-querier and mimir-distributor pods to schedule on control-plane nodes : https://github.com/giantswarm/shared-configs/pull/194

QuentinBisson · 2025-01-20T09:56:37Z

This workaround is sadly not accepted by the majority of team atlas

QuantumEnigmaa · 2025-01-20T09:59:21Z

Herve said he was ok (even though releuctant) in today's standup :)

QuentinBisson · 2025-01-21T09:41:55Z

@AverageMarcus the observability platform is running on grouse, closing for now, let's reopen if we still have issues

QuantumEnigmaa added postmortem team/atlas Team Atlas labels Jan 13, 2025

github-project-automation bot added this to Roadmap Jan 13, 2025

github-project-automation bot moved this to Inbox 📥 in Roadmap Jan 13, 2025

Rotfuks assigned QuantumEnigmaa and QuentinBisson Jan 14, 2025

QuentinBisson closed this as completed Jan 21, 2025

github-project-automation bot moved this from Inbox 📥 to Validation ☑️ in Roadmap Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tinkerers-ci test suites are failing due to observability-platform #3827

Tinkerers-ci test suites are failing due to observability-platform #3827

QuantumEnigmaa commented Jan 13, 2025

QuantumEnigmaa commented Jan 15, 2025

QuantumEnigmaa commented Jan 16, 2025 •

edited

Loading

QuantumEnigmaa commented Jan 20, 2025

QuentinBisson commented Jan 20, 2025

QuantumEnigmaa commented Jan 20, 2025

QuentinBisson commented Jan 21, 2025

Tinkerers-ci test suites are failing due to observability-platform #3827

Tinkerers-ci test suites are failing due to observability-platform #3827

Comments

QuantumEnigmaa commented Jan 13, 2025

QuantumEnigmaa commented Jan 15, 2025

QuantumEnigmaa commented Jan 16, 2025 • edited Loading

QuantumEnigmaa commented Jan 20, 2025

QuentinBisson commented Jan 20, 2025

QuantumEnigmaa commented Jan 20, 2025

QuentinBisson commented Jan 21, 2025

QuantumEnigmaa commented Jan 16, 2025 •

edited

Loading