Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s-infra-prow-build: Add monitoring #6468

Merged
merged 3 commits into from
Feb 27, 2024

Conversation

koksay
Copy link
Contributor

@koksay koksay commented Feb 26, 2024

Adds monitoring components for the k8s-infra-prow-build cluster.
Mainly copied from eks-prow-build-cluster configuration

Please note that, the alertmanager-slack-token.yam file has invalid data for this cluster. It will be updated later when the external secrets are in place.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 26, 2024
@k8s-ci-robot k8s-ci-robot added area/infra Infrastructure management, infrastructure design, code in infra/ area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters area/terraform Terraform modules, testing them, writing more of them, code in infra/gcp/clusters/ sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 26, 2024
- apiURL:
name: alertmanager-k8c-slack-token
key: url
channel: '#alerting-cncf-prod'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably drop the entire alerting config for the moment. we want metrics visualization first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ameukam I removed alertmanager folder and disabled it's configuration on prometheus.yaml

@xmudrii
Copy link
Member

xmudrii commented Feb 26, 2024

/assign

Copy link
Member

@xmudrii xmudrii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking care of this, @koksay!
/lgtm
/approve
/hold
for other folks to take a look

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 27, 2024
@upodroid
Copy link
Member

upodroid commented Feb 27, 2024

Can we look into using Thanos to aggregate and store metrics for a period of time?

Also, I would prefer if prometheus operator was deployed via helm as the diff in Git would be very tiny.

@xmudrii
Copy link
Member

xmudrii commented Feb 27, 2024

We plan to do both, however:

  • This PR replicates the EKS monitoring setup pretty much 1:1 and given that we're sure that this setup works well, we don't want to change it at this point
  • Integrating with Thanos will take some time and we need the monitoring stack ASAP-ish to prepare the workshop that we have at the Contributor Summit
  • The way we apply resources to the GKE cluster is not really compatible with Helm (e.g. I don't know if we even have Helm in the image that's used by post-k8sio-deploy-prow-build-resources job), so using Helm would require additional effort that's probably not worth it as we'll have ArgoCD pipeline for the GKE cluster at some point soon-ish

That said, I have preference to proceed with this PR as it is and iterating (for both GKE and EKS clusters) in the near future.

Copy link
Member

@upodroid upodroid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add Thanos after Kubecon
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: koksay, upodroid, xmudrii

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 27, 2024
@k8s-ci-robot k8s-ci-robot merged commit 9aa4bab into kubernetes:main Feb 27, 2024
3 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Feb 27, 2024
@BenTheElder
Copy link
Member

Pretty sure this change is what is causing scheduling issues for some PR/release blocking jobs:

We have to be careful about increasing per-node resource utilization in the build clusters (as opposed to say, a deployment, which we can scale around just fine).
We have jobs that are best run as the sole tenant and request ~100% schedulable CPU to ensure this. See more in the linked issue.

I think we should consider reverting and then revisit shortly after, given kubernetes/kubernetes has code freeze pending and we're having trouble testing and merging PRs. kubernetes/test-infra#32157 (comment)

For revisit we can discuss in kubernetes/test-infra#32157 after confirming that CI is running smoothly again ...

@koksay
Copy link
Contributor Author

koksay commented Mar 5, 2024

@BenTheElder What if we remove the requests but keep limits for the node-exporter ds?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/infra Infrastructure management, infrastructure design, code in infra/ area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters area/terraform Terraform modules, testing them, writing more of them, code in infra/gcp/clusters/ cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants