Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added guide for monitoring CI #5244

Closed
wants to merge 1 commit into from

Conversation

alejandrox1
Copy link
Contributor

This should help provide context on how CI works and what processes to
follow to maintain our tests.

Signed-off-by: alejandrox1 alarcj137@gmail.com

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 16, 2020
@k8s-ci-robot k8s-ci-robot added area/developer-guide Issues or PRs related to the developer guide sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Oct 16, 2020
@alejandrox1
Copy link
Contributor Author

alejandrox1 commented Oct 16, 2020

For a better read check out the version from my fork directly: https://github.com/alejandrox1/community/blob/triage-tests/contributors/devel/sig-testing/monitoring.md#fill-out-the-issue

Pinging couple people I mentioned this too for review
/cc @hasheddan @SergeyKanzhelev @MHBauer @Merkes @kscharm @knabben @RobertKielty
ptal

@k8s-ci-robot
Copy link
Contributor

@alejandrox1: GitHub didn't allow me to request PR reviews from the following users: merkes, kscharm.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

For a better read check out the version from my fork directly: https://github.com/alejandrox1/community/blob/triage-tests/contributors/devel/sig-testing/monitoring.md#fill-out-the-issue

Pinging couple people I mentioned this too for review
/cc @hasheddan @SergeyKanzhelev @MHBauer @Merkes @kscharm @knabben @RobertKielty
ptal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@thejoycekung thejoycekung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for taking the time to take that doc and really write it out!!! Most of these comments are grammar/spelling nits (because I am kinda nitpicky I am very sorry) but some are also questions/suggestions!

(also yes I am aware it is 1AM on a Friday night I am so sorry please enjoy your weekend)

## Monitoring the health of Kubernetes with TestGrid

TestGrid is a highly-configurable, interactive dashboard for viewing your test
results in a grid, see https://github.com/GoogleCloudPlatform/testgrid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: A better transition would be nice here, like 'it is partially open-sourced so you can view the source code here' or something to that effect?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To disambiguate even more "the back end is open-sourced"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to clarifying the repo contains the back-end components of testgrid, not the dashboard code itself

contributors/devel/sig-testing/monitoring.md Outdated Show resolved Hide resolved
Each SIG has its own set of dashboards and each dashboard is composed of
different end-to-end (e2e) jobs.
E2E jobs are in turn made up of test stages (i.e., bootstraping a Kubernetes
cluster, tearing down a Kubernetes cluster) and e2e tests (i.e., Kubectl client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar nit:

Suggested change
cluster, tearing down a Kubernetes cluster) and e2e tests (i.e., Kubectl client
cluster, tearing down a Kubernetes cluster) and e2e tests (e.g., Kubectl client

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Picking a different test as an example might be more readable here? For this particular test it feels a little clumsy to read - I wasn't sure whether the test was "Kubectl logs should be able to retrieve and filter logs" and it was under "Kubectl client" (but uh, no turns out "Kubectl client Kubectl logs ..." is the real name)

contributors/devel/sig-testing/monitoring.md Outdated Show resolved Hide resolved
contributors/devel/sig-testing/monitoring.md Outdated Show resolved Hide resolved
Comment on lines 204 to 206
Once you have filled out the issue, please mention it in the appropriate mailing
list thread (if you see an email from testgrid mentioning a job or test
failure) and share it with the appropriate SIG in the Kubernetes slack.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar nit:

Suggested change
Once you have filled out the issue, please mention it in the appropriate mailing
list thread (if you see an email from testgrid mentioning a job or test
failure) and share it with the appropriate SIG in the Kubernetes slack.
Once you have filled out the issue, please mention it in the appropriate mailing
list thread (if you see an email from TestGrid mentioning a job or test
failure) and share it with the appropriate SIG in the Kubernetes Slack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: should we mention it to a mailing list also if we did not see an email from testgrid (e.g. it was a flake)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good

contributors/devel/sig-testing/monitoring.md Outdated Show resolved Hide resolved
list thread (if you see an email from testgrid mentioning a job or test
failure) and share it with the appropriate SIG in the Kubernetes slack.

Don't worry if you are not sure how to debug further or how to resole the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a person who is very new, this line is deeply comforting to read. 😅

https://github.com/kubernetes/kubernetes/issues/new/choose , chose the appropriate issue
template.

### Fill out the issue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if this is out of scope for this issue or whether it falls more under "revamping the issue template"? -> We should talk a little bit about how to title it, e.g. prefix with [Failing Test] or [Flaky Test] (depending on what's happening)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I said something similar in a different PR review that wrote instructions on how to file a flake issue #5205 (comment)

Instructions on correctly filling out an issue are most likely to be read if they are part of the issue template itself. Alternatively, make a page dedicated just to how to file kubernetes/kubernetes issues, and link to that page from the issue template.

I opened kubernetes/kubernetes#95528 to cover updating the flake template, maybe it should expand for both.

Sometime Triage will help you find patterns to figure out whats wrong.
In this instance we can also see that this job has been failing rather
frequently (about 2 times per hour).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if this is out of scope for this issue or whether it falls more under "revamping the issue template"? -> We should also mention adding a relevant SIG through /sig <name> and/or cc'ing relevant people like /cc @kubernetes/sig-<foo>-test-failures or @kubernetes/ci-signal


failed as well.

If one or both of these jobs continue failing, or if they failed frently
Copy link
Member

@knabben knabben Oct 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we be more specific in the frequency of the failure? Having an agreement can make it less subjective.


Further down the page you will see all the logs for the entire test run.
Please copy any information you think may be useful from here into the issue.

Copy link
Member

@knabben knabben Oct 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe is worth mention the artifacts for more logging of other components.

Copy link
Member

@justaugustus justaugustus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two filename misspellings:

  • failes-tests.png --> failed-tests.png
  • spyglass-sumary.png --> spyglass-summary.png

Ensure you update any references in the document as well.

I'll hold my content review for now, as @thejoycekung has already done an excellent review to start you off.

@qiutongs
Copy link

I am new to the CI infra. This doc is really informational and provides a great starting point. Thanks!

* `gci-gce-ingress` https://testgrid.k8s.io/sig-release-master-blocking#gci-gce-ingress,
* `kind-master-parallel` https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel

are flaky (we should have some issues opened up for these to investigate why
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be useful to include the actual issues examples

Copy link
Contributor

@hasheddan hasheddan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alejandrox1 thank you so much for your great work here 🙂 and thanks to @thejoycekung for the thorough and helpful review! Once some of the suggestions are incorporated, I think this will be a great document to continue to iterate on. One minor nitpick on document name, it kind of makes me think of monitoring in the more traditional software engineering sense of the word, rather than monitoring testgrid and test results. Maybe something like addressing-test-failures.md or something of that nature?

Comment on lines +39 to +40
Furthermore, if jobs or tests are failing or flaking, then pull requests will
take a lot longer to be merged.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be an opportune time to mention the difference between periodic / presubmit / postsubmit

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also possibly mentioning metrics

contributors/devel/sig-testing/monitoring.md Outdated Show resolved Hide resolved
## Monitoring the health of Kubernetes with TestGrid

TestGrid is a highly-configurable, interactive dashboard for viewing your test
results in a grid, see https://github.com/GoogleCloudPlatform/testgrid.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To disambiguate even more "the back end is open-sourced"

are doing.

We highly enourage any one to take periodically monitor these dashboards.
If you see that a job or test has been failing please raise an issue with the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth mentioning that clicking a failing cell will take you to the job execution in Deck?

Comment on lines +39 to +40
Furthermore, if jobs or tests are failing or flaking, then pull requests will
take a lot longer to be merged.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also possibly mentioning metrics

contributors/devel/sig-testing/monitoring.md Outdated Show resolved Hide resolved
@alejandrox1 alejandrox1 force-pushed the triage-tests branch 6 times, most recently from 70db858 to 48bae65 Compare November 9, 2020 19:07
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2020
This should help provide context on how CI works and what processes to
follow to maintain our tests.

Signed-off-by: alejandrox1 <alarcj137@gmail.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 7, 2020
@alejandrox1
Copy link
Contributor Author

Got through the syntax comments 😅 (sorry for taking so long).
pondering on the open-questions

@SergeyKanzhelev
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 7, 2020
Copy link
Contributor

@hasheddan hasheddan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is at a merge-able state and further improvements can be made iteratively. Thanks for this great write up @alejandrox1!

/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alejandrox1, hasheddan, qiutongs
To complete the pull request process, please assign bentheelder after the PR has been reviewed.
You can assign the PR to them by writing /assign @bentheelder in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hasheddan
Copy link
Contributor

@BenTheElder @spiffxp can we get some eyes on this?

Copy link
Member

@spiffxp spiffxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like @thejoycekung raised some valid unaddressed concerns

which we use to monitor and observe the health of the project.

Each SIG has its own set of dashboards, and each dashboard is composed of
different end-to-end (e2e) jobs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are more than e2e jobs

Suggested change
different end-to-end (e2e) jobs.
different jobs (build, unit test, integration test, end-to-end (e2e) test, etc.)

## Monitoring the health of Kubernetes with TestGrid

TestGrid is a highly-configurable, interactive dashboard for viewing your test
results in a grid, see https://github.com/GoogleCloudPlatform/testgrid.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to clarifying the repo contains the back-end components of testgrid, not the dashboard code itself

These views allow different teams to monitor and understand how their areas
are doing.

We highly encourage anyone to periodically monitor these dashboards.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to encourage toil. SIGs should be periodically monitoring the dashboards related to subprojects they own.

Comment on lines +37 to +38
**Note**: It is important that all SIGs periodically monitor their jobs and
tests. These are used to figure out when to release Kubernetes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too broad. Not all jobs/tests are used to figure out when to release Kubernetes.

Comment on lines +68 to +78
The number one thing to do is to communicate your findings: a test or job has
been flaking or failing.
If you saw a TestGrid alert on a mailing list, please reply to the thread and
mention that you are looking into it.
It is important to communicate to prevent duplicate work and to ensure CI
problems get attention.

In order to communicate with the rest of the community and to drive the work,
please open up an issue on Kubernetes,
https://github.com/kubernetes/kubernetes/issues/new/choose, and choose the appropriate issue
template.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest:

  • look to see if there is already an open issue for the relevant repo, if not, create one
    • the relevant repo for kubernetes/kubernetes release-blocking or merge-blocking jobs is kubernetes/kubernetes
  • reply to alert with link to issue
  • all further communication on that issue

How do I decide which kubernetes/kubernetes issue template to use?

  • if the job is failing continuously, failing test
  • if the job is occasionally passing and failing, flaking test

5. **Anything else we need to know**

There is this wonderful page built by SIG testing that often comes in handy:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the shortlink, the bucket name is going to change eventually (ref: kubernetes/k8s.io#1305)

Suggested change
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1
https://go.k8s.io/triage

Comment on lines +179 to +186
There is one important detail we have to mention at this point, the job names
you see on TestGrid are often aliases.
For example, when we clicked on a test run for
`node-kubelet-features-master`
(
https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features-master
), at the top left corner of spyglass the page tells us the real job name,
`ci-kubernetes-node-kubelet-features` (notice the "ci-kubernetes-" prefix).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, I suggested earlier in the doc

`node-kubelet-features-master`
(
https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features-master
), at the top left corner of spyglass the page tells us the real job name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job name, not tab name

@@ -0,0 +1,212 @@
# Monitoring Kubernetes Health
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Monitoring Kubernetes Health
# Monitoring Kubernetes Test Health

https://github.com/kubernetes/kubernetes/issues/new/choose , chose the appropriate issue
template.

### Fill out the issue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I said something similar in a different PR review that wrote instructions on how to file a flake issue #5205 (comment)

Instructions on correctly filling out an issue are most likely to be read if they are part of the issue template itself. Alternatively, make a page dedicated just to how to file kubernetes/kubernetes issues, and link to that page from the issue template.

I opened kubernetes/kubernetes#95528 to cover updating the flake template, maybe it should expand for both.

@knabben
Copy link
Member

knabben commented Jan 20, 2021

/lgtm

@spiffxp
Copy link
Member

spiffxp commented Jan 28, 2021

/hold
#5244 (review) remains unaddressed

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 28, 2021
@justaugustus
Copy link
Member

Sounds like we've got a crew to swarm this! Thanks everyone!

@RobertKielty --

ref: https://kubernetes.slack.com/archives/C2C40FMNF/p1611943998099400

@spiffxp
Copy link
Member

spiffxp commented Jan 29, 2021

Please tag me as as reviewer as well

@spiffxp
Copy link
Member

spiffxp commented Feb 5, 2021

/close
Since @RobertKielty has picked this up over in #5449

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closed this PR.

In response to this:

/close
Since @RobertKielty has picked this up over in #5449

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/developer-guide Issues or PRs related to the developer guide cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

10 participants