Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added guide for monitoring CI #5244

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
212 changes: 212 additions & 0 deletions contributors/devel/sig-testing/monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
# Monitoring Kubernetes Health
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Monitoring Kubernetes Health
# Monitoring Kubernetes Test Health


**Table of Contents**

- [Monitoring Kubernetes Health](#monitoring-kubernetes-health)
- [Monitoring the health of Kubernetes with TestGrid](#monitoring-the-health-of-kubernetes-with-testgrid)
- [What dashboards should I monitor?](#what-dashboards-should-i-monitor)
- [What do I do when I see a TestGrid alert?](#what-do-i-do-when-i-see-a-testgrid-alert)
- [Communicate your findings](#communicate-your-findings)
- [Fill out the issue](#fill-out-the-issue)
- [Iterate](#iterate)


## Monitoring the health of Kubernetes with TestGrid

TestGrid is a highly-configurable, interactive dashboard for viewing your test
results in a grid, see https://github.com/GoogleCloudPlatform/testgrid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: A better transition would be nice here, like 'it is partially open-sourced so you can view the source code here' or something to that effect?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To disambiguate even more "the back end is open-sourced"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to clarifying the repo contains the back-end components of testgrid, not the dashboard code itself


The Kubernetes community has its own instance of TestGrid, https://testgrid.k8s.io/,
which we use to monitor and observe the health of the project.

Each SIG has its own set of dashboards, and each dashboard is composed of
different end-to-end (e2e) jobs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are more than e2e jobs

Suggested change
different end-to-end (e2e) jobs.
different jobs (build, unit test, integration test, end-to-end (e2e) test, etc.)

E2E jobs are in turn made up of test stages (e.g., bootstrapping a Kubernetes
cluster, tearing down a Kubernetes cluster) and e2e tests (e.g., Kubectl client
Kubectl logs should be able to retrieve and filter logs).
These views allow different teams to monitor and understand how their areas
are doing.

We highly encourage anyone to periodically monitor these dashboards.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to encourage toil. SIGs should be periodically monitoring the dashboards related to subprojects they own.

If you see that a job or test has been failing, please raise an issue with the
corresponding SIG in either their mailing list or in Slack.

Help maintaining tests, fixing broken tests, improving test success rates, and
overall test improvements are always highly needed throughout the project.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar nit:

Suggested change
overall test improvements are always highly needed throughout the project.
overall test improvements is always highly needed throughout the project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thejoycekung you sure on this one? 👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! The sentence is basically Help is always highly needed throughout the project.; we're just qualifying "help" as "help maintaining x, y, and z".


**Note**: It is important that all SIGs periodically monitor their jobs and
tests. These are used to figure out when to release Kubernetes.
Comment on lines +37 to +38
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too broad. Not all jobs/tests are used to figure out when to release Kubernetes.

Furthermore, if jobs or tests are failing or flaking, then pull requests will
take a lot longer to be merged.
Comment on lines +39 to +40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be an opportune time to mention the difference between periodic / presubmit / postsubmit

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also possibly mentioning metrics



### What dashboards should I monitor?

This depends on what areas of Kubernetes you want to contribute to.
You should monitor the dashboards owned by the SIG you are working with.
Additionally, you should check:

* https://testgrid.k8s.io/sig-release-master-blocking and
* https://testgrid.k8s.io/sig-release-master-informing

since these jobs run tests owned by other SIGs.
Also, make sure to periodically check on the "blocking" and "informing"
dashboards for past releases.

---

## What do I do when I see a TestGrid alert?

If you are part of a SIG's mailing list, occasionally you may see emails from
TestGrid reporting that a job or a test has recently failed.
If you are casually browsing through TestGrid, you may also see jobs labeled as
"flaky" (in purple) or as "failing" (in red).
This section is to help guide you on what to do in these occasions.

### Communicate your findings

The number one thing to do is to communicate your findings: a test or job has
been flaking or failing.
If you saw a TestGrid alert on a mailing list, please reply to the thread and
mention that you are looking into it.
It is important to communicate to prevent duplicate work and to ensure CI
problems get attention.

In order to communicate with the rest of the community and to drive the work,
please open up an issue on Kubernetes,
https://github.com/kubernetes/kubernetes/issues/new/choose, and choose the appropriate issue
template.
Comment on lines +68 to +78
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest:

  • look to see if there is already an open issue for the relevant repo, if not, create one
    • the relevant repo for kubernetes/kubernetes release-blocking or merge-blocking jobs is kubernetes/kubernetes
  • reply to alert with link to issue
  • all further communication on that issue

How do I decide which kubernetes/kubernetes issue template to use?

  • if the job is failing continuously, failing test
  • if the job is occasionally passing and failing, flaking test


### Fill out the issue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if this is out of scope for this issue or whether it falls more under "revamping the issue template"? -> We should talk a little bit about how to title it, e.g. prefix with [Failing Test] or [Flaky Test] (depending on what's happening)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I said something similar in a different PR review that wrote instructions on how to file a flake issue #5205 (comment)

Instructions on correctly filling out an issue are most likely to be read if they are part of the issue template itself. Alternatively, make a page dedicated just to how to file kubernetes/kubernetes issues, and link to that page from the issue template.

I opened kubernetes/kubernetes#95528 to cover updating the flake template, maybe it should expand for both.


1. **Which job(s) are failing or flaking**

The job is the tab in TestGrid that you are looking at.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job name is different from tab name. Click on a testgrid cell to see what the actual job name is. Alternatively, tabs that don't have a testgrid-description: annotation automatically populate the description with job name. Description is displayed below testgrid tabs, or available as a hover/tooltip on tab names

e.g. https://testgrid.k8s.io/sig-release-1.20-blocking#verify-1.20

  • tab name is verify-1.20
  • job name is ci-kubernetes-verify-1-20

The difference is important because:

  • searching kubernetes/test-infra by job name is far more likely to find the relevant job config
  • https://prow.k8s.io displays/filters job names
  • https://go.k8s.io/triage displays/filters job names, it has no knowledge of testgrid names


<img src="./testgrid-images/testgrid-jobs.png" height="50%" width="40%">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is alt-text a thing we should do here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree we should alt-text


The above example was taken from the SIG Release dashboard and we can see that
* `conformance-ga-only` https://testgrid.k8s.io/sig-release-master-blocking#conformance-ga-only,
* `skew-cluster-latest-kubectl-stable1-gce` https://testgrid.k8s.io/sig-release-master-blocking#skew-cluster-latest-kubectl-stable1-gce,
* `gci-gce-ingress` https://testgrid.k8s.io/sig-release-master-blocking#gci-gce-ingress,
* `kind-master-parallel` https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel

are flaky (we should have some issues opened up for these to investigate why
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be useful to include the actual issues examples

:smile:).

2. **Which tests are failing or flaking**

Let's grab an example from the SIG release dashboards and look at the
`node-kubelet-features-master` job in
https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features-master.

<img src="./testgrid-images/failed-tests.png" height="70%" width="100%">

Here we see that at 16.07 EDT and 15:07 EDT the job
```
[k8s.io] NodeProblemDetector [NodeFeature:NodeProblemDetector] [k8s.io] SystemLogMonitor should generate node condition and events for corresponding errors [ubuntu]
```
Failed for Kubernetes commit `9af86e8db` (this value is a row below the time -
alejandrox1 marked this conversation as resolved.
Show resolved Hide resolved
the value above it is the run ID).
The corresponding test-infra commit was `11cb57d36` (the value below the commit
for Kubernetes).
Comment on lines +109 to +112
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording here feels a little awkward. I think putting this info with part 3 "Since when has it been failing or flaking" ~ L134 would be more useful, because:

  • In Part 2 we are only concerned about which tests are failing/flaking
  • In Part 3 you've written a lovely breakdown of what each row in the header means, so with that context it will be much easier to interpret which commits it failed on for which repo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @thejoycekung


At 15:07 EDT, the job
```
[k8s.io] NodeProblemDetector [NodeFeature:NodeProblemDetector] [k8s.io] SystemLogMonitor should generate node condition and events for corresponding errors [cos-stable2]
```

failed as well.

If one or both of these jobs continue failing, or if they fail frequently
enough, we should open an issue and investigate.

3. **Since when has it been failing or flaking**

This information you can get from the header of the page showing you all the
tests.
Going from top to bottom, you will see:
* date
* time
* job run ID
* Kubernetes commit that was tested
* (Most often) test-infra commit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (partially for my own knowledge): what else could it be for besides test-infra?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on the test. I think there are some jobs that reference the kubeadm commit (for example).
Need to also search where this value can be configured ('cas i think it can be configured).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quirk: jobs that use (deprecated) bootstrap.py have the test-infra commit; jobs that use pod-utils do not (e.g. https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel)

Customizing docs: https://github.com/kubernetes/test-infra/blob/master/testgrid/config.md#column-headers


4. **Reason for failure**

The aim for this issue is to begin investigating - you don't have to find the
reason for failure right away (nor the solution).
However, do post any information you find useful.

One way of getting useful information is to click on the failed runs (the red
rectangles).
This will send you to a page called [**Spyglass**](https://github.com/kubernetes/test-infra/tree/master/prow/spyglass).

If we do this for the above test failures in `node-kubelet-features-master`, we
will see the following

<img src="./testgrid-images/spyglass-summary.png" height="60%" width="100%">

Right away it will show you what tests failed.
Here we see that 2 tests failed (both related to the node problem detector) and
the `e2e.go: Node Tests` stage was marked as failed (because the node problem
detector tests failed).

You will often see "stages" (steps in an e2e job) as mixed with the tests
themselves.
The stages tell you what was going on in the e2e job when an error
occurred.

If we click on the first test error, we will see logs that will (hopefully) help
us figure out why the test failed.

<img src="./testgrid-images/spyglass-result.png" height="60%" width="100%">

Further down the page you will see all the logs for the entire test run.
Please copy any information you think may be useful from here into the issue.

Copy link
Member

@knabben knabben Oct 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe is worth mention the artifacts for more logging of other components.

5. **Anything else we need to know**

There is this wonderful page built by SIG testing that often comes in handy:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the shortlink, the bucket name is going to change eventually (ref: kubernetes/k8s.io#1305)

Suggested change
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1
https://go.k8s.io/triage

This page is called **Triage**.
We can use it to see if a test we see failing in a given job has been failing in
others and, in general, to understand how jobs are behaving.

For example, we can see how the job we have been looking at has been behaving
recently.

There is one important detail we have to mention at this point, the job names
you see on TestGrid are often aliases.
For example, when we clicked on a test run for
`node-kubelet-features-master`
(
https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features-master
), at the top left corner of spyglass the page tells us the real job name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another question (partially for my own knowledge): when we log an issue for failure/flake should we use the real job name or the alias or both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

man, i've been taking super long to get through this pr but either one is good. i think that most people will know where to look with either the full name of the alias (usually the full name just adds the "ci-kubernetes-" prefix).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job name, not tab name

`ci-kubernetes-node-kubelet-features` (notice the "ci-kubernetes-" prefix).
Comment on lines +179 to +186
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: would it be better to intersperse this information throughout the doc? like, when we talk about the testgrid tab, mention that these are often aliases, when we introduce spyglass, mention that the title at the top is the "real" job name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, I suggested earlier in the doc

Then we can use this full job name in Triage

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=ci-kubernetes-node-kubelet-features

At the time of this writing we saw the following

<img src="./testgrid-images/triage.png" height="50%" width="100%">

**Note**: notice that you can also improve your query by filtering or excluding
results based on test name or failure text.

Sometimes, Triage will help you find patterns to figure out what's wrong.
In this instance, we can also see that this job has been failing rather
frequently (about 2 times per hour).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if this is out of scope for this issue or whether it falls more under "revamping the issue template"? -> We should also mention adding a relevant SIG through /sig <name> and/or cc'ing relevant people like /cc @kubernetes/sig-<foo>-test-failures or @kubernetes/ci-signal

### Iterate

Once you have filled out the issue, please mention it in the appropriate mailing
list thread (if you see an email from TestGrid mentioning a job or test
failure) and share it with the appropriate SIG in the Kubernetes Slack.

Don't worry if you are not sure how to debug further or how to resolve the
issue!
All issues are unique and require a bit of experience to figure out how to work
on them.
For the time being, reach out to people in Slack or the mailing list.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.