-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move backoff logic into own struct and increase test coverage of timeout logic #3031
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The following is the coverage report on the affected files.
|
oooo a data race!
|
I think this race condition might be another instance of #1307 🤔 |
The timeout handler was trying to do two different jobs: - Track timeouts for executing TaskRuns and PipelineRuns; when they timeout, they should be re-reconciled, at which point they would fail b/c they took to long - Track backoffs when TaskRun pod creation fails due to resource quota issues. In this case we want to retry pod creation (by re-reconciling) after a backoff in case the resource issue has been resolved. The logic for these 2 jobs was combined because they both share at their heart very similar logic: the logic for creating go routines that wait until a period of time has passed. Now these two jobs are done by 2 different structs, which both use the new Timer struct and logic, so they can share this logic, without having to mix the code on top. Future things we can do to make this even better: - Add more test coverage for the new structs - The "Handler" struct (which may have too generic a name at this point) probably doesn't need to have separate logic for pipelineruns + taskruns since each of these controllers will instanitate it separately - As discussed in tektoncd#2905, knative/pkg's controller logic will attempt to re-reconcile all the Runs on start up, so we probably don't need to be explicitly doing this as well
Add more tests to increase coverage of timeout package. This means updating some code to be more unit testable, specifically injecting the function used to determine the current time into Backoff. This revealed a "bug" where the backoff logic wasn't using a default value when the TaskRun timeout isn't set - however the defaulting logic ensures that it is always set, so if anything, maybe we should actually panic in this case.
The following is the coverage report on the affected files.
|
@bobcatfish: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
// 1. Stop signal, 2. completion or 3. a given Duration to elapse. | ||
func (b *Backoff) SetTimer(runObj StatusKey, d time.Duration) { | ||
if b.timeoutCallback == nil { | ||
b.logger.Errorf("attempted to set a timer for %q but no callback has been assigned", runObj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds very odd (at least to me 😉), commenting out this logger
from here fixes race condition. This function invocation within handlePodCreationError and the following SetCondition are somehow causing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WARNING: DATA RACE
Read at 0x00c000c3c198 by goroutine 103:
...
go.uber.org/zap.(*SugaredLogger).log()
...
Previous write at 0x00c000c3c198 by goroutine 79:
knative.dev/pkg/apis/duck/v1beta1.(*Status).SetConditions()
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, well spotted! It looks like swapping out the %q
for a %p
in this log statement also removes the error. Using %q
results in a string-formatted struct being printed, which probably walks the entire object and accesses all the fields. My hunch is that the %q
formatter is not threadsafe in that case? In hindsight this may be why we have the StatusKey
interface - the GetRunKey()
func of TaskRun returns a thread-safe descriptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, yup it does walk the entire object. The next log statement (using GetRunKey()
on runObj
) looks safe and just prints the address of taskRun
object instead.
"TaskRun/0xc00015af00"
While trying to figure this out I think I found a bigger problem, I'm going to close this PR for now b/c I'll need to rebase it anyway.
In fact I think this is a bug that I introduced - the address of the Run might be thread safe, but it's not going to be the same on every reconcile! So I think that actually every single Run is timing out because the done channel can never be signaled 😅 |
Changes
Separate backoff logic from timeout logic ⏰
The timeout handler was trying to do two different jobs:
timeout, they should be re-reconciled, at which point they would
fail b/c they took to long
issues. In this case we want to retry pod creation (by re-reconciling)
after a backoff in case the resource issue has been resolved.
The logic for these 2 jobs was combined because they both share at their
heart very similar logic: the logic for creating go routines that wait
until a period of time has passed.
Now these two jobs are done by 2 different structs, which both use the
new Timer struct and logic, so they can share this logic, without having
to mix the code on top.
Future things we can do to make this even better:
probably doesn't need to have separate logic for pipelineruns +
taskruns since each of these controllers will instanitate it
separately
to re-reconcile all the Runs on start up, so we probably don't need
to be explicitly doing this as well
Add more timeout test coverage and fix timeout "bug" 🐞
Add more tests to increase coverage of timeout package. This means
updating some code to be more unit testable, specifically injecting the
function used to determine the current time into Backoff.
This revealed a "bug" where the backoff logic wasn't using a default
value when the TaskRun timeout isn't set - however the defaulting logic
ensures that it is always set, so if anything, maybe we should actually
panic in this case.
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Double check this list of stuff that's easy to miss:
cmd
dir, please updatethe release Task to build and release this image.
Reviewer Notes
If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.
Release Notes
It almost seems like this is fixing a bug, but because of how the defaulting logic works, I don't think it really is.