-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Pod Issue handling for jobs created using Executor API #2065
Conversation
# Conflicts: # internal/executor/reporter/fake/job_event_reporter.go # internal/executor/reporter/job_event_reporter.go # internal/executor/service/cluster_allocation.go # internal/executor/service/job_manager.go # internal/executor/service/job_manager_test.go
# Conflicts: # internal/executor/reporter/fake/job_event_reporter.go # internal/executor/reporter/job_event_reporter.go # internal/executor/service/cluster_allocation.go # internal/executor/service/job_manager.go # internal/executor/service/job_manager_test.go
As part of the migration to using the Executor API, we now have 2 flows through the executor The new flow had no handling of stuck pods, which is what this PR introduces It was simpler to split this off as the handling is slightly different to existing code - Existing code is tied in with lease handling - The existing code has to make additional calls (return lease, report done) - where the new executor api is purely event based
Codecov ReportBase: 47.11% // Head: 47.38% // Increases project coverage by
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## master #2065 +/- ##
==========================================
+ Coverage 47.11% 47.38% +0.27%
==========================================
Files 229 233 +4
Lines 30251 32143 +1892
==========================================
+ Hits 14252 15231 +979
- Misses 14949 15816 +867
- Partials 1050 1096 +46
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
|
||
const ( | ||
UnableToSchedule IssueType = iota | ||
StuckStartingUp IssueType = iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can get rid of all the iota
s after the first if you want
runId := util.ExtractJobRunId(pod) | ||
if runId != "" { | ||
podsByRunId[runId] = pod | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean if runId
is not present here? Should we log something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
result := make([]*issue, 0, len(podIssues)) | ||
|
||
for _, podIssue := range podIssues { | ||
relatedPod := podsByRunId[podIssue.RunId] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if podsByRunId
doesn't have the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine, it'll just be nil - which is often expected - such as if the pod was deleted and we are reporting it getting deleted unexpectedly
} | ||
|
||
// For retryable issues we must: | ||
// - Report JobUnableToScheduleEvent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the scheduler cares about unableToScheduleEvent
. What's this used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is informational
Really it should just be "StartUpIssueEvent"
We could remove it - and we probably should if we had better state management.
It is sometimes helpful when debugging to see that the issue was detected and attempted to be handled
- Especially as part of handling the issue is deleting the pod - which could mean the executor loses state during this process. (i.e executor deletes it and then shuts down before it can return lease)
wg := &sync.WaitGroup{} | ||
processChannel := make(chan K) | ||
|
||
for i := 0; i < commonUtil.Min(len(itemsToProcess), maxThreadCount); i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure there's a min
function in the standard library now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? I can't seem to find it - would you mind linking to anywhere we use it?
New go does allow writing a generic min function - so you're probably correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I'm wrong- it doesn't yet exist. Hopefully it will soon
As part of the migration to using the Executor API, we now have 2 flows through the executor
The new flow had no handling of stuck pods, which is what this PR introduces
It was simpler to split this off as the handling is slightly different to existing code