Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pod Issue handling for jobs created using Executor API #2065

Merged
merged 28 commits into from
Feb 8, 2023

Conversation

JamesMurkin
Copy link
Contributor

As part of the migration to using the Executor API, we now have 2 flows through the executor

The new flow had no handling of stuck pods, which is what this PR introduces

It was simpler to split this off as the handling is slightly different to existing code

  • Existing code is tied in with lease handling
  • The existing code has to make additional calls (return lease, report done) - where the new executor api is purely event based

# Conflicts:
#	internal/executor/reporter/fake/job_event_reporter.go
#	internal/executor/reporter/job_event_reporter.go
#	internal/executor/service/cluster_allocation.go
#	internal/executor/service/job_manager.go
#	internal/executor/service/job_manager_test.go
# Conflicts:
#	internal/executor/reporter/fake/job_event_reporter.go
#	internal/executor/reporter/job_event_reporter.go
#	internal/executor/service/cluster_allocation.go
#	internal/executor/service/job_manager.go
#	internal/executor/service/job_manager_test.go
As part of the migration to using the Executor API, we now have 2 flows through the executor

The new flow had no handling of stuck pods, which is what this PR introduces

It was simpler to split this off as the handling is slightly different to existing code
 - Existing code is tied in with lease handling
 - The existing code has to make additional calls (return lease, report done) - where the new executor api is purely event based
@codecov-commenter
Copy link

codecov-commenter commented Jan 26, 2023

Codecov Report

Base: 47.11% // Head: 47.38% // Increases project coverage by +0.27% 🎉

Coverage data is based on head (1337990) compared to base (8116fa6).
Patch coverage: 73.79% of modified lines in pull request are covered.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2065      +/-   ##
==========================================
+ Coverage   47.11%   47.38%   +0.27%     
==========================================
  Files         229      233       +4     
  Lines       30251    32143    +1892     
==========================================
+ Hits        14252    15231     +979     
- Misses      14949    15816     +867     
- Partials     1050     1096      +46     
Flag Coverage Δ
airflow-operator 89.13% <ø> (ø)
armada-server 46.44% <73.79%> (+0.33%) ⬆️
python-client 93.83% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
internal/executor/application.go 10.29% <0.00%> (-0.32%) ⬇️
internal/executor/job/job_context.go 32.27% <0.00%> (+0.50%) ⬆️
internal/executor/util/pod_util.go 76.47% <ø> (-1.74%) ⬇️
internal/executor/service/pod_issue_handler.go 73.39% <73.39%> (ø)
internal/executor/context/cluster_context.go 57.34% <100.00%> (ø)
internal/executor/service/job_manager.go 60.13% <100.00%> (-3.78%) ⬇️
internal/executor/util/process.go 100.00% <100.00%> (ø)
internal/scheduler/schedulerapp.go 0.00% <0.00%> (ø)
internal/scheduleringester/ingester.go 0.00% <0.00%> (ø)
internal/armada/server/submit_to_log.go 0.00% <0.00%> (ø)
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@JamesMurkin JamesMurkin marked this pull request as ready for review February 6, 2023 11:02

const (
UnableToSchedule IssueType = iota
StuckStartingUp IssueType = iota
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can get rid of all the iotas after the first if you want

runId := util.ExtractJobRunId(pod)
if runId != "" {
podsByRunId[runId] = pod
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean if runId is not present here? Should we log something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

result := make([]*issue, 0, len(podIssues))

for _, podIssue := range podIssues {
relatedPod := podsByRunId[podIssue.RunId]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if podsByRunId doesn't have the key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine, it'll just be nil - which is often expected - such as if the pod was deleted and we are reporting it getting deleted unexpectedly

}

// For retryable issues we must:
// - Report JobUnableToScheduleEvent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the scheduler cares about unableToScheduleEvent. What's this used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is informational

Really it should just be "StartUpIssueEvent"

We could remove it - and we probably should if we had better state management.

It is sometimes helpful when debugging to see that the issue was detected and attempted to be handled

  • Especially as part of handling the issue is deleting the pod - which could mean the executor loses state during this process. (i.e executor deletes it and then shuts down before it can return lease)

wg := &sync.WaitGroup{}
processChannel := make(chan K)

for i := 0; i < commonUtil.Min(len(itemsToProcess), maxThreadCount); i++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure there's a min function in the standard library now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I can't seem to find it - would you mind linking to anywhere we use it?

New go does allow writing a generic min function - so you're probably correct

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I'm wrong- it doesn't yet exist. Hopefully it will soon

d80tb7
d80tb7 previously approved these changes Feb 7, 2023
@JamesMurkin JamesMurkin merged commit 13361d8 into master Feb 8, 2023
@JamesMurkin JamesMurkin deleted the issue_handling_pods branch February 8, 2023 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants