Add Pod Issue handling for jobs created using Executor API #2065

JamesMurkin · 2023-01-26T14:30:30Z

As part of the migration to using the Executor API, we now have 2 flows through the executor

The new flow had no handling of stuck pods, which is what this PR introduces

It was simpler to split this off as the handling is slightly different to existing code

Existing code is tied in with lease handling
The existing code has to make additional calls (return lease, report done) - where the new executor api is purely event based

# Conflicts: # internal/executor/reporter/fake/job_event_reporter.go # internal/executor/reporter/job_event_reporter.go # internal/executor/service/cluster_allocation.go # internal/executor/service/job_manager.go # internal/executor/service/job_manager_test.go

As part of the migration to using the Executor API, we now have 2 flows through the executor The new flow had no handling of stuck pods, which is what this PR introduces It was simpler to split this off as the handling is slightly different to existing code - Existing code is tied in with lease handling - The existing code has to make additional calls (return lease, report done) - where the new executor api is purely event based

codecov-commenter · 2023-01-26T15:29:43Z

Codecov Report

Base: 47.11% // Head: 47.38% // Increases project coverage by +0.27% 🎉

Coverage data is based on head (1337990) compared to base (8116fa6).
Patch coverage: 73.79% of modified lines in pull request are covered.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2065      +/-   ##
==========================================
+ Coverage   47.11%   47.38%   +0.27%     
==========================================
  Files         229      233       +4     
  Lines       30251    32143    +1892     
==========================================
+ Hits        14252    15231     +979     
- Misses      14949    15816     +867     
- Partials     1050     1096      +46

Flag	Coverage Δ
airflow-operator	`89.13% <ø> (ø)`
armada-server	`46.44% <73.79%> (+0.33%)`	⬆️
python-client	`93.83% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
internal/executor/application.go	`10.29% <0.00%> (-0.32%)`	⬇️
internal/executor/job/job_context.go	`32.27% <0.00%> (+0.50%)`	⬆️
internal/executor/util/pod_util.go	`76.47% <ø> (-1.74%)`	⬇️
internal/executor/service/pod_issue_handler.go	`73.39% <73.39%> (ø)`
internal/executor/context/cluster_context.go	`57.34% <100.00%> (ø)`
internal/executor/service/job_manager.go	`60.13% <100.00%> (-3.78%)`	⬇️
internal/executor/util/process.go	`100.00% <100.00%> (ø)`
internal/scheduler/schedulerapp.go	`0.00% <0.00%> (ø)`
internal/scheduleringester/ingester.go	`0.00% <0.00%> (ø)`
internal/armada/server/submit_to_log.go	`0.00% <0.00%> (ø)`
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

d80tb7 · 2023-02-07T15:29:48Z

internal/executor/service/pod_issue_handler.go

+
+const (
+	UnableToSchedule  IssueType = iota
+	StuckStartingUp   IssueType = iota


you can get rid of all the iotas after the first if you want

d80tb7 · 2023-02-07T15:33:21Z

internal/executor/service/pod_issue_handler.go

+		runId := util.ExtractJobRunId(pod)
+		if runId != "" {
+			podsByRunId[runId] = pod
+		}


what does it mean if runId is not present here? Should we log something?

d80tb7 · 2023-02-07T15:34:27Z

internal/executor/service/pod_issue_handler.go

+	result := make([]*issue, 0, len(podIssues))
+
+	for _, podIssue := range podIssues {
+		relatedPod := podsByRunId[podIssue.RunId]


what happens if podsByRunId doesn't have the key?

That is fine, it'll just be nil - which is often expected - such as if the pod was deleted and we are reporting it getting deleted unexpectedly

d80tb7 · 2023-02-07T15:36:49Z

internal/executor/service/pod_issue_handler.go

+}
+
+// For retryable issues we must:
+//   - Report JobUnableToScheduleEvent


I'm not sure the scheduler cares about unableToScheduleEvent. What's this used for?

It is informational

Really it should just be "StartUpIssueEvent"

We could remove it - and we probably should if we had better state management.

It is sometimes helpful when debugging to see that the issue was detected and attempted to be handled

Especially as part of handling the issue is deleting the pod - which could mean the executor loses state during this process. (i.e executor deletes it and then shuts down before it can return lease)

d80tb7 · 2023-02-07T15:39:34Z

internal/executor/util/process.go

+	wg := &sync.WaitGroup{}
+	processChannel := make(chan K)
+
+	for i := 0; i < commonUtil.Min(len(itemsToProcess), maxThreadCount); i++ {


pretty sure there's a min function in the standard library now

Are you sure? I can't seem to find it - would you mind linking to anywhere we use it?

New go does allow writing a generic min function - so you're probably correct

ah I'm wrong- it doesn't yet exist. Hopefully it will soon

JamesMurkin added 22 commits January 23, 2023 13:51

[WIP] Initial implementation of executor using executor API

e0706a9

Merge branch 'master' into executor_pulsar_api_initial

832f4c9

Handle minor api change - using pointers for NodeInfo

e8898c8

Tidy ingress details

80e4217

gofumpt

a5a2e9a

Fix unit tests

4727af8

goimports

8a3d50b

Merge master

b888155

Limit event sequence size

f13eb14

Fix for bad merge

997c6df

Fix ingress and service count

a71429e

Set preempted run id

b1df498

Update internal/executor/reporter/event_sender.go

5212180

Merge branch 'master' into executor_pulsar_api_initial

cbbaf86

Fix merge issue

cf09ad3

gofumpt

ebfe0ae

Merge branch 'master' into issue_handling_pods

36808f4

Merge in master

eea0d98

Remove comments

9fde8bc

Merge branch 'master' into issue_handling_pods

de75055

JamesMurkin marked this pull request as ready for review February 6, 2023 11:02

JamesMurkin added 2 commits February 6, 2023 11:06

Populate runAttempted in return lease event

e37be96

Tidy up

3c02541

d80tb7 reviewed Feb 7, 2023

View reviewed changes

d80tb7 previously approved these changes Feb 7, 2023

View reviewed changes

Tidy from code review

1337990

JamesMurkin dismissed d80tb7’s stale review via 1337990 February 7, 2023 22:58

JamesMurkin added 2 commits February 8, 2023 00:06

Merge branch 'master' into issue_handling_pods

9b9f132

Merge branch 'master' into issue_handling_pods

66378d7

d80tb7 approved these changes Feb 8, 2023

View reviewed changes

JamesMurkin merged commit 13361d8 into master Feb 8, 2023

JamesMurkin deleted the issue_handling_pods branch February 8, 2023 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Pod Issue handling for jobs created using Executor API #2065

Add Pod Issue handling for jobs created using Executor API #2065

JamesMurkin commented Jan 26, 2023

codecov-commenter commented Jan 26, 2023 •

edited

Loading

d80tb7 Feb 7, 2023

d80tb7 Feb 7, 2023

JamesMurkin Feb 7, 2023

d80tb7 Feb 7, 2023

JamesMurkin Feb 7, 2023

d80tb7 Feb 7, 2023

JamesMurkin Feb 7, 2023

d80tb7 Feb 7, 2023

JamesMurkin Feb 7, 2023

d80tb7 Feb 8, 2023

Add Pod Issue handling for jobs created using Executor API #2065

Add Pod Issue handling for jobs created using Executor API #2065

Conversation

JamesMurkin commented Jan 26, 2023

codecov-commenter commented Jan 26, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 26, 2023 •

edited

Loading