Update IsManagedPod to check for job run id and job id #3620

Sovietaced · 2024-05-23T17:48:51Z

We had an issue where a v3 and v4 executor were running on the same cluster. The v4 executor was pulling empty job run ids from pods launched by the v3 executor. We realize that we shouldn't run two executors in the same cluster however it also feels like the filtering logic in the cluster utilization logic should just check for the job run id label if that is what it is ultimately going to pick out of the pod and assume is present.

Fixes: #3618

dave-gantenbein · 2024-05-23T18:49:41Z

We're going to double check with one of the core devs tomorrow, but most likely we need both labels on the pod. I think what we want to do is validate that the executor has given us valid run ids and reject their update if they haven't.

Sovietaced · 2024-05-23T19:02:23Z

We're going to double check with one of the core devs tomorrow, but most likely we need both labels on the pod. I think what we want to do is validate that the executor has given us valid run ids and reject their update if they haven't.

I think adding validation on the server side is definitely ideal. I'd be happy to add that as well just let me know!

dave-gantenbein · 2024-05-23T19:42:20Z

@Sovietaced - by all means!

Sovietaced · 2024-05-24T21:00:17Z

@dave-gantenbein should be good for review. I think the go mod CI job flaked so should be unrelated.

Sovietaced · 2024-05-28T23:17:25Z

Got a unit test flake that seems unrelated

=== RUN   TestGetJobsByState/anyOf
    getjobs_test.go:816: 
        	Error Trace:	/home/runner/work/armada/armada/internal/lookoutv2/repository/getjobs_test.go:816
        	Error:      	Not equal: 
        	            	expected: &model.Job{Annotations:map[string]string{}, Cancelled:<nil>, Cpu:15000, Duplicate:false, EphemeralStorage:107374182400, Gpu:8, JobId:"01hz0qfqhw4ag8rvtvw8rvde8b", JobSet:"job-set-1", LastActiveRunId:(*string)(0xc000726870), LastTransitionTime:time.Date(2022, time.March, 1, 15, 4, 5, 0, time.UTC), Memory:51539607552, Owner:"user-1", Namespace:(*string)(0xc0008184c0), Priority:12, PriorityClass:(*string)(0xc0007267f0), Queue:"queue-1", Runs:[]*model.Run{(*model.Run)(0xc000053e60)}, State:"RUNNING", Submitted:time.Date(2022, time.March, 1, 15, 4, 5, 0, time.UTC), CancelReason:(*string)(nil), Node:(*string)(0xc000726880), Cluster:"cluster-1", ExitCode:(*int32)(nil), RuntimeSeconds:70790764}
        	            	actual  : &model.Job{Annotations:map[string]string{}, Cancelled:<nil>, Cpu:15000, Duplicate:false, EphemeralStorage:107374182400, Gpu:8, JobId:"01hz0qfqhw4ag8rvtvw8rvde8b", JobSet:"job-set-1", LastActiveRunId:(*string)(0xc000796ff0), LastTransitionTime:time.Date(2022, time.March, 1, 15, 4, 5, 0, time.UTC), Memory:51539607552, Owner:"user-1", Namespace:(*string)(0xc000797008), Priority:12, PriorityClass:(*string)(0xc000797020), Queue:"queue-1", Runs:[]*model.Run{(*model.Run)(0xc0007de420)}, State:"RUNNING", Submitted:time.Date(2022, time.March, 1, 15, 4, 5, 0, time.UTC), CancelReason:(*string)(nil), Node:(*string)(0xc0007e0480), Cluster:"cluster-1", ExitCode:(*int32)(nil), RuntimeSeconds:70790765}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -63,3 +63,3 @@
        	            	  ExitCode: (*int32)(<nil>),
        	            	- RuntimeSeconds: (int32) 70790764
        	            	+ RuntimeSeconds: (int32) 70790765
        	            	 })
        	Test:       	TestGetJobsByState/anyOf
--- FAIL: TestGetJobsByState/anyOf (0.00s)
FAIL internal/lookoutv2/repository.TestGetJobsByState/anyOf (0.00s)

richscott · 2024-06-11T19:09:27Z

@Sovietaced Can you rebase this branch from master, then if the unit tests pass again at least once, this should be good to go.

…de job run ids Signed-off-by: Jason Parraga <sovietaced@gmail.com>

Signed-off-by: Jason Parraga <sovietaced@gmail.com>

Sovietaced · 2024-06-12T05:55:39Z

@Sovietaced Can you rebase this branch from master, then if the unit tests pass again at least once, this should be good to go.

done

dave-gantenbein · 2024-06-19T20:08:23Z

@Sovietaced - we discussed this and while it won't cause any negative impact, I'm wondering if you've updated things so the executors and server are on the same version, which would be the recommended pattern. If there's something blocking you from doing that let me know here or in slack and we can merge this, but again this isn't how we'd recommend you go about setting things up long term, as other issues could potentially creep in and we won't be testing this scenario...

Sovietaced · 2024-06-19T20:24:06Z

@Sovietaced - we discussed this and while it won't cause any negative impact, I'm wondering if you've updated things so the executors and server are on the same version, which would be the recommended pattern. If there's something blocking you from doing that let me know here or in slack and we can merge this, but again this isn't how we'd recommend you go about setting things up long term, as other issues could potentially creep in and we won't be testing this scenario...

We haven't updated things yet and have been running a fork but I have already mentioned I realize this is not a supported deployment. I don't think the PR makes the codebase any worse considering there is no validation being done in the scheduler API but you do you.

Sovietaced marked this pull request as ready for review May 23, 2024 17:49

Sovietaced force-pushed the pod-filtering branch from f7bba54 to ef5b10c Compare May 24, 2024 20:45

Sovietaced changed the title ~~Update IsManagedPod to check for job run id instead of just job id~~ Update IsManagedPod to check for job run id and job id May 24, 2024

Sovietaced added 4 commits June 11, 2024 22:35

Strengthen IsManaged semantics and add validation for LeaseJobRuns no…

b5d42a5

…de job run ids Signed-off-by: Jason Parraga <sovietaced@gmail.com>

Add unit test

8322fda

Signed-off-by: Jason Parraga <sovietaced@gmail.com>

go fumpt

091da10

Signed-off-by: Jason Parraga <sovietaced@gmail.com>

goimports

5b7638b

Signed-off-by: Jason Parraga <sovietaced@gmail.com>

Sovietaced force-pushed the pod-filtering branch from 8985b5a to 5b7638b Compare June 12, 2024 05:35

Fix compilation issue

8cf09fd

Signed-off-by: Jason Parraga <sovietaced@gmail.com>

richscott approved these changes Jun 12, 2024

View reviewed changes

richscott requested review from JamesMurkin and d80tb7 June 12, 2024 19:48

Sovietaced closed this Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update IsManagedPod to check for job run id and job id #3620

Update IsManagedPod to check for job run id and job id #3620

Sovietaced commented May 23, 2024 •

edited

Loading

dave-gantenbein commented May 23, 2024

Sovietaced commented May 23, 2024

dave-gantenbein commented May 23, 2024

Sovietaced commented May 24, 2024

Sovietaced commented May 28, 2024

richscott commented Jun 11, 2024

Sovietaced commented Jun 12, 2024

dave-gantenbein commented Jun 19, 2024

Sovietaced commented Jun 19, 2024

Update IsManagedPod to check for job run id and job id #3620

Update IsManagedPod to check for job run id and job id #3620

Conversation

Sovietaced commented May 23, 2024 • edited Loading

dave-gantenbein commented May 23, 2024

Sovietaced commented May 23, 2024

dave-gantenbein commented May 23, 2024

Sovietaced commented May 24, 2024

Sovietaced commented May 28, 2024

richscott commented Jun 11, 2024

Sovietaced commented Jun 12, 2024

dave-gantenbein commented Jun 19, 2024

Sovietaced commented Jun 19, 2024

Sovietaced commented May 23, 2024 •

edited

Loading