Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update IsManagedPod to check for job run id and job id #3620

Closed
wants to merge 5 commits into from

Conversation

Sovietaced
Copy link
Contributor

@Sovietaced Sovietaced commented May 23, 2024

We had an issue where a v3 and v4 executor were running on the same cluster. The v4 executor was pulling empty job run ids from pods launched by the v3 executor. We realize that we shouldn't run two executors in the same cluster however it also feels like the filtering logic in the cluster utilization logic should just check for the job run id label if that is what it is ultimately going to pick out of the pod and assume is present.

Fixes: #3618

@Sovietaced Sovietaced marked this pull request as ready for review May 23, 2024 17:49
@dave-gantenbein
Copy link
Member

We're going to double check with one of the core devs tomorrow, but most likely we need both labels on the pod. I think what we want to do is validate that the executor has given us valid run ids and reject their update if they haven't.

@Sovietaced
Copy link
Contributor Author

We're going to double check with one of the core devs tomorrow, but most likely we need both labels on the pod. I think what we want to do is validate that the executor has given us valid run ids and reject their update if they haven't.

I think adding validation on the server side is definitely ideal. I'd be happy to add that as well just let me know!

@dave-gantenbein
Copy link
Member

@Sovietaced - by all means!

@Sovietaced Sovietaced changed the title Update IsManagedPod to check for job run id instead of just job id Update IsManagedPod to check for job run id and job id May 24, 2024
@Sovietaced
Copy link
Contributor Author

@dave-gantenbein should be good for review. I think the go mod CI job flaked so should be unrelated.

@Sovietaced
Copy link
Contributor Author

Got a unit test flake that seems unrelated

=== RUN   TestGetJobsByState/anyOf
    getjobs_test.go:816: 
        	Error Trace:	/home/runner/work/armada/armada/internal/lookoutv2/repository/getjobs_test.go:816
        	Error:      	Not equal: 
        	            	expected: &model.Job{Annotations:map[string]string{}, Cancelled:<nil>, Cpu:15000, Duplicate:false, EphemeralStorage:107374182400, Gpu:8, JobId:"01hz0qfqhw4ag8rvtvw8rvde8b", JobSet:"job-set-1", LastActiveRunId:(*string)(0xc000726870), LastTransitionTime:time.Date(2022, time.March, 1, 15, 4, 5, 0, time.UTC), Memory:51539607552, Owner:"user-1", Namespace:(*string)(0xc0008184c0), Priority:12, PriorityClass:(*string)(0xc0007267f0), Queue:"queue-1", Runs:[]*model.Run{(*model.Run)(0xc000053e60)}, State:"RUNNING", Submitted:time.Date(2022, time.March, 1, 15, 4, 5, 0, time.UTC), CancelReason:(*string)(nil), Node:(*string)(0xc000726880), Cluster:"cluster-1", ExitCode:(*int32)(nil), RuntimeSeconds:70790764}
        	            	actual  : &model.Job{Annotations:map[string]string{}, Cancelled:<nil>, Cpu:15000, Duplicate:false, EphemeralStorage:107374182400, Gpu:8, JobId:"01hz0qfqhw4ag8rvtvw8rvde8b", JobSet:"job-set-1", LastActiveRunId:(*string)(0xc000796ff0), LastTransitionTime:time.Date(2022, time.March, 1, 15, 4, 5, 0, time.UTC), Memory:51539607552, Owner:"user-1", Namespace:(*string)(0xc000797008), Priority:12, PriorityClass:(*string)(0xc000797020), Queue:"queue-1", Runs:[]*model.Run{(*model.Run)(0xc0007de420)}, State:"RUNNING", Submitted:time.Date(2022, time.March, 1, 15, 4, 5, 0, time.UTC), CancelReason:(*string)(nil), Node:(*string)(0xc0007e0480), Cluster:"cluster-1", ExitCode:(*int32)(nil), RuntimeSeconds:70790765}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -63,3 +63,3 @@
        	            	  ExitCode: (*int32)(<nil>),
        	            	- RuntimeSeconds: (int32) 70790764
        	            	+ RuntimeSeconds: (int32) 70790765
        	            	 })
        	Test:       	TestGetJobsByState/anyOf
--- FAIL: TestGetJobsByState/anyOf (0.00s)
FAIL internal/lookoutv2/repository.TestGetJobsByState/anyOf (0.00s)

@richscott
Copy link
Member

@Sovietaced Can you rebase this branch from master, then if the unit tests pass again at least once, this should be good to go.

…de job run ids

Signed-off-by: Jason Parraga <sovietaced@gmail.com>
Signed-off-by: Jason Parraga <sovietaced@gmail.com>
Signed-off-by: Jason Parraga <sovietaced@gmail.com>
Signed-off-by: Jason Parraga <sovietaced@gmail.com>
Signed-off-by: Jason Parraga <sovietaced@gmail.com>
@Sovietaced
Copy link
Contributor Author

@Sovietaced Can you rebase this branch from master, then if the unit tests pass again at least once, this should be good to go.

done

@dave-gantenbein
Copy link
Member

@Sovietaced - we discussed this and while it won't cause any negative impact, I'm wondering if you've updated things so the executors and server are on the same version, which would be the recommended pattern. If there's something blocking you from doing that let me know here or in slack and we can merge this, but again this isn't how we'd recommend you go about setting things up long term, as other issues could potentially creep in and we won't be testing this scenario...

@Sovietaced
Copy link
Contributor Author

@Sovietaced - we discussed this and while it won't cause any negative impact, I'm wondering if you've updated things so the executors and server are on the same version, which would be the recommended pattern. If there's something blocking you from doing that let me know here or in slack and we can merge this, but again this isn't how we'd recommend you go about setting things up long term, as other issues could potentially creep in and we won't be testing this scenario...

We haven't updated things yet and have been running a fork but I have already mentioned I realize this is not a supported deployment. I don't think the PR makes the codebase any worse considering there is no validation being done in the scheduler API but you do you.

@Sovietaced Sovietaced closed this Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Armada scheduler crashes due to zero'd job run UUID
3 participants