Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call Status.Update once in reconcile #405

Closed
ahg-g opened this issue Feb 10, 2024 · 7 comments · Fixed by #494
Closed

Call Status.Update once in reconcile #405

ahg-g opened this issue Feb 10, 2024 · 7 comments · Fixed by #494
Assignees

Comments

@ahg-g
Copy link
Contributor

ahg-g commented Feb 10, 2024

What would you like to be added:

Status().Update is invoked multiple times in Reconcile, we should refactor the code so that it is called once.

Why is this needed:

Reduces calls to the api sever and avoids conflicts in the case status is actually called more that once.

@googs1025
Copy link
Member

googs1025 commented Feb 17, 2024

I agree with.
I think it is necessary to abstract something like deployJobSet updateJobSetStatus func to complete this issue. We can put all status update operations into updateJobSetStatus to implement.
as follows:

// Reconcile loop
func (r *JobSetController) Reconcile(ctx context.Context,
	req reconcile.Request) (reconcile.Result, error) {

	...
	err := r.client.Get(ctx, req.NamespacedName, jobset)
	if err != nil {
		...
		return reconcile.Result{}, err
	}


	// deploy jobset 
	if err = r.deployJobSet(ctx, jobset); err != nil {
		...
		return reconcile.Result{RequeueAfter: time.Second * 60}, err
	}

	// update status
	if err = r.updateJobSetStatus(ctx, jobset); err != nil {
	        ...
		return reconcile.Result{}, err
	}

	return reconcile.Result{}, nil
}

@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 18, 2024

Thanks! @googs1025 do you want to try and address this?

@googs1025
Copy link
Member

haha, thank @ahg-g ! Bug I have just learned this project lately, and I am not yet suitable to participate in such a large refactoring. I should be involved in smaller aspects first, such as unit testing or fixing spelling errors, etc.

@danielvegamyhre
Copy link
Contributor

/assign

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Apr 3, 2024

Update on this: even if I remove all Status.Update calls in the jobset controller (so we are only modifying the jobset status in memory, not persisting it server side yet), then do 1 deferred Status.Update call at the end of each reconcile attempt, the integration tests are still full of “object has been modified, please reapply on latest resource version” errors.

At first I was thinking the tests creating/deleting child Jobs as owned resources would trigger another concurrent reconciliation attempt, which would modify the js.Status.ReplicatedJobsStatus (and thus update the jobset resource version), resulting in a race between concurrent Reconcile attempts for which status update happens first. However, the jobset controller uses the default MaxConcurrentReconciles: 1 so those jobset status updates should be happening sequentially, not concurrently.

I am thinking maybe those logs are the result of the test code itself, still investigating this.

@kannon92
Copy link
Contributor

kannon92 commented Apr 3, 2024

IMO let's focus on eliminating these logs from the prod. Does your update fix the logs on a real k8s cluster?

I also spent a lot of time trying to eliminate those from integration tests and I wasn't able..

@kannon92
Copy link
Contributor

kannon92 commented Apr 3, 2024

I noticed that the logs in Kueue integration tests also have this so I was thinking it was related to the controller runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants