Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server-side Apply problem #1311

Closed
allenhaozi opened this issue Dec 27, 2020 · 17 comments
Closed

Server-side Apply problem #1311

allenhaozi opened this issue Dec 27, 2020 · 17 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. wg/api-expression Categorizes an issue or PR as relevant to WG API Expression.

Comments

@allenhaozi
Copy link

allenhaozi commented Dec 27, 2020

In our internal scenario, We developed a PaaS platform based on OAM
our business simply declare the Workload and Traits they use
There are two types of Workload, ServerWorkload(online) and TaskWorkload(offline)
Traits include ManualScalerTrait,AutoScalerTrait,LoadBalanceTrait etc.

The ServerWorkload renders a Deployment and creates it
When the ManualScalerTrait observes that this Deployment has been created, it changes the replicas of this Deployment

In practice, however, we found that once our Operator was restarted, the POD that was already under the running deployment would be restarted too.

found the following event

Events:
  Type    Reason             Age                  From                   Message
  ----    ------             ----                 ----                   -------
  Normal  ScalingReplicaSet  13s (x12 over 5h5m)  deployment-controller  Scaled down replica set simple-web-33-5d759bd4cf to 0
  Normal  ScalingReplicaSet  12s (x15 over 5h7m)  deployment-controller  Scaled up replica set simple-web-33-5d759bd4cf to 3

Because the replicas of Deployment rendered by ServerWorkload is 0 But the number of ManualScaler is 3, the deployment changes from 3 to 0 and then to 3 again, Caused a restart

Based on the implementation mechanism, it is difficult for ServerWorkload to patch Deployment because it renders a complete Deployment template.

If APIServer can provide this parameter client.SkipConflictFields, perfect solution

ao := []client.PatchOption{ client.FieldOwner(c1.GetUID()),client.SkipConflictFields}
err := r.Patch(ctx, deploy, client.Apply, ao...)

controller-runtime version : 0.6.2

@coderanger
Copy link
Contributor

What are those GetUID functions?

@coderanger
Copy link
Contributor

Oh the second isn’t a server side apply. Check the patch diff you are sending by logging it or similar. Will probably show something unexpected.

@allenhaozi
Copy link
Author

Oh the second isn’t a server side apply. Check the patch diff you are sending by logging it or similar. Will probably show something unexpected.

Thank you for your response,
Is there an example here that multiple controller manage one resource using server side apply

@coderanger
Copy link
Contributor

I do it in my rabbitmq-operator, https://github.com/coderanger/rabbitmq-operator/blob/main/controllers/rabbituser.go though the code is probably going to be hard to follow. Also it's been a huge source of bugs and I really want to get rid of it :D

@allenhaozi
Copy link
Author

when restart controller pod got the following message

ao := []client.PatchOption{ client.FieldOwner(c1.GetUID())}
err := r.Patch(ctx, deploy, client.Apply, ao...)

utils.Pretty(err)
([]interface {}) (len=1 cap=1) {
 (*errors.withStack)(0xc00070efa0)(cannot patch object:apply: Apply failed with 5 conflicts: conflicts with "c060a82f-e150-47a2-a1a3-83c863113143" using apps/v1:
- .spec.replicas
- .spec.template.spec.containers[name="simple-web-0"].resources.requests.memory
- .spec.template.spec.containers[name="simple-web-0"].resources.requests.memory
- .spec.template.spec.containers[name="simple-web-0"].resources.requests.memory
- .spec.template.spec.containers[name="simple-web-0"].resources.requests.memory)
}

when change ao to []client.PatchOption{client.ForceOwnership, client.FieldOwner(c1.GetUID())}

lead to deployment's pod restart, because C1 controller change the replicas of the deployment

although managedFields show the replicas managed by C2

@coderanger
Copy link
Contributor

Field manager names should not be UUIDs, they should be names, usually the name of the controller (or subcontroller element).

@kwiesmueller
Copy link
Member

The conflict probably means that your apply contains those fields.

When using SSA one should only send the fields they care about. In this case your operator should only specify (I think) replicas. And not other fields.
Otherwise it will try to own them and potentially overwrite them if the are different.

If two controllers are intended to manage the same resource, they should never specify a patch that would make their ownership collide.

SSA is not meant for sending the full object if you don't care about the full object in your workflow.

I don't know how you're building your object here, are you applying the full object?

/wg api-expression
/cc @apelisse @jpbetz

@k8s-ci-robot k8s-ci-robot added the wg/api-expression Categorizes an issue or PR as relevant to WG API Expression. label Dec 28, 2020
@allenhaozi
Copy link
Author

The conflict probably means that your apply contains those fields.

When using SSA one should only send the fields they care about. In this case your operator should only specify (I think) replicas. And not other fields.
Otherwise it will try to own them and potentially overwrite them if the are different.

If two controllers are intended to manage the same resource, they should never specify a patch that would make their ownership collide.

SSA is not meant for sending the full object if you don't care about the full object in your workflow.

I don't know how you're building your object here, are you applying the full object?

/wg api-expression
/cc @apelisse @jpbetz

in my case , I have two controllers
The first controller is responsible for creating Deployment with Spec.Replicas is 0
the second controller is responsible for scaling Deployment, set field Spec.Replicas = 3

when restart the controllers got the following event

Events:
  Type    Reason             Age                  From                   Message
  ----    ------             ----                 ----                   -------
  Normal  ScalingReplicaSet  13s (x12 over 5h5m)  deployment-controller  Scaled down replica set simple-web-33-5d759bd4cf to 0
  Normal  ScalingReplicaSet  12s (x15 over 5h7m)  deployment-controller  Scaled up replica set simple-web-33-5d759bd4cf to 3

@kwiesmueller
Copy link
Member

That should be the representation of a restart for all pods. You scale down and then back up.
Your first controller code above does a Patch and not a create, so it will also update the resource when it's already created.
You might want to check if it's not there already.
Then the second controller can "steal" the field from the create-controller.

Make sure both controllers have different fieldManagers or they might bounce around ownership (one fieldManager has to send the fields it cares about on every apply or they will get dropped).

And make sure that the second controller only sends the replicas field.
I don't know straight out of my head how this is intended to be used with the controller-runtime client (maybe @DirectXMan12 can help?) but in general with SSA you can supply an incomplete JSON that only specifies the fields you care about.

@allenhaozi
Copy link
Author

allenhaozi commented Dec 28, 2020

is there a way remove the deployment field or fill deployment field by managedFields between desired deployment and current deployment ?

I have two kind controller

assume controller 1 named Workload
controller 2 named Scaler

this Apply function for Workload create or update deployment

func (a *Applicator) Apply(ctx context.Context, parent, current runtime.Object, ao ...client.PatchOption) error {
	
		ownerRef := utils.GenOwnerReferenceByObject(parent)
		crmeta.AddOwnerReference(current, ownerRef)
	

	if deploy, ok := current.(*appsv1.Deployment); ok {
		desired := deploy.DeepCopy()
		err := a.Get(ctx, types.NamespacedName{Name: current.GetName(), Namespace: current.GetNamespace()}, deploy)
		if apierrors.IsNotFound(err) {
			ao = append(ao, client.ForceOwnership)
		} else {
			patchOpts := []client.PatchOption{
				client.FieldOwner(utils.GenFieldOwnerByObject(parent))}
			a.prePatch(parent, desired, deploy)
			return errors.Wrap(a.Patch(ctx, desired, client.MergeFrom(deploy), patchOpts...), "cannot patch object:merge")
		}
	}

	// Server-side Apply
	return errors.Wrap(a.Patch(ctx, current, client.Apply, ao...), "cannot patch object:apply")
}

// can remove or fill by the managedFields
// use deploy's replicas replace desired's replicas ? 
func (a *Applicator) prePatch(parent,desired ,current runtime.Object){
        
}

this Patch function is Scaler to scale deployment

// patch method for scale
func (a *Applicator) ScalePatch(ctx ctx.Context, scaler, desired, current runtime.Object) error {

	ownerRef := utils.GenOwnerReferenceByTrait(trait)
	crmeta.AddOwnerReference(desired, ownerRef)

	if err := a.Patch(ctx, desired, client.MergeFrom(current), client.FieldOwner(utils.GenFieldOwnerByObject(scaler))); err != nil {
		return err
	}

	return nil
}

@kwiesmueller
Copy link
Member

I think I don't understand your setup yet.
You said you want one controller to create deployments and one to set the replicas.
The snippet above seems like you're doing both in one thing?

Can you clarify the goal you want to reach?
Right now you're doing either an Apply or a MergePatch depending on the objects existence.

@allenhaozi
Copy link
Author

allenhaozi commented Dec 29, 2020

I know what my problem is.

in my WorkloadController which controls the deployment

ownerReferences:
  - apiVersion: core.4paradigm.com/v1alpha2
    blockOwnerDeletion: true
    controller: true
    kind: WorkloadController
    name: simple-web-29
    uid: d26e5886-938e-4748-9927-4eed3712feaa
  - apiVersion: core.4paradigm.com/v1alpha2
    blockOwnerDeletion: true
    controller: false
    kind: ScalerController
    name: simple-web-29
    uid: 1c156536-63af-4ce6-aea2-533836b6eef5
  uid: 7ac2aacd-c66b-4918-9731-806d93689d76

It should be used a.Patch(ctx, current, client.Apply, ao...) all the time

but I have to pre-process this deployment which WorkloadController render when the deployment has exists.
because it contains the replica field which has been update by ScalerController

at the beginning I think k8s SSA will skip this replicas field which managedFields belong to ScalerController,
In fact I got an error If I don't use client.ForceOwnership

I need find a way to deal with fields that have been modified by ScalerController

thanks @coderanger @kwiesmueller

if SSA can provide client.SkipConflictFields is wonderful

ao := []client.PatchOption{ client.FieldOwner(c1.GetUID()),client.SkipConflictFields}
err := r.Patch(ctx, deploy, client.Apply, ao...)

@coderanger
Copy link
Contributor

That would be a feature request for upstream, but seems unlikely. As mentioned, you shouldn't be sending a replicas field at all. A common mistake with Apply patches is to use a normal object, you almost always want to be using an unstructued. Take a look at the code in https://github.com/coderanger/controller-utils/blob/main/components/template.go (or just use it directly)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. wg/api-expression Categorizes an issue or PR as relevant to WG API Expression.
Projects
None yet
Development

No branches or pull requests

5 participants