Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReplicaSet has onwer ref of the Deployment that created it #35676

Merged
merged 1 commit into from
Dec 23, 2016

Conversation

krmayankk
Copy link

@krmayankk krmayankk commented Oct 27, 2016

What this PR does / why we need it:
This enabled garbage collection for ReplicaSets and ensures they are owned by their respective Deployment objects.

fixes #33845

This is an initial PR to get feedback. Will update this quickly with unit tests if this seems like in the right direction


This change is Reviewable

@krmayankk
Copy link
Author

@caesarxuchao @Kargakis @janetkuo ptal. If this is in the right direction, please provide more inputs as to how to test this locally. I just ran unit tests.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Oct 27, 2016
@@ -355,6 +371,44 @@ func (dc *DeploymentController) syncDeployment(key string) error {
return err
}

if dc.garbageCollectorEnabled {
rsList, err := dc.rsLister.ReplicaSets(deployment.Namespace).List(labels.Everything())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move all of this in a separate function dc.classifyReplicaSets or something so we can keep the main sync handler clean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@0xmichalis
Copy link
Contributor

We should also set owner references by default on creation of a replica set.

@0xmichalis
Copy link
Contributor

@kubernetes/deployment

@0xmichalis 0xmichalis added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Oct 27, 2016
return &ReplicaSetControllerRefManager{rsControl, controllerObject, controllerSelector, controllerKind}
}

// Classify first filters out inactive ReplicaSets, then it classify the remaining ReplicaSets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inactive pods will be garbage-collected by the podGC controller. We don't have anything similar for replica sets so I don't think we should filter out any replica sets at this point - @caesarxuchao to keep me correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to what @Kargakis said, I filtered out inactive pods because that was the old behavior of replicaset controller , I did the filtering in my PR to respect the old behavior. So we don't need to classify inactive replicaset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we don't need to filter inactive ReplicaSet. We don't define what "inactive" mean for RS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already use "inactive" in some places for replica sets with no pods but it's nothing we have agreed on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Still, no need to filter them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep removed the function and comment

cm := controller.NewReplicaSetControllerRefManager(dc.rsControl, deployment.ObjectMeta, deploymentSelector, getDeploymentControllerKind())
matchesAndControlled, matchesNeedsController, controlledDoesNotMatch := cm.Classify(rsList)
for _, replicaSet := range matchesNeedsController {
err := cm.AdoptReplicaSet(replicaSet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These replica sets are coming from a shared cache so you need to deep-copy before mutating them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should be readonly, but @krmayankk could you double check? If it's readonly, could you add a comment here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AdoptReplicaSet seems to patch the owner reference in the replica set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It generates a patch and sends the patch to server. It doesn't change anything locally.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, its readonly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I see it is used just for constructing the patch. ok

// remove the controllerRef for the RS that no longer have matching labels
var errlist []error
for _, replicaSet := range controlledDoesNotMatch {
err := cm.ReleaseReplicaSet(replicaSet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. Deep-copy before mutating.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Adopt/Release are readonly for the object being passed to its arguments

// are the ReplicaSets that have a controllerRef pointing to the controller, but their
// labels no longer match the selector.
func (m *ReplicaSetControllerRefManager) Classify(replicaSets []*extensions.ReplicaSet) (
matchesAndControlled []*extensions.ReplicaSet,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise against the use of named returns since it's harder to process the function if it has multiple return calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through a long debate with lavalamp and wojtek-t last time. The names make the signature more readable. I remember we agreed if we make all returns explicit then having named return variables is not a problem.

controlledDoesNotMatch []*extensions.ReplicaSet) {
for i := range replicaSets {
replicaSet := replicaSets[i]
// TODO: Do we need to ignore inactive RS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the comment

@@ -705,6 +725,10 @@ func FilterActivePods(pods []*api.Pod) []*api.Pod {
return result
}

func IsReplicaSetActive(r *extensions.ReplicaSet) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -56,9 +59,14 @@ const (
MaxRetries = 5
)

func getDeploymentControllerKind() unversioned.GroupVersionKind {
return v1beta1.SchemeGroupVersion.WithKind("DeploymentController")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably needs to be "Deployment"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should be "Deployment".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixd

if dc.garbageCollectorEnabled {
rsList, err := dc.rsLister.ReplicaSets(deployment.Namespace).List(labels.Everything())
if err != nil {
glog.Errorf("Error getting ReplicaSets for deployment %q: %v", deployment, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change deployment to d.Name in the arguments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -56,9 +59,14 @@ const (
MaxRetries = 5
)

func getDeploymentControllerKind() unversioned.GroupVersionKind {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a separate function, inline the body in the caller site.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i understand the reason for this. Are we saying its just used in one place and hence we should do this way. Keeping it consistent with getRSControllerKind() for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unnecessary indirection in the code and the deployment controller isn't the easiest one to maintain so the simpler things are the better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you now use it elsewhere too. ok

Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll finish the rest tonight.

return &ReplicaSetControllerRefManager{rsControl, controllerObject, controllerSelector, controllerKind}
}

// Classify first filters out inactive ReplicaSets, then it classify the remaining ReplicaSets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to what @Kargakis said, I filtered out inactive pods because that was the old behavior of replicaset controller , I did the filtering in my PR to respect the old behavior. So we don't need to classify inactive replicaset

return &ReplicaSetControllerRefManager{rsControl, controllerObject, controllerSelector, controllerKind}
}

// Classify first filters out inactive ReplicaSets, then it classify the remaining ReplicaSets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we don't need to filter inactive ReplicaSet. We don't define what "inactive" mean for RS.

// are the ReplicaSets that have a controllerRef pointing to the controller, but their
// labels no longer match the selector.
func (m *ReplicaSetControllerRefManager) Classify(replicaSets []*extensions.ReplicaSet) (
matchesAndControlled []*extensions.ReplicaSet,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through a long debate with lavalamp and wojtek-t last time. The names make the signature more readable. I remember we agreed if we make all returns explicit then having named return variables is not a problem.

func (m *ReplicaSetControllerRefManager) AdoptReplicaSet(replicaSet *extensions.ReplicaSet) error {
// we should not adopt any ReplicaSets if the controller is about to be deleted
if m.controllerObject.DeletionTimestamp != nil {
return fmt.Errorf("cancel the adopt attempt for RS %s because the controlller is being deleted",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you print out the controller's name as well? It'd be useful for debugging.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// ReleaseReplicaSet sends a patch to free the ReplicaSet from the control of the Deployment controller.
// It returns the error if the patching fails. 404 and 422 errors are ignored.
func (m *ReplicaSetControllerRefManager) ReleaseReplicaSet(replicaSet *extensions.ReplicaSet) error {
glog.V(2).Infof("patching pod %s_%s to remove its controllerRef to %s/%s:%s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/pod/replicaset

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

err := m.rsControl.PatchReplicaSet(replicaSet.Namespace, replicaSet.Name, []byte(deleteOwnerRefPatch))
if err != nil {
if errors.IsNotFound(err) {
// If the ReplicaSet no longer exists, ignore it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the extra space

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// match, which means the ReplicaSet is deleted and then recreated.
// In both cases, the error can be ignored.

// TODO: If the ReplicaSet has owner references, but none of them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to copy the TODO

@@ -56,9 +59,14 @@ const (
MaxRetries = 5
)

func getDeploymentControllerKind() unversioned.GroupVersionKind {
return v1beta1.SchemeGroupVersion.WithKind("DeploymentController")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should be "Deployment".

@@ -84,10 +92,13 @@ type DeploymentController struct {

// Deployments that need to be synced
queue workqueue.RateLimitingInterface

// garbageCollectorEnabled denotes if the garbage collector is enabled.
garbageCollectorEnabled bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think garbageCollectorEnabled is needed anymore. If GC is disabled, adding controllerRef won't break the system, and it still has the benefit of preventing multiple controllers contend for a controllee.

I added garbageCollectorEnabled in RS controller because GC was still alpha (now it's beta) at that time, and we wanted to have a safety button to disable every GC related features.

cm := controller.NewReplicaSetControllerRefManager(dc.rsControl, deployment.ObjectMeta, deploymentSelector, getDeploymentControllerKind())
matchesAndControlled, matchesNeedsController, controlledDoesNotMatch := cm.Classify(rsList)
for _, replicaSet := range matchesNeedsController {
err := cm.AdoptReplicaSet(replicaSet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should be readonly, but @krmayankk could you double check? If it's readonly, could you add a comment here?

return utilerrors.NewAggregate(errlist)
}

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know deployement controller well enough. If the number of controlled replicaset is greater than the max rollback history, will another routine try to delete the surplus? @janetkuo @Kargakis

@caesarxuchao
Copy link
Member

Thanks @krmayankk! Generally looks good. OwnerReferences affects garbage collection behavior (user doc), so could you check these:

@0xmichalis
Copy link
Contributor

I don't think garbageCollectorEnabled is needed anymore. If GC is disabled, adding controllerRef won't break the system, and it still has the benefit of preventing multiple controllers contend for a controllee.

+1. We need to use owner references in a couple of places apart from garbage collecction.

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 29, 2016
@krmayankk krmayankk closed this Nov 1, 2016
@krmayankk krmayankk deleted the contr-ref branch November 1, 2016 06:52
@krmayankk krmayankk restored the contr-ref branch November 1, 2016 06:53
@krmayankk krmayankk reopened this Nov 1, 2016
@krmayankk
Copy link
Author

@caesarxuchao @Kargakis the second e2e test i added is failing. i am debugging this. If any of you have some pointers let me know. It seems to always timing out the Poll.

@caesarxuchao
Copy link
Member

Is this flaky or does it always fail? Can you print out the deployment as well? This could be a bug in the garbage collector itself.

@krmayankk
Copy link
Author

@caesarxuchao its always failing. I have added debug infor for deployment as well. hoping to see some clue there.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCE etcd3 e2e failed for commit 363caa8b19c5cd416e0f570b1f66658b52b4077f. Full PR test history.

The magic incantation to run this job again is @k8s-bot gce etcd3 e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCI GCE e2e failed for commit 363caa8b19c5cd416e0f570b1f66658b52b4077f. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@caesarxuchao
Copy link
Member

@k8s-bot cvm gce e2e test this

@k8s-ci-robot
Copy link
Contributor

Jenkins GCE e2e failed for commit 363caa8b19c5cd416e0f570b1f66658b52b4077f. Full PR test history.

The magic incantation to run this job again is @k8s-bot cvm gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@krmayankk krmayankk force-pushed the contr-ref branch 2 times, most recently from 64c5741 to ceed344 Compare December 22, 2016 23:05
@k8s-ci-robot
Copy link
Contributor

Jenkins Bazel Build failed for commit ceed344. Full PR test history.

The magic incantation to run this job again is @k8s-bot bazel test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, otherwise lgtm. Hopefully we can get this lgtm'ed today!

},
}
}

// verifyRemainingDeploymentsAndReplicaSets verifies if the number of the remaining deployment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you realign the newlines?

return false, fmt.Errorf("Failed to list rs: %v", err)
}

var ret = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit^2: move this to the start of the function, remove the newline, so that the code looks symmetric.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if err != nil {
return false, fmt.Errorf("Failed to list rs: %v", err)
}
// We know there is at least one RS
Copy link
Member

@caesarxuchao caesarxuchao Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a for loop over rs.Items? and fail if any RS's ownerRefs are not cleaned up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

err = wait.PollImmediate(5*time.Second, 2*time.Minute, func() (bool, error) {
ret, err := verifyRemainingDeploymentsAndReplicaSets(f, clientSet, deployment, 0, 1)
if err != nil {
return ret, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the convention is return false, err.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if err != nil {
return ret, err
}
if ret {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this entire block out of the Poll()? Deployment shouldn't be deleted if the RS.ownerRefs are not cleaned up yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@caesarxuchao
Copy link
Member

This PR lgtm after the nits are addressed. I'll be OOO for the next couple of weeks. @Kargakis could you give the lgtm if I don't get the chance? Thank you all for getting this done!

@caesarxuchao
Copy link
Member

LGTM. Thank you!

@caesarxuchao caesarxuchao added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 23, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins GKE smoke e2e failed for commit 7779776. Full PR test history.

The magic incantation to run this job again is @k8s-bot cvm gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 39059, 39175, 35676, 38655)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ReplicaSet is missing an owner ref of the Deployment that created it
9 participants