Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PodDisruptionBudget #308

Merged
merged 2 commits into from
Jan 29, 2020

Conversation

wei-lee
Copy link
Contributor

@wei-lee wei-lee commented Jan 22, 2020

WHAT

This PR will add support for PodDisruptionBudget to the 3scale operator. See JIRA.

HOW

A new feature flag is introduced to the APIManager CRD to allow users to enable this feature. An example can be found here.

When enabled, the following PDBs are created:

  • apicast-production
  • apicast-staging
  • backend-cron
  • backend-listener
  • backend-worker
  • system-app
  • system-sidekiq
  • zync
  • zync-que

All PDBs have the MaxUnavailable field value set to 1. This will ensure when the node is being drained, only 1 pod per component is allowed to be evicted. We do not to update/delete these PDBs even if there is only 1 instance of these components are deployed as it won't violate the rules in PDBs. Also all the components can still be scaled up and down as normal. It will be count against the PDB rules, but PDB won't stop replica controllers from scaling down the pods.

Verification

To verify:

  1. Follow the instructions here to build the operator and deploy it an OpensShift 4 cluster. You can use this image quay.io/wei_lee/3scale-operator:test.
  2. Create a new APIManager CR using this example.
  3. Wait for deployments start for all the components.
  4. List all the PDBs in the namespace and you should see 9 of them. You can also check the content of them to make sure they have the right spec.
  5. Try scale down some of the components and you should still be able to do that.
  6. Delete the APIManager CR that is created and make sure all the PDBs are deleted as well.

@wei-lee wei-lee requested a review from eguzki January 22, 2020 10:53
@wei-lee wei-lee changed the title Intly 4681 add pdb Add support for PodDisruptionBudget Jan 22, 2020
@wei-lee wei-lee force-pushed the INTLY-4681-add-pdb branch from 056e6a9 to d2f46b6 Compare January 27, 2020 09:51
@miguelsorianod
Copy link
Contributor

Hi @wei-lee,

Thank you for the contribution.

I've been trying the PodDisruptionBudget and the behaviour I understand it has is the following one:

  • As you commented, when draining a node, this PR allows that, when activated, it ensures that at most one pod at a time will be evicted from the node for each component. It is important to note that if the node to drain has only one pod for a component, the PodDisruptionBudget will comply the rules, as the MaxUnavailable is set to 1. This means that this functionality does not prevent a node being left without a pod of a component. What it does, is it makes the draining of the node for all of its pods components to be moved to another node one at a time. That's the intent of this PR right? and not preventing a node being left without a pod of a component.
  • As you also commented, the PodDisruptionBudget stats are taken into account when scaling up/down a DeploymentConfig/Deployment or performing some kind of rolling update (for example when you change the PodTemplate spec and this triggers a new deployment), but they are not enforced in this case, so we should be able to perform changes into the Deployment and scale up/down without problems even if the PodDisruptionBudget stats indicate that it shouldn't be possible to evict.

As I understand, apart from enabling the PodDisruptionBudget field, to really use this functionality the number of replicas of the different components have to be configured with more than one replica. Otherwise downtime avoidance is not prevented.

We are currently reviewing the code and we'll get back to you with feedback about it.

Thank you.

go.mod Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
@wei-lee
Copy link
Contributor Author

wei-lee commented Jan 27, 2020

@miguelsorianod

Hi @wei-lee,

Thank you for the contribution.

I've been trying the PodDisruptionBudget and the behaviour I understand it has is the following one:

  • As you commented, when draining a node, this PR allows that, when activated, it ensures that at most one pod at a time will be evicted from the node for each component. It is important to note that if the node to drain has only one pod for a component, the PodDisruptionBudget will comply the rules, as the MaxUnavailable is set to 1. This means that this functionality does not prevent a node being left without a pod of a component. What it does, is it makes the draining of the node for all of its pods components to be moved to another node one at a time. That's the intent of this PR right? and not preventing a node being left without a pod of a component.

Yes, that's right.

  • As you also commented, the PodDisruptionBudget stats are taken into account when scaling up/down a DeploymentConfig/Deployment or performing some kind of rolling update (for example when you change the PodTemplate spec and this triggers a new deployment), but they are not enforced in this case, so we should be able to perform changes into the Deployment and scale up/down without problems even if the PodDisruptionBudget stats indicate that it shouldn't be possible to evict.

That's right too.

As I understand, apart from enabling the PodDisruptionBudget field, to really use this functionality the number of replicas of the different components have to be configured with more than one replica. Otherwise downtime avoidance is not prevented.

Yes, right again.

We are currently reviewing the code and we'll get back to you with feedback about it.

Thank you.

Thank you for reviewing!

@eguzki
Copy link
Member

eguzki commented Jan 27, 2020

Good Job 🏅

The PR introduces lots of changes. Reviewing will continue

@eguzki
Copy link
Member

eguzki commented Jan 27, 2020

Feel free to ask anything about the code structure. Happy to answer

@eguzki
Copy link
Member

eguzki commented Jan 28, 2020

No PDB has been created for system-sphinx deploymentconfig. Any reason not to do it or just missed it?

@wei-lee
Copy link
Contributor Author

wei-lee commented Jan 28, 2020

No PDB has been created for system-sphinx deploymentconfig. Any reason not to do it or just missed it?

@eguzki By the look of this code, it looks like system-sphinx is not scalable. That's why I didn't create the PDB for it as it won't make any differences. Should it be scalable too?

@eguzki
Copy link
Member

eguzki commented Jan 28, 2020

@eguzki By the look of this code, it looks like system-sphinx is not scalable. That's why I didn't create the PDB for it as it won't make any differences. Should it be scalable too?

@wei-lee Fair enough. To be honest we do not even know if it is scalable. My bet is that it is actually scalable. Furthermore, the operator creates a service for it.

For this PR, we are not creating PDB for system-sphinx. But we have created one action point to ask system guys about it and include PDB for it in case it is scalable in some other PR.

@eguzki
Copy link
Member

eguzki commented Jan 28, 2020

looking soo good, congrats @wei-lee 🏅

@wei-lee
Copy link
Contributor Author

wei-lee commented Jan 28, 2020

@miguelsorianod @eguzki Thanks for the review, really appreciated it. I think I have addressed all your comments now, do you mind taking a look again and let me know if there is anything else you want to change.

Thanks a lot!

@wei-lee
Copy link
Contributor Author

wei-lee commented Jan 28, 2020

Forgot to mention I have also deployed the 3scale operator to an openshift cluster and it works fine.

@miguelsorianod
Copy link
Contributor

Functionality wise looks good to me 👍 . I've also tried to deploy it in my local cluster and trying to enable and disable PodDisruptionBudget and it works correctly too 👍

Copy link
Contributor

@miguelsorianod miguelsorianod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be almost merged. To me there's only pending to address the go.sum changes and the generated openapi changes that we left you feeback about.

Also before merging the PR, can you clean the commit history?
We usually have one or more commits for the code functionality (separated in "logical" commits) and another different commit with the generated templates that we usually name it as "generator: update templates".

Thank you.

@wei-lee wei-lee force-pushed the INTLY-4681-add-pdb branch 2 times, most recently from 4e56164 to c7824fc Compare January 28, 2020 17:14
@wei-lee
Copy link
Contributor Author

wei-lee commented Jan 28, 2020

@miguelsorianod I have now reverted changes to the go.sum file and cleaned up the commit history.

I also regenerated the files by using 0.8.0 version of the operator-sdk by running operator-sdk generate k8s and operator-sdk generate openapi. However, the generated files are still different from master and the crd tests are failing. Do you have any idea where I did wrong? Am I using the wrong version of the operator-sdk?

@miguelsorianod
Copy link
Contributor

Hi @wei-lee,

The changes in the pkg/apis/apps/v1alpha1/zz_generated.deepcopy.go and pkg/apis/apps/v1alpha1/zz_generated.openapi.go files are expected and correct now. The changes are expected because are due to the addition of the PodDisruptionBudget changes so it's ok in that regard 👍 .

About the deploy/crds/capabilities* files please exclude those changes from the commits. Those changes happens because some fields in the CRDs are not generated by the operator-sdk's openapi generator tool but we wanted to include them to offer security at deploy time.
The same happens with some of the fields in 'deploy/crds/apps_v1alpha1_apimanager_crd.yaml'. From there we only want the changes related to the PodDisruptionBudget addition. Please do not add the changes that remove some parts of the configurationSecretRef and awsSecretCredentials fields. We know this is unfortunate but we do not know of a way to include them automatically as the openapi tool does not seem to include those fields or how to make it include them (at least in the version used in operator-sdk 0.8.0)

@wei-lee
Copy link
Contributor Author

wei-lee commented Jan 28, 2020

@miguelsorianod understood. Thanks for the detailed explanation. I will make the changes you suggested.

@wei-lee wei-lee force-pushed the INTLY-4681-add-pdb branch 3 times, most recently from 9ce30d3 to e8a4557 Compare January 28, 2020 18:56
@wei-lee
Copy link
Contributor Author

wei-lee commented Jan 28, 2020

@miguelsorianod ok, all issues are addressed now. Please let me know if there is anything else you need.

Thanks for the help! 👍

@miguelsorianod
Copy link
Contributor

Hi @wei-lee,

The final state of the changes looks good now 👍

Regarding the commit history, I think the commits do not correspond with the changes. On the update templates commit I do not see that the amp-ha template changes are there, which should be the only file in that commit.
I also see some changes related to go.sum and some changes we talked about in the deepcopy generated files in the commits that shouldn't be shown be added in any commit. Can you fix that? Thank you.

@wei-lee
Copy link
Contributor Author

wei-lee commented Jan 29, 2020

@miguelsorianod is it ok if i just merge the 2 commits to 1? The commits are already squashed so it will be tricky to reorganise them again.

@eguzki
Copy link
Member

eguzki commented Jan 29, 2020

@wei-lee this is one of my favorite git tricks.

Merge both commits into one, and then split them as you wish.

Split commit in two

https://emmanuelbernard.com/blog/2014/04/14/split-a-commit-in-two-with-git/

git rebase -i <oldsha1>~1
# mark the expected commit as `edit` (replace pick in front of the line), save and close
git reset HEAD^
git add ...
git commit -m "First part"
git add ...
git commit -m "Second part"
git rebase --continue

@wei-lee
Copy link
Contributor Author

wei-lee commented Jan 29, 2020

@eguzki thanks for the trick. That's really useful. Will try that now.

@wei-lee wei-lee force-pushed the INTLY-4681-add-pdb branch from e8a4557 to 1e10de5 Compare January 29, 2020 11:53
@codeclimate
Copy link

codeclimate bot commented Jan 29, 2020

Code Climate has analyzed commit 1e10de5 and detected 46 issues on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 5
Duplication 17
Style 24

View more on Code Climate.

@wei-lee
Copy link
Contributor Author

wei-lee commented Jan 29, 2020

@miguelsorianod thanks again @eguzki for the trick, it is easier than I thought. Now it should be ok now.

Copy link
Contributor

@miguelsorianod miguelsorianod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me now and on my view it can be merged, good job 👍

Thank you very much, we appreciate your contribution and work on this @wei-lee.

@miguelsorianod miguelsorianod merged commit 0e2f40d into 3scale:master Jan 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants