Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stage] UPSTREAM: 49016: PV controller: resync informers manually #16927

Closed
wants to merge 1 commit into from

Conversation

jsafrane
Copy link
Contributor

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1501152

cc @openshift/sig-storage

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 18, 2017
@openshift-merge-robot openshift-merge-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. vendor-update Touching vendor dir or related files labels Oct 18, 2017
@jsafrane
Copy link
Contributor Author

/retest
flake: #16929

@deads2k
Copy link
Contributor

deads2k commented Oct 18, 2017

@jsafrane backport to stage?

@deads2k
Copy link
Contributor

deads2k commented Oct 18, 2017

The resync for other shared informer consumers would be skipped until their requested resync came about, right? Did that feature break somehow?

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2017
@jsafrane jsafrane changed the base branch from master to stage October 18, 2017 13:20
@openshift-merge-robot openshift-merge-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 18, 2017
@jsafrane
Copy link
Contributor Author

rebased to stage

@deads2k, it's fix for this bug: kubernetes/kubernetes#49905 (comment). PV controller may start when informer sync period is already fixed and can't be changed.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 18, 2017

@jsafrane: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/extended_builds 5a6b0b0 link /test extended_builds
ci/openshift-jenkins/extended_conformance_install_update 5a6b0b0 link /test extended_conformance_install_update

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@deads2k
Copy link
Contributor

deads2k commented Oct 18, 2017

@deads2k, it's fix for this bug: kubernetes/kubernetes#49905 (comment). PV controller may start when informer sync period is already fixed and can't be changed.

That sounds like a bug. Who is starting it? We fixed the GC start problem.

@jsafrane
Copy link
Contributor Author

It's started as usual controller in controller-manager, nothing special about it. And someone complained (@LiGgit?) that we should not force all controllers to have 15 second resync period, that's too often. So I added manual resync.

@jsafrane
Copy link
Contributor Author

        "Auto-merging vendor/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go", 
        "CONFLICT (content): Merge conflict in vendor/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go", 

This PR does not touch kuberuntime_manager.go...

/test all

@jsafrane
Copy link
Contributor Author

/test cross
/test extended_builds

@deads2k
Copy link
Contributor

deads2k commented Oct 18, 2017

It's started as usual controller in controller-manager, nothing special about it. And someone complained (@LiGgit?) that we should not force all controllers to have 15 second resync period, that's too often. So I added manual resync.

@liggitt is the skip logic that avoids resyncing everyone broken?

@jsafrane
Copy link
Contributor Author

And someone complained (@LiGgit?) that we should not force all controllers to have 15 second resync period

Found the complaint: kubernetes/kubernetes#48941

@deads2k
Copy link
Contributor

deads2k commented Oct 18, 2017

And someone complained (@LiGgit?) that we should not force all controllers to have 15 second resync period
Found the complaint: kubernetes/kubernetes#48941

@liggitt was that before we ensured startup order and had the opt-in resync?

@liggitt
Copy link
Contributor

liggitt commented Oct 18, 2017

@liggitt was that before we ensured startup order

probably

and had the opt-in resync?

not sure what that is referring to. I still think a 15 second resync, no matter how it is done, is way too short. we have better patterns for retrying failed objects at a shorter interval without resyncing the whole list.

@deads2k
Copy link
Contributor

deads2k commented Oct 18, 2017

not sure what that is referring to. I still think a 15 second resync, no matter how it is done, is way too short. we have better patterns for retrying failed objects at a shorter interval without resyncing the whole list.

No other consumers would see a resync. Only this controller that asked would see it.

I agree it is too short, but this doesn't look better than actually specifying the resync since it doesn't hurt other consumers.

@wongma7
Copy link
Contributor

wongma7 commented Oct 18, 2017

would like to point out it's not just for retrying failed cases, it's a fundamental assumption in the pv controller which is apparently "space shuttle" code and hard to change? The reason for this bug is basically there is a "syncUnboundClaim" but no "syncUnboundVolume": if a volume has just been created, there is no reason to assume there exists a claim looking for it. but if a claim has just been created, obviously the controller should find a volume for it.

another case from today: a Bound volume's pvc is deleted. "syncClaim" will get triggered, but it should not be syncClaim's responsibility to update the Bound volume to Released. that is the job of "syncVolume," which can only be triggered by the periodic resync...

@liggitt
Copy link
Contributor

liggitt commented Oct 18, 2017

it's a fundamental assumption in the pv controller which is apparently "space shuttle" code and hard to change?

requiring resync of all objects to stay responsive on subsets of objects that need reprocessing is an assumption that does not scale and should be redesigned.

@jsafrane
Copy link
Contributor Author

requiring resync of all objects to stay responsive on subsets of objects that need reprocessing is an assumption that does not scale and should be redesigned.

While I agree with this, nobody complained so far that PV controller is too slow. Compare with A/D controller that syncs all pods with attachable volumes every 100 ms.

Redesign is on long-term TODO list, I added a card to our board. https://trello.com/c/ARmicYxn/577-speed-up-pv-controller

@deads2k
Copy link
Contributor

deads2k commented Oct 19, 2017

At any rate, there are now better ways to solve this and you should use them instead of this. However, this is the current state upstream and is better than the bug at the moment.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2017
@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, jsafrane

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@deads2k
Copy link
Contributor

deads2k commented Oct 19, 2017

@jsafrane you you also open the pull against master so it doesn't regress.

@deads2k
Copy link
Contributor

deads2k commented Oct 19, 2017

@jsafrane holding this until the openshift/origin:master pull is open and labeled.

@deads2k
Copy link
Contributor

deads2k commented Oct 19, 2017

@jsafrane I'm sorry about this, but @eparis found a special-case in online that lets us avoid this problem. repoint to master? still lgtm

@jsafrane jsafrane changed the title UPSTREAM: 49016: PV controller: resync informers manually [stage] UPSTREAM: 49016: PV controller: resync informers manually Oct 20, 2017
@jsafrane
Copy link
Contributor Author

opened PR against master in #16965

openshift-merge-robot added a commit that referenced this pull request Oct 20, 2017
Automatic merge from submit-queue (batch tested with PRs 16667, 16796, 16960, 16965, 16894).

[master] UPSTREAM: 49016: PV controller: resync informers manually

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1501152

This is the same as #16927, just for master instead of stage.

/assign @deads2k
@jsafrane
Copy link
Contributor Author

master counterpart is merged, can we merge this one?

@eparis
Copy link
Member

eparis commented Oct 23, 2017

going to close this one, we'll get it in stage tomorrow night on the next stage rebase.

@eparis eparis closed this Oct 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants