Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Support for cluster-level resource propagation pause and resume capabilities #5118

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

XiShanYongYe-Chang
Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang commented Jul 2, 2024

What type of PR is this?

/kind design
/kind documentation

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #1567, #4421, #4688

Special notes for your reviewer:

Other related issues: #4937

Does this PR introduce a user-facing change?:

NONE

@karmada-bot karmada-bot added the kind/design Categorizes issue or PR as related to design. label Jul 2, 2024
@karmada-bot karmada-bot added kind/documentation Categorizes issue or PR as related to documentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 2, 2024
@XiShanYongYe-Chang
Copy link
Member Author

/cc @CharlesQQ @a7i @chaunceyjiang @whitewindmills
Hi guys, can you help take a look?

@karmada-bot
Copy link
Collaborator

@XiShanYongYe-Chang: GitHub didn't allow me to request PR reviews from the following users: CharlesQQ.

Note that only karmada-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @CharlesQQ @a7i @chaunceyjiang @whitewindmills
Hi guys, can you help take a look?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov-commenter
Copy link

codecov-commenter commented Jul 2, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 28.23%. Comparing base (d4c5890) to head (30c5f03).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5118   +/-   ##
=======================================
  Coverage   28.23%   28.23%           
=======================================
  Files         632      632           
  Lines       43712    43712           
=======================================
+ Hits        12343    12344    +1     
+ Misses      30467    30466    -1     
  Partials      902      902           
Flag Coverage Δ
unittests 28.23% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@whitewindmills
Copy link
Member

@XiShanYongYe-Chang
it's also releated issue

@whitewindmills
Copy link
Member

/assign

@XiShanYongYe-Chang XiShanYongYe-Chang force-pushed the add-work-pause-proposal branch 2 times, most recently from 6bfc022 to 81b39a1 Compare July 5, 2024 06:41
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from xishanyongye-chang. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@XiShanYongYe-Chang
Copy link
Member Author

Hi @CharlesQQ @chaunceyjiang @whitewindmills @RainbowMango , it's ready to review again.

Copy link
Contributor

@Vacant2333 Vacant2333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!
Thanks for your proposal!

In Solution One, the SchedulePause of ResourceBinding is propagated through PropagationPolicy. If I create a PropagationPolicy with a PauseStrategy set to nil by default, is it feasible to only update the SchedulePause of ResourceBinding to true via a webhook when the karmada detector creates the ResourceBinding? If this is feasible, does "propagation" mean that the associated ResourceBinding/Work is only updated when the PropagationPolicy updates the PauseStrategy field?

I also have a suggestion: can this issue be considered as a user story? In brief, it involves pausing the ResourceBinding at the time of creation through a webhook, and then using a custom controller to determine when certain ResourceBindings can be scheduled.

#4919

@Monokaix
Copy link

Monokaix commented Jul 8, 2024

Hi! Thanks for your proposal!

In Solution One, the SchedulePause of ResourceBinding is propagated through PropagationPolicy. If I create a PropagationPolicy with a PauseStrategy set to nil by default, is it feasible to only update the SchedulePause of ResourceBinding to true via a webhook when the karmada detector creates the ResourceBinding? If this is feasible, does "propagation" mean that the associated ResourceBinding/Work is only updated when the PropagationPolicy updates the PauseStrategy field?

I also have a suggestion: can this issue be considered as a user story? In brief, it involves pausing the ResourceBinding at the time of creation through a webhook, and then using a custom controller to determine when certain ResourceBindings can be scheduled.

#4919

+1
What if set rb's schedulePause to true directly by webhook? this case will not get propagation involved, seems it's an independent case and is not like the work pause which is inherited from pp.

@XiShanYongYe-Chang
Copy link
Member Author

is it feasible to only update the SchedulePause of ResourceBinding to true via a webhook when the karmada detector creates the ResourceBinding?

For solution 1, the source of user control is in the PropagationPolicy. If it is modified through webhook, it will be inconsistent with the declaration of PropagationPolicy.

I think solution 2 might be better for your scenario.

@XiShanYongYe-Chang
Copy link
Member Author

In brief, it involves pausing the ResourceBinding at the time of creation through a webhook, and then using a custom controller to determine when certain ResourceBindings can be scheduled.

I understand that this is a specific implementation, not a user case. In fact, the two actions you point out can be done by controlling the pause of RB. Karmada is not aware of when you pause and resume.

That's why I pointed out that solution 2 is more suitable for your use case.

@Monokaix
Copy link

Monokaix commented Jul 8, 2024

When the user sets pause at the pp or cr level, other components such as wenhook may also set pause/resume at the rb or work level. There will be a conflict here. We should explain this limitation: )

@XiShanYongYe-Chang
Copy link
Member Author

Yes, if you need to set it up via a webhook, the user needs to be aware of the restrictions.

@CharlesQQ
Copy link
Contributor

In fact, what I am concerned about is whether the new work object has been modified by the latest overridepolicy after resumption of propagation. This must be ensured.

@XiShanYongYe-Chang
Copy link
Member Author

In fact, what I am concerned about is whether the new work object has been modified by the latest overridepolicy after resumption of propagation. This must be ensured.

Good question! What you say should be the definitive expected behavior, because when you change the op, work is updated, and when work is resumed, it is updated with the latest spec.

@RainbowMango
Copy link
Member

RainbowMango commented Jul 13, 2024

I would prefer to introduce this functionality in PropagationPolicy as it is more convenient for users and straightforward than introducing another API.

Also proposing the API:

// PropagationSpec represents the desired behavior of PropagationPolicy.
type PropagationSpec struct {
	// Suspension declares the policy for suspending different aspects of propagation.
	// nil means no suspension. no default values.
	// +optional
	Suspension *Suspension `json:"suspension,omitempty"`
}

// Suspension defines the policy for suspending different aspects of propagation.
type Suspension struct {
	// Scheduling controls whether scheduling should be suspended.
	// +optional
	// Note: Postpone this until there is a solid use case.
	// Scheduling *bool `json:"scheduling,omitempty"`

	// Dispatching controls whether dispatching should be suspended.
	// nil means not suspend, no default value, only accepts 'true'.
	// Note: true means stop propagating to all clusters. Can not co-exist
	// with DispatchingOnClusters which is used to suspend particular clusters.
	// +optional
	Dispatching *bool `json:"dispatching,omitempty"`

	// DispatchingOnClusters declares a list of clusters to which the dispatching
	// should be suspended.
	// Note: Can not co-exist with Dispatching which is used to suspend all.
	// +optional
	DispatchingOnClusters *SuspendClusters `json:"dispatchingOnClusters,omitempty"`
}

// SuspendClusters represents a group of clusters that should be suspended from propagating.
// Note: No plan to introduce the label selector or field selector to select clusters yet, as it
// would make the system unpredictable.
type SuspendClusters struct {
	// ClusterNames is the list of clusters to be selected.
	// +optional
	ClusterNames []string `json:"clusterNames,omitempty"`
}

Note: ResourceBinding and Work also need similar field(s).

@Monokaix
Copy link

Another question here, after scheduling or distribution paused, will there be any problems with resource occupation calculation? For example, will resources be pre-occupied after the work is paused? If resources are occupied and multiple workare resumed at the same time, will resource calculation errors be caused?

@XiShanYongYe-Chang
Copy link
Member Author

Another question here, after scheduling or distribution paused, will there be any problems with resource occupation calculation? For example, will resources be pre-occupied after the work is paused? If resources are occupied and multiple workare resumed at the same time, will resource calculation errors be caused?

For scheduling paused, resources will not be pre-occupied. For distribution pause, I think it will.

…ilities

Signed-off-by: changzhen <changzhen5@huawei.com>
@XiShanYongYe-Chang
Copy link
Member Author

/retest

@XiShanYongYe-Chang
Copy link
Member Author

Hi guys, at the moment, I'm inclined towards Plan One, as it is more straightforward and convenient for users, without the need for learning new APIs. If we introduce a new API to describe the pause strategy for resources, it might seem a bit limited. Later on, if we provide a Rollout API, we can then focus on thinking about adding new APIs.

How do you think? @CharlesQQ @chaunceyjiang @whitewindmills @RainbowMango @a7i @Monokaix @Vacant2333

@RainbowMango
Copy link
Member

I vote for solution one and sent the draft API on #5118 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. kind/documentation Categorizes issue or PR as related to documentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to Support dynamically adding overriders
10 participants