[Proposal] Support for cluster-level resource propagation pause and resume capabilities #5118

XiShanYongYe-Chang · 2024-07-02T03:15:07Z

What type of PR is this?

/kind design
/kind documentation

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #1567, #4421, #4688

Special notes for your reviewer:

Other related issues: #4937

Does this PR introduce a user-facing change?:

NONE

XiShanYongYe-Chang · 2024-07-02T03:23:28Z

/cc @CharlesQQ @a7i @chaunceyjiang @whitewindmills
Hi guys, can you help take a look?

karmada-bot · 2024-07-02T03:23:34Z

@XiShanYongYe-Chang: GitHub didn't allow me to request PR reviews from the following users: CharlesQQ.

Note that only karmada-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @CharlesQQ @a7i @chaunceyjiang @whitewindmills
Hi guys, can you help take a look?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

codecov-commenter · 2024-07-02T03:26:54Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 28.23%. Comparing base (d4c5890) to head (30c5f03).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5118   +/-   ##
=======================================
  Coverage   28.23%   28.23%           
=======================================
  Files         632      632           
  Lines       43712    43712           
=======================================
+ Hits        12343    12344    +1     
+ Misses      30467    30466    -1     
  Partials      902      902

Flag	Coverage Δ
unittests	`28.23% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

whitewindmills · 2024-07-02T03:38:30Z

@XiShanYongYe-Chang
it's also releated issue

whitewindmills · 2024-07-02T03:38:58Z

/assign

docs/proposals/work-pause/README.md

karmada-bot · 2024-07-05T06:41:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from xishanyongye-chang. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

docs/proposals/work-pause/README.md

XiShanYongYe-Chang · 2024-07-05T08:15:30Z

Hi @CharlesQQ @chaunceyjiang @whitewindmills @RainbowMango , it's ready to review again.

docs/proposals/work-pause/README.md

Vacant2333

Hi!
Thanks for your proposal!

In Solution One, the SchedulePause of ResourceBinding is propagated through PropagationPolicy. If I create a PropagationPolicy with a PauseStrategy set to nil by default, is it feasible to only update the SchedulePause of ResourceBinding to true via a webhook when the karmada detector creates the ResourceBinding? If this is feasible, does "propagation" mean that the associated ResourceBinding/Work is only updated when the PropagationPolicy updates the PauseStrategy field?

I also have a suggestion: can this issue be considered as a user story? In brief, it involves pausing the ResourceBinding at the time of creation through a webhook, and then using a custom controller to determine when certain ResourceBindings can be scheduled.

#4919

docs/proposals/work-pause/README.md

Monokaix · 2024-07-08T07:18:12Z

Hi! Thanks for your proposal!

In Solution One, the SchedulePause of ResourceBinding is propagated through PropagationPolicy. If I create a PropagationPolicy with a PauseStrategy set to nil by default, is it feasible to only update the SchedulePause of ResourceBinding to true via a webhook when the karmada detector creates the ResourceBinding? If this is feasible, does "propagation" mean that the associated ResourceBinding/Work is only updated when the PropagationPolicy updates the PauseStrategy field?

I also have a suggestion: can this issue be considered as a user story? In brief, it involves pausing the ResourceBinding at the time of creation through a webhook, and then using a custom controller to determine when certain ResourceBindings can be scheduled.

#4919

+1
What if set rb's schedulePause to true directly by webhook? this case will not get propagation involved, seems it's an independent case and is not like the work pause which is inherited from pp.

XiShanYongYe-Chang · 2024-07-08T07:29:03Z

is it feasible to only update the SchedulePause of ResourceBinding to true via a webhook when the karmada detector creates the ResourceBinding?

For solution 1, the source of user control is in the PropagationPolicy. If it is modified through webhook, it will be inconsistent with the declaration of PropagationPolicy.

I think solution 2 might be better for your scenario.

XiShanYongYe-Chang · 2024-07-08T07:33:43Z

In brief, it involves pausing the ResourceBinding at the time of creation through a webhook, and then using a custom controller to determine when certain ResourceBindings can be scheduled.

I understand that this is a specific implementation, not a user case. In fact, the two actions you point out can be done by controlling the pause of RB. Karmada is not aware of when you pause and resume.

That's why I pointed out that solution 2 is more suitable for your use case.

Monokaix · 2024-07-08T08:22:44Z

When the user sets pause at the pp or cr level, other components such as wenhook may also set pause/resume at the rb or work level. There will be a conflict here. We should explain this limitation: )

XiShanYongYe-Chang · 2024-07-08T08:29:40Z

Yes, if you need to set it up via a webhook, the user needs to be aware of the restrictions.

docs/proposals/work-pause/README.md

CharlesQQ · 2024-07-10T15:10:19Z

In fact, what I am concerned about is whether the new work object has been modified by the latest overridepolicy after resumption of propagation. This must be ensured.

XiShanYongYe-Chang · 2024-07-11T01:34:17Z

In fact, what I am concerned about is whether the new work object has been modified by the latest overridepolicy after resumption of propagation. This must be ensured.

Good question! What you say should be the definitive expected behavior, because when you change the op, work is updated, and when work is resumed, it is updated with the latest spec.

docs/proposals/work-pause/README.md

RainbowMango · 2024-07-13T11:29:38Z

I would prefer to introduce this functionality in PropagationPolicy as it is more convenient for users and straightforward than introducing another API.

Also proposing the API:

// PropagationSpec represents the desired behavior of PropagationPolicy.
type PropagationSpec struct {
	// Suspension declares the policy for suspending different aspects of propagation.
	// nil means no suspension. no default values.
	// +optional
	Suspension *Suspension `json:"suspension,omitempty"`
}

// Suspension defines the policy for suspending different aspects of propagation.
type Suspension struct {
	// Scheduling controls whether scheduling should be suspended.
	// +optional
	// Note: Postpone this until there is a solid use case.
	// Scheduling *bool `json:"scheduling,omitempty"`

	// Dispatching controls whether dispatching should be suspended.
	// nil means not suspend, no default value, only accepts 'true'.
	// Note: true means stop propagating to all clusters. Can not co-exist
	// with DispatchingOnClusters which is used to suspend particular clusters.
	// +optional
	Dispatching *bool `json:"dispatching,omitempty"`

	// DispatchingOnClusters declares a list of clusters to which the dispatching
	// should be suspended.
	// Note: Can not co-exist with Dispatching which is used to suspend all.
	// +optional
	DispatchingOnClusters *SuspendClusters `json:"dispatchingOnClusters,omitempty"`
}

// SuspendClusters represents a group of clusters that should be suspended from propagating.
// Note: No plan to introduce the label selector or field selector to select clusters yet, as it
// would make the system unpredictable.
type SuspendClusters struct {
	// ClusterNames is the list of clusters to be selected.
	// +optional
	ClusterNames []string `json:"clusterNames,omitempty"`
}

Note: ResourceBinding and Work also need similar field(s).

Monokaix · 2024-07-15T02:07:14Z

Another question here, after scheduling or distribution paused, will there be any problems with resource occupation calculation? For example, will resources be pre-occupied after the work is paused? If resources are occupied and multiple workare resumed at the same time, will resource calculation errors be caused?

XiShanYongYe-Chang · 2024-07-15T07:49:18Z

Another question here, after scheduling or distribution paused, will there be any problems with resource occupation calculation? For example, will resources be pre-occupied after the work is paused? If resources are occupied and multiple workare resumed at the same time, will resource calculation errors be caused?

For scheduling paused, resources will not be pre-occupied. For distribution pause, I think it will.

…ilities Signed-off-by: changzhen <changzhen5@huawei.com>

XiShanYongYe-Chang · 2024-07-15T11:32:00Z

/retest

XiShanYongYe-Chang · 2024-07-16T09:47:04Z

Hi guys, at the moment, I'm inclined towards Plan One, as it is more straightforward and convenient for users, without the need for learning new APIs. If we introduce a new API to describe the pause strategy for resources, it might seem a bit limited. Later on, if we provide a Rollout API, we can then focus on thinking about adding new APIs.

How do you think? @CharlesQQ @chaunceyjiang @whitewindmills @RainbowMango @a7i @Monokaix @Vacant2333

RainbowMango · 2024-07-16T09:59:27Z

I vote for solution one and sent the draft API on #5118 (comment).

karmada-bot added the kind/design Categorizes issue or PR as related to design. label Jul 2, 2024

karmada-bot requested review from Tingtal and whitewindmills July 2, 2024 03:15

karmada-bot added kind/documentation Categorizes issue or PR as related to documentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 2, 2024

This was referenced Jul 2, 2024

Enhance behavioral control of Karmada propagation policies #4559

Open

Ability to suspend work #4688

Open

karmada-bot requested review from a7i and chaunceyjiang July 2, 2024 03:23

karmada-bot assigned whitewindmills Jul 2, 2024

chaunceyjiang reviewed Jul 2, 2024

View reviewed changes