Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve traffic strategy for canary and partition release #219

Merged
merged 1 commit into from
Jul 30, 2024

Conversation

myname4423
Copy link
Contributor

@myname4423 myname4423 commented Jun 12, 2024

Ⅰ. Describe what this PR does

add an additional step before Upgrade step within runCanary function in order to:

  1. check replicas both of integer type and percentage type for partiton-style in case all pods are updated with new version
  2. patch selector to stable Service in case routing traffic to canary pods unintentionally

Ⅱ. Does this pull request fix one issue?

Ⅲ. Special notes for reviews

pkg/trafficrouting/manager.go Show resolved Hide resolved
pkg/controller/rollout/rollout_canary.go Show resolved Hide resolved
pkg/controller/rollout/rollout_canary.go Show resolved Hide resolved
pkg/controller/rollout/rollout_canary.go Outdated Show resolved Hide resolved
pkg/controller/rollout/rollout_canary.go Show resolved Hide resolved
@@ -106,6 +166,12 @@ func (m *canaryReleaseManager) runCanary(c *RolloutContext) error {
return err
} else if done {
canaryStatus.CurrentStepState = v1beta1.CanaryStepStateTrafficRouting
// if it is partition style and the last batch, we can skip the CanaryStepStateTrafficRouting step
// to bypass the bug mentioned above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz be more specific about the bug, leave the bug url

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/*
The following check is used to solve scenario like this:
steps:
- replicas: 1 # frist batch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frist -> first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// if canary svc has been already cleaned up, just return
// even DisableGenerateCanaryService is true, canary svc still exists, because canary service is stable service
if err = m.Get(context.TODO(), client.ObjectKeyFromObject(cService), cService); err != nil {
if err = m.Get(context.TODO(), client.ObjectKeyFromObject(cService), cService); err != nil && stableServiceRestored {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err != nil && !errors.IsNotFound(err), the logic will skip the if clause and start restore stable service, it seems better to just return error and retry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ok because we will get the stable Service again in the following restoring stable service
whatever, i have refactored this logic, looking forward to your check again, many thanks!

@myname4423 myname4423 force-pushed the traffic2 branch 2 times, most recently from 7e1281a to 7c48f3e Compare July 19, 2024 14:31
To avoid this issue, we restore stable Service before scaling the stable pods down to zero.
This ensures that the backends behind the stable ingress remain active, preventing the bug from being triggered.
*/
expectedReplicas, _ := intstr.GetScaledValueFromIntOrPercent(currentStep.Replicas, int(c.Workload.Replicas), true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems a best effort workaround, if the stable replicas is down and not ready, the stable service will also have zero endpoints thus trigger the bug.

@@ -181,6 +180,13 @@ func (m *Manager) DoTrafficRouting(c *TrafficRoutingContext) (bool, error) {
if serviceModified {
return false, nil
}
} else if c.DisableGenerateCanaryService {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz comment the field CanaryStrategy.DisableGenerateCanaryService and explain in what case it is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove subset related logic, i will submit a dedicated pr related to this in the future

// only if both conditions are met
if canaryServiceRemoved && stableServiceRestored {
/*
even both of the conditions are met, we call Finalise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even both of the conditions are met

if both of the conditions are met

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic here is bloated and complex, i will simplify it in the coming pr

Signed-off-by: yunbo <yunbo10124scut@gmail.com>
@zmberg
Copy link
Member

zmberg commented Jul 30, 2024

/lgtm
/approve

@kruise-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zmberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kruise-bot kruise-bot merged commit 16a3f0a into openkruise:master Jul 30, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants