Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #61363, Bounded retries for cloud allocator. #61375

Merged

Conversation

satyasm
Copy link
Contributor

@satyasm satyasm commented Mar 20, 2018

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #61363

Special notes for your reviewer:
Changed the tracking of nodesInProcessing from a set to map[string]int so that we can count the
number of times we re-process the node and not re-queue in case updateMaxRetries exceeded.

Release note:

Bound cloud allocator to 10 retries with 100 ms delay between retries.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 20, 2018
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 20, 2018
@satyasm
Copy link
Contributor Author

satyasm commented Mar 20, 2018

/assign @bowei
/cc @nicksardo, @shyamjvs

@satyasm
Copy link
Contributor Author

satyasm commented Mar 20, 2018

@bowei added the check for errors.IsNotFound to skip processing.

@nicksardo
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 20, 2018
@satyasm
Copy link
Contributor Author

satyasm commented Mar 20, 2018

/retest

1 similar comment
@satyasm
Copy link
Contributor Author

satyasm commented Mar 20, 2018

/retest

@bowei
Copy link
Member

bowei commented Mar 22, 2018

Is there a repro and test for the issue that was hit? (How do we know this fixed the problem).

@satyasm
Copy link
Contributor Author

satyasm commented Mar 22, 2018

The unit test was added for the simulation of the problem. Without the retry, the unit test hangs indefinitely and test fails with a timeout. thanks.

@bowei
Copy link
Member

bowei commented Mar 23, 2018

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 23, 2018
@@ -67,7 +67,7 @@ type cloudCIDRAllocator struct {

// Keep a set of nodes that are currectly being processed to avoid races in CIDR allocation
lock sync.Mutex
nodesInProcessing sets.String
nodesInProcessing map[string]int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to document

@@ -67,7 +68,7 @@ type cloudCIDRAllocator struct {

// Keep a set of nodes that are currectly being processed to avoid races in CIDR allocation
lock sync.Mutex
nodesInProcessing sets.String
nodesInProcessing map[string]int // track number of retries per node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this

type nodeProcessingInfo {
  retries int 
}
map[string]nodeProcessingInfo

so it's self documenting.

return true
}

func (ca *cloudCIDRAllocator) retryNodeInProcessing(nodeName string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename canRetry(nodeName string), this doesn't actually do the retry, it's a predicate

@satyasm
Copy link
Contributor Author

satyasm commented Mar 23, 2018

good observations. updated with the changes. thanks!

@bowei
Copy link
Member

bowei commented Mar 23, 2018

/lgtm

will need backport to 1.9, 1.8

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 23, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, satyasm

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 60455, 61365, 61375, 61597, 61491). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 0ab01d1 into kubernetes:master Mar 26, 2018
k8s-github-robot pushed a commit that referenced this pull request Apr 2, 2018
Automatic merge from submit-queue.

Backport Cloud CIDR allocator fixes to 1.9

**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #61363 

**Special notes for your reviewer**:
Manually doing a backport as cherry-pick does not work due to package renaming. 
See #61375 for the change that triggered the need 
for the backport.

**Release note**:

```release-note
Backport Cloud CIDR allocator fixes to 1.9
```
k8s-github-robot pushed a commit that referenced this pull request Apr 4, 2018
…#61124-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #61375 #61124 upstream release 1.10

**What this PR does / why we need it**:
Backport of stability fixes on IPAM controller to 1.10.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #61363, Fixes #61124 

**Special notes for your reviewer**:
These are already merged on master, backport to 1.10 branch.

**Release note**:

```release-note
Backport of stability fixes on IPAM controller to 1.10.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cloud CIDR Allocator can get into retry loop
5 participants