Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate IP addresses at scale: Possible read/write locking problem? #110

Open
dougbtv opened this issue May 25, 2021 · 13 comments
Open

Duplicate IP addresses at scale: Possible read/write locking problem? #110

dougbtv opened this issue May 25, 2021 · 13 comments

Comments

@dougbtv
Copy link
Member

dougbtv commented May 25, 2021

@pperiyasamy has reported that they're seeing duplicate IP addresses on a large k8s cluster, using the k8s backend.

He's pointed out that a lock may not be acquired like it is in the etcd implementation (the same problem does not appear when using the etcd backend)

https://github.com/k8snetworkplumbingwg/whereabouts/blob/master/pkg/storage/storage.go#L82-L108

Does this need to have a further locking mechanism?

How to test for the duplicate IP problem with this scaled up concurrency?

@pperiyasamy
Copy link

/cc @JanScheurich

@crandles
Copy link
Collaborator

crandles commented May 25, 2021

The Kubernetes CRD backend does not use locks, but does use what was at the time a recommended alternative:

// create the patch
patch, err := jsonpatch.CreatePatch(origBytes, modBytes)
if err != nil {
return err
}
// add additional tests to the patch
ops := []jsonpatch.Operation{
// ensure patch is applied to appropriate resource version only
{Operation: "test", Path: "/metadata/resourceVersion", Value: orig.ObjectMeta.ResourceVersion},
}
for _, o := range patch {
// safeguard add ops -- "add" will update existing paths, this "test" ensures the path is empty
if o.Operation == "add" {
var m map[string]interface{}
ops = append(ops, jsonpatch.Operation{Operation: "test", Path: o.Path, Value: m})
}
}
-- if I can find a reference I'll add a few links. IMO, using annotations / resource versions and the jsonpatch "transaction" matched closely to how locks are often implemented with etcd.

The intent here was to use JSON patch for the update operation, with added tests to:

  • Only allow the update operation to occur on the expected resource version
    • this should prevent two operations from being written at the same time (and retry logic should allow eventual success)
  • ensure patched paths (adds) are empty
    • this seemed like an extra safeguard that was good to add

Perhaps something is going wrong with the patch, or how the tests were assumed to work.

How many IPs are in use when this occurs? Could we be hitting our upper bound? I know that we have one, but I have not looked into it before.

I am a few versions behind, and only use a few IPs so may not be hitting this edge case.

@JanScheurich
Copy link

JanScheurich commented May 25, 2021

We see this issue massively (~5 duplicates per IP address) already with a 100 pod deployment in a cluster of 80 node when using a huge /8 IPv4 range. So lack of available IPs we can rule out. In the kubelet logs we observed a huge number of where about retries due to failed patch operations. Apparently the concurrency protection of the patch is working at times but does not prevent duplicate assignments fully. I can't rule out that we hit the upper limit of retries, but If I look at the code that should lead to IP address assignment failure. We don't see that.

In non-trivial cluster and application deployments, I think the locking approach of the etcd datastore is much more suitable. The lock-less K8s approach, even if it were safe, leads to a lot of blind load on the K8s API due to systematically failing patch attempts.

@pperiyasamy
Copy link

The intent here was to use JSON patch for the update operation, with added tests to:

  • Only allow the update operation to occur on the expected resource version

    • this should prevent two operations from being written at the same time (and retry logic should allow eventual success)
  • ensure patched paths (adds) are empty

    • this seemed like an extra safeguard that was good to add

Perhaps something is going wrong with the patch, or how the tests were assumed to work.

@crandles I hope this still makes write to succeed just by overwriting an existing ip address mapped to another pod in p.pool.Spec.Allocations. that's not wanted, isn't it ?
As Jan mentioned, I think we should use similar approach as implemented for etcd datastore type.

@JanScheurich
Copy link

JanScheurich commented May 25, 2021

@pperiyasamy : I have not found any specific documentation about the patch operation of the K8s API that would describe the meaning of tests within a patch and how the K8s API reacts to failing tests. My naive interpretation of what should happen inside the k8s API server is:

  1. Lock the K8s resource to be patched in etcd
  2. Read the locked resource
  3. Check the tests with respect to the locked resource
    4a. If all tests succeed, apply the patch to the resource, write it to etcd and remove the lock. Return success
    4b. If any of the tests fails, do not apply the patch, remove the lock and return failure.

With the specific test on the original resource version, this should only let one of a set of conflicting concurrent patches succeed (assuming, of course that the resource version is stepped with every update). The remaining patches should fail and trigger a complete retry (read, modify, patch).

Conceptually this is similar to a spin lock using atomic compare&swap instructions, except here all of the involved operations are heavy on the K8s API. So this doesn't seem like a good approach in scenarios where many transactions happen in parallel, as is systematically the case when K8s replicasets, statefulsets or daemonsets are deployed/deleted with many pod replicas.

@crandles
Copy link
Collaborator

crandles commented May 25, 2021

re: upper limit, i meant limit on number of IPs we can allocate. This is going to depend on the maximum size of CRD resources. I'm not sure what the maximum number of address we can store in one pool is. (doing some rough math, I suspect we can fit around 22k ips in a single pool)

@crandles I hope this still makes write to succeed just by overwriting an existing ip address mapped to another pod in p.pool.Spec.Allocations. that's not wanted, isn't it ?

Correct, but locking doesn't prevent that either.

I just looked at the lock implementation in etcd, and I think we could do something similar by adding an owner or similar annotation to lock the resource, if it is present watch and wait, and releasing it upon completion.

@dougbtv
Copy link
Member Author

dougbtv commented May 26, 2021

Thanks for the collaborative exchange, and Chris I really like the idea about adding a locking mechanism with an "owner" annotation -- I think that's probably pretty feasible.

Tomo also mentioned the possibility of using some kind of leader election type of mechanism, he had linked:

@pperiyasamy
Copy link

@dougbtv I too think so, the locking mechanism with an "owner" annotation looks to be simple approach as it can just follow Lock and Unlock syntax which would make Store APIs to be in intact. I guess there won't be still any race conditions with this approach.
leader election is also a good option, but it might break current Store API contract as the IPManagement code has to be executed in LeaderCallbacks as per this example

@dougbtv
Copy link
Member Author

dougbtv commented Jun 29, 2021

Just an FYI that this problem has been under active development.

Essentially, what it comes down to is that Peri and Tomo have discussed at length designs for moving forward, and we have proposed a number of fixes which have been locally tested for general usage, but are awaiting some tests at scale -- this is non-trivial to replicate locally.

Tomo has proposed a leader election pull request @ #113

Peri has two pull requests available @ #114 & #115 which use a "lock" CRD to denote that a particular instance has acquired the lock via CRD.

Tomo's main consideration is that using the k8s libraries to solve for this issue ensures long-term maintainability. Peri has proposed a change to the way we use Update() although this seems to cause some blind load on the cluster. Peri has also proposed pull requests which use the lock CRD as an alternative and to assess performance facets.

@JanScheurich
Copy link

During Peri's and my testing we found out that the actual root cause for the duplicate IPs was an incorrect handling of the K8s client context timeout. Instead of an error, the current IP was returned to the CNI at context timeout, despite the fact that the IP address was not successfully marked as allocated in the IP pool.

With a corrected context timeout handling, both the original Patch API and the simplified Update API were working correctly without causing duplicate IPs.

We then continued testing to compare the behavior and performance of the different proposed solutions:

  1. Existing Patch API with correct context timeout handling (return context timeout error for deallocate #119)
  2. Update API with correct context timeout handling (replace patch with update api to rewrite IPPool #114 )
  3. IPPoolLock protecting the Update API (use IPPoolLock CR for rewriting IPPool #115)
  4. Tomo's Leader Election solution (Add leader election #113)

As test case we create and delete a K8s deployment with 500 pod replicas on bare metal cluster of ~70 worker nodes. Each pod had a single secondary OVS CNI interface with whereabouts IPAM. Without any whereabouts patch this always resulted in a significant number of duplicate IPs.

A major consequence of corrected context timeout handling is that the CNI ADD or DEL commands now fail, triggering Kubelet to retry the pod sandbox creation or deletion. We observed that Kubelet apparently retries sandbox creation infinitely, but sandbox deletion only once.

For a deployment of 500 pods that can give rise to thousands of FailedCreatePodSandBox pod Warnings with reason "context deadline exceeded". Eventually the deployment succeeds, but it takes quite long and the event noise doesn't look good.

The times it took until all 500 pods are in Running state is:

  1. Existing Patch API with correct context timeout handling: 5m 25s
  2. Update API with correct context timeout handling: 5m 30s
  3. IPPoolLock around the Update API: 4m 04s
  4. Tomo's Leader Election solution: 35 mins

In contrast to 1-3, the leader Election solution took 7-8 times as long but produced only 34 "context deadline exceeded" Warning events and Kubelet retries.

Worse than the warning noise during pod creation is that solutions 1-3 all leave approx. 450 stale IPs in the pool when the deployment of 500 pods is deleted. The root cause are again the context timeouts together with fact that Kubelet retries deletion of the pod sandbox only once and then deletes the pod no matter if the CNI DEL was successful or not.

The Leader election patch did not leave stale IPs or "context deadline exceeded" Warnings but took in the order of 12 minutes to delete all 500 pods.

We believe the large number of context timeouts of solutions 1-3 are a consequence of the many quasi-parallel K8s API operations triggered by the independent but highly synchronized CNI ADD/DEL operations. As each CNI retries its failing K8s API operation many times, the API rate limiter kicks in and slows down the requests even more. We have tried with several back-off schemes in whereabout's K8s storage backend, but didn't manage to improve the bad behavior significantly.

All in all, the tests have confirmed our suspicion that the K8s API server with its optimistic locking and client retry paradigm is a particularly bad choice as storage backend for whereabouts IP pool use case. The etcd backend with its blocking lock API is much more efficient and suitable.

The leader election approach appears to be working more correctly and with less noise, but its performance still seems unacceptably slow. We haven't investigated that further, yet.

To rely on the K8s API server as storage backend, we'd probably need to redesign the whereabouts IPAM along the lines of Calico IPAM to assign IP blocks to individual nodes through the K8s API server and manage the IP blocks locally on each node without further use of the API server.

@rthakur-est
Copy link
Contributor

We have tried #127 on a 100 node setup with deployment/undeployment of 500 pods with 2500 ms backoff configuration. It solved the problem of both duplicate IP assignment and stale IPs. Following was the time taken -
a. Deployment - 4 min 34 seconds
b. Undeployment - 5 min 49 seconds

@JanScheurich
Copy link

@dougbtv Has this issue been fixed in the latest whereabouts release 0.5 or 0.5.1?

@dougbtv
Copy link
Member Author

dougbtv commented Dec 22, 2021

Just a note that 0.5.0 and 0.5.1 should have the race condition fixes, I recommend 0.5.1 as it also has some bug fixes for ip reconciliation.

https://github.com/k8snetworkplumbingwg/whereabouts/releases/tag/v0.5.1

andreaskaris pushed a commit to andreaskaris/whereabouts that referenced this issue Mar 22, 2023
…nsistency-openshift-4.13-ose-multus-whereabouts-ipam-cni

Updating ose-multus-whereabouts-ipam-cni images to be consistent with ART
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants