-
Notifications
You must be signed in to change notification settings - Fork 43
IPAM raising KeyError #47
Comments
Ung. There shouldn't be any issue with blocks disappearing under anyone's feet for the simple reason that we never delete blocks---only create and update. It's odd because the code path you're hitting first tries to do an atomic create of a block, which fails because the block already exists. Then it tries to read the existing block, and this fails claiming the block doesn't exist. What's going on here? Does the block exist or not? After a little head-scratching I remember an etcd issue from a while ago: etcd-io/etcd#741 It seems maybe we are being bitten by this issue, where reads are not consistent by default (for the definition of consistent that would stop us hitting the above issue). @plwhite what's the impact of this issue? Are you blocked by it, etc? |
I have a workaround, because I am catching the KeyError and retrying, so this is not blocking at all. If my workaround doesn't work then I may have to revisit that, but I'm fairly relaxed about that risk. I don't know what's in etcd because I nuked it after saving the logs, unfortunately. However, I did find that another host grabbed the IP address 192.168.56.0 from the block (the first in the block) at 15:07:48,415 which is a few milliseconds before this host attempts to grab the block. Hence I'm very sure that it's contention with two hosts trying to grab the same block, and one of them not quite getting in in time. My guess is that
That feels plausible if reads are satisfied at the closest node but writes go to the elected master (so two different etcd nodes are being asked). |
Yeah, that's more or less exactly what etcd-io/etcd#741 is about. |
Link to the updated docs associated with this problem: https://github.com/coreos/etcd/pull/883/files Until there is a better fix in etcd, it looks like it is necessary to use a quorum=true indication on a GET which will increase the response time. |
@robbrockbank that's my assessment too. There is some scope for using this sparingly in our implementation. Only certain critical areas can lead to the race condition Pete is seeing. |
On further analysis, I've decided we should always use We could build retry logic into the
could fail with KeyError. Super confusing and I think we're better off just taking the constant-factor perf hit. |
Would be nice if this was a global option on the etcd server so it wasn't required on each get request. Worth raising with etcd? |
I don't think we'd want to do it as a global option. There are plenty of cases (show commands, for example) where we wouldn't really care about stale reads and don't want the overhead. |
It's not impossible that this is a problem with calling code, but here is a stack trace that to me implies a possible bug in the libcalico IPAM.
My suspicion is that the issue was caused by multiple hosts attempting to grab an IP at the same time, and that one of them found that the block it was about to grab went away under its feet (this is a moderately sized scale test with 200 hosts, one of which failed). Is that plausible? If so, I'd like to aim for a workaround (assuming the fix isn't too hard). Would catching the KeyError and trying again be sensible?
The calling code in
host-agent.py
looks like this.The text was updated successfully, but these errors were encountered: