IPAM raising KeyError #47

plwhite · 2015-11-02T16:22:13Z

It's not impossible that this is a problem with calling code, but here is a stack trace that to me implies a possible bug in the libcalico IPAM.

My suspicion is that the issue was caused by multiple hosts attempting to grab an IP at the same time, and that one of them found that the block it was about to grab went away under its feet (this is a moderately sized scale test with 200 hosts, one of which failed). Is that plausible? If so, I'd like to aim for a workaround (assuming the fix isn't too hard). Would catching the KeyError and trying again be sensible?

2015-11-02 15:07:47,642:INFO:auto_assign_ips(407): Auto-assign 1 IPv4, 0 IPv6 addrs
2015-11-02 15:07:47,673:INFO:_auto_assign(457): Ran out of affine blocks for plw-host-0138 in pool None
2015-11-02 15:07:48,427:ERROR:<module>(405): Main exiting
Traceback (most recent call last):
  File "./host-agent.py", line 403, in <module>
    main()
  File "./host-agent.py", line 386, in main
    start_job(job_id, sub_job, etcd_client)
  File "./host-agent.py", line 130, in start_job
    attributes={})
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 409, in auto_assign_ips
    attributes, pool[0], hostname)
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 475, in _auto_assign
    pool)
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 151, in _new_affine_block
    self._claim_block_affinity(host, block_cidr)
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 177, in _claim_block_affinity
    block = self._read_block(block_cidr)
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 65, in _read_block
    raise KeyError(str(block_cidr))
KeyError: '192.168.56.0/26'

The calling code in host-agent.py looks like this.

            ip_list, _ = ipam_client.auto_assign_ips(num_v4=1,
                                                     num_v6=0,
                                                     handle_id=None,
                                                     attributes={})

The text was updated successfully, but these errors were encountered:

spikecurtis · 2015-11-02T16:46:16Z

Ung. There shouldn't be any issue with blocks disappearing under anyone's feet for the simple reason that we never delete blocks---only create and update.

It's odd because the code path you're hitting first tries to do an atomic create of a block, which fails because the block already exists. Then it tries to read the existing block, and this fails claiming the block doesn't exist. What's going on here? Does the block exist or not?

After a little head-scratching I remember an etcd issue from a while ago: etcd-io/etcd#741

It seems maybe we are being bitten by this issue, where reads are not consistent by default (for the definition of consistent that would stop us hitting the above issue).

@plwhite what's the impact of this issue? Are you blocked by it, etc?

plwhite · 2015-11-02T17:12:49Z

I have a workaround, because I am catching the KeyError and retrying, so this is not blocking at all. If my workaround doesn't work then I may have to revisit that, but I'm fairly relaxed about that risk.

I don't know what's in etcd because I nuked it after saving the logs, unfortunately. However, I did find that another host grabbed the IP address 192.168.56.0 from the block (the first in the block) at 15:07:48,415 which is a few milliseconds before this host attempts to grab the block. Hence I'm very sure that it's contention with two hosts trying to grab the same block, and one of them not quite getting in in time. My guess is that

A and B are both trying to get the same block at the same time
A does an atomic create
B does an atomic create, but the modification fails (A is already doing one)
B reads, but the create has not propagated to everywhere, and so B gives up.

That feels plausible if reads are satisfied at the closest node but writes go to the elected master (so two different etcd nodes are being asked).

spikecurtis · 2015-11-02T17:15:25Z

That feels plausible if reads are satisfied at the closest node but writes go to the elected master
(so two different etcd nodes are being asked).

Yeah, that's more or less exactly what etcd-io/etcd#741 is about.

robbrockbank · 2015-11-03T18:34:25Z

Link to the updated docs associated with this problem:

https://github.com/coreos/etcd/pull/883/files

Until there is a better fix in etcd, it looks like it is necessary to use a quorum=true indication on a GET which will increase the response time.

spikecurtis · 2015-11-03T18:57:48Z

@robbrockbank that's my assessment too. There is some scope for using this sparingly in our implementation. Only certain critical areas can lead to the race condition Pete is seeing.

spikecurtis · 2015-11-04T00:38:36Z

On further analysis, I've decided we should always use quorum=True for reading in IPAM.

We could build retry logic into the libcalico functions, which might mean better performance in mainline operation, but means that higher layers making IPAM calls may get caught out by non-linear behavior. E.g.

ips = auto_assign(....)
release(ips)

could fail with KeyError. Super confusing and I think we're better off just taking the constant-factor perf hit.

Fixes projectcalico#47

robbrockbank · 2015-11-04T19:29:33Z

Would be nice if this was a global option on the etcd server so it wasn't required on each get request. Worth raising with etcd?

spikecurtis · 2015-11-04T22:23:49Z

I don't think we'd want to do it as a global option. There are plenty of cases (show commands, for example) where we wouldn't really care about stale reads and don't want the overhead.

Fixes projectcalico#47

spikecurtis self-assigned this Nov 3, 2015

spikecurtis pushed a commit to spikecurtis/libcalico that referenced this issue Nov 4, 2015

Use quorum=True for etcd reads in IPAM.

d01d2ec

Fixes projectcalico#47

spikecurtis pushed a commit to spikecurtis/libcalico that referenced this issue Nov 4, 2015

Use quorum=True for etcd reads in IPAM.

1b43dcf

Fixes projectcalico#47

spikecurtis mentioned this issue Nov 4, 2015

Use quorum=True for etcd reads in IPAM. #49

Merged

spikecurtis pushed a commit to spikecurtis/libcalico that referenced this issue Nov 5, 2015

Use quorum=True for etcd reads in IPAM.

7893c7a

Fixes projectcalico#47

spikecurtis pushed a commit to spikecurtis/libcalico that referenced this issue Nov 5, 2015

Use quorum=True for etcd reads in IPAM.

f880765

Fixes projectcalico#47

spikecurtis closed this as completed in #49 Nov 5, 2015

This was referenced Nov 17, 2015

Uprade to kubernetes 1.1 with calico-docker 0.10.0 issue projectcalico/k8s-exec-plugin#92

Closed

Bump libcalico to v0.5.0 projectcalico/k8s-exec-plugin#101

Merged

plwhite mentioned this issue Dec 2, 2015

Two hosts managed to get the same block #58

Closed

spikecurtis mentioned this issue Dec 24, 2015

IPAM failures when running kubernetes scale test #66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPAM raising KeyError #47

IPAM raising KeyError #47

plwhite commented Nov 2, 2015

spikecurtis commented Nov 2, 2015

plwhite commented Nov 2, 2015

spikecurtis commented Nov 2, 2015

robbrockbank commented Nov 3, 2015

spikecurtis commented Nov 3, 2015

spikecurtis commented Nov 4, 2015

robbrockbank commented Nov 4, 2015

spikecurtis commented Nov 4, 2015

IPAM raising KeyError #47

IPAM raising KeyError #47

Comments

plwhite commented Nov 2, 2015

spikecurtis commented Nov 2, 2015

plwhite commented Nov 2, 2015

spikecurtis commented Nov 2, 2015

robbrockbank commented Nov 3, 2015

spikecurtis commented Nov 3, 2015

spikecurtis commented Nov 4, 2015

robbrockbank commented Nov 4, 2015

spikecurtis commented Nov 4, 2015