Offline migration of expiring TTL content in a cluster causes v3 store to be inconsistent across cluster members #8305

smarterclayton · 2017-07-25T19:15:14Z

We have a 3 node etcd 3.1.9 cluster (with three kubernetes 1.6 api servers contacting it) that we are upgrading from v2 mode to v3. Post upgrade, one of the api servers appears to be serving stale reads and watches from right about the time of the upgrade - a few of the API requests that call down into etcd retrieve current data, but a large number never see updates (and never see compaction either). I.e. at resource version 3,000,000 at upgrade, writes continue to the cluster, but while other members report at 3,012,000 after 20-30 minutes, the old member is still returning GET/LIST call resource versions at 3,000,000 or near there.

Scenario:

3 node etcd cluster at 3.1.9 with 3 kube-apiservers talking to them, each apiserver talks to all etcd nodes
Stop apiservers and etcd
Run etcd migrate-storage on each etcd member
Start etcd, reattach all TTL'd data to v3 leases
Update apiserver config to etcd3 mode
Start all apiservers

Outcome:

One of the three api servers responds to GET/LIST/WATCH with resource versions from before or right after the upgrade. It accepts writes, but never returns the results of writes to those key ranges. Other api servers serve reads and writes fine. All etcd instances report the same leader and have the same raft term and report. We verified that the api servers were calling down into etcd.

After a restart of the affected apiserver, it begins serving up to date reads. Have not observed any subsequent stale reads.

smarterclayton · 2017-07-25T19:16:33Z

We did this twice in a row (upgraded, observed the staleness, restored from backup, ran through the whole procedure again) in our stage environment, and it occurred both times. We noticed that the apiserver on the same node as the etcd leader was the one affected both times (going to run to a third time to see if it is consistent).

xiang90 · 2017-07-25T19:29:59Z

I am not sure if this is an etcd issue, etcd/client issue or k8s apiserver issue. Need some investigation there first.

@smarterclayton can you share the migration data somehow? so someone from etcd team can easily reproduce the problem?

smarterclayton · 2017-07-25T19:32:19Z

Unfortunately no, it's private internal data. Some of the things we have ruled out:

we were unable to get etcdctl to return any stale data from any member, whether via serializable reads, watches, or range reads
we verified we were not hitting the watch cache (which was affected, but we called through on that server). calls via the etcd client using the etcd3 storage backend were definitely talking to one of the etcd servers and returning stale data
restarting the affected apiserver resolved the issue, so it's either transient on the server, or confined to a client

We do have a reproducible env, so we'll try to get additional data from the servers in order to debug.

xiang90 · 2017-07-25T19:36:31Z

we were unable to get etcdctl to return any stale data from any member, whether via serializable reads, watches, or range reads

so issues reads from etcdctl returns new vaules, but calling etcd.read inside apiserver returns stale data?

restarting the affected apiserver resolved the issue, so it's either transient on the server, or confined to a client

have you tried to restart the etcd server that api server connects? if restarting etcd does not solve the problem, the the issue is probably inside apiserver or client code.

smarterclayton · 2017-07-25T19:37:14Z

so issues reads from etcdctl returns new vaules, but calling etcd.read inside apiserver returns stale data?

Yes

have you tried to restart the etcd server that api server connects? if restarting etcd does not solve the problem, the the issue is probably inside apiserver or client code.

Yes, that will be the next attempt.

xiang90 · 2017-07-25T19:40:18Z

3 node etcd cluster at 3.1.9 with 3 kube-apiservers talking to them, each apiserver talks to all etcd nodes

@smarterclayton also can you try with 1 node etcd cluster + 1 kube-apiservers?

smarterclayton · 2017-07-25T20:29:28Z

So we don't have a setup where we can downgrade this particular cluster to single of each. I will say that we have never observed this in v2 mode, and have not observed it yet in any cluster (this is the standard config for openshift, 3+3) that was started with etcd v3 mode on. We also don't believe we've observed it in non-migration 1+1 setups where the cluster was created from scratch.

We did confirm that this occurred three times in a row, exactly the same, so we can consistently recreate the scenario with this particular cluster config.

Will try to have more info by tomorrow.

sdodson · 2017-07-25T21:41:20Z

We reproduced this again. This time we performed the migration, thought that the environment was sane, some 30 minutes later discovered problems with stale data. We then restarted all api servers and the problem persisted. Then I restarted etcd on the leader and the problem went away without restarting the api server.

xiang90 · 2017-07-26T05:23:44Z

We then restarted all api servers and the problem persisted. Then I restarted etcd on the leader and the problem went away without restarting the api server.

Interesting... If you use etcdctl to query etcd leader, will you see stale data?

liggitt · 2017-07-26T21:32:54Z

after further debugging, one of the etcd nodes actually has different data in its store, despite claiming to be up to date with the raft index.

Our migration process:

takes all etcd readers/writers offline
ensures all etcd members were at the same raft index
takes all etcd members offline
runs the v2->v3 migration on each member's data
brings up etcd members
brings up clients (v3 clients like HA kubernetes API servers, and v2 clients that use the v2 API for leader election)

After a period of time that varies, we observe the etcd nodes reporting different data for the same read queries. This persists across restarts of all etcd members, with all clients shut down, no writes occurring, and all etcd members reporting the same raft index.

xiang90 · 2017-07-26T21:34:04Z

@liggitt

so issues reads from etcdctl returns new vaules, but calling etcd.read inside apiserver returns stale data?
Yes

So this is NOT the case? We need to get a clear answer on this.

liggitt · 2017-07-26T21:37:48Z

correct, using etcdctl directly against etcd returns different results depending on which member you query.

https://bugzilla.redhat.com/show_bug.cgi?id=1475351#c8 has more details as well

xiang90 · 2017-07-26T21:42:38Z

@liggitt

https://bugzilla.redhat.com/show_bug.cgi?id=1475351#c8 has more details as well

Can you actually summarize the problem and post it here? Thanks.

liggitt · 2017-07-26T21:44:29Z

#8305 (comment) is a good summary

xiang90 · 2017-07-26T21:48:12Z

runs the v2->v3 migration on each member's data

after this, can you check if the hashes of member are matched? (either via gRPC hash API or manually check contents)

liggitt · 2017-07-26T21:49:03Z

the member data?

xiang90 · 2017-07-26T21:49:49Z

the member data?

the data you can get from the kv store. a range over all keys.

liggitt · 2017-07-31T19:09:41Z

we tracked this down to an issue migrating v2->v3 stores when the v2 store contains actively expiring TTL keys.

Reproducer script at https://gist.github.com/liggitt/0feedecf5d6d1b51113bf58d10a22b4c

We followed the offline migration guide at https://coreos.com/etcd/docs/latest/op-guide/v2-migration.html#offline-migration, which did not mention issues migrating data stores containing actively expiring TTL keys.

This text is not correct in the presence of TTL keys:

First, all members in the etcd cluster must converge to the same state. This can be achieved by stopping all applications that write keys to etcd. Alternatively, if the applications must remain running, configure etcd to listen on a different client URL and restart all etcd members. To check if the states converged, within a few seconds, use the ETCDCTL_API=3 etcdctl endpoint status command to confirm that the raft index of all members match (or differ by at most 1 due to an internal sync raft command).

If differing content is migrated, it puts the mvcc stores in an inconsistent state that can affect future transactions on migrated data, or on new data.

It seems like the following should be done:

Update the migration guide to make it very clear that clusters with actively expiring TTL data should not run migration on each individual member if they intend to preserve ttl keys (which is the default), but should run migration on one member, then rejoin the other members as if they were new members.
Update the migrate command to warn that migrate should only be run on one member of a cluster if --no-ttl=false and TTL data is encountered.
Update health checking to detect when the mvcc store revisions differ at the same raft index. I couldn't tell if alarm: CORRUPT #7125 included detection of that case.
Include store revision in table output of endpoint status... we were just checking the raft index matched, and didn't notice the revision was out of sync until we dumped the json content.

xiang90 · 2017-07-31T19:17:14Z

@liggitt

It seems the problem is that the data on members are not converged when you run migration tool due to expiring TTL keys. The doc has the assumption that the state of the members are consistent. Can you help to improve the migration doc to make it clear?

smarterclayton · 2017-07-31T19:19:36Z

The current doc recommends using a command that requires the master be online. In the presence of TTL there is no way for an online master to be sure it is consistent (expirations in the TTL store don't seem to increment the raft index?). Is there an offline command? In the presence of TTL you'd have to continually stop and start the cluster until you were sure the stores were consistent, and only do the check while offline on each node.

smarterclayton · 2017-07-31T19:20:56Z

If TTL expiration doesn't go through raft (is this correct that it does not?), how can you be sure the cluster members are consistent?

heyitsanthony · 2017-07-31T19:26:32Z

@smarterclayton TTL expiration goes through consensus. Related: https://github.com/coreos/etcd/blob/master/etcdserver/apply_v2.go#L122.

smarterclayton · 2017-07-31T19:32:52Z

Ok. So we see the following:

have a cluster setup
verify members are consistent
shut down members
migrate

RESULT: members are not consistent.

If the expiration increments raft, how do we ensure that we don't race on shutdown with a TTL application? If you have any TTLs, then the currently recommended doc is wrong, because there is no way to prevent an expiration happening between 2 and 3. Even if we could check it offline, you'd basically just have to write a start / check / stop / check loop that runs until it gets lucky. That's pretty crazy.

smarterclayton · 2017-07-31T19:34:43Z

Also, the failure mode for a migration with inconsistent content is pretty bad. Is there a better process the doc should recommend to ensure the stores are consistent post migration that we are missing?

xiang90 · 2017-07-31T21:34:40Z

Also, the failure mode for a migration with inconsistent content is pretty bad. Is there a better process the doc should recommend to ensure the stores are consistent post migration that we are missing?

We are working on adding runtime hash checking.

heyitsanthony · 2017-08-01T19:05:37Z

Closing in favor of #8348 so inconsistencies can be detected at boot via command line.

smarterclayton · 2017-08-01T19:25:20Z

Is there going to be a separate issue to track changing the doc?

heyitsanthony · 2017-08-01T19:42:42Z

@smarterclayton I added a note on the etcdctl issue; it doesn't need to be tracked separately

liggitt mentioned this issue Jul 25, 2017

kube-apiserver 1.7 breaks high availability kubernetes/kubernetes#48865

Closed

liggitt mentioned this issue Jul 26, 2017

one of the three nodes of etcd cluster has outdated data #8214

Closed

smarterclayton changed the title ~~etcd client (kube-apiserver) seeing stale reads and watches, but writes succeed~~ Offline migration of expiring TTL content in a cluster causes v3 store to be inconsistent across cluster members Jul 31, 2017

heyitsanthony closed this as completed Aug 1, 2017

gyuho mentioned this issue Aug 2, 2017

*: add 'endpoint hashkv' command #8351

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offline migration of expiring TTL content in a cluster causes v3 store to be inconsistent across cluster members #8305

Offline migration of expiring TTL content in a cluster causes v3 store to be inconsistent across cluster members #8305

smarterclayton commented Jul 25, 2017 •

edited

Loading

smarterclayton commented Jul 25, 2017

xiang90 commented Jul 25, 2017

smarterclayton commented Jul 25, 2017

xiang90 commented Jul 25, 2017

smarterclayton commented Jul 25, 2017

xiang90 commented Jul 25, 2017

smarterclayton commented Jul 25, 2017

sdodson commented Jul 25, 2017

xiang90 commented Jul 26, 2017

liggitt commented Jul 26, 2017 •

edited

Loading

xiang90 commented Jul 26, 2017

liggitt commented Jul 26, 2017

xiang90 commented Jul 26, 2017

liggitt commented Jul 26, 2017

xiang90 commented Jul 26, 2017

liggitt commented Jul 26, 2017

xiang90 commented Jul 26, 2017

liggitt commented Jul 31, 2017 •

edited

Loading

xiang90 commented Jul 31, 2017

smarterclayton commented Jul 31, 2017 •

edited

Loading

smarterclayton commented Jul 31, 2017

heyitsanthony commented Jul 31, 2017

smarterclayton commented Jul 31, 2017 •

edited

Loading

smarterclayton commented Jul 31, 2017

xiang90 commented Jul 31, 2017

heyitsanthony commented Aug 1, 2017

smarterclayton commented Aug 1, 2017

heyitsanthony commented Aug 1, 2017

Offline migration of expiring TTL content in a cluster causes v3 store to be inconsistent across cluster members #8305

Offline migration of expiring TTL content in a cluster causes v3 store to be inconsistent across cluster members #8305

Comments

smarterclayton commented Jul 25, 2017 • edited Loading

smarterclayton commented Jul 25, 2017

xiang90 commented Jul 25, 2017

smarterclayton commented Jul 25, 2017

xiang90 commented Jul 25, 2017

smarterclayton commented Jul 25, 2017

xiang90 commented Jul 25, 2017

smarterclayton commented Jul 25, 2017

sdodson commented Jul 25, 2017

xiang90 commented Jul 26, 2017

liggitt commented Jul 26, 2017 • edited Loading

xiang90 commented Jul 26, 2017

liggitt commented Jul 26, 2017

xiang90 commented Jul 26, 2017

liggitt commented Jul 26, 2017

xiang90 commented Jul 26, 2017

liggitt commented Jul 26, 2017

xiang90 commented Jul 26, 2017

liggitt commented Jul 31, 2017 • edited Loading

xiang90 commented Jul 31, 2017

smarterclayton commented Jul 31, 2017 • edited Loading

smarterclayton commented Jul 31, 2017

heyitsanthony commented Jul 31, 2017

smarterclayton commented Jul 31, 2017 • edited Loading

smarterclayton commented Jul 31, 2017

xiang90 commented Jul 31, 2017

heyitsanthony commented Aug 1, 2017

smarterclayton commented Aug 1, 2017

heyitsanthony commented Aug 1, 2017

smarterclayton commented Jul 25, 2017 •

edited

Loading

liggitt commented Jul 26, 2017 •

edited

Loading

liggitt commented Jul 31, 2017 •

edited

Loading

smarterclayton commented Jul 31, 2017 •

edited

Loading

smarterclayton commented Jul 31, 2017 •

edited

Loading