Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offline migration of expiring TTL content in a cluster causes v3 store to be inconsistent across cluster members #8305

Closed
smarterclayton opened this issue Jul 25, 2017 · 28 comments

Comments

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 25, 2017

We have a 3 node etcd 3.1.9 cluster (with three kubernetes 1.6 api servers contacting it) that we are upgrading from v2 mode to v3. Post upgrade, one of the api servers appears to be serving stale reads and watches from right about the time of the upgrade - a few of the API requests that call down into etcd retrieve current data, but a large number never see updates (and never see compaction either). I.e. at resource version 3,000,000 at upgrade, writes continue to the cluster, but while other members report at 3,012,000 after 20-30 minutes, the old member is still returning GET/LIST call resource versions at 3,000,000 or near there.

Scenario:

  1. 3 node etcd cluster at 3.1.9 with 3 kube-apiservers talking to them, each apiserver talks to all etcd nodes
  2. Stop apiservers and etcd
  3. Run etcd migrate-storage on each etcd member
  4. Start etcd, reattach all TTL'd data to v3 leases
  5. Update apiserver config to etcd3 mode
  6. Start all apiservers

Outcome:

One of the three api servers responds to GET/LIST/WATCH with resource versions from before or right after the upgrade. It accepts writes, but never returns the results of writes to those key ranges. Other api servers serve reads and writes fine. All etcd instances report the same leader and have the same raft term and report. We verified that the api servers were calling down into etcd.

After a restart of the affected apiserver, it begins serving up to date reads. Have not observed any subsequent stale reads.

@smarterclayton
Copy link
Contributor Author

We did this twice in a row (upgraded, observed the staleness, restored from backup, ran through the whole procedure again) in our stage environment, and it occurred both times. We noticed that the apiserver on the same node as the etcd leader was the one affected both times (going to run to a third time to see if it is consistent).

@xiang90
Copy link
Contributor

xiang90 commented Jul 25, 2017

I am not sure if this is an etcd issue, etcd/client issue or k8s apiserver issue. Need some investigation there first.

@smarterclayton can you share the migration data somehow? so someone from etcd team can easily reproduce the problem?

@smarterclayton
Copy link
Contributor Author

Unfortunately no, it's private internal data. Some of the things we have ruled out:

  1. we were unable to get etcdctl to return any stale data from any member, whether via serializable reads, watches, or range reads
  2. we verified we were not hitting the watch cache (which was affected, but we called through on that server). calls via the etcd client using the etcd3 storage backend were definitely talking to one of the etcd servers and returning stale data
  3. restarting the affected apiserver resolved the issue, so it's either transient on the server, or confined to a client

We do have a reproducible env, so we'll try to get additional data from the servers in order to debug.

@xiang90
Copy link
Contributor

xiang90 commented Jul 25, 2017

we were unable to get etcdctl to return any stale data from any member, whether via serializable reads, watches, or range reads

so issues reads from etcdctl returns new vaules, but calling etcd.read inside apiserver returns stale data?

restarting the affected apiserver resolved the issue, so it's either transient on the server, or confined to a client

have you tried to restart the etcd server that api server connects? if restarting etcd does not solve the problem, the the issue is probably inside apiserver or client code.

@smarterclayton
Copy link
Contributor Author

so issues reads from etcdctl returns new vaules, but calling etcd.read inside apiserver returns stale data?

Yes

have you tried to restart the etcd server that api server connects? if restarting etcd does not solve the problem, the the issue is probably inside apiserver or client code.

Yes, that will be the next attempt.

@xiang90
Copy link
Contributor

xiang90 commented Jul 25, 2017

3 node etcd cluster at 3.1.9 with 3 kube-apiservers talking to them, each apiserver talks to all etcd nodes

@smarterclayton also can you try with 1 node etcd cluster + 1 kube-apiservers?

@smarterclayton
Copy link
Contributor Author

So we don't have a setup where we can downgrade this particular cluster to single of each. I will say that we have never observed this in v2 mode, and have not observed it yet in any cluster (this is the standard config for openshift, 3+3) that was started with etcd v3 mode on. We also don't believe we've observed it in non-migration 1+1 setups where the cluster was created from scratch.

We did confirm that this occurred three times in a row, exactly the same, so we can consistently recreate the scenario with this particular cluster config.

Will try to have more info by tomorrow.

@sdodson
Copy link

sdodson commented Jul 25, 2017

We reproduced this again. This time we performed the migration, thought that the environment was sane, some 30 minutes later discovered problems with stale data. We then restarted all api servers and the problem persisted. Then I restarted etcd on the leader and the problem went away without restarting the api server.

@xiang90
Copy link
Contributor

xiang90 commented Jul 26, 2017

We then restarted all api servers and the problem persisted. Then I restarted etcd on the leader and the problem went away without restarting the api server.

Interesting... If you use etcdctl to query etcd leader, will you see stale data?

@liggitt
Copy link
Contributor

liggitt commented Jul 26, 2017

after further debugging, one of the etcd nodes actually has different data in its store, despite claiming to be up to date with the raft index.

Our migration process:

  • takes all etcd readers/writers offline
  • ensures all etcd members were at the same raft index
  • takes all etcd members offline
  • runs the v2->v3 migration on each member's data
  • brings up etcd members
  • brings up clients (v3 clients like HA kubernetes API servers, and v2 clients that use the v2 API for leader election)

After a period of time that varies, we observe the etcd nodes reporting different data for the same read queries. This persists across restarts of all etcd members, with all clients shut down, no writes occurring, and all etcd members reporting the same raft index.

@xiang90
Copy link
Contributor

xiang90 commented Jul 26, 2017

@liggitt

so issues reads from etcdctl returns new vaules, but calling etcd.read inside apiserver returns stale data?
Yes

So this is NOT the case? We need to get a clear answer on this.

@liggitt
Copy link
Contributor

liggitt commented Jul 26, 2017

correct, using etcdctl directly against etcd returns different results depending on which member you query.

https://bugzilla.redhat.com/show_bug.cgi?id=1475351#c8 has more details as well

@xiang90
Copy link
Contributor

xiang90 commented Jul 26, 2017

@liggitt

https://bugzilla.redhat.com/show_bug.cgi?id=1475351#c8 has more details as well

Can you actually summarize the problem and post it here? Thanks.

@liggitt
Copy link
Contributor

liggitt commented Jul 26, 2017

#8305 (comment) is a good summary

@xiang90
Copy link
Contributor

xiang90 commented Jul 26, 2017

runs the v2->v3 migration on each member's data

after this, can you check if the hashes of member are matched? (either via gRPC hash API or manually check contents)

@liggitt
Copy link
Contributor

liggitt commented Jul 26, 2017

the member data?

@xiang90
Copy link
Contributor

xiang90 commented Jul 26, 2017

the member data?

the data you can get from the kv store. a range over all keys.

@liggitt
Copy link
Contributor

liggitt commented Jul 31, 2017

we tracked this down to an issue migrating v2->v3 stores when the v2 store contains actively expiring TTL keys.

Reproducer script at https://gist.github.com/liggitt/0feedecf5d6d1b51113bf58d10a22b4c

We followed the offline migration guide at https://coreos.com/etcd/docs/latest/op-guide/v2-migration.html#offline-migration, which did not mention issues migrating data stores containing actively expiring TTL keys.

This text is not correct in the presence of TTL keys:

First, all members in the etcd cluster must converge to the same state. This can be achieved by stopping all applications that write keys to etcd. Alternatively, if the applications must remain running, configure etcd to listen on a different client URL and restart all etcd members. To check if the states converged, within a few seconds, use the ETCDCTL_API=3 etcdctl endpoint status command to confirm that the raft index of all members match (or differ by at most 1 due to an internal sync raft command).

If differing content is migrated, it puts the mvcc stores in an inconsistent state that can affect future transactions on migrated data, or on new data.

It seems like the following should be done:

  • Update the migration guide to make it very clear that clusters with actively expiring TTL data should not run migration on each individual member if they intend to preserve ttl keys (which is the default), but should run migration on one member, then rejoin the other members as if they were new members.
  • Update the migrate command to warn that migrate should only be run on one member of a cluster if --no-ttl=false and TTL data is encountered.
  • Update health checking to detect when the mvcc store revisions differ at the same raft index. I couldn't tell if alarm: CORRUPT #7125 included detection of that case.
  • Include store revision in table output of endpoint status... we were just checking the raft index matched, and didn't notice the revision was out of sync until we dumped the json content.

@smarterclayton smarterclayton changed the title etcd client (kube-apiserver) seeing stale reads and watches, but writes succeed Offline migration of expiring TTL content in a cluster causes v3 store to be inconsistent across cluster members Jul 31, 2017
@xiang90
Copy link
Contributor

xiang90 commented Jul 31, 2017

@liggitt

It seems the problem is that the data on members are not converged when you run migration tool due to expiring TTL keys. The doc has the assumption that the state of the members are consistent. Can you help to improve the migration doc to make it clear?

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Jul 31, 2017

The current doc recommends using a command that requires the master be online. In the presence of TTL there is no way for an online master to be sure it is consistent (expirations in the TTL store don't seem to increment the raft index?). Is there an offline command? In the presence of TTL you'd have to continually stop and start the cluster until you were sure the stores were consistent, and only do the check while offline on each node.

@smarterclayton
Copy link
Contributor Author

If TTL expiration doesn't go through raft (is this correct that it does not?), how can you be sure the cluster members are consistent?

@heyitsanthony
Copy link
Contributor

@smarterclayton TTL expiration goes through consensus. Related: https://github.com/coreos/etcd/blob/master/etcdserver/apply_v2.go#L122.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Jul 31, 2017

Ok. So we see the following:

  1. have a cluster setup
  2. verify members are consistent
  3. shut down members
  4. migrate

RESULT: members are not consistent.

If the expiration increments raft, how do we ensure that we don't race on shutdown with a TTL application? If you have any TTLs, then the currently recommended doc is wrong, because there is no way to prevent an expiration happening between 2 and 3. Even if we could check it offline, you'd basically just have to write a start / check / stop / check loop that runs until it gets lucky. That's pretty crazy.

@smarterclayton
Copy link
Contributor Author

Also, the failure mode for a migration with inconsistent content is pretty bad. Is there a better process the doc should recommend to ensure the stores are consistent post migration that we are missing?

@xiang90
Copy link
Contributor

xiang90 commented Jul 31, 2017

Also, the failure mode for a migration with inconsistent content is pretty bad. Is there a better process the doc should recommend to ensure the stores are consistent post migration that we are missing?

We are working on adding runtime hash checking.

@heyitsanthony
Copy link
Contributor

Closing in favor of #8348 so inconsistencies can be detected at boot via command line.

@smarterclayton
Copy link
Contributor Author

Is there going to be a separate issue to track changing the doc?

@heyitsanthony
Copy link
Contributor

@smarterclayton I added a note on the etcdctl issue; it doesn't need to be tracked separately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants