Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes #16028

Closed
5 tasks done
Tracked by #118501
serathius opened this issue Jun 7, 2023 · 19 comments
Closed
5 tasks done
Tracked by #118501

Comments

@serathius
Copy link
Member

serathius commented Jun 7, 2023

What would you like to be added?

This is etcd side bug for kubernetes/kubernetes#118501

tldr; Etcd restore operation breaks existing etcd guarantees about revision never decreasing.

Proposal

  • Etcd restore operation should support bumping revision of the next etcd operations by an arbitrary number.
  • Etcd restore operation should support marking old revisions as compacted.
  • Etcd restore operation should support bumping revision of the latest key versions.

Tracking work

  • Implement optional --bump-revision flag to etcdutl snapshot restore operation.
  • Implement optional --mark-compacted flag to etcdutl snapshot restore operation.
  • Confirm that those changes are backport compatible for v3.5 and v3.4
  • Add etcd restore with e2e tests that confirm that:
    • Revision doesn't decrease if value provided via --bump-revision is high enough
    • Client is notified about restore by watch being broken
    • Client cannot access revisions from between backup and restore operation
  • Backport the changes and tests to v3.5 and v3.4 etcd.

Why is this needed?

Without zero official guidance each vendor/administrator had a different plan for etcd restore many of which were broken and incorrect. Having an official guided restore operation allow whole community to work together and share their experience leading to improvements in Kubernetes disaster recovery handling.

@serathius
Copy link
Member Author

cc @ptabor @ahrtr

@serathius serathius changed the title Restoring etcd in client friendly way, aka etcd restore for Kubernetes Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes Jun 7, 2023
@ahrtr
Copy link
Member

ahrtr commented Jun 8, 2023

Offline compaction

Offline compaction makes sense to me. It's useful to perform compaction + defragmentation before we restore a snapshot. We may need to handle it from both etcd and bbolt perspective. Let me take care of it. But it may take some time, because I will be on some personal stuff starting from next Tuesday and until 1 ~ 2 weeks later.

I propose to add a separate command etcdutl compact [flags], just similar to the existing command etcdutl defrag [flags]. Let's do not change the existing etcdctl snapshot ... commands.

Offline bumping revision

I am not against offline bumping revision, but please see my concern/comment in #15381 (comment). Would like to get more feedback before we move on.

@tjungblu
Copy link
Contributor

tjungblu commented Jun 8, 2023

I think we already have etcdctl compaction <rev> that does that, albeit online.

Maybe before we go into the debate on what we need to implement for compaction, do we even need to compact?
I think what Marek meant, is that we return the current revision as "compacted" to force reset the informer caches, not do an actual compaction.

@serathius
Copy link
Member Author

serathius commented Jun 8, 2023

Offline compaction

Offline compaction makes sense to me. It's useful to perform compaction + defragmentation before we restore a snapshot. We may need to handle it from both etcd and bbolt perspective. Let me take care of it. But it may take some time, because I will be on some personal stuff starting from next Tuesday and until 1 ~ 2 weeks later.

As @tjungblu said, nowere I mentioned that I want to run offline compaction. I just want to mark revisions as compacted so they are inaccessible by clients. For large db files it's preferred that we can do that without doing any work.

Please don't agree to work on the task, but say that you don't agree on the design and change it.

I don't think you understood the design, we don't want to perform offline compaction, we don't care about offline defragmentation, we just want to make restore work. I what I mean by work is for etcd to uphold the https://etcd.io/docs/v3.6/learning/api_guarantees/. Meaning revision is unique and ordered.

@tjungblu
Copy link
Contributor

tjungblu commented Jun 8, 2023

(moving your arguments over from #15381 (comment) so we don't split the discussion into too many places)

Usually it is risky for an offline tool to modify the db data

Agreed, what @dusk125 wrote so far is just adding a single bump transaction on top of the snapshot data:
https://github.com/etcd-io/etcd/pull/16029/files#diff-c0f9abda5049ad8050c02966bd327db7fcd27b6848a54deac44d78d5b003f954R347-R368

So we're neither modifying the snapshot you restore from (important for integrity checks), nor do we hack surgically around the binary files. I think Marek's approach is even a step further, by just adding "+x" on the next transaction that's executed - not sure this is enough. Maybe @dusk125 has some more input on what worked and what didn't, I believe we tried that before.

When you restore an etcd cluster, usually it means that you already ran into a disastrous failure. In other words, there is already downtime. But if it's a huge cluster (e.g. with hundreds of or thousands of nodes), it might be inconvenient to restart all operators and nodes (actually kubelet, kube-proxy and CNI?). But if it's just a small cluster with a couple of nodes, it might be fine to just restart them.

We need to differentiate on what's down, quorum loss definitely means k8s API is not writable and no linearized reads are possible. Workload on the worker nodes can still continue happily on their own. Operators/Controllers (also consider third party stuff you installed with a helm chart) might be impacted, depending on what they're doing with the API.

The restarting definitely works functionally, but I think there is a window of time between you restoring to old_rev - 1000 and the operators catching up again that could be disastrous. Imagine you add a thousand revs again very quickly, then your informer cache gets a totally different state of the world + some new state. David called this "alternate history", which I found a good term for it.

But for stale data, it means the snapshot isn't up to date, and there are some data loss....what' the potential impact?

Whatever we bring up from a restored snapshot, must be guaranteed so far into the future that we don't create that alternate history. There must be data loss by definition, but that should be reconciled correctly by the operators (:crossed_fingers:).

@ahrtr
Copy link
Member

ahrtr commented Jun 8, 2023

I think what Marek meant, is that we return the current revision as "compacted"

Please elaborate this?

@dusk125
Copy link
Contributor

dusk125 commented Jun 8, 2023

So we're neither modifying the snapshot you restore from (important for integrity checks), nor do we hack surgically around the binary files. I think Marek's approach is even a step further, by just adding "+x" on the next transaction that's executed - not sure this is enough. Maybe @dusk125 has some more input on what worked and what didn't, I believe we tried that before.

In the work I had been doing, I did not modify the snapshot, nor do I think we should modify it in whatever path we take. If, for whatever reason, the restore fails, the user could remove the proposed flags, and go through the normal restore and restart their workload; we wouldn't be able to do this if the snapshot file has been changed.

My testing was focused around getting a watcher to resync automatically as if the key its watching on changed, without it actually needing to change. The watcher would reconnect to the restored etcd and get a notification that the value was modified, and thus get its "new" value and hopefully continue operating as normal.

If the revision your watching is shown as compacted, then the watch fails. If the watcher contract is to invalidate and re-watch when a watch fails due to a compacted revision, then I think @serathius 's suggestion of marking everything as compacted after restore (not touching the snapshot file) and adding a single dummy revision on top of everything could work; I can start testing this out.

@chaochn47
Copy link
Member

Hi @serathius @dusk125 @tjungblu, this is great proposal. Could you please share how invalidating client cache (operator/controller) depends on watch fails on compacted higher revision will work or it works conceptually as of today? I think that's the fundamental assumption (theory) we make that drives this design.

It should clear some confusion in this thread.

(or should we move this conversation to k/k issue?)

@tjungblu
Copy link
Contributor

tjungblu commented Jun 8, 2023

Thanks for all the discussions in the meantime, I think I'm also a little further in my understanding and try to summarize in more detailed pieces.

Let's look at this from a K8s API point of view first. We want to ensure all operators/controllers consuming the watchlist API will receive a signal to invalidate their caches. Working backwards, from the official docs we can find that a 410 should be returned to signal a fresh listing from the start.

Which in turn, is always triggered by our own etcd return of etcdrpc.ErrCompacted when using the etcd watches.


From etcd point of view, we want to ensure that we have a high enough revision to start from after restoring from a snapshot before we serve any data to the outside world. Yet, we also need to ensure that we don't serve any history between the backup revision (BR) and restored revision (NR, basically BR + a big enough integer). Conveniently, this is exactly what the compaction does.

For that, do we really need to run a full compaction? Or is it enough to mark a revision range as already compacted?

I think there are multiple approaches here, one that's hacky would be to use UnsafeSetScheduledCompact and use the scheduledCompactRev schema key to bump it. Not sure if this will already do the trick, @dusk125 wanted to check that out next. There's also opportunity in changing the compactMainRev in mvcc kvstore.go going through the code.

Rewinding a little, how to establish the revision in the snapshot?

Similar to Allen's approach, we discussed about O(n) looping through each key in the snapshot to establish which revision is the highest. If you look into kvstore.go, it already does that. I think there were some reservations around the performance impact of its linear runtime.

Since creating a snapshot already requires iteration, we would also propose caching this information in the snapshot proto as additional/trailing metadata. When contained in a snapshot it would allows us to save the iteration of the whole key space when restoring.

@dusk125
Copy link
Contributor

dusk125 commented Jun 8, 2023

I would be happy to take this issue on, can you assign this issue to me please?

@serathius
Copy link
Member Author

@tjungblu @dusk125 Thanks for jumping on a call and writing down our discussion. One think I noticed during our discussion was that my proposal didn't handle correctly so client during restore could still experience revision fluctuating 100 -> 110 -> 100.

@ahrtr
Copy link
Member

ahrtr commented Jun 9, 2023

Thanks all for the explanation, but it seems that we are still playing with a not well explained concept "mark a revision range as already compacted". etcd supports compaction, and returns ErrCompacted if the requested rev is older than the latest compacted revision or return ErrFutureRev if the requested revision is bigger than current revision. I have never seen mark a revision range as already compacted before, nor a clear explanation so far.

ensure that we don't serve any history between the backup revision (BR) and restored revision (NR, basically BR + a big enough integer).

Yes, indeed there might be a gap in between. If users request a revision in the gap, then they get nothing. Users may be confused because etcd's revisions are supposed to be continuous. So far I do not see any feasible & good approach other than a complete offline compaction just as I mentioned in #16028 (comment)

@tjungblu
Copy link
Contributor

tjungblu commented Jun 9, 2023

Users may be confused because etcd's revisions are supposed to be continuous.

Fair point, I think this is just not practically possible because we don't know what the very latest revision was in a disaster recovery scenario. If all you have is a backup snapshot from two days ago and everything else is somehow lost, how do you find out the most recently served revision?

So far I do not see any feasible & good approach other than a complete offline compaction just as I mentioned in #16028 (comment)

We should settle on some non-functional requirements for a restoration. Marek made a good point about the process being as fast and reliable as possible to avoid spooking admins/SREs on an already stressful task. Waiting for a full offline compaction, that also might fail, on big snapshots might take some time.

From my experience starting a big etcd from disk (let's say close to our 8 gigs limit) until ready to serve the first byte of client requests (TTFB if you so want to call it) is on the order of several minutes. Does anyone have some reliable numbers from their cloud fleet they can share? I don't have this metric in our telemetry sadly.

@ahrtr
Copy link
Member

ahrtr commented Jun 9, 2023

Just had a quick discussion with @serathius .

There are two things.

  1. Bumping revision. We are on the same page. The purpose is to make sure the revision will never decrease.
  2. Compaction. Two options
    • @serathius 's suggestion: "marking old revisions as compacted". It's actually just setting the scheduledCompactRev field in Meta bucket. So the finishedCompactRev != scheduledCompactRev, it means that previous compaction did not finish. When etcd starts, it will try to finish the compaction operation (so there is no performance differences between the two approaches). It's the minimal change on etcdutl, and zero change for etcdserver. The only concern is it's a little hacky. We need test it carefully. Let's follow this approach for now.
    • @ahrtr 's suggestion: offline compaction, just as I mentioned in Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes #16028 (comment). We already have offline defrag, it's natural to enhance etcdutl to support offline compaction. We can discuss this separately. FYI. We might get rid of defrag completely after Support customizing the rebalance threshold bbolt#422 is resolved, but we will always need offline compaction.

@dusk125
Copy link
Contributor

dusk125 commented Jun 12, 2023

I've been able to verify that marking a revision as compacted: mvcc.UnsafeSetScheduledCompact(tx, revision) does cause the server to think that it has a compaction in progress which it will finish (sometime on startup after restore), and it causes the watchers to get canceled with the required revision has been compacted error.

The following is from running the server after restoring with my most recent changes in my draft PR with --bump-revision 1000000 --mark-compacted arguments

{"caller":"mvcc/kvstore.go:385","msg":"kvstore restored","current-rev":1000004}
{"caller":"mvcc/kvstore.go:392","msg":"resume scheduled compaction","scheduled-compact-revision":1000004}
{"caller":"mvcc/index.go:194","msg":"compact tree index","revision":1000004}
{"caller":"mvcc/kvstore_compaction.go:69","msg":"finished scheduled compaction","compact-revision":1000004,"took":"3.403894ms","hash":1818873421}

@chaochn47
Copy link
Member

chaochn47 commented Jun 30, 2023

From a etcd cluster administrator perspective, I would like the etcdctl snapshot restore flag named as desired-compacted-revision.

I don't want to calculate how much revision should I bump up since last etcd snapshot, I just need to supply an desired resource version to be compacted based on the observation in apiserver logs.

As a mitigation, naming the snapshot db file suffix with the revision and doing a calculation afterwards also works.. It's just not as convenient as just supplying the desired one.

@ahrtr
Copy link
Member

ahrtr commented Jul 6, 2023

I don't want to calculate how much revision should I bump up since last etcd snapshot, I just need to supply an desired resource version to be compacted based on the observation in apiserver logs

It is a good point. The concern is you might not always be able to know the desired target revision, but you are always able to provide a rough big enough bumping revision. As long as we can enhance the etcdutl tool to get the current highest revision, then we can easily calculate the bumping revision (desired revision - current highest revision). It seems not a big problem.

@ahrtr
Copy link
Member

ahrtr commented Jul 20, 2023

This task should have completed?

@serathius
Copy link
Member Author

Yes, I the remaining work should be on Kubernetes side.

@dusk125 @tjungblu would you be interested in finishing kubernetes/kubernetes#118501 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants