Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport #8009 into 3.0? #8253

Closed
lavalamp opened this issue Jul 12, 2017 · 8 comments
Closed

Backport #8009 into 3.0? #8253

lavalamp opened this issue Jul 12, 2017 · 8 comments

Comments

@lavalamp
Copy link

It would be super nice to have the fix for #8009 backported into 3.0.x.

@wojtek-t @mml @jpbetz

@heyitsanthony
Copy link
Contributor

heyitsanthony commented Jul 12, 2017

Unlikely to happen. Since the fix doesn't sync down the free list, booting into the patched etcd 3.0 then booting into an older version of 3.0 will leak free pages at best / corrupt the db at worst.

@mml
Copy link

mml commented Jul 13, 2017

What does this imply for our ability to roll back from a fixed 3.1.x to a broken 3.0.x?

@heyitsanthony
Copy link
Contributor

@mml 3.3.0 will have the patch and proper rollback support. Issuing a rollback would restore the backend to the slow mode with free lists before reverting the cluster version to 3.2.

@jpbetz
Copy link
Contributor

jpbetz commented Jul 18, 2017

@heyitsanthony If we only backported "Garbage collect pages allocated after minimum txid" (etcd-io/bbolt#3) to etcd 3.0.x and 3.1.x. We might be able to resolve the most pressing issues with #8009 without introducing any rollback or backward compatibility issues in the way freelists are persisted.

We would either need to get boltdb/bolt#694 merged or build a version of bbolt that contains etcd-io/bbolt#3 but not etcd-io/bbolt#1. After that, Etcd would just pick up the new versions for the minor release of 3.0.x and 3.1.x.

Does this sound reasonable? I'll be available to contribute.

@xiang90
Copy link
Contributor

xiang90 commented Jul 18, 2017

We might be able to resolve the most pressing issues

We tried it, and it will not solve what #8009 hit. A sudden free pages release due to compaction (and a previous spike on page usage) will still trigger the problem unless we stop syncing the free pages.

etcd-io/bbolt#3 helps more on reducing page usage on concurrent read/write txns case.

@jpbetz
Copy link
Contributor

jpbetz commented Jul 18, 2017

Thanks @xiang90 How did you try it? Do you have a way to replicate #8009?

@abel-von
Copy link

@xiang90 we are also considering to back port the PRs to 3.1.9 to fix the "database space exceed" issue, Is there any way to do this ?

@xiang90
Copy link
Contributor

xiang90 commented Sep 28, 2017

The backport policy is documented here: https://github.com/coreos/etcd/blob/master/Documentation/branch_management.md

We could backport patches to more than one minor releases in theory, but given the people we have today, it is not feasible. I am closing this.

@xiang90 xiang90 closed this as completed Sep 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants