rethink about log compaction #7162

xiang90 · 2017-01-15T00:53:23Z

Now we compact raft log every 100,000 entries.

So we will keep at most 100,000 entries in-memory.

Keeping more entries in memory is good for fast follower recovery. If a follower dies, and it restarts within 100,000 entries lagging to the leader, the leader can send entries to followers without triggering a snapshot sent. Sending a snapshot is usually more expensive than sending entries.

However, 100,000 fixed number can be dangerous, and causes OOM. We assume each entry is around 1KB. So 100,000 entries is only 100MB. However, the max entry size is 1MB. In this cause, 100,000 entries cost 100GB.

I propose that we also need to take entries size into consideration when decide to compaction.

mitake · 2017-01-16T06:57:12Z

It seems to be interesting and important. logcabin configures the trigger based on both of size and a number of not snapshotted entries (although taking a snapshot is triggered when all of the conditions are satisfied): https://github.com/logcabin/logcabin/blob/master/Server/StateMachine.cc#L593

I also think that even a number of entries are small, replaying them on revived followers can be longer if they contain much of puts. Because of parallelism (nondeterminism) unfriendly nature of state machine replication, even replaying cannot exploit multicore. Maybe

more accurate estimation of replay cost based on various parameters (e.g. size, a number of puts in Txn, etc)
parallelising replay based on analyzing dependency relation between entries (it won't be easy at all)

would be helpful for stable operation of etcd cluster and increase its availability?

xiang90 · 2017-01-31T18:35:51Z

@mitake I assigned this to both you and me. I assume you are interested in this one :)

mitake · 2017-02-01T03:31:28Z

@xiang90 sure, of course. Thanks!

stale · 2020-04-07T07:12:02Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

xiang90 mentioned this issue Jan 15, 2017

etcdserver: increase snapshot to 100,000 #7160

Merged

heyitsanthony added this to the unplanned milestone Jan 17, 2017

xiang90 self-assigned this Jan 31, 2017

xiang90 assigned mitake Jan 31, 2017

mitake mentioned this issue Apr 20, 2017

WIP, RFC *: a new option for size based compaction #7782

Closed

gyuho added the area/performance label Feb 25, 2018

stale bot added the stale label Apr 7, 2020

stale bot closed this as completed Apr 28, 2020

chaochn47 mentioned this issue Apr 5, 2022

[area/questions] Make default snapshot-count 10k #13889

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rethink about log compaction #7162

rethink about log compaction #7162

xiang90 commented Jan 15, 2017

mitake commented Jan 16, 2017

xiang90 commented Jan 31, 2017

mitake commented Feb 1, 2017

stale bot commented Apr 7, 2020

rethink about log compaction #7162

rethink about log compaction #7162

Comments

xiang90 commented Jan 15, 2017

mitake commented Jan 16, 2017

xiang90 commented Jan 31, 2017

mitake commented Feb 1, 2017

stale bot commented Apr 7, 2020