Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/rawdb: fsync the index file after each freezer write #28483

Merged
merged 2 commits into from
Nov 10, 2023

Conversation

rjl493456442
Copy link
Member

@rjl493456442 rjl493456442 commented Nov 8, 2023

This pull request tries to fix an issue in freezer, raised by #28105


Let's briefly recap the issue: If a power failure occurs during the execution in path model, Geth may not be able to recover in the subsequent restart.

After investigating a bit further, I discovered that the underlying state freezer is corrupted. Specifically, the freezer has experienced corruption in its index files, which now contain indexes with the values {file_num = 0; offset = 0} at the end.

Upon accessing the freezer table, it will initiate a repair process by truncating the excess bytes in the data file. However, due to the "zero" index at the end, the freezer table truncates the entire data file while retaining the complete index file.

Therefore, freezer table enters a very weird status that items indexes are present but their corresponding data is missing.


The root cause is why index file will contain zero indexes at the end after the power failure? It turns out it's relevant with OS file management.

According to Linux [manual][(https://man7.org/linux/man-pages/man2/write.2.html), it said that the file size is expanded when we try to write some items at the end of the file. In another word, the file offset will be moved even without flushing the associated data from OS cache from underlying disk.

If the file was [open(2)](https://man7.org/linux/man-pages/man2/open.2.html)ed with O_APPEND, the
file offset is first set to the end of the file before writing.
The adjustment of the file offset and the write operation are
performed as an atomic step.

This feature is used to ensure concurrent file writers won't overwrite with each other by atomically increasing the file offset.

Among the APIs subsequently listed are write() and [writev(2)](https://man7.org/linux/man-pages/man2/writev.2.html).
And among the effects that should be atomic across threads (and
processes) are updates of the file offset.  However, before Linux
3.14, this was not the case: if two processes that share an open
file description (see [open(2)](https://man7.org/linux/man-pages/man2/open.2.html)) perform a write() (or [writev(2)](https://man7.org/linux/man-pages/man2/writev.2.html))
at the same time, then the I/O operations were not atomic with
respect to updating the file offset, with the result that the
blocks of data output by the two processes might (incorrectly)
overlap.  This problem was fixed in Linux 3.14.

Also from the documentation of sqlite, it also points out this issue at 6.2

When data is appended to the end of the rollback journal, SQLite normally makes the pessimistic assumption
that the file is first extended with invalid "garbage" data and that afterwards the correct data replaces the garbage.

In other words, SQLite assumes that the file size is increased first and then afterwards the content is written 
into the file. If a power failure occurs after the file size has been increased but before the file content has been
written, the rollback journal can be left containing garbage data. If after power is restored, another SQLite process 
sees the rollback journal containing the garbage data and tries to roll it back into the original database file, 
it might copy some of the garbage into the database file and thus corrupt the database file.

We can conclude that the issue is relevant with the async file write.


How to fix the issue in order to survive the power failure?

Detect the garbage index data and truncate them in the repair stage

It's a straightforward idea that we can blindly truncate the index items with {0,0} values at the end of the file. However, due to the fact that freezer might store 0-size items, {0, 0} might be a valid index if all previous items + itself are all zero size.

So the opening question is left here how can we avoid this special scenario and correctly detect the garbage data?

Sync the file after each freezer write

We can minimize the possibility of having "garbage" zero value index items by SYNC index file after each write.
Once the index file is sync'd, it will ensure the stored index item is complete. But what if the power failure occurs
between the file.Write() and file.Sync()?

Apparently, sqlite uses another marker to track the item number in the index file, and use a two step SYNC approach
to address it. Essentially, it flushes the index file as the first step and then update this marker as the second step.
In this approach, the marker will be the indicator how many index items we expect and truncate all other above.

This approach is a bit over-complicated and might degrade the performance.

@rjl493456442
Copy link
Member Author

截屏2023-11-09 下午4 52 02

The additional file sync operation indeed takes more time, roughly 0.5ms, kind of acceptable.

截屏2023-11-09 下午4 52 53

Wrt overall performance, this pull request is even faster than master branch(no idea why it's faster).

@rjl493456442 rjl493456442 changed the title core/rawdb: fsync the index and data file after each freezer write core/rawdb: fsync the index file after each freezer write Nov 9, 2023
@holiman
Copy link
Contributor

holiman commented Nov 9, 2023

triage discussion: let's benchmark fsyncing the head-file too, since otherwise we might up with corrupted data (all zeroes) in the actual data content, even if the index is correct.

@rjl493456442
Copy link
Member Author

Run the benchmark again with sync both data file and index file

Snap sync

截屏2023-11-10 上午9 51 08

This pull request is ~20 minutes faster than master. There are many factors can affect the snap sync performance, but at least we can have this conclusion that this pull request won't degrade performance significantly.

Full Sync

I swapped the machine, to deploy pr on machine 8 and master on machine 7 this time, the switch time is 19:06. We can see that 08 starts to catch up with better performance.
截屏2023-11-10 上午9 54 45

Obviously, the EVM execution is faster with this pull request.
截屏2023-11-10 上午9 55 42

Although it's also true that pr will spend +1ms time to build the state history and it's expected. The overhead is totally acceptable.
截屏2023-11-10 上午9 56 49

All in all, we can summarize that this pull request won't affect the performance and safe to merge it.

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 💯

@holiman holiman added this to the 1.13.5 milestone Nov 10, 2023
@karalabe karalabe merged commit 326fa00 into ethereum:master Nov 10, 2023
1 of 2 checks passed
devopsbo3 pushed a commit to HorizenOfficial/go-ethereum that referenced this pull request Nov 10, 2023
…8483)

* core/rawdb: fsync the index and data file after each freezer write

* core/rawdb: fsync the data file in freezer after write
devopsbo3 added a commit to HorizenOfficial/go-ethereum that referenced this pull request Nov 10, 2023
devopsbo3 added a commit to HorizenOfficial/go-ethereum that referenced this pull request Nov 10, 2023
Dergarcon pushed a commit to specialmechanisms/mev-geth-0x2mev that referenced this pull request Jan 31, 2024
…8483)

* core/rawdb: fsync the index and data file after each freezer write

* core/rawdb: fsync the data file in freezer after write
maoueh pushed a commit to streamingfast/go-ethereum that referenced this pull request Feb 23, 2024
…8483)

* core/rawdb: fsync the index and data file after each freezer write

* core/rawdb: fsync the data file in freezer after write
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants