-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delay storage operations for an extent that is transitioning from uninited to initialized state #1578
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
…nited to initialized state When a file extent that is marked as uninited is written to, all sectors corresponding to the extent are written, including both the data from the write request and any zeroed area(s) surrounding it. This presents a problem when subsequent storage requests are issued on behalf of the extent. Such requests may be made without first waiting for completion of the initializing write. This property of the pagecache - that requests are issued without first waiting for completions for prior requests - is typically not an issue, for read data is cached and not re-read after a write, and writes are posted from the same pagecache buffer, making the order of write completions irrelevant. But the areas of an extent outside of the initial write request, which are written with zeros, could be subsequently read into the cache (or pages overwritten as a whole) prior to the zero writes completing. This has been found to lead to spurious inconsistent data being read - namely, areas that should be zero are read and found to contain nonzero data. To ameliorate this condition, the uninited flag has been changed into an optional record that becomes allocated when a write is first posted to an uninited extent. When subsequent storage requests are made to the extent before the initializing writes complete, they are added to a queue within the uninited record. On completion of the initializing writes, the queued requests are then dispatched, and the uninited state is set to 'initialized', indicating that further operations on the extent should occur normally, as if uninited was not set. (Given that extent references are not tracked and that the extents themselves may be deallocated at any time, the uninited completion never refers to the extent itself, and the record exists for the lifetime of the extent - thus the uninited 'initialized' state having the same semantics as a null uninited field.)
wjhun
force-pushed
the
fix-uninited-extents
branch
from
September 21, 2021 12:30
2cf0f1d
to
432bd88
Compare
francescolavra
added a commit
that referenced
this pull request
Apr 17, 2024
PR #1578 implemented deferred execution of storage operations for TFS file extents that are transitioning from uninited to initialized state. This creates a problem when a pagecache node I/O request spans multiple extents: if one extent is uninited and the next extent is initialized, (a portion of) the SG buffers destined to the first extent are written into the second extent, and vice versa. In order to properly assign SG buffers to extents, the SG list supplied in a given node I/O request must be consumed by each extent in the exact order in which the different extents cover the file range of the I/O request. If a storage operation for a given extent is deferred, this requirement may not be satisfied, which leads to corruption of file contents. This change fixes the above issue by removing deferred execution of storage operations: instead, read requests are executed with zero-filled buffers, and write requests are executed without delay (i.e. without waiting for extent initialization to complete) with the buffers supplied in the requests. This implementation relies on 2 assumptions: 1) disk drivers submit I/O requests to their peripherals in the same order as the requests are generated 2) if multiple write requests are ongoing simultaneously for a given address range, disk peripherals write data to the storage medium in the same order as the requests are submitted The second assumption may not hold in general, because a write operation on a large address range may complete after an operation on a smaller range even if the former has been submitted before the latter; but since the kernel issues write requests using buffers whose minimum size is the page size in the page cache, and the write operations with zero-filled buffers that are submitted when an uninited extent begins transitioning to the initialized state use this minimum size (see the `zero_blocks()` function in the TFS code), it is reasonable to assume that write operations that fill a given file extent with zeros are executed by the disk peripheral before any subsequent write requests for the same extent, even if these subsequent requests are submitted while the zero-filling writes are still ongoing.
rinor
pushed a commit
to rinor/nanos
that referenced
this pull request
Apr 18, 2024
PR nanovms#1578 implemented deferred execution of storage operations for TFS file extents that are transitioning from uninited to initialized state. This creates a problem when a pagecache node I/O request spans multiple extents: if one extent is uninited and the next extent is initialized, (a portion of) the SG buffers destined to the first extent are written into the second extent, and vice versa. In order to properly assign SG buffers to extents, the SG list supplied in a given node I/O request must be consumed by each extent in the exact order in which the different extents cover the file range of the I/O request. If a storage operation for a given extent is deferred, this requirement may not be satisfied, which leads to corruption of file contents. This change fixes the above issue by removing deferred execution of storage operations: instead, read requests are executed with zero-filled buffers, and write requests are executed without delay (i.e. without waiting for extent initialization to complete) with the buffers supplied in the requests. This implementation relies on 2 assumptions: 1) disk drivers submit I/O requests to their peripherals in the same order as the requests are generated 2) if multiple write requests are ongoing simultaneously for a given address range, disk peripherals write data to the storage medium in the same order as the requests are submitted The second assumption may not hold in general, because a write operation on a large address range may complete after an operation on a smaller range even if the former has been submitted before the latter; but since the kernel issues write requests using buffers whose minimum size is the page size in the page cache, and the write operations with zero-filled buffers that are submitted when an uninited extent begins transitioning to the initialized state use this minimum size (see the `zero_blocks()` function in the TFS code), it is reasonable to assume that write operations that fill a given file extent with zeros are executed by the disk peripheral before any subsequent write requests for the same extent, even if these subsequent requests are submitted while the zero-filling writes are still ongoing.
francescolavra
added a commit
that referenced
this pull request
Apr 22, 2024
PR #1578 implemented deferred execution of storage operations for TFS file extents that are transitioning from uninited to initialized state. This creates a problem when a pagecache node I/O request spans multiple extents: if one extent is uninited and the next extent is initialized, (a portion of) the SG buffers destined to the first extent are written into the second extent, and vice versa. In order to properly assign SG buffers to extents, the SG list supplied in a given node I/O request must be consumed by each extent in the exact order in which the different extents cover the file range of the I/O request. If a storage operation for a given extent is deferred, this requirement may not be satisfied, which leads to corruption of file contents. This change fixes the above issue by removing deferred execution of storage operations: instead, read requests are executed with zero-filled buffers, and write requests are executed without delay (i.e. without waiting for extent initialization to complete) with the buffers supplied in the requests. This implementation relies on 2 assumptions: 1) disk drivers submit I/O requests to their peripherals in the same order as the requests are generated 2) if multiple write requests are ongoing simultaneously for a given address range, disk peripherals write data to the storage medium in the same order as the requests are submitted The second assumption may not hold in general, because a write operation on a large address range may complete after an operation on a smaller range even if the former has been submitted before the latter; but since the kernel issues write requests using buffers whose minimum size is the page size in the page cache, and the write operations with zero-filled buffers that are submitted when an uninited extent begins transitioning to the initialized state use this minimum size (see the `zero_blocks()` function in the TFS code), it is reasonable to assume that write operations that fill a given file extent with zeros are executed by the disk peripheral before any subsequent write requests for the same extent, even if these subsequent requests are submitted while the zero-filling writes are still ongoing.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When a file extent that is marked as uninited is written to, all sectors
corresponding to the extent are written, including both the data from the
write request and any zeroed area(s) surrounding it. This presents a problem
when subsequent storage requests are issued on behalf of the extent. Such
requests may be made without first waiting for completion of the initializing
write. This property of the pagecache - that requests are issued without first
waiting for completions for prior requests - is typically not an issue, for
read data is cached and not re-read after a write, and writes are posted from
the same pagecache buffer, making the order of write completions
irrelevant. But the areas of an extent outside of the initial write request,
which are written with zeros, could be subsequently read into the cache (or
pages overwritten as a whole) prior to the zero writes completing. This has
been found to lead to spurious inconsistent data being read - namely, areas
that should be zero are read and found to contain nonzero data.
To ameliorate this condition, the uninited flag has been changed into an
optional record that becomes allocated when a write is first posted to an
uninited extent. When subsequent storage requests are made to the extent
before the initializing writes complete, they are added to a queue within the
uninited record. On completion of the initializing writes, the queued
requests are then dispatched, and the uninited state is set to 'initialized',
indicating that further operations on the extent should occur normally, as if
uninited was not set. (Given that extent references are not tracked and that
the extents themselves may be deallocated at any time, the uninited completion
never refers to the extent itself, and the record exists for the lifetime of
the extent - thus the uninited 'initialized' state having the same semantics
as a null uninited field.)
Resolves #1571