Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor File.ReadAllBytes* improvements #61519

Merged
merged 9 commits into from
Nov 17, 2021

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Nov 12, 2021

Changes:

  • switch from using FileStream to RandomAccess, reduces the allocations by +- 100 bytes
  • stop using FileOptions.SequentialScan as it requires an additional sys-call on Unix. Modern kernels are really good at recognizing file access patterns, we don't get anything from this hint. Reading small files (< 1kb) on Unix is 8% faster.
  • use Array.MaxLength instead int.MaxSize for determining the limit. It fixes an edge case bug where for files of size > Array.MaxLength but < int.MaxSize where it was throwing OOM instead IOException:
string path = "here.dat";
using (FileStream fs = File.Create(path))
{
    fs.SetLength(Array.MaxLength + 1L);
}

File.ReadAllBytes(path);
PS D:\projects\repros\maxArraySize> dotnet run -c Release
Out of memory.
  • simplify the source code, get rid of returningInternalTask-related logic
  • increase the test coverage for non-seekable files and for those which report 0 length despite not being empty by setting fileLength = 0 for DEBUG builds

…case bug for files: Array.MaxLength < Length < int.MaxValue
…an additional sys call on Unix and every OS is able to recognize the disk access patterns anyway
@adamsitnik adamsitnik added area-System.IO tenet-performance Performance related issue labels Nov 12, 2021
@adamsitnik adamsitnik added this to the 7.0.0 milestone Nov 12, 2021
@ghost
Copy link

ghost commented Nov 12, 2021

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Changes:

  • switch from using FileStream to RandomAccess, reduces the allocations by +- 100 bytes
  • stop using FileOptions.SequentialScan as it requires an additional sys-call on Unix. Modern kernels are really good at recognizing file access patterns, we don't get anything from this hint. Reading small files (< 1kb) on Unix is 8% faster.
  • use Array.MaxLength instead int.MaxSize for determining the limit. It fixes an edge case bug where for files of size > but < we would get OOM instead IOException:
string path = "here.dat";
using (FileStream fs = File.Create(path))
{
    fs.SetLength(Array.MaxLength + 1L);
}

File.ReadAllBytes(path);
PS D:\projects\repros\maxArraySize> dotnet run -c Release
Out of memory.
  • simplify the source code, get rid of returningInternalTask-related logic
Author: adamsitnik
Assignees: -
Labels:

area-System.IO, tenet-performance

Milestone: 7.0.0

@danmoseley
Copy link
Member

Just curious, where does the 2GB file limit come from? Do we have to put it in a byte[]? Since we support buffers larger than Array.MaxLength bytes, eg new long[Array.MaxLength];. I am not sure what the actual size limit on a buffer is.

@stephentoub
Copy link
Member

stephentoub commented Nov 12, 2021

Modern kernels are really good at recognizing file access patterns, we don't get anything from this hint.

There's 0% benefit on any Windows system with any file system? I question whether that's actually the case. Should we simply change the implementation to only specify this on Windows, since there it's (I believe, please correct me if I'm wrong) free whereas we need to make an additional syscall on Unix and thus it really needs to be worth its weight?

@tmds
Copy link
Member

tmds commented Nov 15, 2021

Modern kernels are really good at recognizing file access patterns, we don't get anything from this hint.

fyi, Linux doesn't recognize much. Use of SequentialScan will double the read-ahead buffer.

It makes sense to leave it out for ReadAllBytes because in most cases the cost will be greater than the gain.

Co-authored-by: Dan Moseley <danmose@microsoft.com>
@adamsitnik
Copy link
Member Author

There's 0% benefit on any Windows system with any file system?

According to https://devblogs.microsoft.com/oldnewthing/20120120-00/?p=8493 (which might be outdated as it was referring to Windows 7):

If you issue around six reads in a row, each of which begins where the previous one left off, then the cache manager switches to FILE_FLAG_SEQUENTIAL_SCAN behavior for your file

But here for 99.99% of the cases we perform a single read (the size of the buffer passed to the OS is equal to the file size). I would expect it to be smart enough to recognize the pattern.

Should we simply change the implementation to only specify this on Windows, since there it's (I believe, please correct me if I'm wrong) free

From the benchmarks that I've run (Win 11 + SSD + NTFS + BitLocker) I see no perf gains. I can add it for Windows if you think it's worth it just in case it's beneficial for some other configs (older Windows? FAT32?).

@adamsitnik
Copy link
Member Author

Just curious, where does the 2GB file limit come from?

Afaik because both Array and Span indexers use Int32 for indexing. @jkotas is most probably the best person to ask for more details.

Do we have to put it in a byte[]? Since we support buffers larger than Array.MaxLength bytes, eg new long[Array.MaxLength];.

In this case not, as the public contract states that this method returns byte[].

@stephentoub
Copy link
Member

fyi, Linux doesn't recognize much. Use of SequentialScan will double the read-ahead buffer.
It makes sense to leave it out for ReadAllBytes because in most cases the cost will be greater than the gain.

Should we make SequentialScan then in general a nop on Linux / Unix?


#if DEBUG
fileLength = 0; // improve the test coverage for ReadAllBytesUnknownLength
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead just ensure we have tests that read files that return 0 for the length?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead just ensure we have tests that read files that return 0 for the length?

We have some tests, and you are the person who authored most of them: dotnet/corefx#28388

The problem is that they don't guarantee that all possible code paths are going to be executed (using stack-allocated array, renting an array from the pool, returning it and renting a larger array etc) as they deal with a virtual file system that seems to be exposing files only for reading.

We could use pipes (#58434 introduced some tests for that ), but it would require more work and handling all edge cases.

I know that setting fileLength to 0 with #if DEBUG is ugly, but it gives us what we need with a very little effort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it gives us what we need with a very little effort.

Does it? It means, for example, that Debug.Asserts in the normal non-zero code paths are now rendered useless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means, for example, that Debug.Asserts in the normal non-zero code paths are now rendered useless.

Which asserts do you mean? I can't see any in the code path for regular files that report length properly:

private static async Task<byte[]> InternalReadAllBytesAsync(SafeFileHandle sfh, int count, CancellationToken cancellationToken)
{
using (sfh)
{
int index = 0;
byte[] bytes = new byte[count];
do
{
int n = await RandomAccess.ReadAtOffsetAsync(sfh, bytes.AsMemory(index), index, cancellationToken).ConfigureAwait(false);
if (n == 0)
{
ThrowHelper.ThrowEndOfFileException();
}
index += n;
} while (index < count);
return bytes;
}
}

int index = 0;
int count = (int)fileLength;
byte[] bytes = new byte[count];
while (count > 0)
{
int n = RandomAccess.ReadAtOffset(sfh, bytes.AsSpan(index, count), index);
if (n == 0)
{
ThrowHelper.ThrowEndOfFileException();
}
index += n;
count -= n;
}
return bytes;

Copy link
Member

@stephentoub stephentoub Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which asserts do you mean?

Any on any code paths used from here all the way down through the runtime. The change says that the 99.9% code path is never executed in debug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any on any code paths used from here all the way down through the runtime.

These paths (namely RandomAccess.ReadAtOffsetAsync and RandomAccess.ReadAtOffset) have an excellent test code coverage (directly as RandomAccess tests and indirectly via FileStream tests) so I am not worried about it.

I do understand that it's not a clear win and a a trade off.

Comment on lines 528 to 529
// SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.
FileOptions options = (OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None) | FileOptions.Asynchronous;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Suggested change
// SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.
FileOptions options = (OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None) | FileOptions.Asynchronous;
FileOptions options = FileOptions.Asynchronous;
if (OperatingSystem.IsWindows())
{
// SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.
options |= FileOptions.SequentialScan;
}

?

Or as this is corelib, it could also be e.g.

Suggested change
// SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.
FileOptions options = (OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None) | FileOptions.Asynchronous;
const FileOptions Options =
#if TARGET_WINDOWS
FileOptions.Asynchronous | FileOptions.SequentialScan;
#else
// SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.
FileOptions.Asynchronous;
#endif

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced, as the proposals require too much space. I've moved FileOptions.Asynchronous to the left, hope it helps:

- FileOptions options = (OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None) | FileOptions.Asynchronous;
+ FileOptions options = FileOptions.Asynchronous | (OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None);


#if DEBUG
fileLength = 0; // improve the test coverage for InternalReadAllBytesUnknownLengthAsync
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question/comment as earlier.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@tmds
Copy link
Member

tmds commented Nov 15, 2021

Should we make SequentialScan then in general a nop on Linux / Unix?

Supposedly the larger read-ahead is going to deliver performance in some cases.

I think we should not use it by default on Unix, and still allow the user to opt-in when he creates the FileStream/FileSafeHandle himself.

@adamsitnik can you also update the private AsyncStreamReader method similar to the other methods?

// bufferSize == 1 used to avoid unnecessary buffer in FileStream
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize: 1, FileOptions.SequentialScan))
// SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.
FileOptions options = OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe turn this into a static property, like PlatformSequentialScanOption.

@adamsitnik
Copy link
Member Author

can you also update the private AsyncStreamReader method similar to the other methods?

@tmds To be honest I would prefer to use an approach similar to what I did in #58167 and get of this reader entirely. But this is low priority compared to... reviewing your PRs for example ;)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.IO tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants