Minor File.ReadAllBytes* improvements #61519

adamsitnik · 2021-11-12T18:10:06Z

Changes:

switch from using FileStream to RandomAccess, reduces the allocations by +- 100 bytes
stop using FileOptions.SequentialScan as it requires an additional sys-call on Unix. Modern kernels are really good at recognizing file access patterns, we don't get anything from this hint. Reading small files (< 1kb) on Unix is 8% faster.
use Array.MaxLength instead int.MaxSize for determining the limit. It fixes an edge case bug where for files of size > Array.MaxLength but < int.MaxSize where it was throwing OOM instead IOException:

string path = "here.dat";
using (FileStream fs = File.Create(path))
{
    fs.SetLength(Array.MaxLength + 1L);
}

File.ReadAllBytes(path);

PS D:\projects\repros\maxArraySize> dotnet run -c Release
Out of memory.

simplify the source code, get rid of returningInternalTask-related logic
increase the test coverage for non-seekable files and for those which report 0 length despite not being empty by setting fileLength = 0 for DEBUG builds

…case bug for files: Array.MaxLength < Length < int.MaxValue

…an additional sys call on Unix and every OS is able to recognize the disk access patterns anyway

ghost · 2021-11-12T18:10:10Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Changes:

switch from using FileStream to RandomAccess, reduces the allocations by +- 100 bytes
stop using FileOptions.SequentialScan as it requires an additional sys-call on Unix. Modern kernels are really good at recognizing file access patterns, we don't get anything from this hint. Reading small files (< 1kb) on Unix is 8% faster.
use Array.MaxLength instead int.MaxSize for determining the limit. It fixes an edge case bug where for files of size > but < we would get OOM instead IOException:

string path = "here.dat";
using (FileStream fs = File.Create(path))
{
    fs.SetLength(Array.MaxLength + 1L);
}

File.ReadAllBytes(path);

PS D:\projects\repros\maxArraySize> dotnet run -c Release
Out of memory.

simplify the source code, get rid of returningInternalTask-related logic

Author:	adamsitnik
Assignees:	-
Labels:	`area-System.IO`, `tenet-performance`
Milestone:	7.0.0

src/libraries/System.Private.CoreLib/src/System/IO/File.cs

danmoseley · 2021-11-12T19:38:36Z

Just curious, where does the 2GB file limit come from? Do we have to put it in a byte[]? Since we support buffers larger than Array.MaxLength bytes, eg new long[Array.MaxLength];. I am not sure what the actual size limit on a buffer is.

stephentoub · 2021-11-12T22:47:00Z

Modern kernels are really good at recognizing file access patterns, we don't get anything from this hint.

There's 0% benefit on any Windows system with any file system? I question whether that's actually the case. Should we simply change the implementation to only specify this on Windows, since there it's (I believe, please correct me if I'm wrong) free whereas we need to make an additional syscall on Unix and thus it really needs to be worth its weight?

src/libraries/System.Private.CoreLib/src/System/IO/File.cs

src/libraries/System.IO.FileSystem/tests/File/ReadWriteAllBytes.cs

tmds · 2021-11-15T09:02:30Z

Modern kernels are really good at recognizing file access patterns, we don't get anything from this hint.

fyi, Linux doesn't recognize much. Use of SequentialScan will double the read-ahead buffer.

It makes sense to leave it out for ReadAllBytes because in most cases the cost will be greater than the gain.

Co-authored-by: Dan Moseley <danmose@microsoft.com>

adamsitnik · 2021-11-15T09:46:21Z

There's 0% benefit on any Windows system with any file system?

According to https://devblogs.microsoft.com/oldnewthing/20120120-00/?p=8493 (which might be outdated as it was referring to Windows 7):

If you issue around six reads in a row, each of which begins where the previous one left off, then the cache manager switches to FILE_FLAG_SEQUENTIAL_SCAN behavior for your file

But here for 99.99% of the cases we perform a single read (the size of the buffer passed to the OS is equal to the file size). I would expect it to be smart enough to recognize the pattern.

Should we simply change the implementation to only specify this on Windows, since there it's (I believe, please correct me if I'm wrong) free

From the benchmarks that I've run (Win 11 + SSD + NTFS + BitLocker) I see no perf gains. I can add it for Windows if you think it's worth it just in case it's beneficial for some other configs (older Windows? FAT32?).

…cient for edge cases)

adamsitnik · 2021-11-15T11:25:56Z

Just curious, where does the 2GB file limit come from?

Afaik because both Array and Span indexers use Int32 for indexing. @jkotas is most probably the best person to ask for more details.

Do we have to put it in a byte[]? Since we support buffers larger than Array.MaxLength bytes, eg new long[Array.MaxLength];.

In this case not, as the public contract states that this method returns byte[].

src/libraries/System.IO.FileSystem/tests/File/ReadWriteAllBytesAsync.cs

stephentoub · 2021-11-15T14:44:56Z

fyi, Linux doesn't recognize much. Use of SequentialScan will double the read-ahead buffer.
It makes sense to leave it out for ReadAllBytes because in most cases the cost will be greater than the gain.

Should we make SequentialScan then in general a nop on Linux / Unix?

stephentoub · 2021-11-15T14:46:03Z

src/libraries/System.Private.CoreLib/src/System/IO/File.cs

+
+#if DEBUG
+                fileLength = 0; // improve the test coverage for ReadAllBytesUnknownLength
+#endif


Can we instead just ensure we have tests that read files that return 0 for the length?

Can we instead just ensure we have tests that read files that return 0 for the length?

We have some tests, and you are the person who authored most of them: dotnet/corefx#28388

The problem is that they don't guarantee that all possible code paths are going to be executed (using stack-allocated array, renting an array from the pool, returning it and renting a larger array etc) as they deal with a virtual file system that seems to be exposing files only for reading.

We could use pipes (#58434 introduced some tests for that ), but it would require more work and handling all edge cases.

I know that setting fileLength to 0 with #if DEBUG is ugly, but it gives us what we need with a very little effort.

but it gives us what we need with a very little effort.

Does it? It means, for example, that Debug.Asserts in the normal non-zero code paths are now rendered useless.

It means, for example, that Debug.Asserts in the normal non-zero code paths are now rendered useless.

Which asserts do you mean? I can't see any in the code path for regular files that report length properly:

runtime/src/libraries/System.Private.CoreLib/src/System/IO/File.cs

Lines 548 to 567 in 2576390

private static async Task<byte[]> InternalReadAllBytesAsync(SafeFileHandle sfh, int count, CancellationToken cancellationToken)

{

using (sfh)

{

int index = 0;

byte[] bytes = new byte[count];

do

{

int n = await RandomAccess.ReadAtOffsetAsync(sfh, bytes.AsMemory(index), index, cancellationToken).ConfigureAwait(false);

if (n == 0)

{

ThrowHelper.ThrowEndOfFileException();

}

index += n;

} while (index < count);

return bytes;

}

}

runtime/src/libraries/System.Private.CoreLib/src/System/IO/File.cs

Lines 275 to 289 in 2576390

int index = 0;

int count = (int)fileLength;

byte[] bytes = new byte[count];

while (count > 0)

{

int n = RandomAccess.ReadAtOffset(sfh, bytes.AsSpan(index, count), index);

if (n == 0)

{

ThrowHelper.ThrowEndOfFileException();

}

index += n;

count -= n;

}

return bytes;

Which asserts do you mean?

Any on any code paths used from here all the way down through the runtime. The change says that the 99.9% code path is never executed in debug.

Any on any code paths used from here all the way down through the runtime.

These paths (namely RandomAccess.ReadAtOffsetAsync and RandomAccess.ReadAtOffset) have an excellent test code coverage (directly as RandomAccess tests and indirectly via FileStream tests) so I am not worried about it.

I do understand that it's not a clear win and a a trade off.

stephentoub · 2021-11-15T14:50:27Z

src/libraries/System.Private.CoreLib/src/System/IO/File.cs

+            // SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.
+            FileOptions options = (OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None) | FileOptions.Asynchronous;


Maybe:

Suggested change

// SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.

FileOptions options = (OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None) | FileOptions.Asynchronous;

FileOptions options = FileOptions.Asynchronous;

if (OperatingSystem.IsWindows())

{

// SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.

options |= FileOptions.SequentialScan;

}

?

Or as this is corelib, it could also be e.g.

Suggested change

// SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.

FileOptions options = (OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None) | FileOptions.Asynchronous;

const FileOptions Options =

#if TARGET_WINDOWS

FileOptions.Asynchronous | FileOptions.SequentialScan;

#else

// SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.

FileOptions.Asynchronous;

#endif

I am not convinced, as the proposals require too much space. I've moved FileOptions.Asynchronous to the left, hope it helps:

- FileOptions options = (OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None) | FileOptions.Asynchronous; + FileOptions options = FileOptions.Asynchronous | (OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None);

stephentoub · 2021-11-15T14:51:06Z

src/libraries/System.Private.CoreLib/src/System/IO/File.cs

+
+#if DEBUG
+            fileLength = 0; // improve the test coverage for InternalReadAllBytesUnknownLengthAsync
+#endif


Same question/comment as earlier.

stephentoub

Thanks.

tmds · 2021-11-15T16:39:14Z

Should we make SequentialScan then in general a nop on Linux / Unix?

Supposedly the larger read-ahead is going to deliver performance in some cases.

I think we should not use it by default on Unix, and still allow the user to opt-in when he creates the FileStream/FileSafeHandle himself.

@adamsitnik can you also update the private AsyncStreamReader method similar to the other methods?

tmds · 2021-11-15T16:41:42Z

src/libraries/System.Private.CoreLib/src/System/IO/File.cs

-            // bufferSize == 1 used to avoid unnecessary buffer in FileStream
-            using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize: 1, FileOptions.SequentialScan))
+            // SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.
+            FileOptions options = OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None;


nit: maybe turn this into a static property, like PlatformSequentialScanOption.

adamsitnik · 2021-11-15T16:47:45Z

can you also update the private AsyncStreamReader method similar to the other methods?

@tmds To be honest I would prefer to use an approach similar to what I did in #58167 and get of this reader entirely. But this is low priority compared to... reviewing your PRs for example ;)

adamsitnik added 4 commits November 12, 2021 17:29

switch from FileStream to RandomAccess

eeb5e0b

use Array.MaxLength as a limit for File.ReadAllBytes and fix an edge …

87aaf3d

…case bug for files: Array.MaxLength < Length < int.MaxValue

apply same changes to the async version

2b6b22b

there is no gain of using FileOptions.SequentialScan, as it requires …

39c2621

…an additional sys call on Unix and every OS is able to recognize the disk access patterns anyway

adamsitnik added area-System.IO tenet-performance Performance related issue labels Nov 12, 2021

adamsitnik added this to the 7.0.0 milestone Nov 12, 2021

adamsitnik requested review from carlossanlop and jozkee November 12, 2021 18:10

danmoseley reviewed Nov 12, 2021

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/IO/File.cs Outdated Show resolved Hide resolved

stephentoub reviewed Nov 12, 2021

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/IO/File.cs Outdated Show resolved Hide resolved

stephentoub reviewed Nov 12, 2021

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/IO/File.cs Outdated Show resolved Hide resolved

stephentoub reviewed Nov 12, 2021

View reviewed changes

src/libraries/System.IO.FileSystem/tests/File/ReadWriteAllBytes.cs Show resolved Hide resolved

Update src/libraries/System.Private.CoreLib/src/System/IO/File.cs

29b6b8d

Co-authored-by: Dan Moseley <danmose@microsoft.com>

adamsitnik added 3 commits November 15, 2021 11:47

use SequentialScan on Windows just in case it helps for some configs

9963347

use the previous pattern (the code is more complex but also more effi…

9755a45

…cient for edge cases)

improve test coverage

9fa8190

adamsitnik requested a review from stephentoub November 15, 2021 11:20

stephentoub reviewed Nov 15, 2021

View reviewed changes

src/libraries/System.IO.FileSystem/tests/File/ReadWriteAllBytesAsync.cs Outdated Show resolved Hide resolved

stephentoub reviewed Nov 15, 2021

View reviewed changes

src/libraries/System.IO.FileSystem/tests/File/ReadWriteAllBytesAsync.cs Outdated Show resolved Hide resolved

stephentoub reviewed Nov 15, 2021

View reviewed changes

stephentoub approved these changes Nov 15, 2021

View reviewed changes

address code review feedback

2576390

tmds reviewed Nov 15, 2021

View reviewed changes

adamsitnik merged commit d1b3816 into dotnet:main Nov 17, 2021

adamsitnik deleted the useRandomAccess branch November 17, 2021 12:29

AndyAyersMS mentioned this pull request Nov 22, 2021

arm64 jitstress failing with encoding_found assert #61944

Closed

ghost locked as resolved and limited conversation to collaborators Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor File.ReadAllBytes* improvements #61519

Minor File.ReadAllBytes* improvements #61519

adamsitnik commented Nov 12, 2021 •

edited

Loading

ghost commented Nov 12, 2021

danmoseley commented Nov 12, 2021

stephentoub commented Nov 12, 2021 •

edited

Loading

tmds commented Nov 15, 2021

adamsitnik commented Nov 15, 2021

adamsitnik commented Nov 15, 2021

stephentoub commented Nov 15, 2021

stephentoub Nov 15, 2021

adamsitnik Nov 15, 2021

stephentoub Nov 15, 2021

adamsitnik Nov 16, 2021

stephentoub Nov 17, 2021 •

edited

Loading

adamsitnik Nov 17, 2021

stephentoub Nov 15, 2021

adamsitnik Nov 15, 2021

stephentoub Nov 15, 2021

stephentoub left a comment

tmds commented Nov 15, 2021

tmds Nov 15, 2021

adamsitnik commented Nov 15, 2021

	private static async Task<byte[]> InternalReadAllBytesAsync(SafeFileHandle sfh, int count, CancellationToken cancellationToken)
	{
	using (sfh)
	{
	int index = 0;
	byte[] bytes = new byte[count];
	do
	{
	int n = await RandomAccess.ReadAtOffsetAsync(sfh, bytes.AsMemory(index), index, cancellationToken).ConfigureAwait(false);
	if (n == 0)
	{
	ThrowHelper.ThrowEndOfFileException();
	}

	index += n;
	} while (index < count);

	return bytes;
	}
	}

	int index = 0;
	int count = (int)fileLength;
	byte[] bytes = new byte[count];
	while (count > 0)
	{
	int n = RandomAccess.ReadAtOffset(sfh, bytes.AsSpan(index, count), index);
	if (n == 0)
	{
	ThrowHelper.ThrowEndOfFileException();
	}

	index += n;
	count -= n;
	}
	return bytes;

		// SequentialScan is a perf hint that requires extra sys-call on non-Windows OSes.
		FileOptions options = (OperatingSystem.IsWindows() ? FileOptions.SequentialScan : FileOptions.None) \| FileOptions.Asynchronous;

Minor File.ReadAllBytes* improvements #61519

Minor File.ReadAllBytes* improvements #61519

Conversation

adamsitnik commented Nov 12, 2021 • edited Loading

ghost commented Nov 12, 2021

danmoseley commented Nov 12, 2021

stephentoub commented Nov 12, 2021 • edited Loading

tmds commented Nov 15, 2021

adamsitnik commented Nov 15, 2021

adamsitnik commented Nov 15, 2021

stephentoub commented Nov 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Nov 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub left a comment

Choose a reason for hiding this comment

tmds commented Nov 15, 2021

Choose a reason for hiding this comment

adamsitnik commented Nov 15, 2021

adamsitnik commented Nov 12, 2021 •

edited

Loading

stephentoub commented Nov 12, 2021 •

edited

Loading

stephentoub Nov 17, 2021 •

edited

Loading