Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async File IO APIs mimicking Win32 OVERLAPPED #24847

Closed
alexbudmsft opened this issue Jan 30, 2018 · 45 comments · Fixed by #53669
Closed

Async File IO APIs mimicking Win32 OVERLAPPED #24847

alexbudmsft opened this issue Jan 30, 2018 · 45 comments · Fixed by #53669
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.IO
Milestone

Comments

@alexbudmsft
Copy link

alexbudmsft commented Jan 30, 2018

Edit by @carlossanlop: The updated API Proposal is here.


It would be great if the .NET framework had async file APIs that closely modelled Win32 OVERLAPPED without the need for building one ourselves with Interop (e.g. https://dschenkelman.github.io/2013/10/29/asynchronous-io-in-c-io-completion-ports/)

Today, the APIs do not take an offset parameter and instead operate on the internal file position. While there are some workarounds (creating multiple Filestream objects and changing their file pointer before issuing the IO) this is unwieldy and inefficient. Ideally there'd be a single FILE_OBJECT (in kernel mode) opened with FILE_FLAG_OVERLAPPED against which you can queue many async IOs where .NET would allocate an OVERLAPPED internally and set the OffsetLow/High fields.

You'd have to integrate this somehow with completions but keep it as flexible as possible. In Win32 you can check for completions via polling (looking at OVERLAPPED.Internal for STATUS_IO_PENDING), waiting on an event (OVERLAPPED.hEvent), waiting on the FILE_OBJECT's internal event (by waiting on the file handle itself), associating the handle with an IOCP and calling GetQueuedCompletionStatus(), and finally putting the thread in an alertable wait state via SleepEx et al and using ReadFileEx.

We don't necessarily have to support all of these, but at the very least polling and event based should be supported somehow. The .NET "Task" model already wraps an event-like scheme with Task.WaitAny() for example, so we could try to plug into that. For polling, we'd want a method on the OVERLAPPED wrapper (whether we use the existing .NET Overlapped class or hide it is up to you) that checks the Internal field (i.e. HasOverlappedIoCompleted macro equivalent).

This will allow a lot more high performance server style applications that care about the full functionality of the Win32 IO model to be written in C# without any hand-rolled Interop.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@adamsitnik
Copy link
Member

adamsitnik commented Feb 2, 2021

Hi @alexbudmsft

We are currently working on a FileStream rewrite for .NET 6.0. I've done an initial analysis of the problems that we want to solve and it seems that using OVERLAPPED for not just async (we are already doing that) but also sync IO (we are not doing that yet) might help us solve #16354, #16341, #27047, and #25905.

If I am right, it could mean that we could fulfill your request by adding the following overloads to FileStream:

public partial class FileStream : IO.Stream
{
    public override int Read(byte[] buffer, int offset, int count)
    public override int Read(Span<byte> buffer)
+   public virtual int Read(Span<byte> buffer, long fileOffset)
    public override Task<int> ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
    public override ValueTask<int> ReadAsync(Memory<byte> buffer, CancellationToken cancellationToken)
+   public virtual ValueTask<int> ReadAsync(Memory<byte> buffer, long fileOffset, CancellationToken cancellationToken)
    public override void Write(byte[] buffer, int offset, int count) { }
    public override void Write(ReadOnlySpan<byte> buffer) { }
+   public virtual void Write(ReadOnlySpan<byte> buffer, long fileOffset) { }    
    public override Task WriteAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
    public override ValueTask WriteAsync(ReadOnlyMemory<byte> buffer, CancellationToken cancellationToken)
+   public virtual ValueTask WriteAsync(ReadOnlyMemory<byte> buffer, long fileOffset, CancellationToken cancellationToken)
}

Would that be enough for you as a potential end-user of the feature?

Edit: I think that we can get Position, ReadByte, Seek and WriteByte working fine without throwing.

What would be the expected behavior for you as an end-user? Would throwing for these methods be OK? Or even expected?

This will allow a lot more high performance server style applications that care about the full functionality of the Win32 IO model to be written in C# without any hand-rolled Interop.

Could you say something more about possible usages? I wonder what kind of problems this feature could allow solving. The more I know the better prioritization I can try to get for this request.

@nockawa
Copy link

nockawa commented Mar 1, 2021

One of the use case I have is about raw data reading/writing where a Stream is not particularly useful: I don't need an object that keeps track of the current Position into the file and updates it. I have no use of it and it is a problem if I want to issue concurrent reads and writes.
So yes, being able to issue read/write specifying the position would be ideal and I'd expect this to be as fast as possible, thread-safe, async, and using overlapped io of course.
We now have a very efficient way to address memory with Span/Memory<T> so I'd say the next step if having a very low level API that perform raw read/write to enable many kind of database related operation where you don't need to particularly process the data, but just read/write lots of them with the highest throughput, async and multithreaded.

@mriehm
Copy link

mriehm commented Mar 4, 2021

@adamsitnik
Your proposal above shows adding these new APIs just to the FileStream type, but I'm wondering if you are also considering including these in the base Stream type?

Otherwise, if I own an API that takes a Stream parameter, I cannot use these higher performance APIs even if the runtime type of my parameter happens to be FileStream, unless I try casting to FileStream and have a dedicated code path for that case. Though, in the case that the FileStream has been wrapped in some delegating Stream, then even casting to FileStream would not work.
Also, I understand that #48813 is going to disable parallel calls to FileStream.ReadAsync(), with the justification that consumers could switch to this new, higher-performance API to re-enable that behavior. However, it will not be possible for all Stream consumers to switch to this new API if this API is not on the base Stream type.

I'm not sure how much other Stream implementations would benefit from overriding this API. Perhaps something like MemoryStream could improve performance a tiny bit due to not needing to update the position. It could also enable some degree of thread-safety in the case of only readers.
For everything else, this seems a reasonable base Stream implementation:

public virtual int Read(Span<byte> buffer, long fileOffset)
{
    if (!CanSeek)
    {
        throw new NotSupportedException();
    }

    int oldPosition = Position;
    Position = fileOffset;
    int bytesRead = Read(buffer);
    Position = oldPosition;
    return bytesRead;
}

@stephentoub
Copy link
Member

stephentoub commented Mar 4, 2021

I don't think these should be on either Stream or FileStream.

The purpose of Stream is in reading a sequence of data in a linear manner. Concurrent usage of a file where every request is for a specific position is not a stream. It has dubious interactions with any buffering in the instance. It doesn't work with everything the FileStream could wrap (e.g. a pipe). Suggesting these APIs could be used in parallel but other overloads of the same method can't is confusing and likely to lead to production bugs. And then doing so in a way that overrides a base method that isn't also thread-safe is a recipe for more production bugs.

If we want to add APIs for this scenario, I'm against them being instance methods on FileStream. They could be methods related to a SafeFileHandle, on File, or on some new low-level File I/O type.

@nockawa
Copy link

nockawa commented Mar 5, 2021

@stephentoub I totally agree. Today FileStream is acting as the entry point for disk access and it's stream based, which I can understand. But for low level usage we would ideally need a new set of types, things would be less confusing for sure, and easier to use.

@GSPP
Copy link

GSPP commented Mar 13, 2021

Maybe there should be a type that contains a file handle and that exposes low-level operations on files. As @stephentoub has pointed out, the Stream interface is not capable of exposing everything that an OS-level file is comprised of. There's at least offset-based IO, but there also is deletion by handle for example and I'm sure there are many things more.

It could be an informative exercise to scan the entire catalog of Windows and Linux file API functions that take a file handle. Any such API potentially is a candidate for being exposed on a low-level file handle type.

@adamsitnik
Copy link
Member

If we want to add APIs for this scenario, I'm against them being instance methods on FileStream.

👍

They could be methods related to a SafeFileHandle

To obtain a SafeFileHandle users would need to call native OS API, which would lead to a bad user experience and likely introduce bugs.

The other way is to create a FileStream and then call the SafeFileHandle getter. This is expensive (you need to create a FileStream which is a non-trivial reference type). Users would also need to know that isAsync needs to be set to true in the FileStream ctor.

Moreover, accessing the handle has side-effects, like Flushing the buffered content:

internal override SafeFileHandle SafeFileHandle
{
get
{
Flush();
_exposedHandle = true;
return _fileHandle;
}
}

Or tracking the file offset:

private void VerifyOSHandlePosition()
{
bool verifyPosition = _exposedHandle; // in release, only verify if we've given out the handle such that someone else could be manipulating it
#if DEBUG
verifyPosition = true; // in debug, always make sure our position matches what the OS says it should be
#endif
if (verifyPosition && CanSeek)
{
long oldPos = _filePosition; // SeekCore will override the current _position, so save it now
long curPos = SeekCore(_fileHandle, 0, SeekOrigin.Current);
if (oldPos != curPos)
{
// For reads, this is non-fatal but we still could have returned corrupted
// data in some cases, so discard the internal buffer. For writes,
// this is a problem; discard the buffer and error out.
_readPos = _readLength = 0;
if (_writePos > 0)
{
_writePos = 0;
throw new IOException(SR.IO_FileStreamHandlePosition);
}
}
}
}

When we switch to tracking the offset only in memory (#49145), we might need to sync it with OS offset for cases where SafeFileHandle was exposed.

on File

This would require using handles or opening the file every time we want to perform an operation?

or on some new low-level File I/O type.

IMO this is the best solution. We could introduce a very simple API and moreover separate the sync and async implementations:

public sealed class OverlappedFile : IDisposable
{
    public string Path { get; }
    public FileAccess FileAccess { get; }
    public long Length { get; set; }

    public OverlappedFile(string path, FileMode mode = FileMode.Open, FileAccess access = FileAccess.Read, FileShare share = FileShare.Read) { }

    public int Read(Span<byte> buffer, long fileOffset) { };
    public void Write(ReadOnlySpan<byte> buffer, long fileOffset) { }
}

public sealed class AsynchronousOverlappedFile : IDisposable
{
    public string Path { get; }
    public FileAccess FileAccess { get; }
    public long Length { get; set; }

    public AsynchronousOverlappedFile(string path, FileMode mode = FileMode.Open, FileAccess access = FileAccess.Read, FileShare share = FileShare.Read) { }

    public ValueTask<int> ReadAsync(Memory<byte> buffer, long fileOffset, CancellationToken cancellationToken) { }
    public ValueTask WriteAsync(ReadOnlyMemory<byte> buffer, long fileOffset, CancellationToken cancellationToken) { }
}

Since for Windows and Unix the file must be seekable to perform overlapped IO, there is no need for the CanSeek property. I would also prefer not to add handle-based ctor (keep it simple, don't use handles if you don't have to).

The biggest problem for me here is... naming. My current ideas:

  • OverlappedFile and AsynchronousOverlappedFile
  • RandomAccessFile and AsynchronousRandomAccessFile

@stephentoub
Copy link
Member

To obtain a SafeFileHandle users would need to call native OS API, which would lead to a bad user experience and likely introduce bugs.

This whole issue is about introducing new APIs. I'm not sure why new APIs that return a SafeFileHandle would be off the table.

This would require using handles or opening the file every time we want to perform an operation?

Yes, the former.

@adamsitnik
Copy link
Member

I'm not sure why new APIs that return a SafeFileHandle would be off the table.

One of the problems with SafeFileHandle is that (at least as of today) it does not contain any information about the type of the handle (file, pipe etc) and whether seeking is allowed or not.

On Windows, this is fetched with an additional syscall:

int handleType = Interop.Kernel32.GetFileType(handle);
Debug.Assert(handleType == Interop.Kernel32.FileTypes.FILE_TYPE_DISK || handleType == Interop.Kernel32.FileTypes.FILE_TYPE_PIPE || handleType == Interop.Kernel32.FileTypes.FILE_TYPE_CHAR, "FileStream was passed an unknown file type!");
_canSeek = handleType == Interop.Kernel32.FileTypes.FILE_TYPE_DISK;
_isPipe = handleType == Interop.Kernel32.FileTypes.FILE_TYPE_PIPE;

Which we of course want to avoid in high-perf APIs (not completely, but just do it once for a given file like we do it in FileStream ctor as of today).

If we skip the file type check and just set the offsets for the OVERLAPPED structure, they will be ignored and Windows will just perform the read of the next n bytes, rather than n bytes for given offset:

https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-readfile#parameters

For an hFile that does not support byte offsets, Offset and OffsetHigh are ignored.

So this would give us a silent error.

On Linux, there are some edge cases as well:

https://man7.org/linux/man-pages/man2/pwrite.2.html

POSIX requires that opening a file with the O_APPEND flag should
have no effect on the location at which pwrite() writes data.
However, on Linux, if a file is opened with O_APPEND, pwrite()
appends data to the end of the file, regardless of the value of
offset.

To avoid that, we would have to open the file on our own and ensure that O_APPEND is not set.

Moreover, having a dedicated type that can be more future-proof because allows for a very easy extension of the type and introduction of a new state. Let's say that we want to use new APIs like io_uring which might for example require to have a reference to something like a ring that should be reused for all subsequent operations on the given file.

@mriehm
Copy link

mriehm commented Mar 20, 2021

I can only speak for my own use cases, but an OverlappedFile type that cannot be constructed with a file handle would not likely be useful to me.

Performance Concerns
While I would of course appreciate higher performance Read() and Write() operations, the performance of opening file handles has also been a significant source of performance problems for my application (exacerbated by conditions such as accessing files over the network or on-access antivirus scanning), which has led me to evolve my code to reuse file handles as much as possible. In cases where .NET does not support an operation by handle (not yet anyway 😉), I'll use FileStream.Handle with calls into native code to perform operations such as deletions, renames, or basic setting of file info. If using an OverlappedFile type meant that I could no longer perform these other operations by handle (or that I need to open both FileStream and OverlappedFile), I would likely consider the performance degradation to outweigh the potential performance improvement. Though perhaps this OverlappedFile type could still be useful for a different application, such as a database, that prioritizes differently between Read()/Write() performance and Open() performance.

Other Concerns
-Lack of atomicity. For example, if I want to open a file, read the contents, and then delete the file, if I have to delete the file by path instead of by handle, how can I guarantee that I am deleting the same file that I just read? Maybe the original file has already been deleted/renamed and replaced? Perhaps most applications would not be concerned with this degree of robustness in the face of concurrent file operations, but my application is. Also, these types of issues can sometimes surface as security vulnerabilities (e.g. reading a file inserted by a malicious actor instead of the file that you intended to read).

-File locking issues. If I have to open multiple file handles to do different operations (write, delete, set file info, etc.), and I want these file handles open concurrently, I cannot open these file handles with very restrictive FileShare options, since all of these handles must be compatible with each other. This means that I cannot as effectively lock other applications out of the files that I am currently modifying.

-Specifying File Open Options Though I understand that your API above is a very preliminary proposal, it doesn't currently provide for specifying FileSystemRights or FileOptions or FileSecurity like can be done for FileStream today. Personally, I actually go to native code today for even more control of creating the file handle. I understand that some options will necessarily be incompatible with any implementation of OverlappedFile, but I would consider using a file handle a sufficiently advanced scenario that it is acceptable to put the responsibility on the consumer to use compatible options, as long as OverlappedFile fails in an obvious way if the file handle wasn't opened with the correct options. For example, I would consider not respecting the fileOffset parameter at all to be an obvious failure that would surely be noticed quickly.

I do recognize your concerns about a lack of encapsulation making future growth more difficult. However, generally, couldn't you just have consumers that pass file handles fall back to an older code path if necessary? I think, it's better to have a usable API with limited growth potential, as opposed to an API that can't be used at all (for my use cases).

Thanks.

@adamsitnik adamsitnik added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Mar 25, 2021
@adamsitnik adamsitnik self-assigned this Mar 25, 2021
@adamsitnik adamsitnik modified the milestones: Future, 5.0.x, 6.0.0 Mar 25, 2021
@adamsitnik
Copy link
Member

adamsitnik commented Mar 25, 2021

Background and Motivation

Some high-performance server applications require the possibility to perform parallel reads and|or writes with the overhead being as small as possible.

In theory, it's possible to use FileStream to seek and read|write to|from a given offset, but this is not atomic (two syscalls) and not thread-safe (ReadAsync and WriteAsync are not thread-safe).

Both Windows and Unix provide the necessary APIs that allow for that:

  • ReadFile and WriteFile on Windows (with OVERLAPPED flag enabled if we want to perform async operations)
  • pread and pwrite on Unix
  • io_uring on Linux

Both Windows and Unix support only regular disk files. Unseekable files (pipes, etc.) are not supported.

Proposed API

Underlying OS APIs like OVERLAPPED on Windows require to specify whether the file will be used for async IO up-front. This is why FileSteam ctor accepts isAsync argument.

The problem with FileSteam is that it exposes both sync and async operations and their behavior is different depending on what was passed to the ctor.

This makes the code very hard to read and maintain. Example:

public override Task WriteAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
{
if (!_useAsyncIO)
{
// If we weren't opened for asynchronous I/O, we still call to the base implementation so that
// Write is invoked asynchronously. But we can do so using the base Stream's internal helper
// that bypasses delegating to BeginWrite, since we already know this is FileStream rather
// than something derived from it and what our BeginWrite implementation is going to do.
return (Task)base.BeginWriteInternal(buffer, offset, count, null, null, serializeAsynchronously: true, apm: false);
}
return WriteAsyncInternal(new ReadOnlyMemory<byte>(buffer, offset, count), cancellationToken).AsTask();
}

So when introducing new APIs, we would like to separate async and sync implementations.

public sealed class RandomAccessFile : IDisposable
{
    public string? Path { get; } // not Name like in FileStream
    public FileAccess Access { get; }
    public SafeFileHandle SafeHandle { get; }
    public ulong Length { get; set; } // ulong, not long

    public RandomAccessFile(string filePath, FileMode mode = FileMode.Open, FileAccess access = FileAccess.Read, FileShare share = FileShare.Read)
    public RandomAccessFile(SafeFileHandle handle, FileAccess access = FileAccess.Read)
    
    public void Dispose()

    public int Read(Span<byte> buffer, ulong fileOffset)
    public int Write(ReadOnlySpan<byte> buffer, ulong fileOffset)
}

public sealed class AsynchronousRandomAccessFile : IDisposable
{
    public string? Path { get; }
    public FileAccess Access { get; }
    public SafeFileHandle SafeHandle { get; }
    public ulong Length { get; set; }

    public AsynchronousRandomAccessFile(string filePath, FileMode mode = FileMode.Open, FileAccess access = FileAccess.Read, FileShare share = FileShare.Read)
    public AsynchronousRandomAccessFile(SafeFileHandle handle, FileAccess access = FileAccess.Read)
    
    public void Dispose()

    public ValueTask<int> ReadAsync(Memory<byte> buffer, ulong fileOffset, CancellationToken cancellationToken = default)
    public ValueTask<int> WriteAsync(ReadOnlyMemory<byte> buffer, ulong fileOffset, CancellationToken cancellationToken = default)
}

Alternative names: OverlappedFile and AsynchronousOverlappedFile

Note: The APIs are using ValueTask as it allows for caching of IValueTaskSource instances.

Usage Examples

Copying files in parallel:

static void CopyParallelSync(string sourcePath, string destinationPath)
{
    using (var sourceFile = new RandomAccessFile(sourcePath))
    using (var destinationFile = new RandomAccessFile(destinationPath))
    {
        int bufferLength = (int)(sourceFile.Length / (ulong)Environment.ProcessorCount);

        Parallel.For(0, Environment.ProcessorCount, index =>
        {
            byte[] bytes = ArrayPool<byte>.Shared.Rent(bufferLength);
            ulong fileOffset = (ulong)(index * bufferLength);

            int bytesRead = sourceFile.Read(bytes.AsSpan(0, bufferLength), fileOffset);
            int bytesWritten = destinationFile.Write(bytes.AsSpan(0, bufferLength), fileOffset);
            // for the sake of simplicity we assume that bufferLength == bytesRead == bytesWritten 

            ArrayPool<byte>.Shared.Return(bytes);
        });
    }
}

static async Task CopyParallelAsync(string sourcePath, string destinationPath)
{
    await using (var sourceFile = new AsynchronousRandomAccessFile(sourcePath))
    await using (var destinationFile = new AsynchronousRandomAccessFile(destinationPath))
    {
        int bufferLength = (int)(sourceFile.Length / (ulong)Environment.ProcessorCount);

        Task[] tasks = new Task[Environment.ProcessorCount];

        for (int index = 0; index < tasks.Length; index++)
        {
            tasks[index] = CopyPartAsync(index, bufferLength, sourceFile, destinationFile);
        }

        await Task.WhenAll(tasks);
    }

    async Task CopyPartAsync(int index, int bufferLength, AsynchronousRandomAccessFile sourceFile, AsynchronousRandomAccessFile destinationFile)
    {
        byte[] bytes = ArrayPool<byte>.Shared.Rent(bufferLength);
        ulong fileOffset = (ulong)(index * bufferLength);

        int bytesRead = await sourceFile.ReadAsync(bytes.AsMemory(0, bufferLength), fileOffset);
        int bytesWritten = await destinationFile.WriteAsync(bytes.AsMemory(0, bufferLength), fileOffset);

        ArrayPool<byte>.Shared.Return(bytes);
    }
}

Alternative Designs

Extending FileStream

As @stephentoub wrote:

The purpose of Stream is in reading a sequence of data in a linear manner. Concurrent usage of a file where every request is for a specific position is not a stream. It has dubious interactions with any buffering in the instance. It doesn't work with everything the FileStream could wrap (e.g. a pipe). Suggesting these APIs could be used in parallel but other overloads of the same method can't is confusing and likely to lead to production bugs. And then doing so in a way that overrides a base method that isn't also thread-safe is a recipe for more production bugs.

SafeFileHandle extensions

public sealed partial class SafeFileHandle : Microsoft.Win32.SafeHandles.SafeHandleZeroOrMinusOneIsInvalid
{    
    public SafeFileHandle()
    public SafeFileHandle(System.IntPtr preexistingHandle, bool ownsHandle)
    
+   public static SafeFileHandle OpenFile(string path, FileMode mode = FileMode.Open, FileAccess access = FileAccess.Read, FileShare share = FileShare.Read, FileOptions options = FileOptions.None)

    public override bool IsInvalid { get; }
+   public bool CanSeek { get; }

    protected override bool ReleaseHandle() { throw null; }
}

+public static class SafeFileHandleExtensions
+{
+    public static int Read(this SafeFileHandle fileHandle, Span<byte> buffer, ulong fileOffset)
+    public static int Write(this SafeFileHandle fileHandle, ReadOnlySpan<byte> buffer, ulong fileOffset)
+    
+    public static ValueTask<int> ReadAsync(this SafeFileHandle fileHandle, Memory<byte> buffer, ulong fileOffset, CancellationToken cancellationToken = default)
+    public static ValueTask<int> WriteAsync(this SafeFileHandle fileHandle, ReadOnlyMemory<byte> buffer, ulong fileOffset, CancellationToken cancellationToken = default)
+}

The advantage would be the lack of introducing new type(s), but the disadvantage is a potential and quite common performance hit which is something that we want to avoid in a high-performance API.

There are two ways of obtaining safe file handles. The first one is calling native OS API. It's not cross-platform and it requires a very good understanding of the APIs.

As noticed by @jkotas, this could be solved by an introduction of a factory method that creates SafeFileHandle:

When the user provides us such a handle, we don't know whether the file is seekable or not. To find out, we have to perform a syscall:

int handleType = Interop.Kernel32.GetFileType(handle);
Debug.Assert(handleType == Interop.Kernel32.FileTypes.FILE_TYPE_DISK || handleType == Interop.Kernel32.FileTypes.FILE_TYPE_PIPE || handleType == Interop.Kernel32.FileTypes.FILE_TYPE_CHAR, "FileStream was passed an unknown file type!");
_canSeek = handleType == Interop.Kernel32.FileTypes.FILE_TYPE_DISK;
_isPipe = handleType == Interop.Kernel32.FileTypes.FILE_TYPE_PIPE;

This could be mitigated by extending SafeFileHandle with a lazy-initialized CanSeek property.

The other way of obtaining SafeFileHandle is by creating a FileStream and calling it's SafeFileHandle property. The problem is that accessing SafeFileHandle requires at least one syscall.

When buffering is enabled, the FileStream needs to flush the read|write buffer:

internal override SafeFileHandle SafeFileHandle
{
get
{
// BufferedFileStreamStrategy must flush before the handle is exposed
// so whoever uses SafeFileHandle to access disk data can see
// the changes that were buffered in memory so far
Flush();
return _strategy.SafeFileHandle;
}
}

(if the buffers are empty, no syscalls are being performed)

And when file is seekable (which should always be true for this API), we need to synchronise the file offset with the OS (this is a mandatory syscall as of today and it reduces the cost of every ReadAsync and WriteAsync by tracking the offset in memory):

internal sealed override SafeFileHandle SafeFileHandle
{
get
{
if (CanSeek)
{
// Update the file offset before exposing it since it's possible that
// in memory position is out-of-sync with the actual file position.
FileStreamHelpers.Seek(_fileHandle, _path, _filePosition, SeekOrigin.Begin);
}
return _fileHandle;
}
}

This could be minimized by accessing the FileStream.SafeFileHandle just once:

SafeFileHandle safeFileHandle = fileStream.SafeFileHandle;
safeFileHandle.X(..);
safeFileHandle.X(..);
safeFileHandle.X(..);
safeFileHandle.X(..);

but there would be nothing to stop the users from calling the property many times and paying for the overhead every time:

fileStream.SafeFileHandle.X(..);
fileStream.SafeFileHandle.X(..);
fileStream.SafeFileHandle.X(..);
fileStream.SafeFileHandle.X(..);

@jkotas
Copy link
Member

jkotas commented Mar 25, 2021

ulong, not long

Why ulong? long is much more natural to use within .NET.

@jkotas
Copy link
Member

jkotas commented Mar 25, 2021

There are two ways of obtaining safe file handles. The first one is calling native OS API. It's not cross platform and it requires a very good understanding of the APIs.

Can this be addressed by introducing an file SafeHandle factory method that takes similar arguments as FileStream constructor?

Also, I think it is important that the APIs we expose for this give you access to the OS handle if needed, to let you call any specialized OS-specific methods. Otherwise, one would have to switch to something completely different to do an OS-specific call, or resort to private reflection to fetch the OS handle.

@adamsitnik
Copy link
Member

Why ulong? long is much more natural to use within .NET.

With ulong we don't have to check for negative values. It's just a minor personal preference.

Can this be addressed by introducing an file SafeHandle factory method that takes similar arguments as FileStream constructor?

Good point, I am going to add such method to the alternative proposal

Also, I think it is important that the APIs we expose for this give you access to the OS handle if needed,

Another good point, I am going to add a property that exposes the handle

@danmoseley
Copy link
Member

Is overlapped IO primarily a Windows term? Also, overlapped does not suggest what it is for to anyone not familiar with it. I think your suggestions of RandomAccessFile and AsynchronousRandomAccessFile are good if we create these types.

@davidfowl
Copy link
Member

RandomAccessFile 👍🏾

@stephentoub
Copy link
Member

Can this be addressed by introducing an file SafeHandle factory method that takes similar arguments as FileStream constructor?

+1. This is the same thing I suggested earlier in the thread: #24847 (comment).

RandomAccessFile

Sounds fine.

So when introducing new APIs, we would like to separate async and sync implementations.

Are there known consumers for both? Will FileStream sit on top of these new types?

public string Path { get; }

This should be nullable, unless we require that only SafeFileHandles backed by disk files are used and we have a way of retrieving the original path on all OSes.

IAsyncDisposable

What work would DisposeAsync do that's asynchronous?

public Task ReadAsync
public Task WriteAsync

These should be ValueTask rather than Task.

Also, in these APIs, on Windows, Linux, and macOS, is the underlying operation guaranteed to write out all of the data, or might it only do a partial write?

public ulong Length { get; set; }

What does this do if there isn't a known length?

@adamsitnik
Copy link
Member

adamsitnik commented Mar 26, 2021

Are there known consumers for both?

The original author of the feature request asked for async support only, but I've received some complaints on Twitter about the lack of pread and pwrite support (I can search for a link if needed)

Will FileStream sit on top of these new types?

I would rather expect both types to reuse the same helper methods, without one depending on another.

This should be nullable,

Good point, I am going to fix that.

What work would DisposeAsync do that's asynchronous?

Another good point. I assume that Dispose should just Dispose the handle and don't wait until the pending requests have been finished?

These should be ValueTask rather than Task.

I have explained why they are not returning ValueTask in the proposal. Is my thinking incorrect?

Also, in these APIs, on Windows, Linux, and macOS, is the underlying operation guaranteed to write out all of the data, or might it only do a partial write?

This is an excellent catch 👍 . https://man7.org/linux/man-pages/man2/pwrite.2.html

Note that it is not an error for a successful call to transfer
fewer bytes than requested

I am going to update the proposal to return Task<int>.

What does this do if there isn't a known length?

Only seekable files will be supported, so I would assume that Length will always be available. If not, throw the same error as FileStream.Length does.

@benaadams
Copy link
Member

I am going to check the OS limitations (I can already see that ReadFileScatter works only in async mode with buffering disabled on Windows) and update the proposal.

On WIndows they need to be aligned page sized blocks? ReadFileScatter function

Each buffer must be at least the size of a system memory page and must be aligned on a system memory page size boundary. The system reads one system memory page of data into each buffer.

@adamsitnik
Copy link
Member

@benaadams yes, the full list of NO_BUFFERING requirements: #27408 (comment)

@adamsitnik
Copy link
Member

adamsitnik commented May 10, 2021

Updated proposal

public sealed partial class SafeFileHandle : Microsoft.Win32.SafeHandles.SafeHandleZeroOrMinusOneIsInvalid
{    
    public SafeFileHandle()
    public SafeFileHandle(System.IntPtr preexistingHandle, bool ownsHandle)
    
+   public static SafeFileHandle Open(string filePath, FileMode mode = FileMode.Open, FileAccess access = FileAccess.Read, FileShare share = FileShare.Read, FileOptions options = FileOptions.None, long preAllocationSize = 0)

    public override bool IsInvalid { get; }

    protected override bool ReleaseHandle() { throw null; }
    
+   public ulong Length { get; }
    
+   public int ReadAtOffset(Span<byte> buffer, ulong fileOffset);
+   public int WriteAtOffset(ReadOnlySpan<byte> buffer, ulong fileOffset);

+   public long ReadScatterAtOffset(ReadOnlyMemory<Memory<byte>> buffers, ulong fileOffset);
+   public long WriteGatherAtOffset(ReadOnlyMemory<ReadOnlyMemory<byte>> buffers, ulong fileOffset);
    
+   public ValueTask<int> ReadAtOffsetAsync(Memory<byte> buffer, ulong fileOffset, CancellationToken cancellationToken = default);
+   public ValueTask<int> WriteAtOffsetAsync(ReadOnlyMemory<byte> buffer, ulong fileOffset, CancellationToken cancellationToken = default);

+   public ValueTask<long> ReadScatterAtOffsetAsync(ReadOnlyMemory<Memory<byte>> buffers, ulong fileOffset, CancellationToken cancellationToken = default);
+   public ValueTask<long> WriteGatherAtOffsetAsync(ReadOnlyMemory<ReadOnlyMemory<byte>> buffers, ulong fileOffset, CancellationToken cancellationToken = default);
}

Mappings to OS-specific methods:

Windows:

  • ReadAtOffset => ReadFile
  • WriteAtOffset => WriteFile
  • ReadScatterAtOffset => n sequential calls to ReadFile
  • WriteGatherAtOffset => n sequential calls to WriteFile
  • ReadAtOffsetAsync => ReadFile
  • WriteAtOffsetAsync => WriteFile
  • ReadScatterAtOffsetAsync => ReadFileScatter if possible, otherwise n concurrent calls to ReadFile
  • WriteGatherAtOffsetAsync => WriteFileGather if possible, otherwise n concurrent calls to WriteFile

Unix:

  • ReadAtOffset => pread
  • WriteAtOffset => pwrite
  • ReadScatterAtOffset => preadv
  • WriteGatherAtOffset => pwritev
  • ReadAtOffsetAsync => pread (on a ThreadPool thread)
  • WriteAtOffsetAsync => pwrite (on a ThreadPool thread)
  • ReadScatterAtOffsetAsync => preadv (on a ThreadPool thread)
  • WriteGatherAtOffsetAsync => pwritev (on a ThreadPool thread)

Linux: as soon as we add io_uring support, switch to io_uring for Async methods

Exceptions

  • all offset-based methods would throw for files that don't support seeking (example: pipes)
  • all async methods would throw for handles that were not opened for async IO
  • all sync methods would throw for handles that were opened for async IO

@adamsitnik adamsitnik added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-needs-work API needs work before it is approved, it is NOT ready for implementation labels May 10, 2021
@bartonjs
Copy link
Member

bartonjs commented May 13, 2021

Video

  • WriteAtOffset should be void-returning, and throw if it can't complete, like Stream.Write
  • Length shouldn't be a property since it has no meaning on pipes (and throws)
  • Scatter/Gather should use IReadOnlyList as the outer generic
  • We moved the IO operations to a new System.IO.RandomAccess static
  • We moved the SafeHandle.Open to File.OpenHandle
  • We had a discussion about extension methods, decided no for now.
  • We had a discussion about signed vs unsigned and decided signed returns with overloaded inputs.
namespace System.IO
{
    public static class RandomAccess
    {
        public static long GetLength(SafeFileHandle handle) => throw null;

        public static int ReadAtOffset(SafeFileHandle handle, Span<byte> buffer, long fileOffset);
        public static int ReadAtOffset(SafeFileHandle handle, Span<byte> buffer, ulong fileOffset);
        public static void WriteAtOffset(SafeFileHandle handle, ReadOnlySpan<byte> buffer, long fileOffset);
        public static void WriteAtOffset(SafeFileHandle handle, ReadOnlySpan<byte> buffer, ulong fileOffset);

        public static long ReadScatterAtOffset(SafeFileHandle handle, IReadOnlyList<Memory<byte>> buffers, long fileOffset);
        public static long ReadScatterAtOffset(SafeFileHandle handle, IReadOnlyList<Memory<byte>> buffers, ulong fileOffset);
        public static void WriteGatherAtOffset(SafeFileHandle handle, IReadOnlyList<ReadOnlyMemory<byte>> buffers, long fileOffset);
        public static void WriteGatherAtOffset(SafeFileHandle handle, IReadOnlyList<ReadOnlyMemory<byte>> buffers, ulong fileOffset);
    
        public static ValueTask<int> ReadAtOffsetAsync(SafeFileHandle handle, Memory<byte> buffer, long fileOffset, CancellationToken cancellationToken = default);
        public static ValueTask<int> ReadAtOffsetAsync(SafeFileHandle handle, Memory<byte> buffer, ulong fileOffset, CancellationToken cancellationToken = default);
        public static ValueTask WriteAtOffsetAsync(SafeFileHandle handle, ReadOnlyMemory<byte> buffer, long fileOffset, CancellationToken cancellationToken = default);
        public static ValueTask WriteAtOffsetAsync(SafeFileHandle handle, ReadOnlyMemory<byte> buffer, ulong fileOffset, CancellationToken cancellationToken = default);

        public static ValueTask<long> ReadScatterAtOffsetAsync(SafeFileHandle handle, IReadOnlyList<Memory<byte>> buffers, long fileOffset, CancellationToken cancellationToken = default);
        public static ValueTask<long> ReadScatterAtOffsetAsync(SafeFileHandle handle, IReadOnlyList<Memory<byte>> buffers, ulong fileOffset, CancellationToken cancellationToken = default);
        public static ValueTask WriteGatherAtOffsetAsync(SafeFileHandle handle, IReadOnlyList<ReadOnlyMemory<byte>> buffers, long fileOffset, CancellationToken cancellationToken = default);
        public static ValueTask WriteGatherAtOffsetAsync(SafeFileHandle handle, IReadOnlyList<ReadOnlyMemory<byte>> buffers, ulong fileOffset, CancellationToken cancellationToken = default);
    }

    partial class File
    {
        public static SafeFileHandle OpenHandle(string filePath, FileMode mode = FileMode.Open, FileAccess access = FileAccess.Read, FileShare share = FileShare.Read, FileOptions options = FileOptions.None, long preAllocationSize = 0)
    }
}

namespace Microsoft.Win32.SafeHandles
{
    partial class SafeFileHandle
    {
        public bool IsAsync { get; }
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels May 13, 2021
@stephentoub
Copy link
Member

Why aren't all of these names just Read/Write{Async}? I don't see the benefit of "AtOffset" when there's already an argument "fileOffset" that's required, and there's no precedent for using the terminology "Scatter"/"Gather" in .NET APIs, e.g. the Socket methods that support such things are just overloads of Receive/Send{Async}. These should just be overloads of "Read/Write{Async}". We're talking about adding such overloads to Stream as well, and there these should definitely just be overloads, not names Including this additional terminology.

@bartonjs, @terrajobst, can we revisit this?

@stephentoub stephentoub added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-approved API was approved in API review, it can be implemented labels May 21, 2021
@bartonjs
Copy link
Member

Why aren't all of these names just Read/Write{Async}? I don't see the benefit of "AtOffset" when there's already an argument "fileOffset" that's required

Personally, I find it providing clarity that it's entirely disassociated with the file handle's inbuilt notion of position. It's a completely different kind of read from fread or read(2), and I see the method name making that abundantly clear is a good thing.

there's no precedent for using the terminology "Scatter"/"Gather" in .NET APIs, e.g. the Socket methods that support such things are just overloads of Receive/Send{Async}. These should just be overloads of "Read/Write{Async}".

The impression I got from the review was that we wanted the terms if we added it to Stream. If there's dissent on that point and you think we're setting up for mismatch, that's worth bringing up again.

@stephentoub
Copy link
Member

stephentoub commented May 21, 2021

I find it providing clarity that it's entirely disassociated with the file handle's inbuilt notion of position.

The fileOffset argument provides that. What else could it mean? On top of that, these are non-extension static methods on a class named RandomAccess, which itself conveys it's not just going from some built-in location for sequential streaming.

The impression I got from the review was that we wanted the terms if we added it to Stream

From my perspective, that's a bad direction. We should not do that. Whether the methods use special OS APIs or do multiple read/writes or copy into a buffer with a single read/write is an implementation detail. At the end of the day, the only difference here is the data type of the argument, and whether it's a Memory or a list or a ReadOnlySequence or whatever, it's still the same read/write operation, just with various types holding it. That's what we use overloads for.

@bartonjs
Copy link
Member

these are non-extension static methods on a class named RandomAccess,

Being extension methods was discussed. If it happens then the call just looks like handle.Read(destination, dataOffset), which could be an expectation of the read length, or an offset into destination, or possibly other interpretations. As handle.ReadAtOffset(destination, dataOffset) it feels pretty unambiguous.

Even as a static invocation, RandomAccess.Read(handle, destination, dataOffset) still doesn't feel as clear to me as RandomAccess.ReadAtOffset(handle, destination, dataOffset). The "RandomAccess" part stands out as "not a normal read", but the integer could easily be a position-relative offset, or some sort of length, or an I/O priority level, or something. Again, I (personally) think that it provides more clarity than it costs in characters typed or line length.

But I think we spent more time talking about the name when it was proposed as an instance method on SafeFileHandle than after it was a static non-extension method on RandomAccess. So maybe I'm the odd man out.

@stephentoub
Copy link
Member

stephentoub commented May 22, 2021

As handle.ReadAtOffset(destination, dataOffset) it feels pretty unambiguous.

I don't see how that disambiguates offset in destination from file offset. If that's really a concern, it should be AtFileOffset rather than AtOffset.

Regarding extension methods, as you say, that was discussed and dismissed. As static non- extension methods, this additional wording is unnecessary. For me, it's too verbose and duplicative, saying the same thing multiple times when it's already evident from the RandomAccess class name and the required fileOffset parameter name exactly what this is.

We call these methods Read and Write everywhere else, with semantics impacted by what arguments they take; we should do the same here.

@stephentoub
Copy link
Member

stephentoub commented May 22, 2021

(That said, I care much more about not including Scatter and Gather in the names. The mechanism by which these are implemented is an implementation detail, we overload by data type in .NET for these situations, I don't want .NET devs to have to learn the scatter/gather terminology, and we already don't use those words for existing APIs in .NET that provide such semantics, e.g. on Socket.)

@bartonjs
Copy link
Member

bartonjs commented May 25, 2021

Video

  • On re-review we decided to remove the names Scatter and Gather, as we already have precedent for these concepts without the name.
  • On re-review we feel that, on the whole, the "AtOffset" particles aren't necessary with the static method invocation. Since there's a fair hesitation for having these as instance methods on the SafeHandle type or even extension methods, we can go ahead and remove the redundancy.
namespace System.IO
{
    public static class RandomAccess
    {
        public static long GetLength(SafeFileHandle handle) => throw null;

        public static int Read(SafeFileHandle handle, Span<byte> buffer, long fileOffset);
        public static int Read(SafeFileHandle handle, Span<byte> buffer, ulong fileOffset);
        public static void Write(SafeFileHandle handle, ReadOnlySpan<byte> buffer, long fileOffset);
        public static void Write(SafeFileHandle handle, ReadOnlySpan<byte> buffer, ulong fileOffset);

        public static long Read(SafeFileHandle handle, IReadOnlyList<Memory<byte>> buffers, long fileOffset);
        public static long Read(SafeFileHandle handle, IReadOnlyList<Memory<byte>> buffers, ulong fileOffset);
        public static void Write(SafeFileHandle handle, IReadOnlyList<ReadOnlyMemory<byte>> buffers, long fileOffset);
        public static void Write(SafeFileHandle handle, IReadOnlyList<ReadOnlyMemory<byte>> buffers, ulong fileOffset);
    
        public static ValueTask<int> ReadAsync(SafeFileHandle handle, Memory<byte> buffer, long fileOffset, CancellationToken cancellationToken = default);
        public static ValueTask<int> ReadAsync(SafeFileHandle handle, Memory<byte> buffer, ulong fileOffset, CancellationToken cancellationToken = default);
        public static ValueTask WriteAsync(SafeFileHandle handle, ReadOnlyMemory<byte> buffer, long fileOffset, CancellationToken cancellationToken = default);
        public static ValueTask WriteAsync(SafeFileHandle handle, ReadOnlyMemory<byte> buffer, ulong fileOffset, CancellationToken cancellationToken = default);

        public static ValueTask<long> ReadAsync(SafeFileHandle handle, IReadOnlyList<Memory<byte>> buffers, long fileOffset, CancellationToken cancellationToken = default);
        public static ValueTask<long> ReadAsync(SafeFileHandle handle, IReadOnlyList<Memory<byte>> buffers, ulong fileOffset, CancellationToken cancellationToken = default);
        public static ValueTask WriteAsync(SafeFileHandle handle, IReadOnlyList<ReadOnlyMemory<byte>> buffers, long fileOffset, CancellationToken cancellationToken = default);
        public static ValueTask WriteAsync(SafeFileHandle handle, IReadOnlyList<ReadOnlyMemory<byte>> buffers, ulong fileOffset, CancellationToken cancellationToken = default);
    }

    partial class File
    {
        public static SafeFileHandle OpenHandle(string filePath, FileMode mode = FileMode.Open, FileAccess access = FileAccess.Read, FileShare share = FileShare.Read, FileOptions options = FileOptions.None, long preAllocationSize = 0)
    }
}

namespace Microsoft.Win32.SafeHandles
{
    partial class SafeFileHandle
    {
        public bool IsAsync { get; }
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels May 25, 2021
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 3, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 15, 2021
@adamsitnik adamsitnik removed the blocking Marks issues that we want to fast track in order to unblock other important work label Jun 15, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jul 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.IO
Projects
None yet
Development

Successfully merging a pull request may close this issue.