Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please consider node like pread / pwrite operations #34

Closed
Gozala opened this issue Nov 15, 2018 · 12 comments
Closed

Please consider node like pread / pwrite operations #34

Gozala opened this issue Nov 15, 2018 · 12 comments

Comments

@Gozala
Copy link

Gozala commented Nov 15, 2018

Many p2p use cases depend on pread / pwrite operations that is reading a byte range from the provided offset / writing byte range from the provided offset without changing file position as that allows for scheduling parallel read / write operations.

While it is possible to use hand rolled scheduler and queue read / write operations it does greatly affect throughput and writing such code is non trivial.

@Gozala
Copy link
Author

Gozala commented Nov 15, 2018

I forgot to elaborate on why this is necessary for p2p use cases - That is because participating peer usually gets different chunks from different peers in the swarm and also seeds those chunks as they are being requested. Which is why read / write of specific byte ranges is important for such use cases.

@Gozala
Copy link
Author

Gozala commented Nov 15, 2018

/cc @mafintosh here as he could offer far more detailed insight into this than I can.

@mkruisselbrink
Copy link
Contributor

At least the currently described API only has pread/pwrite like operations (well, pwrite, for reading it relies on the existing blob reading APIs, which thanks to slicing more or less are pretty pread like anyway). I expect we'll end up with some combination of a pread/pwrite like API and a way to read/write using Readable/WritableStreams, which hopefully should address all use cases. Exact shape TBD, but good to hear that there is a strong desire for pread/pwrite style APIs.

@Gozala
Copy link
Author

Gozala commented Nov 29, 2018

At least the currently described API only has pread/pwrite like operations (well, pwrite,

Did it change ? I believe when I submitted writer did not had position argument. Glad to see it has now.

for reading it relies on the existing blob reading APIs, which thanks to slicing more or less are pretty pread like anyway).

Are you referring to FileReader API,I would disagree, it requires reading the whole file and taking only a slice of it after the fact. Which for some of these applications would be a problem as for instance Dat stores all the data in a single sparse file loading it all into memory just too reach few bytes would be a highly inefficient.

I expect we'll end up with some combination of a pread/pwrite like API and a way to read/write using Readable/WritableStreams, which hopefully should address all use cases. Exact shape TBD

👍

@mkruisselbrink
Copy link
Contributor

Did it change ? I believe when I submitted writer did not had position argument. Glad to see it has now.

Yeah, at least at some point the sample code didn't have that position argument, but I changed my mind :)

Are you referring to FileReader API,I would disagree, it requires reading the whole file and taking only a slice of it after the fact. Which for some of these applications would be a problem as for instance Dat stores all the data in a single sparse file loading it all into memory just too reach few bytes would be a highly inefficient.

That is not necesarily true. You can just call slice() on the blob/file you have, and then only pass the sliced blob to FileReader (or wrap the sliced blob in a Response to use a more modern reading API). Creating a slice() should be fairly efficient, and at least for file backed blobs in chrome will not involve any reading or copying of the data.

@Gozala
Copy link
Author

Gozala commented Nov 29, 2018

That is not necesarily true. You can just call slice() on the blob/file you have, and then only pass the sliced blob to FileReader (or wrap the sliced blob in a Response to use a more modern reading API). Creating a slice() should be fairly efficient, and at least for file backed blobs in chrome will not involve any reading or copying of the data.

I did not realized that, thanks for the clarification!

@mkruisselbrink
Copy link
Contributor

The writing API was changed, but still has pwrite/random access capabilities. The previous writer.write(position, data) can now be written as either writable.write({type: 'write', data: data, position: position}) or writable.seek(position); writable.write(data);.

So since I believe this is addressed, closing this issue.

@Gozala
Copy link
Author

Gozala commented Apr 16, 2020

I have not being paying close attention to this, but there’s fundamental difference between reading writing at random offsets and using seek based API as later requires task coordination to avoid races on concurrent access while former doesn’t.

@Gozala
Copy link
Author

Gozala commented Apr 16, 2020

From what I can tell looking at current spec > Writes the content of data into the file associated with stream at position bytes from the top of the file. Also updates the current file cursor offset to the end of the written data.

Even writes with specified offset would exhibit problem mentioned in a previous comment. As updating position affect concurrent read / writes

@mkruisselbrink
Copy link
Contributor

Currently all writes always operate on a temporary/shadow file. And until you close the writer no changes will be reflected in the actual file. You can't read from the shadow file, so there is definitely no issue between reads and writes. (we do realize this is somewhat limiting, and there definitely are use cases where this model doesn't work, in particular your use case isn't very efficient in this model).

I'm not sure I entirely understand the concern about writes either. Is the concern there that if you use both "pwrite" style writes and writes without explicit offset those might interfere with each other? I'm not sure I understand when you'd be in such a situation. I'd expect in most cases you'd either want to write at an explicit offset, or you'd want to just stream data at the "current position". What use cases would require mixing the two?

@Gozala
Copy link
Author

Gozala commented Apr 17, 2020

@mkruisselbrink let me try to set some context on the use case. Many p2p applications tend to create arbitrary number of network connections to get chunks of data in parallel which that then is written to the underlying file. This implies many concurrent read and write operations can occur. More specific example would be dat which uses random-access-storage for data storage. random-access-file implementation is very performant as nodejs provides pread / pwrite and there for read / write operations do not require task queues. On the other end of the spectrum is random-access-idb which is only cross browser solution, but also performs poorly due to indexedDB not having a way to read ranges of data. And then there is
random-access-idb-mutable-file which uses non standard mozilla indexedb extension. This API has is most similar to one proposed here. So if two concurrent read/writes occur file position is updated mid-read/write and results in incorrect data being read (or written causing corruption). To avoid this problem implementation uses task queue that ensures that new read / write does not happen in mid-flight of the other.

This queuing limitation affects significantly performance which is why I brought this issue up. That being said if changes aren't reflected until file is closed that is probably going to be even more problematic than lack or pread/pwrite.

It is also worth pointing out that this is not tied to dat specificially. Any use case that assumes concurrent access to file that is large enough to be impractical for keeping in memory will be face these challenges.

@mkruisselbrink
Copy link
Contributor

Thanks for the detailed explanation. Unfortunately that is indeed a use case that isn't very well supported yet today. Something like the NativeIO proposal might be a better fit, with the caveat that as currently envisioned that only gives access to origin scoped/private files. That might still work for P2P applications though, if they're fine with not actually storing the final downloaded file to a user specified location until the entire file is downloaded?

Either way, we'll definitely use cases like this in mind as we think about the future of this API...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants