-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to represent a sequence of bytes #39
Comments
One of the nice advantages to using a |
Yes, these buffers are meant to be shared between threads. When querying a layer, all intermediate objects (SubjectLookup, SubjectPredicateLookup, PredicateLookup, ObjectLookup) will contain their own Arcs with the byte arrays that are relevant to their query. This allows lookup objects to be safely transfered between threads, completely independent of what may happen to the layer afterwards. I think you're probably right that Bytes is a better interface for us. I was not aware that we could write custom implementations for it, so that's pretty nice. That said, I'm seriously considering ditching the SharedMmap in favor of just reading the data into memory normally. This'll eliminate our only use of unsafe, and it'll actually be safer, as we currently aren't prepared to deal with files in the store disappearing or changing while they're mmap'ed. So regarding best representation, I think you're right about Bytes being a good candidate for that. There's actually some places in the code where we now explicitely track an offset+length, which we wouldn't have to with Bytes. |
Thanks for the explanation!
I'm not sure what you meant by custom implementation here, so, just in case it wasn't clear, what I wrote in #40 is an alternative to As for actually customizing
Okay. Can you say a bit about why you chose to use
I'd be happy to have a go at implementing this. Shall I? |
Ah, I misunderstood.
My main motivation for using mmap was that it'd allow us to easily load more graph layers into memory than we have physical memory for. Any mmap'ed file can be swapped in and out of memory freely since it is backed by physical storage anyway. There's other problems with using mmap too. Mmap basically loads in pages on demand, whenever they are needed due to a memory access (actual implementations are actually a bit smarter and will try to predict what pages you need in the future too, but I digress). When it needs to load a page, it suspends the thread, does a bunch of disk io to load in the page, then it resumes the thread. This does not play nice with tokio, which assumes that threads don't randomly go to sleep. It is also more unpredictable when exactly such a load will take place, unlike with an explicit file read.
Feel free to do so! |
That all makes sense. It also sounds eerily similar to the motivation many database management systems use for managing their own memory and disk I/O rather than hand that over to the virtual memory system in the OS. I'll work on this next, after I finish what I'm doing now. |
* Transition all uses of `AsRef<[u8]>` to `Bytes` * Use `clone` in `LogArrayIterator`, remove `OwnedLogArrayIterator` * Track `first` element in `LogArray`, remove `LogArraySlice` * Use `Bytes` in `FileLoad`, remove `Map` associated type * Read files into memory, remove `memmap::Mmap` Closes terminusdb#39
* Transition all uses of `AsRef<[u8]>` to `Bytes` * Use `clone` in `LogArrayIterator`, remove `OwnedLogArrayIterator` * Track `first` element in `LogArray`, remove `LogArraySlice` * Use `Bytes` in `FileLoad`, remove `Map` associated type * Read files into memory, remove `memmap::Mmap` Closes terminusdb#39
I'm opening up this issue to discuss the appropriate representation for a buffer (i.e. an arbitrary contiguous sequence of bytes) in
terminus-store
. This discussion will help me to get an understanding for the motivation and mechanics of the current approach and to probe for reactions to an alternative approach, which I propose at the end. Please feel free to comment on anything or to correct my understanding if necessary.Currently, the predominant view of a buffer appears to be
M: AsRef<[u8]>
. This type implies two things:data: M
has the operationdata.as_ref()
that returns&[u8]
. This gives a read-only view of a buffer that can be shared between threads without the option of writing to it.data: M
owns the value referencing the buffer. There is no borrowing of references here.This appears to have been changed from a previously predominant view of a buffer as a slice:
data: &'a [u8]
(1deedbf, bf6416b, ad7dd42, e5a50a0, c6a14f9). This view meant:data: &'a [u8]
cannot be shared between threads.'a
lifetime.Now, given that the buffers currently seem to be backed by one of the two following
struct
s:pub struct SharedVec(pub Arc<Vec<u8>>);
pub struct SharedMmap(Option<Arc<FileBacking>>);
which both have
Arc
, I presume that the data is being shared read-only between threads. (I'm actually not yet clear on where the sharing is occurring, so if you want to enlighten me, I'd appreciate it!) If there was no sharing, I think the slice approach is better, since (a) there is less runtime work to manage usage of the buffers and (b) the type system keeps track of the lifetimes.I think using
M: AsRef<[u8]>
is somewhat painful as schema for typing a buffer. It's too general and leads to trait bounds such asM: 'static + AsRef<[u8]> + Clone + Send + Sync
in many places.After doing some research, I think something like
Bytes
from thebytes
crate would work better.Bytes
is a thread-shareable container representing a contiguous sequence of bytes. It satisfies'static + AsRef<[u8]> + Clone + Send + Sync
. It also supports operations likesplit_to
andsplit_off
, which I think would work well when you want to segment a buffer into different representations. Replacingdata: M
withdata: Bytes
would make many of the trait bounds disappear.Unfortunately,
Bytes
does not supportmemmap::Mmap
, which means it would not suitterminus-store
's current usage ofAsRef<[u8]>
. However, I've already implemented an adaptation ofBytes
that does supportmemmap::Mmap
. Others have, too. See tokio-rs/bytes#359.Here are some questions prompted by the above:
terminus-store
?AsRef<[u8]>
? Could that type be astruct
instead of a set of trait bounds?The text was updated successfully, but these errors were encountered: