-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace buffer management with Arrow buffers #173
base: main
Are you sure you want to change the base?
Conversation
This implements a second version of the Query interface using Arrow Arrays. The core idea here is that we can share Arrow arrays freely and then convert them to mutable buffers as long as there are no external references to them. This is all done safely and returns an error if any buffer is externally referenced.
@rroelke When you go to review this I'd start with the new As mentioned earlier, the thing I'm least happy about is that Another thing I've not fully fleshed out is the strict/relaxed compatibility modes. This is not about utf8 validation, but mapping unrepresentable TileDB types. Strict mode rejects attempting to read things like I'm also not 100% sold on the QueryBuilder interface that I currently have. I think with some slight changes we could decrease the verbosity even further. Another thing I'd expect to change that isn't addressed by this PR is figuring out a way to allow for using Arrow types when creating schema. Its a bit weird that we create schemas with the TileDB types and then read data using Arrow types. I'd expect to be able to specify arrow types in the schema creation code using arrow types. I've left that off since that's something to tackle once we've decided whether or not we go this direction. Also, there's absolutely zero tests in this PR other than porting the examples that touch arrays which will obviously need to change. Bottom line, consider this an official proposal to switch our query interface to use Arrow arrays as the buffer interface in tiledb-rs. As currently implemented, this is orthogonal to the existing query interface so it could be merged as is, but feel free to come up with a whole laundry list of things you'd like to see added/changed/removed before accepting. I just think this is a pretty good point to try and get your buy in that this is a solid change worth pursuing throughout the rest of this repo and the others using these cates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting my first round of comments. Certainly I'm not done reviewing
.add_range("cols", &[1, 4]) | ||
.end_subarray() | ||
.build() | ||
.map_err(|e| TileDBError::Other(format!("{e}")))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an easy fix but this shouldn't be necessary to map the error from the tiledb API to turn it into a tiledb result.
It's not something to add to this PR but I tend to think we should go back to tiledb-rs
and re-do our errors in the style that we've been doing in tables
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a commit that cleans all this up. The first few examples I ignored it as orthogonal but then got annoyed at it. All I've done is pushed a impl From<MyError> for TileDBError
though so the end result is the same by formatting into an Other
variant. But the code is cleaner.
Beyond that, I totes agree that the error handling needs to be overhauled now that we've got a better handle on how things fit together.
.map_err(|e| TileDBError::Other(format!("{e}")))?; | ||
|
||
if !matches!(status, QueryStatus::Completed) { | ||
return Err(TileDBError::Other("Make this better.".to_string())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems easy to miss this without something easy to search like TODO
or FIXME
or todo!()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fixed now. I was just hurrying to get the examples updated.
use super::QueryBuilder; | ||
|
||
/// Default field capacity is 10MiB | ||
const DEFAULT_CAPACITY: usize = 1024 * 1024 * 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a per-field value this inevitably will lead to under-utilized memory. If you have a u8 and a u64 for example then allocating 16MiB for each ensures that only 2MiB of the u8 space would be used.
A default seems important but I'd suggest having the default be pushed into the query which can choose a default for each field. Probably multiply number of fields by 10MiB and then parcel it out.
Of course, this likely is not a new problem, and perhaps also is not one to solve within the PR... so I think my recommendation here would be that we should create a story prior to merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I didn't spend any time thinking too hard on this, and as you noticed I removed the entire weight
concept to avoid going down a rabbit hole. Currently, buffer allocation is basically hard coded and done inside the query/querybuilder interface. My general thought for these is that we'd move buffer allocation out of the query interface and into something like that query adaptor interface.
} | ||
} | ||
|
||
pub struct QueryFieldsBuilderForQuery { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a doc comment. If this one is "ForQuery" then what's the first one for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I infer that the answer is that this can add fields to an existing query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First off, totes agree that we're gonna need better documentation stuff. Though this is the API stuff I'm not 100% sold on. A lot of this was just written in "make the interesting bits possible" mode so it's just enough to work. Though I think the fields thing in particular could use some improvement now that I've used it a bit.
The basic idea here was to support two ways of interacting with the QueryBuilder for adding fields:
let mut query = QueryBuilder::new(array, QueryType::Read)
.start_fields()
.field("foo")
.field("bar")
.end_fields()
.build()?;
let fields = QueryFields::default()
.field("foo")
.field("bar")
.build();
let mut query = QueryBuilder::new(array, QueryType::Read)
.with_fields(fields)
.build()?;
And amusingly, I've used both of those in the examples. The first one for writing example code is useful enough, though I wouldn't be sad to lose it in the grand scheme of things. The second approach is used in the reading_incomplete_arrow.rs
example because it fits with the way I landed for buffer reallocation.
} | ||
|
||
pub fn end_fields(self) -> QueryBuilder { | ||
self.query_builder.with_fields(self.fields_builder.build()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a field is specified in the first QueryBuilder
then this will replace it with the new request right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the idea was that users would use one or the other, not both. I really think the field spec needs work because of the way it limits easily adding options. Off the top of my head, perhaps something like such:
let buffer = Arc::new(Int32Array::from(vec![0u32; 10]));
let mut query = QueryBuilder::new()
.with_fields([
("example_1", buffer),
("example_2", vec![FieldArg::Capacity(100)]),
("example_3", vec![FieldArg::Capacity(100), FieldArg::RelaxedConversion])
])
.build()?;
And then with_fields
would just take a Into<QueryFields>
generic type. I suddenly miss atoms for API design...
RequiresVarSized(adt::DataType), | ||
|
||
#[error("TileDB does not support timezones on timestamps")] | ||
TimeZonesNotSupported, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this
nullable: bool, | ||
) -> Result<adt::DataType> { | ||
if let Some(arrow_type) = self.default_arrow_type(dtype) { | ||
self.convert_datatype_to(dtype, cvn, nullable, arrow_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, the source/target type pairing is important but this also makes it not required. Very nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though perhaps the default_arrow_type
should also account for the cvn and nullability?
When requesting the default type, conversion to whatever the default is should always work.
When requesting a specific type, it's odd that convert_datatype_to
can return something else. "Yes we can work with that" or "no way bro" seems like a more natural answer than "yes here's what you're actually going to get"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite happy with how it separates the "pick what to convert to" from "check if conversion is legit" logic.
it's odd that
convert_datatype_to
can return something else.
This one took me a second even after I went back and re-read that method. I totally glossed over the list/fixesizelist conversion before realizing that's what you must be referring to.
While writing this, I've started to think of the List/FixedList/Binary/Utf8 types as just an encoding of TileDB's CellValNum concept into the type system. As far as I'm concerned, these are basically non-negotiable parts of the conversion process. I.e., given a TileDB field with Datatype::Int64, CellValNum(2)
its not really representable in anything other than a FixedSizeList(fieldWithInt64type, 2)
. Same for CellValNum::Var
going into a List
. Now, we can obviously change the Int64
to anything with a compatible width (i.e., any time type, UInt64, w/e), but the List/FixedSizeList modifiers can't change.
The only wrench in that logic is obviously the Binary/Utf8 dichotomy, since those are basically unrelated (type system wise) tough basically logically equivalent to List
. My guess is that this is more engineering practicality than anything else. I.e., allowing for more efficient conversion to and from String types and the like rather than encoding that on top of the more abstract List
interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Selecting a bit from this discussion:
apache/datafusion#11513
some logically equivalent array types in the Arrow format have very different DataTypes – for example a logical string array can be represented in the Arrow array of DataType Utf8, Dictionary(Utf8), RunEndEncoded(Utf8), and StringView
Utf8 and List(u8) are physically the same but logically different because the first one requires valid UTF-8 and prevents construction if you don't have that.
While writing this, I've started to think of the List/FixedList/Binary/Utf8 types as just an encoding of TileDB's CellValNum concept into the type system
Yep, this is how I see it too. The initial arrow conversion layer I wrote ran with this, with the Exact/Inexact enums. In retrospect LogicalMatch and PhysicalMatch might have been better names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utf8 and List(u8) are physically the same but logically different because the first one requires valid UTF-8 and prevents construction if you don't have that.
Huh, I must've said it in a different comment, but I totes agree there. I'm just vaguely surprised that Utf8
doesn't wrap a List(u8)
internally as they do for OffsetsBuffer
wrapping a ScalarBuffer::<i64>
. I.e., the ScalarBuffer
is responsible for "Yep, these bytes work as a buffer of i64 values" and OffsetsBuffer
then adds the "This buffer of i64 values is a valid list of non-negative monotonically increasing i64 values". I'm just assuming hand wavy String type compatibility or some such as the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the action I'm leaning towards here is actually to remove this function (convert) and replace with something like is_physically_compatible
(not necessarily that but here I choose to be verbose).
Use case 1 - the user just asked for the field and we are allocating the buffer. They will deal with the type later. In this case there's no "conversion", we're just going to come up with some compatible type.
Use case 2 - the user supplies an arrow array and requests that we transfer data into it. In this case we must check compatibility but there''s still no notion of a "conversion".
Use case 3 - the user requests an arrow data type but we allocate. That's not implemented in this PR, it probably looks like item 2.
} | ||
} else if matches!(arrow_type, adt::DataType::Boolean) { | ||
if !cvn.is_single_valued() { | ||
return Err(Error::RequiresSingleValued(arrow_type)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come this can't also be a list?
Is that a limitation of arrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I have no idea. I'm guessing I wrote it that way because I was focused on something else and didn't want context shift. Totes agree that I should go figure out if Arrow does support it and add that if so.
} | ||
} | ||
|
||
pub fn convert_datatype_to( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks to me like it is written only with the default datatype case in mind, so it's missing some things.
- the user might request a List type which isn't covered here
- the binary/utf-8 case looks too permissive in that it allows i64 attributes; but not permissive enough in that fixed cell val num also oughta work (and we would later have to handle the offsets)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa. You are 100% right on my writing this with the blinders of just having written the default conversion stuff. Definitely seeing a whole bunch of dumb ways this would fail. And to be fair, I have currently elided another major safety bit around validating user provided buffer compatibility beyond generating random errors when a conversion fails.
As to the Binary/Utf8 with fixed cvn's, for Binary, I just discovered FixedSizeBinary
the other day and realized that'll give us some of this. But for Utf8, I don't see a way to have Arrow enforce a string length, and without I hesitate to add support for it this early. Certainly there's a possibility where we manually enforce an offsets check to assert the correct CVN for all values, but that seems like a "Lets polish that off later" sort of issue rather than a make or break for the initial work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reckon the default type selection comes in handy for fixed cell val num, since then FixedSizeBinary is clearly the right thing. But if the user requests UTF-8 from your fixed size attribute (which is definitely a tables-realm scenario) then I'd say it should be treated as compatible and the query can just whip up some offsets easily.
The other direction, going from var cell val num into a FixedSizeBinary, isn't a scenario I expect we will encounter, so we can skip it. But I think allocating space for offsets and writing them and checking that they are all the same length would be a fine thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've confused me on this one. For FixedSizeBinary
we've got a direct data type that'd cover all cases (though I wonder what it'd do for CellValNum::single()). So I'd expect that to map directly 100% of the time.
For strings, as you point out, we can just generate our own offsets for read compatibility. You mentioned that writing fixed length binary is unlikely (though will be transparently supportable). Its the string side that worries me when folks attempt to write arbitrary strings into a fixed width cell. I'm fairly certain it would be supportable, just in a "lets kick that off into its own PR" after we're sure how we want to do it as I could see a number of approaches on where that conversion/enforcement happens.
Making sure I understand the state machine / flow:
I kind of suspect that there is a simpler path to achieving the title of the PR but this addresses some other issues as well.
|
From a design standpoint, while I fully understand it and I think I understand why you did it, I am not in love with the treatment of Arc as a marker for shared/mutable. It does have the issue you've acknowledged where buffers are dropped if there is an intermediate error, which we might want to deal with for production, but... I also fully understand the choice and I'm not even sure that RefCell is compatible with using arrow's But regardless I think we'll be better off if the caller is responsible for ensuring that their Arc can be unwrapped, and if the caller is responsible for handling any failures of that. I recall you do not feel fondly about RefCell |
The discussion of what do to about types mostly fits in the other thread, but I'll add a thought that I'm having here... we could add a new super trait, something like
and then provide implementations based on PrimitiveArray (for example) for the timestamp femtosecond. Kind of like treating arrow as a physical type (even though we know it isn't) and the tiledb type as a logical type (which the Datatype part of it is). |
I think we should consider breaking this down into pieces and modifying the existing query crate(s) for a smoother transition.
|
You've got it. The important bit that I think this PR shows is how to go back and forth between a shared arrow array and a mutable buffer that can be passed to FFI calls safely. All of the other parts of this PR are mostly just me writing enough to show that the shared/mutable dichotomy is viable.
So, I'm not 100% on your meaning here, but I think I can squint and sorta see where you're getting RefCell vibes from. So I'll try and explain why I think this isn't RefCell. But first, the decision to use Arc was pretty simple: it's what Arrow uses. On to the RefCell thing. I think you're seeing interior mutability where there isn't any. Instead there's merely an ownership transition from exclusive to shared. ---
title: RefCell
---
flowchart BT
B["`struct B { value: RefCell<A>} `"] --> A
C["`struct C { value: RefCell<A>} `"] --> A
A RefCell is when we want to have two different ways to mutably write the same piece of RAM. In the poorly drawn diagram above I'm attempting to show two structs that have a RefCell to the same value. Either of those RefCell's can be used to mutate A as long as there's coordination to avoid runtime panics. ---
title Arc
---
flowchart BT
subgraph Query
direction BT
Shared <--> Mutable
end
Query <--> libtiledb.so
Arrow <--> Query
On the other hand, the Arc/MutableBuffer transitions are transitioning ownership from shared (Arc) to exclusive (MutableBuffer). So its not Arc masquerading as some RefCell, its just an Arc while the outside world has access to it. Remember, that if they hold onto that Arc and call
Calling this out specifically since I've actually fixed this part. You can see an example here: tiledb-rs/tiledb/api/src/query_arrow/buffers.rs Lines 518 to 544 in 905e5c5
The basic idea is that if we fail the shared -> mutable transition, we just rebuild the shared array. The implementation of try_from for ListArray's is the gnarliest of all of those.
First thing is that RefCell isn't even Send so it'd be pretty awkward to use in conjunction with an Arc. However, with this PR it'd be trivial to get a
This is already how things work. I've just pushed this commit that shows the behavior when a client keeps external references to those Arrow arrays:
I got nothing against RefCell! Its got its uses.
I've considered similar things like storing a copy of the original Datatype in the Array metadata or some such. However I just can't figure out how that'd work when passing these arrays to a bunch of other places. I remember DataFusion yelling at me when things had different metadata values for instance. I think there could be something to this, I just don't have a good idea on how it'd work.
I'm totes on board in figuring out a way to make this transition smaller/more incremental/what have you. Though I'll have to sit down and have a think on how that'd best be accomplished. I'm more than fine splitting this PR or something like it up into a set of smaller steps, though you haven't quite got me convinced that evolving the existing query interface is necessarily the way to go. I think it'd be a lot easer to agree on a target API that we're convinced will handle all of our use cases (via adaptors or what not), implement that, then start upgrading all of our uses of the old APIs to the new version before finally dropping the old APIs. That's not to say you can't convince me otherwise, I just get scared of the shadows when I start seeing paste macros.... |
This implements a second version of the Query interface using Arrow Arrays. The core idea here is that we can share Arrow arrays freely and then convert them to mutable buffers as long as there are no external references to them. This is all done safely and returns an error if any buffer is externally referenced.