-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace arrowrs with arrow2 #224
Conversation
# Conflicts: # .gitignore
rust/src/encodings/plain.rs
Outdated
(*self.file).seek(SeekFrom::Start(self.position + offset as u64))?; | ||
// let mut mutable_buf = Buffer::new(read_len * T::get_byte_width()); | ||
let byte_size = std::mem::size_of::<T>(); | ||
let mut buf = vec![0u8; read_len * byte_size];//TODO is it right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are obviously different from @eddyxu 's arrow-rs
version because there are missing equivalent things in arrow2
, need deep review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T is native rust type? Does it support bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T is an Arrow2 trait that extends native rust Type. From this list I think bool is not included, but maybe bool is just UInt8?
@Renkai let's target this PR to |
@eddyxu retargeted, but the diff is too large, would you update branch rust with newest main branch? |
# Conflicts: # rust/Cargo.toml # rust/src/encodings.rs # rust/src/encodings/plain.rs # rust/src/schema.rs
branch updated, now the diff page is clearer |
use DataType::*; | ||
matches!( | ||
t, | ||
UInt8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those indents are not right?
Aren't arrow2
has similar check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it's copied from arrow-rs and keeps the original indents, at least Clion believe it indents properly.
Arrow2 also has a similar one, but it's not public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not put this as public interface on Schema
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, private it.
@@ -100,6 +100,44 @@ impl Field { | |||
} | |||
} | |||
|
|||
/// Return Arrow Data Type name. | |||
pub fn type_str(t: &DataType) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this method? Can we use #Debug()
or format
traits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used for keeping compatible with the names in protobuf. I'm not sure if the results of #Debug()
or format
are compatible with the content from protobuf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.rs/arrow2/latest/arrow2/datatypes/enum.DataType.html#impl-Debug-for-DataType
What are the returns from here?
This implementation lacks support for quite some data types and causes panic in production code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result is
println!("UInt8: {:?}",DataType::UInt8);
println!("UInt16: {:?}",DataType::UInt16);
println!("Binary: {:?}",DataType::Binary);
println!("Extension: {:?}", DataType::Extension("the_str".to_string(), Box::new(DataType::Date64), Option::Some("option_string".into())));
-----
UInt8: UInt8
UInt16: UInt16
Binary: Binary
Extension: Extension("the_str", Date64, Some("option_string"))
I think what you said about panic is right but maybe we can start with simple cases first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get a parity to https://github.com/eto-ai/lance/blob/ef3a53cfdedb39569b75e4c2343290452b2b8f30/cpp/src/lance/arrow/type.cc#L222 ?
These are the types we are currently using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Referred to this, but the C++ version of Timestamp doesn't have a timezone field, while the rust version has one, for compatibility reasons I ignored the timezone field in rust.
rust/src/encodings.rs
Outdated
type ArrowType; | ||
|
||
fn decode(&mut self, offset: i32, length: &Option<i32>) -> Result<ArrayRef>; | ||
fn decode(&mut self, offset: i32, length: &Option<i32>) -> Result<Arc<dyn Array>> ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why some of them return Arc<dyn Array>
some of them return Box<dyn Array>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make our interface consistent? either all return Box<dyn Array>
or all <Arc<dyn Array>>
?
Conceptually, decode / take should all return "array" from user API's aspect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to box
rust/src/encodings/plain.rs
Outdated
let byte_size = std::mem::size_of::<T>(); | ||
let mut buf = vec![0u8; read_len * byte_size];//TODO is it right? | ||
(*self.file).read_exact(&mut buf)?; | ||
let mut builder = MutablePrimitiveArray::with_capacity(read_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an arrow2 api to directly pass the continuous memory buffer into Array?
This assignment of each element is not efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the API that can batch convert raw bytes to PrimitiveArray in Arrow2. If we want batch convert, maybe we can use ReadBytesExt? The code would become similar to blow
match T::PRIMITIVE {
PrimitiveType::Int8 => {
let mut vec = vec![0i8; read_len];
(*self.file).read_i8_into(&mut vec)?;
PrimitiveArray::from_slice(vec).arced();
}
PrimitiveType::Int16 => {
let mut vec = vec![0i16; read_len];
(*self.file).read_i16::<LittleEndian>(&mut vec)?;
PrimitiveArray::from_slice(vec).arced();
}
/**
Boilerplate code ...
**/
PrimitiveType::Int32 => {}
PrimitiveType::Int64 => {}
PrimitiveType::Int128 => {}
PrimitiveType::Int256 => {}
PrimitiveType::UInt8 => {}
PrimitiveType::UInt16 => {}
PrimitiveType::UInt32 => {}
PrimitiveType::UInt64 => {}
PrimitiveType::Float16 => {}
PrimitiveType::Float32 => {}
PrimitiveType::Float64 => {}
PrimitiveType::DaysMs => {}
PrimitiveType::MonthDayNano => {}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way is not applicable for DaysMs
and MonthDayNano
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that we are using arrow is that arrow is a "zero-copy" representation of arrays. Copying / constructing an array should be just setting a few pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as an implementation of Arrow, Arrow2 should definitely have a function that can convert [u8]
to PrimitiveArray
without copy data? I would take a further look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more question, Arrow2 said it's transmute-free.
In my understanding, the meaning of transmute-free is:
We don't provide such a way that "Copying / constructing an array should be just setting a few pointers."
Wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think that's the meaning of transmute-free
.
@changhiskhan could you help to clarify ?
So arrow is zero-copy
representation of arrays. If we copy each elements, we can rather just use a Vec<>
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found some very similar question in the Arrow2 issues, will copy from it.
@@ -118,6 +166,15 @@ impl Field { | |||
} | |||
} | |||
|
|||
fn to_str(unit: &TimeUnit) -> &'static str { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we implement this a fmt::Display
trait over TimeUnit
?
rust/src/bin/lance.rs
Outdated
println!("Schema: {}\n", reader.schema()); | ||
use std::any::TypeId; | ||
let is_little_endian = TypeId::of::<byteorder::NativeEndian>() == TypeId::of::<byteorder::LittleEndian>(); | ||
println!("is little endian {:?}", is_little_endian) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is
=> Is
Co-authored-by: Lei Xu <lei@eto.ai>
Co-authored-by: Lei Xu <lei@eto.ai>
Co-authored-by: Lei Xu <lei@eto.ai>
Co-authored-by: Lei Xu <lei@eto.ai>
Co-authored-by: Lei Xu <lei@eto.ai>
Co-authored-by: Lei Xu <lei@eto.ai>
No description provided.