Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace arrowrs with arrow2 #224

Merged
merged 23 commits into from
Oct 19, 2022
Merged

replace arrowrs with arrow2 #224

merged 23 commits into from
Oct 19, 2022

Conversation

Renkai
Copy link
Contributor

@Renkai Renkai commented Oct 7, 2022

No description provided.

@Renkai Renkai added the WIP work in progress label Oct 7, 2022
(*self.file).seek(SeekFrom::Start(self.position + offset as u64))?;
// let mut mutable_buf = Buffer::new(read_len * T::get_byte_width());
let byte_size = std::mem::size_of::<T>();
let mut buf = vec![0u8; read_len * byte_size];//TODO is it right?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are obviously different from @eddyxu 's arrow-rs version because there are missing equivalent things in arrow2, need deep review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T is native rust type? Does it support bool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T is an Arrow2 trait that extends native rust Type. From this list I think bool is not included, but maybe bool is just UInt8?

@Renkai Renkai requested a review from eddyxu October 7, 2022 06:44
@Renkai Renkai removed the WIP work in progress label Oct 7, 2022
@Renkai Renkai marked this pull request as ready for review October 7, 2022 06:44
@eddyxu
Copy link
Contributor

eddyxu commented Oct 7, 2022

@Renkai let's target this PR to rust branch?

@Renkai Renkai changed the base branch from main to rust October 7, 2022 07:04
@Renkai
Copy link
Contributor Author

Renkai commented Oct 7, 2022

@eddyxu retargeted, but the diff is too large, would you update branch rust with newest main branch?

# Conflicts:
#	rust/Cargo.toml
#	rust/src/encodings.rs
#	rust/src/encodings/plain.rs
#	rust/src/schema.rs
@Renkai
Copy link
Contributor Author

Renkai commented Oct 8, 2022

branch updated, now the diff page is clearer

use DataType::*;
matches!(
t,
UInt8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those indents are not right?

Aren't arrow2 has similar check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's copied from arrow-rs and keeps the original indents, at least Clion believe it indents properly.

Arrow2 also has a similar one, but it's not public.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not put this as public interface on Schema as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, private it.

@@ -100,6 +100,44 @@ impl Field {
}
}

/// Return Arrow Data Type name.
pub fn type_str(t: &DataType) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this method? Can we use #Debug() or format traits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used for keeping compatible with the names in protobuf. I'm not sure if the results of #Debug() or format are compatible with the content from protobuf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.rs/arrow2/latest/arrow2/datatypes/enum.DataType.html#impl-Debug-for-DataType

What are the returns from here?

This implementation lacks support for quite some data types and causes panic in production code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result is

            println!("UInt8: {:?}",DataType::UInt8);
            println!("UInt16: {:?}",DataType::UInt16);
            println!("Binary: {:?}",DataType::Binary);
            println!("Extension: {:?}", DataType::Extension("the_str".to_string(), Box::new(DataType::Date64), Option::Some("option_string".into())));
-----
UInt8: UInt8
UInt16: UInt16
Binary: Binary
Extension: Extension("the_str", Date64, Some("option_string"))

I think what you said about panic is right but maybe we can start with simple cases first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Referred to this, but the C++ version of Timestamp doesn't have a timezone field, while the rust version has one, for compatibility reasons I ignored the timezone field in rust.

type ArrowType;

fn decode(&mut self, offset: i32, length: &Option<i32>) -> Result<ArrayRef>;
fn decode(&mut self, offset: i32, length: &Option<i32>) -> Result<Arc<dyn Array>> ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why some of them return Arc<dyn Array> some of them return Box<dyn Array>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arc is a mimic of the original arrow-rs version, the Box is deduced from this arrow2 function. It's very similar to the arrow-rs version except for the signature. I don't know the design intention of Arrow2 here, just keep it unchanged when it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make our interface consistent? either all return Box<dyn Array> or all <Arc<dyn Array>>?
Conceptually, decode / take should all return "array" from user API's aspect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to box

let byte_size = std::mem::size_of::<T>();
let mut buf = vec![0u8; read_len * byte_size];//TODO is it right?
(*self.file).read_exact(&mut buf)?;
let mut builder = MutablePrimitiveArray::with_capacity(read_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an arrow2 api to directly pass the continuous memory buffer into Array?

This assignment of each element is not efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the API that can batch convert raw bytes to PrimitiveArray in Arrow2. If we want batch convert, maybe we can use ReadBytesExt? The code would become similar to blow

        match T::PRIMITIVE {
            PrimitiveType::Int8 => {
                let mut vec = vec![0i8; read_len];
                (*self.file).read_i8_into(&mut vec)?;
                PrimitiveArray::from_slice(vec).arced();
            }
            PrimitiveType::Int16 => {
                let mut vec = vec![0i16; read_len];
                (*self.file).read_i16::<LittleEndian>(&mut vec)?;
                PrimitiveArray::from_slice(vec).arced();
            }
            /**
            Boilerplate code ...
            **/
            PrimitiveType::Int32 => {}
            PrimitiveType::Int64 => {}
            PrimitiveType::Int128 => {}
            PrimitiveType::Int256 => {}
            PrimitiveType::UInt8 => {}
            PrimitiveType::UInt16 => {}
            PrimitiveType::UInt32 => {}
            PrimitiveType::UInt64 => {}
            PrimitiveType::Float16 => {}
            PrimitiveType::Float32 => {}
            PrimitiveType::Float64 => {}
            PrimitiveType::DaysMs => {}
            PrimitiveType::MonthDayNano => {}
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way is not applicable for DaysMs and MonthDayNano.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that we are using arrow is that arrow is a "zero-copy" representation of arrays. Copying / constructing an array should be just setting a few pointers.

https://arrow.apache.org/docs/format/Columnar.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as an implementation of Arrow, Arrow2 should definitely have a function that can convert [u8] to PrimitiveArray without copy data? I would take a further look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question, Arrow2 said it's transmute-free.
In my understanding, the meaning of transmute-free is:
We don't provide such a way that "Copying / constructing an array should be just setting a few pointers."
Wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think that's the meaning of transmute-free.

@changhiskhan could you help to clarify ?

So arrow is zero-copy representation of arrays. If we copy each elements, we can rather just use a Vec<> , right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found some very similar question in the Arrow2 issues, will copy from it.

@Renkai Renkai requested a review from eddyxu October 16, 2022 07:57
@@ -118,6 +166,15 @@ impl Field {
}
}

fn to_str(unit: &TimeUnit) -> &'static str {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we implement this a fmt::Display trait over TimeUnit?

println!("Schema: {}\n", reader.schema());
use std::any::TypeId;
let is_little_endian = TypeId::of::<byteorder::NativeEndian>() == TypeId::of::<byteorder::LittleEndian>();
println!("is little endian {:?}", is_little_endian)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is => Is

@Renkai Renkai merged commit 3e457ba into lancedb:rust Oct 19, 2022
eddyxu added a commit that referenced this pull request Oct 19, 2022
Co-authored-by: Lei Xu <lei@eto.ai>
eddyxu added a commit that referenced this pull request Oct 28, 2022
Co-authored-by: Lei Xu <lei@eto.ai>
eddyxu added a commit that referenced this pull request Oct 28, 2022
Co-authored-by: Lei Xu <lei@eto.ai>
eddyxu added a commit that referenced this pull request Nov 10, 2022
Co-authored-by: Lei Xu <lei@eto.ai>
eddyxu added a commit that referenced this pull request Nov 11, 2022
Co-authored-by: Lei Xu <lei@eto.ai>
eddyxu added a commit that referenced this pull request Dec 9, 2022
Co-authored-by: Lei Xu <lei@eto.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants