Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: Restore CI by making parquet and arrow version consistent #280

Merged
merged 1 commit into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@
[workspace]
resolver = "2"
members = [
"crates/catalog/*",
"crates/examples",
"crates/iceberg",
"crates/test_utils",
"crates/catalog/*",
"crates/examples",
"crates/iceberg",
"crates/test_utils",
]

[workspace.package]
Expand All @@ -37,9 +37,9 @@ rust-version = "1.75.0"
anyhow = "1.0.72"
apache-avro = "0.16"
array-init = "2"
arrow-arith = { version = ">=46" }
arrow-array = { version = ">=46" }
arrow-schema = { version = ">=46" }
arrow-arith = { version = "51" }
arrow-array = { version = "51" }
arrow-schema = { version = "51" }
async-stream = "0.3.5"
async-trait = "0.1"
bimap = "0.6"
Expand All @@ -61,7 +61,7 @@ murmur3 = "0.5.2"
once_cell = "1"
opendal = "0.45"
ordered-float = "4.0.0"
parquet = "50"
parquet = "51"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why previously parquet version is not consistent with arrow version. Different versions could cause errors like:

    = note: `RecordBatch` and `arrow_array::record_batch::RecordBatch` have similar names, but are actually distinct types                                                                                                             
note: `RecordBatch` is defined in crate `arrow_array`                                                              
   --> /Users/liangchi/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-50.0.0/src/record_batch.rs:72:1                                                                                                                
    |                                                                                                              
72  | pub struct RecordBatch {                                                                                     
    | ^^^^^^^^^^^^^^^^^^^^^^                           
note: `arrow_array::record_batch::RecordBatch` is defined in crate `arrow_array`                                   
   --> /Users/liangchi/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-51.0.0/src/record_batch.rs:72:1                                                                                                                
    |                                                                                                                                                                                                                                  
72  | pub struct RecordBatch {                                                                                     
    | ^^^^^^^^^^^^^^^^^^^^^^
    = note: perhaps two different versions of crate `arrow_array` are being used?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to change it to parquet >= 46 <51? We faced similar problems in icelake. cc @Xuanwo I still recommend to pin the arrow version since arrow doesn't ensure compatibility during major version upgrades.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can temporarily fix the CI by pinning arrow and parquet. However, since our users rely on various versions of arrow, we cannot force them to always use the latest version. Ultimately, we will determine a MSAV through different methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And commiting Cargo.lock is still recommanded to make sure CI works as expected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, let's pin this first to unlock the ci failure.

pilota = "0.10.0"
pretty_assertions = "1.4.0"
port_scanner = "0.1.5"
Expand Down
18 changes: 8 additions & 10 deletions crates/iceberg/src/transform/temporal.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,8 @@

use super::TransformFunction;
use crate::{Error, ErrorKind, Result};
use arrow_arith::{
arity::binary,
temporal::{month_dyn, year_dyn},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are deprecated in latest arrow release.

};
use arrow_arith::temporal::DatePart;
use arrow_arith::{arity::binary, temporal::date_part};
use arrow_array::{
types::Date32Type, Array, ArrayRef, Date32Array, Int32Array, TimestampMicrosecondArray,
};
Expand All @@ -43,8 +41,8 @@ pub struct Year;

impl TransformFunction for Year {
fn transform(&self, input: ArrayRef) -> Result<ArrayRef> {
let array =
year_dyn(&input).map_err(|err| Error::new(ErrorKind::Unexpected, format!("{err}")))?;
let array = date_part(&input, DatePart::Year)
.map_err(|err| Error::new(ErrorKind::Unexpected, format!("{err}")))?;
Ok(Arc::<Int32Array>::new(
array
.as_any()
Expand All @@ -61,15 +59,15 @@ pub struct Month;

impl TransformFunction for Month {
fn transform(&self, input: ArrayRef) -> Result<ArrayRef> {
let year_array =
year_dyn(&input).map_err(|err| Error::new(ErrorKind::Unexpected, format!("{err}")))?;
let year_array = date_part(&input, DatePart::Year)
.map_err(|err| Error::new(ErrorKind::Unexpected, format!("{err}")))?;
let year_array: Int32Array = year_array
.as_any()
.downcast_ref::<Int32Array>()
.unwrap()
.unary(|v| 12 * (v - UNIX_EPOCH_YEAR));
let month_array =
month_dyn(&input).map_err(|err| Error::new(ErrorKind::Unexpected, format!("{err}")))?;
let month_array = date_part(&input, DatePart::Month)
.map_err(|err| Error::new(ErrorKind::Unexpected, format!("{err}")))?;
Ok(Arc::<Int32Array>::new(
binary(
month_array.as_any().downcast_ref::<Int32Array>().unwrap(),
Expand Down
33 changes: 8 additions & 25 deletions crates/iceberg/src/writer/file_writer/parquet_writer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@
//! The module contains the file writer for parquet file format.

use std::{
cmp::max,
collections::HashMap,
sync::{atomic::AtomicI64, Arc},
};
Expand All @@ -43,9 +42,6 @@ use super::{
/// ParquetWriterBuilder is used to builder a [`ParquetWriter`]
#[derive(Clone)]
pub struct ParquetWriterBuilder<T: LocationGenerator, F: FileNameGenerator> {
/// `buffer_size` determines the initial size of the intermediate buffer.
/// The intermediate buffer will automatically be resized if necessary
init_buffer_size: usize,
props: WriterProperties,
schema: ArrowSchemaRef,

Expand All @@ -55,21 +51,16 @@ pub struct ParquetWriterBuilder<T: LocationGenerator, F: FileNameGenerator> {
}

impl<T: LocationGenerator, F: FileNameGenerator> ParquetWriterBuilder<T, F> {
/// To avoid EntiryTooSmall error, we set the minimum buffer size to 8MB if the given buffer size is smaller than it.
const MIN_BUFFER_SIZE: usize = 8 * 1024 * 1024;

/// Create a new `ParquetWriterBuilder`
/// To construct the write result, the schema should contain the `PARQUET_FIELD_ID_META_KEY` metadata for each field.
pub fn new(
init_buffer_size: usize,
props: WriterProperties,
schema: ArrowSchemaRef,
file_io: FileIO,
location_generator: T,
file_name_generator: F,
) -> Self {
Self {
init_buffer_size,
props,
schema,
file_io,
Expand Down Expand Up @@ -112,20 +103,14 @@ impl<T: LocationGenerator, F: FileNameGenerator> FileWriterBuilder for ParquetWr
.generate_location(&self.file_name_generator.generate_file_name()),
)?;
let inner_writer = TrackWriter::new(out_file.writer().await?, written_size.clone());
let init_buffer_size = max(Self::MIN_BUFFER_SIZE, self.init_buffer_size);
let writer = AsyncArrowWriter::try_new(
inner_writer,
self.schema.clone(),
init_buffer_size,
Some(self.props),
)
.map_err(|err| {
Error::new(
crate::ErrorKind::Unexpected,
"Failed to build parquet writer.",
)
.with_source(err)
})?;
let writer = AsyncArrowWriter::try_new(inner_writer, self.schema.clone(), Some(self.props))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.map_err(|err| {
Error::new(
crate::ErrorKind::Unexpected,
"Failed to build parquet writer.",
)
.with_source(err)
})?;

Ok(ParquetWriter {
writer,
Expand Down Expand Up @@ -311,7 +296,6 @@ mod tests {

// write data
let mut pw = ParquetWriterBuilder::new(
0,
WriterProperties::builder().build(),
to_write.schema(),
file_io.clone(),
Expand Down Expand Up @@ -551,7 +535,6 @@ mod tests {

// write data
let mut pw = ParquetWriterBuilder::new(
0,
WriterProperties::builder().build(),
to_write.schema(),
file_io.clone(),
Expand Down
Loading