-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11599: [Rust] Add function to create array with all nulls #9469
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format?
See also: |
@alamb do you mind completing this for me? I'd otherwise only be able to work on it in my evening (GMT+2). The tests pass, but I need to add more tests (on the Arrow side), and complete some of the |
Nice, this is something useful! |
Thanks @nevi-me -- I will try and give it a look / try in a few hours |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions from me
#[test] | ||
fn test_null_list_primitive() { | ||
let data_type = | ||
DataType::List(Box::new(Field::new("item", DataType::Int32, true))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a soundness hole here, that applies more for parquet. If I created this array as non-nullable, there wouldn't have been an error, even though this doesn't comply with the spec.
@alamb @jorgecarleitao @Dandandan should we panic, or should I change the signature to Result<ArrayRef>
so I can validate this condition?
I'm leaning to a panic, just like we panic on impl From<ArrayDataRef> for {type}Array
. Wrapping a function that doesn't really error out, in a Result feels like assembing the Avengers for a single villain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote panic
for now if you think it is a logic error that a user should be avoiding anyways.
Or alternately, we can make it panic
now and Result
later if someone finds a need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that hole exists in so many places, unfortunately. Since ArrayData
has a null_count
, we have been relying on it for this.
Technically, this goes back to the issue that we are not performing any validation whether the datatype is consistent with the array's content. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @nevi-me -- thank you. I left some hopefully helpful additional test suggestions. But all in all this looks great
rust/arrow/src/array/array.rs
Outdated
} | ||
DataType::Float16 => unreachable!(), | ||
DataType::Int32 | ||
| DataType::UInt32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is probably ok, it took me a while to convince myself that new_null_sized_array::<Int32Type>
was ok to use for different types like Float32
I wonder if it would be better to explicitly list out the types here
DataType::Float32 => new_null_sized_array::<Float32Type>(data_type, length),
DataType::Date32 => new_null_sized_array::<Int32Type>(data_type, length),
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change them, seemed tedious though because f32 and i32 use the same underlying size. I was looking for a convenient way of getting the byte size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, imo we have been doing this incorrectly all along. We currently have one PrimitiveArray
per logical type (e.g. Date32
, Int32
) because there is a one-to-one correspondence between ArrowPrimitiveType
and DataType
. This implies that we will never be able to support Timestamp(_,X)
with PrimitiveArray
, as that type has many variations (one per time zone).
Having one array type per logical type also makes the whole code less maintainable, because we have to downcast for every variation of the DataType
, instead of for every variation of the underlying physical type.
IMO we should have 1 PrimitiveArray
per physical type. I.e. PrimitiveArray<i32>
, PrimitiveArray<i64>
, etc. The DataType
just denotes how the physical values are to be interpreted (logical), not their layout in memory nor any safety considerations (e.g. &str
vs &[u8]
).
More generally, a generic parameter T
represents a variation of the underlying physical / memory layout, and we have been using PrimitiveArray<T>
to represent the same layout but different logical representations.
This is one of the main changes in the proposal I wrote, on which a primitive array is defined by its physical representation, PrimitiveArray<i64>
and its logical representation (DataType
), on which the datatype is a many to one to a physical representation (e.g. Date32
, Time32
and Int32
all map to i32
).
4d62854
to
f43426b
Compare
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
f43426b
to
8c45e47
Compare
Codecov Report
@@ Coverage Diff @@
## master #9469 +/- ##
==========================================
- Coverage 82.31% 82.28% -0.04%
==========================================
Files 234 234
Lines 54482 54594 +112
==========================================
+ Hits 44849 44920 +71
- Misses 9633 9674 +41
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks again @nevi-me -- I am keen to get this merged in sooner rather than later (so I can get #9468 in and then pull this change into my downstream project).
Any objections @jorgecarleitao ?
@alamb you can merge this so it unblocks your work. If there are further questions, I'm happy to address them in a follow-up PR |
Thanks @nevi-me -- I think this is ready to go then. Thanks again! |
FYI @maxburke |
As pointed out by @tustvold on influxdata/influxdb_iox#783 (comment), it would seem the whole point of `NullArray::new_with_type`, is to cheaply construct entirely null columns, with a smaller memory footprint. Currently trying to print them out causes a panic: ``` #[test] fn test_pretty_format_null() -> Result<()> { // define a schema. let schema = Arc::new(Schema::new(vec![ Field::new("a", DataType::Utf8, true), Field::new("b", DataType::Int32, true), ])); let num_rows = 4; // define data (null) let batch = RecordBatch::try_new( schema, vec![ Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)), Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)), ], )?; let table = pretty_format_batches(&[batch])?; } ``` Panics: ``` failures: ---- util::pretty::tests::test_pretty_format_null stdout ---- thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27 ``` # Note _Update: the issue with `NullArray` claiming to be types such as `Int32` has been fixed by @nevi-me in #9469 _ Closes #9468 from alamb/alamb/support-null-printing Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This allows creating an array with n null values Closes apache#9469 from nevi-me/make-empty-arrays Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Wakahisa <nevilledips@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
As pointed out by @tustvold on https://github.com/influxdata/influxdb_iox/pull/783#discussion_r574414243, it would seem the whole point of `NullArray::new_with_type`, is to cheaply construct entirely null columns, with a smaller memory footprint. Currently trying to print them out causes a panic: ``` #[test] fn test_pretty_format_null() -> Result<()> { // define a schema. let schema = Arc::new(Schema::new(vec![ Field::new("a", DataType::Utf8, true), Field::new("b", DataType::Int32, true), ])); let num_rows = 4; // define data (null) let batch = RecordBatch::try_new( schema, vec![ Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)), Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)), ], )?; let table = pretty_format_batches(&[batch])?; } ``` Panics: ``` failures: ---- util::pretty::tests::test_pretty_format_null stdout ---- thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27 ``` # Note _Update: the issue with `NullArray` claiming to be types such as `Int32` has been fixed by @nevi-me in apache#9469 _ Closes apache#9468 from alamb/alamb/support-null-printing Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
As pointed out by @tustvold on influxdata/influxdb_iox#783 (comment), it would seem the whole point of `NullArray::new_with_type`, is to cheaply construct entirely null columns, with a smaller memory footprint. Currently trying to print them out causes a panic: ``` #[test] fn test_pretty_format_null() -> Result<()> { // define a schema. let schema = Arc::new(Schema::new(vec![ Field::new("a", DataType::Utf8, true), Field::new("b", DataType::Int32, true), ])); let num_rows = 4; // define data (null) let batch = RecordBatch::try_new( schema, vec![ Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)), Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)), ], )?; let table = pretty_format_batches(&[batch])?; } ``` Panics: ``` failures: ---- util::pretty::tests::test_pretty_format_null stdout ---- thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27 ``` # Note _Update: the issue with `NullArray` claiming to be types such as `Int32` has been fixed by @nevi-me in apache/arrow#9469 _ Closes #9468 from alamb/alamb/support-null-printing Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This allows creating an array with n null values Closes apache#9469 from nevi-me/make-empty-arrays Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Wakahisa <nevilledips@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
As pointed out by @tustvold on https://github.com/influxdata/influxdb_iox/pull/783#discussion_r574414243, it would seem the whole point of `NullArray::new_with_type`, is to cheaply construct entirely null columns, with a smaller memory footprint. Currently trying to print them out causes a panic: ``` #[test] fn test_pretty_format_null() -> Result<()> { // define a schema. let schema = Arc::new(Schema::new(vec![ Field::new("a", DataType::Utf8, true), Field::new("b", DataType::Int32, true), ])); let num_rows = 4; // define data (null) let batch = RecordBatch::try_new( schema, vec![ Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)), Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)), ], )?; let table = pretty_format_batches(&[batch])?; } ``` Panics: ``` failures: ---- util::pretty::tests::test_pretty_format_null stdout ---- thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27 ``` # Note _Update: the issue with `NullArray` claiming to be types such as `Int32` has been fixed by @nevi-me in apache#9469 _ Closes apache#9468 from alamb/alamb/support-null-printing Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This allows creating an array with n null values Closes apache#9469 from nevi-me/make-empty-arrays Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Wakahisa <nevilledips@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
As pointed out by @tustvold on https://github.com/influxdata/influxdb_iox/pull/783#discussion_r574414243, it would seem the whole point of `NullArray::new_with_type`, is to cheaply construct entirely null columns, with a smaller memory footprint. Currently trying to print them out causes a panic: ``` #[test] fn test_pretty_format_null() -> Result<()> { // define a schema. let schema = Arc::new(Schema::new(vec![ Field::new("a", DataType::Utf8, true), Field::new("b", DataType::Int32, true), ])); let num_rows = 4; // define data (null) let batch = RecordBatch::try_new( schema, vec![ Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)), Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)), ], )?; let table = pretty_format_batches(&[batch])?; } ``` Panics: ``` failures: ---- util::pretty::tests::test_pretty_format_null stdout ---- thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27 ``` # Note _Update: the issue with `NullArray` claiming to be types such as `Int32` has been fixed by @nevi-me in apache#9469 _ Closes apache#9468 from alamb/alamb/support-null-printing Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This allows creating an array with n null values