Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11599: [Rust] Add function to create array with all nulls #9469

Closed
wants to merge 7 commits into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Feb 11, 2021

This allows creating an array with n null values

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@nevi-me
Copy link
Contributor Author

nevi-me commented Feb 11, 2021

@alamb do you mind completing this for me? I'd otherwise only be able to work on it in my evening (GMT+2). The tests pass, but I need to add more tests (on the Arrow side), and complete some of the unimplemented!() cases.

@ritchie46
Copy link
Contributor

Nice, this is something useful!

@alamb
Copy link
Contributor

alamb commented Feb 11, 2021

Thanks @nevi-me -- I will try and give it a look / try in a few hours

@nevi-me nevi-me changed the title WIP: make null arrays ARROW-11599: [Rust] Add function to create array with all nulls Feb 11, 2021
@github-actions
Copy link

Copy link
Contributor Author

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions from me

#[test]
fn test_null_list_primitive() {
let data_type =
DataType::List(Box::new(Field::new("item", DataType::Int32, true)));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a soundness hole here, that applies more for parquet. If I created this array as non-nullable, there wouldn't have been an error, even though this doesn't comply with the spec.

@alamb @jorgecarleitao @Dandandan should we panic, or should I change the signature to Result<ArrayRef> so I can validate this condition?

I'm leaning to a panic, just like we panic on impl From<ArrayDataRef> for {type}Array. Wrapping a function that doesn't really error out, in a Result feels like assembing the Avengers for a single villain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote panic for now if you think it is a logic error that a user should be avoiding anyways.

Or alternately, we can make it panic now and Result later if someone finds a need

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that hole exists in so many places, unfortunately. Since ArrayData has a null_count, we have been relying on it for this.

Technically, this goes back to the issue that we are not performing any validation whether the datatype is consistent with the array's content. :(

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @nevi-me -- thank you. I left some hopefully helpful additional test suggestions. But all in all this looks great

}
DataType::Float16 => unreachable!(),
DataType::Int32
| DataType::UInt32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is probably ok, it took me a while to convince myself that new_null_sized_array::<Int32Type> was ok to use for different types like Float32

I wonder if it would be better to explicitly list out the types here

 DataType::Float32 => new_null_sized_array::<Float32Type>(data_type, length),
 DataType::Date32 => new_null_sized_array::<Int32Type>(data_type, length),
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change them, seemed tedious though because f32 and i32 use the same underlying size. I was looking for a convenient way of getting the byte size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, imo we have been doing this incorrectly all along. We currently have one PrimitiveArray per logical type (e.g. Date32, Int32) because there is a one-to-one correspondence between ArrowPrimitiveType and DataType. This implies that we will never be able to support Timestamp(_,X) with PrimitiveArray, as that type has many variations (one per time zone).

Having one array type per logical type also makes the whole code less maintainable, because we have to downcast for every variation of the DataType, instead of for every variation of the underlying physical type.

IMO we should have 1 PrimitiveArray per physical type. I.e. PrimitiveArray<i32>, PrimitiveArray<i64>, etc. The DataType just denotes how the physical values are to be interpreted (logical), not their layout in memory nor any safety considerations (e.g. &str vs &[u8]).

More generally, a generic parameter T represents a variation of the underlying physical / memory layout, and we have been using PrimitiveArray<T> to represent the same layout but different logical representations.

This is one of the main changes in the proposal I wrote, on which a primitive array is defined by its physical representation, PrimitiveArray<i64> and its logical representation (DataType), on which the datatype is a many to one to a physical representation (e.g. Date32, Time32 and Int32 all map to i32).

@nevi-me nevi-me force-pushed the make-empty-arrays branch 2 times, most recently from 4d62854 to f43426b Compare February 11, 2021 22:45
@codecov-io
Copy link

Codecov Report

Merging #9469 (8c45e47) into master (7660a22) will decrease coverage by 0.03%.
The diff coverage is 56.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9469      +/-   ##
==========================================
- Coverage   82.31%   82.28%   -0.04%     
==========================================
  Files         234      234              
  Lines       54482    54594     +112     
==========================================
+ Hits        44849    44920      +71     
- Misses       9633     9674      +41     
Impacted Files Coverage Δ
rust/datafusion/src/scalar.rs 56.73% <15.00%> (+1.04%) ⬆️
rust/arrow/src/array/array.rs 75.50% <62.99%> (-13.03%) ⬇️
rust/arrow/src/array/null.rs 87.50% <66.66%> (-5.10%) ⬇️
rust/datafusion/src/physical_plan/parquet.rs 88.05% <66.66%> (-0.06%) ⬇️
rust/arrow/src/array/equal/utils.rs 76.47% <0.00%> (+0.98%) ⬆️
rust/arrow/src/array/transform/fixed_binary.rs 84.21% <0.00%> (+5.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7660a22...8c45e47. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks again @nevi-me -- I am keen to get this merged in sooner rather than later (so I can get #9468 in and then pull this change into my downstream project).

Any objections @jorgecarleitao ?

@nevi-me
Copy link
Contributor Author

nevi-me commented Feb 13, 2021

@alamb you can merge this so it unblocks your work. If there are further questions, I'm happy to address them in a follow-up PR

@alamb
Copy link
Contributor

alamb commented Feb 13, 2021

Thanks @nevi-me -- I think this is ready to go then. Thanks again!

@alamb alamb closed this in 88e9eb8 Feb 13, 2021
@alamb
Copy link
Contributor

alamb commented Feb 13, 2021

FYI @maxburke

alamb added a commit that referenced this pull request Feb 14, 2021
As pointed out by @tustvold  on influxdata/influxdb_iox#783 (comment), it would seem the whole point of `NullArray::new_with_type`, is to cheaply construct entirely null columns, with a smaller memory footprint.

Currently trying to print them out causes a panic:

```
    #[test]
    fn test_pretty_format_null() -> Result<()> {
        // define a schema.
        let schema = Arc::new(Schema::new(vec![
            Field::new("a", DataType::Utf8, true),
            Field::new("b", DataType::Int32, true),
        ]));

        let num_rows = 4;

        // define data (null)
        let batch = RecordBatch::try_new(
            schema,
            vec![
                Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)),
                Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)),
            ],
        )?;

        let table = pretty_format_batches(&[batch])?;
}

```

Panics:

```
failures:

---- util::pretty::tests::test_pretty_format_null stdout ----
thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27
```

# Note

_Update: the issue with  `NullArray` claiming to be types such as  `Int32` has been fixed by @nevi-me  in #9469 _

Closes #9468 from alamb/alamb/support-null-printing

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
sgnkc pushed a commit to sgnkc/arrow that referenced this pull request Feb 17, 2021
This allows creating an array with n null values

Closes apache#9469 from nevi-me/make-empty-arrays

Lead-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Wakahisa <nevilledips@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
sgnkc pushed a commit to sgnkc/arrow that referenced this pull request Feb 17, 2021
As pointed out by @tustvold  on https://github.com/influxdata/influxdb_iox/pull/783#discussion_r574414243, it would seem the whole point of `NullArray::new_with_type`, is to cheaply construct entirely null columns, with a smaller memory footprint.

Currently trying to print them out causes a panic:

```
    #[test]
    fn test_pretty_format_null() -> Result<()> {
        // define a schema.
        let schema = Arc::new(Schema::new(vec![
            Field::new("a", DataType::Utf8, true),
            Field::new("b", DataType::Int32, true),
        ]));

        let num_rows = 4;

        // define data (null)
        let batch = RecordBatch::try_new(
            schema,
            vec![
                Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)),
                Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)),
            ],
        )?;

        let table = pretty_format_batches(&[batch])?;
}

```

Panics:

```
failures:

---- util::pretty::tests::test_pretty_format_null stdout ----
thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27
```

# Note

_Update: the issue with  `NullArray` claiming to be types such as  `Int32` has been fixed by @nevi-me  in apache#9469 _

Closes apache#9468 from alamb/alamb/support-null-printing

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
As pointed out by @tustvold  on influxdata/influxdb_iox#783 (comment), it would seem the whole point of `NullArray::new_with_type`, is to cheaply construct entirely null columns, with a smaller memory footprint.

Currently trying to print them out causes a panic:

```
    #[test]
    fn test_pretty_format_null() -> Result<()> {
        // define a schema.
        let schema = Arc::new(Schema::new(vec![
            Field::new("a", DataType::Utf8, true),
            Field::new("b", DataType::Int32, true),
        ]));

        let num_rows = 4;

        // define data (null)
        let batch = RecordBatch::try_new(
            schema,
            vec![
                Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)),
                Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)),
            ],
        )?;

        let table = pretty_format_batches(&[batch])?;
}

```

Panics:

```
failures:

---- util::pretty::tests::test_pretty_format_null stdout ----
thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27
```

# Note

_Update: the issue with  `NullArray` claiming to be types such as  `Int32` has been fixed by @nevi-me  in apache/arrow#9469 _

Closes #9468 from alamb/alamb/support-null-printing

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This allows creating an array with n null values

Closes apache#9469 from nevi-me/make-empty-arrays

Lead-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Wakahisa <nevilledips@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
As pointed out by @tustvold  on https://github.com/influxdata/influxdb_iox/pull/783#discussion_r574414243, it would seem the whole point of `NullArray::new_with_type`, is to cheaply construct entirely null columns, with a smaller memory footprint.

Currently trying to print them out causes a panic:

```
    #[test]
    fn test_pretty_format_null() -> Result<()> {
        // define a schema.
        let schema = Arc::new(Schema::new(vec![
            Field::new("a", DataType::Utf8, true),
            Field::new("b", DataType::Int32, true),
        ]));

        let num_rows = 4;

        // define data (null)
        let batch = RecordBatch::try_new(
            schema,
            vec![
                Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)),
                Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)),
            ],
        )?;

        let table = pretty_format_batches(&[batch])?;
}

```

Panics:

```
failures:

---- util::pretty::tests::test_pretty_format_null stdout ----
thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27
```

# Note

_Update: the issue with  `NullArray` claiming to be types such as  `Int32` has been fixed by @nevi-me  in apache#9469 _

Closes apache#9468 from alamb/alamb/support-null-printing

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This allows creating an array with n null values

Closes apache#9469 from nevi-me/make-empty-arrays

Lead-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Wakahisa <nevilledips@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
As pointed out by @tustvold  on https://github.com/influxdata/influxdb_iox/pull/783#discussion_r574414243, it would seem the whole point of `NullArray::new_with_type`, is to cheaply construct entirely null columns, with a smaller memory footprint.

Currently trying to print them out causes a panic:

```
    #[test]
    fn test_pretty_format_null() -> Result<()> {
        // define a schema.
        let schema = Arc::new(Schema::new(vec![
            Field::new("a", DataType::Utf8, true),
            Field::new("b", DataType::Int32, true),
        ]));

        let num_rows = 4;

        // define data (null)
        let batch = RecordBatch::try_new(
            schema,
            vec![
                Arc::new(NullArray::new_with_type(num_rows, DataType::Utf8)),
                Arc::new(NullArray::new_with_type(num_rows, DataType::Int32)),
            ],
        )?;

        let table = pretty_format_batches(&[batch])?;
}

```

Panics:

```
failures:

---- util::pretty::tests::test_pretty_format_null stdout ----
thread 'util::pretty::tests::test_pretty_format_null' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/util/display.rs:201:27
```

# Note

_Update: the issue with  `NullArray` claiming to be types such as  `Int32` has been fixed by @nevi-me  in apache#9469 _

Closes apache#9468 from alamb/alamb/support-null-printing

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants