`NA` should be automatically cast to the right type when creating a `DataFrame` with list-columns #795

etiennebacher · 2024-02-08T17:53:56Z

Very annoying to have to specify NA_integer_ or the other variants:

library(polars)
options(polars.do_not_repeat_call = TRUE)

pl$DataFrame(x = list(1L, 2L, NA_integer_))
#> shape: (3, 1)
#> ┌───────────┐
#> │ x         │
#> │ ---       │
#> │ list[i32] │
#> ╞═══════════╡
#> │ [1]       │
#> │ [2]       │
#> │ [null]    │
#> └───────────┘

# NA should be automatically cast to the correct type
pl$DataFrame(x = list(1L, 2L, NA))
#> Error: Execution halted with the following contexts
#>    0: In R: in $DataFrame():
#>    1: When constructing polars literal from Robj
#>    2: Encountered the following error in Rust-Polars:
#>          expected Series

The text was updated successfully, but these errors were encountered:

etiennebacher · 2024-02-11T19:36:59Z

Smaller reprex:

pl$lit(list(1L, 2L, NA))

eitsupi · 2024-02-12T03:43:54Z

Perhaps a mechanism is needed to guess the type and cast it.

I don't know how type conversion is currently done between R and Polars.
Related to #425 and #592

etiennebacher · 2024-02-12T20:45:36Z

There's already a mechanism actually, @sorhawell left this comment here:

r-polars/src/rust/src/conversion_r_to_s.rs

Lines 288 to 305 in 4e9160a

    
           // boubble any errors 
        
           let series_vec = series_vec_result?; 
        
           //TODO cast leafs type to any shared see polars_core::utils::get_supertype() 
        
           // check for any type mismatch to avoid polars panics 
        
           let mut s_iter = series_vec.iter(); 
        
           let first_s = s_iter.next(); 
        
           for s_ref in s_iter { 
        
               if s_ref.dtype() != first_s.expect("could not loop if none first_s").dtype() { 
        
                   Err(pl::PolarsError::SchemaMismatch(format!( 
        
                       "When building series from R list; some parsed sub-elements did not match: One element was {} and another was {}", 
        
                       first_s.expect("dont worry about it").dtype(),s_ref.dtype() 
        
                   ).into()))?; 
        
               } 
        
           } 
        
           // use polars new method to concat concatenated series 
        
           Ok(pl::Series::new(name, series_vec))

eitsupi · 2024-02-23T05:10:55Z

I noticed that the arrow package does not support this either.

> arrow::arrow_table(x = list(1L, 2L, NA))
Error: Invalid: cannot convert

> arrow::arrow_table(x = list(1L, 2L, NA_integer_))
Table
3 rows x 1 columns
$x: list<item <int32>>

eitsupi · 2024-04-11T16:19:11Z

I think using vctrs::list_of() instead of list() is simpler solution for now.
I believe this function has been widely used in the tidyverse world for years.

eitsupi · 2024-06-14T14:12:24Z

I was able to make this work in an implementation I am rewriting from scratch using savvy.

> as_polars_series(list(1L, 2L, NA))
shape: (3,)
Series: '' [list[i32]]
[
        [1]
        [2]
        [null]
]

I think the implementation is far cleaner now that I've made it completely branch on the S3 method.

To bring this into this repository would require a considerable rewrite of what is currently there, and I don't have the bandwidth to do it right now, but it should be possible.

as_polars_series.list <- function(x, name = NULL, ...) {
  series_list <- lapply(x, \(child) {
    if (is.null(child)) {
      NULL
    } else {
      as_polars_series(child)$`_s`
    }
  })

  PlRSeries$new_series_list(name %||% "", series_list) |>
    wrap()
}

    fn new_series_list(name: &str, values: ListSexp) -> savvy::Result<Self> {
        let series_vec: Vec<Option<Series>> = values
            .values_iter()
            .map(|value| match value.into_typed() {
                TypedSexp::Null(_) => None,
                TypedSexp::Environment(e) => {
                    let ptr = e.get(".ptr").unwrap().unwrap();
                    let r_series = <&PlRSeries>::try_from(ptr).unwrap();
                    Some(r_series.series.clone())
                }
                _ => panic!("Expected a list of Series"),
            })
            .collect();

        let dtype = series_vec
            .iter()
            .map(|s| {
                if let Some(s) = s {
                    s.dtype().clone()
                } else {
                    DataType::Null
                }
            })
            .reduce(|acc, b| try_get_supertype(&acc, &b).unwrap_or(DataType::String))
            .unwrap_or(DataType::Null);

        let casted_series_vec: Vec<Option<Series>> = series_vec
            .into_iter()
            .map(|s| {
                if let Some(s) = s {
                    Some(s.cast(&dtype).unwrap())
                } else {
                    None
                }
            })
            .collect();

        Ok(Series::new(name, casted_series_vec).into())
    }

eitsupi · 2024-06-22T11:18:21Z

Related change in Python Polars 1.0.0 pola-rs/polars#16939
Perhaps two functions may be needed on the Rust side for list, one to autocast and one not to autocast.

etiennebacher · 2024-06-22T13:10:40Z

Perhaps two functions may be needed on the Rust side for list, one to autocast and one not to autocast.

Probably but I don't think it's related to this issue. The average R user doesn't know (and I think also doesn't need to know) all the variants of NA. Passing NA in a list should work automatically, just like it does when we pass a vector.

eitsupi · 2024-06-22T13:24:40Z

Ultimately the problem here is that each element in R's list is not guaranteed to have the same type; Apache Arrow's list requires that all elements have the same type, so converting R's list to Arrow's list can either fail without cast or or cast.

Users who have been using dplyr for a few years would know that we had to use explicit type of NA because specifying NA in something like case_when() would fail due to a type mismatch.
This has been fixed by vctrs (tidyverse/dplyr#6300), so we may be able to implement the same thing (cast NA) here by looking at the vctrs implementation.

etiennebacher · 2024-06-22T13:44:02Z

Users who have been using dplyr for a few years would know that we had to use explicit type of NA because specifying NA in something like case_when() would fail due to a type mismatch.

Yes but that was far from ideal and raised confusion, which is why they changed the behavior with vctrs. One way to handle that internally would probably be to get the supertype of all non-NA values and then cast NA to this supertype (don't know how to do that though).

eitsupi · 2024-06-22T13:59:58Z

One way to handle that internally would probably be to get the supertype of all non-NA values and then cast NA to this supertype (don't know how to do that though).

If you want to do it on the R side, you probably need to ask for help from the vctrs package.

If you want to do it on the Rust side, the following approach I took with neo-r-polars (I found the original process for this by searching in the polars repository).
https://github.com/eitsupi/neo-r-polars/blob/e4f953ff1f5ea830c777af0d5211e144adeceedf/src/rust/src/series/construction.rs#L76-L97

My concern is that this is extra processing being performed to find the supertype, and that if even one string is mixed in the list, the whole thing becomes string type.
I imagine cases where implicit casts are not desired.

etiennebacher · 2024-06-22T14:09:28Z

I was thinking doing it on the Rust side.

if even one string is mixed in the list, the whole thing becomes string type. I imagine cases where implicit casts are not desired.

That's where the strict parameter is needed. To clarify, I do think this parameter is important, I just think it should ignore NA and therefore is unrelated to this issue. Here's the current behavior with the beta of py-polars 1.0:

>>> pl.Series([[1, 2, "a"]], strict=True) # error

>>> pl.Series([[1, 2, "a"]], strict=False)
shape: (1,)
Series: '' [list[str]]
[
        ["1", "2", "a"]
]

eitsupi · 2024-06-22T14:19:02Z

I guess it depends on how far we extend the "ignore NA" thing.
For example, is it only length 1, or is the length arbitrary and all elements are polars' null?

I can also create a vector of length 0 in R, which in Polars would be a Series of length 0 with an explicit type, how should this ideally work?

> polars::as_polars_series(list(character(), integer()))
Error: Execution halted with the following contexts
   0: In R: in as_polars_series():
   0: During function call [polars::as_polars_series(list(character(), integer()))]
   1: Encountered the following error in Rust-Polars:
        When building series from R list; some parsed sub-elements did not match: One element was str and another was i32

eitsupi · 2024-06-22T14:22:54Z

In my opinion NULL should be used to represent missing values in R's list, and currently NULL works properly in Polars as well, being a Series of type null.
is there any reason why you would want to use NA instead of NULL?

> polars::as_polars_series(list(1, 2, NULL, 3))
polars Series: shape: (4,)
Series: '' [list[f64]]
[
        [1.0]
        [2.0]
        []
        [3.0]
]

Edit: The implementation of neo-r-polars would look like this. Perhaps this is preferable?

> as_polars_series(list(1, 2, NULL, 3))
shape: (4,)
Series: '' [list[f64]]
[
        [1.0]
        [2.0]
        null
        [3.0]
]

eitsupi · 2024-09-06T08:07:27Z

Now works on the next branch.

library(neopolars)

as_polars_series(list(1, NULL, 1L), strict = TRUE)
#> Error in `as_polars_series()`:
#> ! Evaluation failed.
#> Caused by error:
#> ! If `strict = TRUE`, all elements of the list except `NULL` must have the same datatype. expected: `f64`, got: `i32` at index: 3

as_polars_series(list(1, NULL, 1L))
#> shape: (3,)
#> Series: '' [list[f64]]
#> [
#>  [1.0]
#>  null
#>  [1.0]
#> ]

^{Created on 2024-09-06 with reprex v2.1.1}

Probably the safest choice for most users is to use vctrs::list_of().

etiennebacher · 2024-09-06T08:11:31Z

Does it work with NA as in the example of the first post?

eitsupi · 2024-09-06T08:32:01Z

No, only NULLs are treated specially; use strict=FALSE to cast NA to NA_integer_.

eitsupi · 2024-09-06T08:44:48Z

I was initially going to treat only the polars' Null data type specially, but gave up when I realized that when passing a vector of Series to the Series::new() function, if the first element is Null, all subsequent Series will be cast to Null and the values will all be null.

r-polars/src/rust/src/series/construction.rs

Lines 104 to 113 in 5d5be1f

    
           for (i, s) in series_vec.iter().enumerate() { 
        
               if let Some(s) = s { 
        
                   if s.dtype() != &expected_dtype { 
        
                       return Err( 
        
                           format!("If `strict = TRUE`, all elements of the list except `NULL` must have the same datatype. expected: `{}`, got: `{}` at index: {}", expected_dtype, s.dtype(), i + 1).into() 
        
                       ); 
        
                   } 
        
               } 
        
           } 
        
           return Ok(Series::new(name, series_vec).into());

Since even Series of length 0 and None are treated differently in this way, it is quite distorted to give special treatment to NAs that fall under Series of length 1.

etiennebacher · 2024-09-06T10:46:45Z

No, only NULLs are treated specially; use strict=FALSE to cast NA to NA_integer_.

In this case, could we have strict=FALSE by default? I don't think many average users know the different variants of NA and want to use NA whatever the datatype

eitsupi · 2024-09-06T11:00:18Z

In this case, could we have strict=FALSE by default? I don't think many average users know the different variants of NA and want to use NA whatever the datatype

Of course it would be better that way.
vctrs::list_of() also automatically casts the super type by default.

r-polars/R/as_polars_series.R

Line 274 in 5d5be1f

as_polars_series.list <- function(x, name = NULL, ..., strict = FALSE) {

The situation is so different from Python that perhaps the default value should be FALSE even if this argument is to be exposed to pl$Series() (it is not currently exposed).

eitsupi added the enhancement New feature or request label Feb 8, 2024

eitsupi added this to the Rewrite milestone Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`NA` should be automatically cast to the right type when creating a `DataFrame` with list-columns #795

`NA` should be automatically cast to the right type when creating a `DataFrame` with list-columns #795

etiennebacher commented Feb 8, 2024 •

edited

Loading

etiennebacher commented Feb 11, 2024

eitsupi commented Feb 12, 2024

etiennebacher commented Feb 12, 2024

eitsupi commented Feb 23, 2024

eitsupi commented Apr 11, 2024

eitsupi commented Jun 14, 2024 •

edited

Loading

eitsupi commented Jun 22, 2024

etiennebacher commented Jun 22, 2024

eitsupi commented Jun 22, 2024

etiennebacher commented Jun 22, 2024

eitsupi commented Jun 22, 2024

etiennebacher commented Jun 22, 2024

eitsupi commented Jun 22, 2024 •

edited

Loading

eitsupi commented Jun 22, 2024 •

edited

Loading

eitsupi commented Sep 6, 2024

etiennebacher commented Sep 6, 2024

eitsupi commented Sep 6, 2024

eitsupi commented Sep 6, 2024

etiennebacher commented Sep 6, 2024

eitsupi commented Sep 6, 2024 •

edited

Loading

NA should be automatically cast to the right type when creating a DataFrame with list-columns #795

NA should be automatically cast to the right type when creating a DataFrame with list-columns #795

Comments

etiennebacher commented Feb 8, 2024 • edited Loading

etiennebacher commented Feb 11, 2024

eitsupi commented Feb 12, 2024

etiennebacher commented Feb 12, 2024

eitsupi commented Feb 23, 2024

eitsupi commented Apr 11, 2024

eitsupi commented Jun 14, 2024 • edited Loading

eitsupi commented Jun 22, 2024

etiennebacher commented Jun 22, 2024

eitsupi commented Jun 22, 2024

etiennebacher commented Jun 22, 2024

eitsupi commented Jun 22, 2024

etiennebacher commented Jun 22, 2024

eitsupi commented Jun 22, 2024 • edited Loading

eitsupi commented Jun 22, 2024 • edited Loading

eitsupi commented Sep 6, 2024

etiennebacher commented Sep 6, 2024

eitsupi commented Sep 6, 2024

eitsupi commented Sep 6, 2024

etiennebacher commented Sep 6, 2024

eitsupi commented Sep 6, 2024 • edited Loading

`NA` should be automatically cast to the right type when creating a `DataFrame` with list-columns #795

`NA` should be automatically cast to the right type when creating a `DataFrame` with list-columns #795

etiennebacher commented Feb 8, 2024 •

edited

Loading

eitsupi commented Jun 14, 2024 •

edited

Loading

eitsupi commented Jun 22, 2024 •

edited

Loading

eitsupi commented Jun 22, 2024 •

edited

Loading

eitsupi commented Sep 6, 2024 •

edited

Loading