Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA should be automatically cast to the right type when creating a DataFrame with list-columns #795

Open
etiennebacher opened this issue Feb 8, 2024 · 20 comments
Labels
enhancement New feature or request
Milestone

Comments

@etiennebacher
Copy link
Collaborator

etiennebacher commented Feb 8, 2024

Very annoying to have to specify NA_integer_ or the other variants:

library(polars)
options(polars.do_not_repeat_call = TRUE)

pl$DataFrame(x = list(1L, 2L, NA_integer_))
#> shape: (3, 1)
#> ┌───────────┐
#> │ x         │
#> │ ---       │
#> │ list[i32] │
#> ╞═══════════╡
#> │ [1]       │
#> │ [2]       │
#> │ [null]    │
#> └───────────┘

# NA should be automatically cast to the correct type
pl$DataFrame(x = list(1L, 2L, NA))
#> Error: Execution halted with the following contexts
#>    0: In R: in $DataFrame():
#>    1: When constructing polars literal from Robj
#>    2: Encountered the following error in Rust-Polars:
#>          expected Series
@eitsupi eitsupi added the enhancement New feature or request label Feb 8, 2024
@etiennebacher
Copy link
Collaborator Author

Smaller reprex:

pl$lit(list(1L, 2L, NA))

@eitsupi
Copy link
Collaborator

eitsupi commented Feb 12, 2024

Perhaps a mechanism is needed to guess the type and cast it.

I don't know how type conversion is currently done between R and Polars.
Related to #425 and #592

@etiennebacher
Copy link
Collaborator Author

There's already a mechanism actually, @sorhawell left this comment here:

// boubble any errors
let series_vec = series_vec_result?;
//TODO cast leafs type to any shared see polars_core::utils::get_supertype()
// check for any type mismatch to avoid polars panics
let mut s_iter = series_vec.iter();
let first_s = s_iter.next();
for s_ref in s_iter {
if s_ref.dtype() != first_s.expect("could not loop if none first_s").dtype() {
Err(pl::PolarsError::SchemaMismatch(format!(
"When building series from R list; some parsed sub-elements did not match: One element was {} and another was {}",
first_s.expect("dont worry about it").dtype(),s_ref.dtype()
).into()))?;
}
}
// use polars new method to concat concatenated series
Ok(pl::Series::new(name, series_vec))

@eitsupi
Copy link
Collaborator

eitsupi commented Feb 23, 2024

I noticed that the arrow package does not support this either.

> arrow::arrow_table(x = list(1L, 2L, NA))
Error: Invalid: cannot convert

> arrow::arrow_table(x = list(1L, 2L, NA_integer_))
Table
3 rows x 1 columns
$x: list<item <int32>>

@eitsupi
Copy link
Collaborator

eitsupi commented Apr 11, 2024

I think using vctrs::list_of() instead of list() is simpler solution for now.
I believe this function has been widely used in the tidyverse world for years.

@eitsupi
Copy link
Collaborator

eitsupi commented Jun 14, 2024

I was able to make this work in an implementation I am rewriting from scratch using savvy.

> as_polars_series(list(1L, 2L, NA))
shape: (3,)
Series: '' [list[i32]]
[
        [1]
        [2]
        [null]
]

I think the implementation is far cleaner now that I've made it completely branch on the S3 method.

To bring this into this repository would require a considerable rewrite of what is currently there, and I don't have the bandwidth to do it right now, but it should be possible.

as_polars_series.list <- function(x, name = NULL, ...) {
  series_list <- lapply(x, \(child) {
    if (is.null(child)) {
      NULL
    } else {
      as_polars_series(child)$`_s`
    }
  })

  PlRSeries$new_series_list(name %||% "", series_list) |>
    wrap()
}
    fn new_series_list(name: &str, values: ListSexp) -> savvy::Result<Self> {
        let series_vec: Vec<Option<Series>> = values
            .values_iter()
            .map(|value| match value.into_typed() {
                TypedSexp::Null(_) => None,
                TypedSexp::Environment(e) => {
                    let ptr = e.get(".ptr").unwrap().unwrap();
                    let r_series = <&PlRSeries>::try_from(ptr).unwrap();
                    Some(r_series.series.clone())
                }
                _ => panic!("Expected a list of Series"),
            })
            .collect();

        let dtype = series_vec
            .iter()
            .map(|s| {
                if let Some(s) = s {
                    s.dtype().clone()
                } else {
                    DataType::Null
                }
            })
            .reduce(|acc, b| try_get_supertype(&acc, &b).unwrap_or(DataType::String))
            .unwrap_or(DataType::Null);

        let casted_series_vec: Vec<Option<Series>> = series_vec
            .into_iter()
            .map(|s| {
                if let Some(s) = s {
                    Some(s.cast(&dtype).unwrap())
                } else {
                    None
                }
            })
            .collect();

        Ok(Series::new(name, casted_series_vec).into())
    }

@eitsupi
Copy link
Collaborator

eitsupi commented Jun 22, 2024

Related change in Python Polars 1.0.0 pola-rs/polars#16939
Perhaps two functions may be needed on the Rust side for list, one to autocast and one not to autocast.

@etiennebacher
Copy link
Collaborator Author

Perhaps two functions may be needed on the Rust side for list, one to autocast and one not to autocast.

Probably but I don't think it's related to this issue. The average R user doesn't know (and I think also doesn't need to know) all the variants of NA. Passing NA in a list should work automatically, just like it does when we pass a vector.

@eitsupi
Copy link
Collaborator

eitsupi commented Jun 22, 2024

Ultimately the problem here is that each element in R's list is not guaranteed to have the same type; Apache Arrow's list requires that all elements have the same type, so converting R's list to Arrow's list can either fail without cast or or cast.

Users who have been using dplyr for a few years would know that we had to use explicit type of NA because specifying NA in something like case_when() would fail due to a type mismatch.
This has been fixed by vctrs (tidyverse/dplyr#6300), so we may be able to implement the same thing (cast NA) here by looking at the vctrs implementation.

@etiennebacher
Copy link
Collaborator Author

Users who have been using dplyr for a few years would know that we had to use explicit type of NA because specifying NA in something like case_when() would fail due to a type mismatch.

Yes but that was far from ideal and raised confusion, which is why they changed the behavior with vctrs. One way to handle that internally would probably be to get the supertype of all non-NA values and then cast NA to this supertype (don't know how to do that though).

@eitsupi
Copy link
Collaborator

eitsupi commented Jun 22, 2024

One way to handle that internally would probably be to get the supertype of all non-NA values and then cast NA to this supertype (don't know how to do that though).

If you want to do it on the R side, you probably need to ask for help from the vctrs package.

If you want to do it on the Rust side, the following approach I took with neo-r-polars (I found the original process for this by searching in the polars repository).
https://github.com/eitsupi/neo-r-polars/blob/e4f953ff1f5ea830c777af0d5211e144adeceedf/src/rust/src/series/construction.rs#L76-L97

My concern is that this is extra processing being performed to find the supertype, and that if even one string is mixed in the list, the whole thing becomes string type.
I imagine cases where implicit casts are not desired.

@etiennebacher
Copy link
Collaborator Author

I was thinking doing it on the Rust side.

if even one string is mixed in the list, the whole thing becomes string type. I imagine cases where implicit casts are not desired.

That's where the strict parameter is needed. To clarify, I do think this parameter is important, I just think it should ignore NA and therefore is unrelated to this issue. Here's the current behavior with the beta of py-polars 1.0:

>>> pl.Series([[1, 2, "a"]], strict=True) # error

>>> pl.Series([[1, 2, "a"]], strict=False)
shape: (1,)
Series: '' [list[str]]
[
        ["1", "2", "a"]
]

@eitsupi
Copy link
Collaborator

eitsupi commented Jun 22, 2024

I guess it depends on how far we extend the "ignore NA" thing.
For example, is it only length 1, or is the length arbitrary and all elements are polars' null?

I can also create a vector of length 0 in R, which in Polars would be a Series of length 0 with an explicit type, how should this ideally work?

> polars::as_polars_series(list(character(), integer()))
Error: Execution halted with the following contexts
   0: In R: in as_polars_series():
   0: During function call [polars::as_polars_series(list(character(), integer()))]
   1: Encountered the following error in Rust-Polars:
        When building series from R list; some parsed sub-elements did not match: One element was str and another was i32

@eitsupi
Copy link
Collaborator

eitsupi commented Jun 22, 2024

In my opinion NULL should be used to represent missing values in R's list, and currently NULL works properly in Polars as well, being a Series of type null.
is there any reason why you would want to use NA instead of NULL?

> polars::as_polars_series(list(1, 2, NULL, 3))
polars Series: shape: (4,)
Series: '' [list[f64]]
[
        [1.0]
        [2.0]
        []
        [3.0]
]

Edit: The implementation of neo-r-polars would look like this. Perhaps this is preferable?

> as_polars_series(list(1, 2, NULL, 3))
shape: (4,)
Series: '' [list[f64]]
[
        [1.0]
        [2.0]
        null
        [3.0]
]

@eitsupi eitsupi added this to the Rewrite milestone Jun 27, 2024
@eitsupi
Copy link
Collaborator

eitsupi commented Sep 6, 2024

Now works on the next branch.

library(neopolars)

as_polars_series(list(1, NULL, 1L), strict = TRUE)
#> Error in `as_polars_series()`:
#> ! Evaluation failed.
#> Caused by error:
#> ! If `strict = TRUE`, all elements of the list except `NULL` must have the same datatype. expected: `f64`, got: `i32` at index: 3
as_polars_series(list(1, NULL, 1L))
#> shape: (3,)
#> Series: '' [list[f64]]
#> [
#>  [1.0]
#>  null
#>  [1.0]
#> ]

Created on 2024-09-06 with reprex v2.1.1

Probably the safest choice for most users is to use vctrs::list_of().

@etiennebacher
Copy link
Collaborator Author

Does it work with NA as in the example of the first post?

@eitsupi
Copy link
Collaborator

eitsupi commented Sep 6, 2024

No, only NULLs are treated specially; use strict=FALSE to cast NA to NA_integer_.

@eitsupi
Copy link
Collaborator

eitsupi commented Sep 6, 2024

I was initially going to treat only the polars' Null data type specially, but gave up when I realized that when passing a vector of Series to the Series::new() function, if the first element is Null, all subsequent Series will be cast to Null and the values will all be null.

for (i, s) in series_vec.iter().enumerate() {
if let Some(s) = s {
if s.dtype() != &expected_dtype {
return Err(
format!("If `strict = TRUE`, all elements of the list except `NULL` must have the same datatype. expected: `{}`, got: `{}` at index: {}", expected_dtype, s.dtype(), i + 1).into()
);
}
}
}
return Ok(Series::new(name, series_vec).into());

Since even Series of length 0 and None are treated differently in this way, it is quite distorted to give special treatment to NAs that fall under Series of length 1.

@etiennebacher
Copy link
Collaborator Author

No, only NULLs are treated specially; use strict=FALSE to cast NA to NA_integer_.

In this case, could we have strict=FALSE by default? I don't think many average users know the different variants of NA and want to use NA whatever the datatype

@eitsupi
Copy link
Collaborator

eitsupi commented Sep 6, 2024

In this case, could we have strict=FALSE by default? I don't think many average users know the different variants of NA and want to use NA whatever the datatype

Of course it would be better that way.
vctrs::list_of() also automatically casts the super type by default.

as_polars_series.list <- function(x, name = NULL, ..., strict = FALSE) {

The situation is so different from Python that perhaps the default value should be FALSE even if this argument is to be exposed to pl$Series() (it is not currently exposed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants