-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NA
should be automatically cast to the right type when creating a DataFrame
with list-columns
#795
Comments
Smaller reprex: pl$lit(list(1L, 2L, NA)) |
There's already a mechanism actually, @sorhawell left this comment here: r-polars/src/rust/src/conversion_r_to_s.rs Lines 288 to 305 in 4e9160a
|
I noticed that the arrow package does not support this either. > arrow::arrow_table(x = list(1L, 2L, NA))
Error: Invalid: cannot convert
> arrow::arrow_table(x = list(1L, 2L, NA_integer_))
Table
3 rows x 1 columns
$x: list<item <int32>> |
I think using |
I was able to make this work in an implementation I am rewriting from scratch using savvy. > as_polars_series(list(1L, 2L, NA))
shape: (3,)
Series: '' [list[i32]]
[
[1]
[2]
[null]
] I think the implementation is far cleaner now that I've made it completely branch on the S3 method. To bring this into this repository would require a considerable rewrite of what is currently there, and I don't have the bandwidth to do it right now, but it should be possible. as_polars_series.list <- function(x, name = NULL, ...) {
series_list <- lapply(x, \(child) {
if (is.null(child)) {
NULL
} else {
as_polars_series(child)$`_s`
}
})
PlRSeries$new_series_list(name %||% "", series_list) |>
wrap()
} fn new_series_list(name: &str, values: ListSexp) -> savvy::Result<Self> {
let series_vec: Vec<Option<Series>> = values
.values_iter()
.map(|value| match value.into_typed() {
TypedSexp::Null(_) => None,
TypedSexp::Environment(e) => {
let ptr = e.get(".ptr").unwrap().unwrap();
let r_series = <&PlRSeries>::try_from(ptr).unwrap();
Some(r_series.series.clone())
}
_ => panic!("Expected a list of Series"),
})
.collect();
let dtype = series_vec
.iter()
.map(|s| {
if let Some(s) = s {
s.dtype().clone()
} else {
DataType::Null
}
})
.reduce(|acc, b| try_get_supertype(&acc, &b).unwrap_or(DataType::String))
.unwrap_or(DataType::Null);
let casted_series_vec: Vec<Option<Series>> = series_vec
.into_iter()
.map(|s| {
if let Some(s) = s {
Some(s.cast(&dtype).unwrap())
} else {
None
}
})
.collect();
Ok(Series::new(name, casted_series_vec).into())
} |
Related change in Python Polars 1.0.0 pola-rs/polars#16939 |
Probably but I don't think it's related to this issue. The average R user doesn't know (and I think also doesn't need to know) all the variants of |
Ultimately the problem here is that each element in R's list is not guaranteed to have the same type; Apache Arrow's list requires that all elements have the same type, so converting R's list to Arrow's list can either fail without cast or or cast. Users who have been using dplyr for a few years would know that we had to use explicit type of NA because specifying |
Yes but that was far from ideal and raised confusion, which is why they changed the behavior with |
If you want to do it on the R side, you probably need to ask for help from the vctrs package. If you want to do it on the Rust side, the following approach I took with neo-r-polars (I found the original process for this by searching in the polars repository). My concern is that this is extra processing being performed to find the supertype, and that if even one string is mixed in the list, the whole thing becomes string type. |
I was thinking doing it on the Rust side.
That's where the >>> pl.Series([[1, 2, "a"]], strict=True) # error
>>> pl.Series([[1, 2, "a"]], strict=False)
shape: (1,)
Series: '' [list[str]]
[
["1", "2", "a"]
] |
I guess it depends on how far we extend the "ignore NA" thing. I can also create a vector of length 0 in R, which in Polars would be a Series of length 0 with an explicit type, how should this ideally work? > polars::as_polars_series(list(character(), integer()))
Error: Execution halted with the following contexts
0: In R: in as_polars_series():
0: During function call [polars::as_polars_series(list(character(), integer()))]
1: Encountered the following error in Rust-Polars:
When building series from R list; some parsed sub-elements did not match: One element was str and another was i32 |
In my opinion NULL should be used to represent missing values in R's list, and currently NULL works properly in Polars as well, being a Series of type null. > polars::as_polars_series(list(1, 2, NULL, 3))
polars Series: shape: (4,)
Series: '' [list[f64]]
[
[1.0]
[2.0]
[]
[3.0]
] Edit: The implementation of neo-r-polars would look like this. Perhaps this is preferable? > as_polars_series(list(1, 2, NULL, 3))
shape: (4,)
Series: '' [list[f64]]
[
[1.0]
[2.0]
null
[3.0]
] |
Now works on the library(neopolars)
as_polars_series(list(1, NULL, 1L), strict = TRUE)
#> Error in `as_polars_series()`:
#> ! Evaluation failed.
#> Caused by error:
#> ! If `strict = TRUE`, all elements of the list except `NULL` must have the same datatype. expected: `f64`, got: `i32` at index: 3 as_polars_series(list(1, NULL, 1L))
#> shape: (3,)
#> Series: '' [list[f64]]
#> [
#> [1.0]
#> null
#> [1.0]
#> ] Created on 2024-09-06 with reprex v2.1.1 Probably the safest choice for most users is to use |
Does it work with NA as in the example of the first post? |
No, only |
I was initially going to treat only the polars' r-polars/src/rust/src/series/construction.rs Lines 104 to 113 in 5d5be1f
Since even Series of length 0 and |
In this case, could we have |
Of course it would be better that way. Line 274 in 5d5be1f
The situation is so different from Python that perhaps the default value should be |
Very annoying to have to specify
NA_integer_
or the other variants:The text was updated successfully, but these errors were encountered: