Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use blob::blob() class instead of bare list(raw) #115

Open
eitsupi opened this issue Feb 5, 2025 · 4 comments
Open

Use blob::blob() class instead of bare list(raw) #115

eitsupi opened this issue Feb 5, 2025 · 4 comments
Labels
feature a feature request or enhancement

Comments

@eitsupi
Copy link

eitsupi commented Feb 5, 2025

Currently nanoparquet seems to map some types of Parquet to list(raw) in R.
I think it would be worthwhile to use blob vectors instead of a list of raw vectors, as nanoarrow and others have done.

@gaborcsardi gaborcsardi added the feature a feature request or enhancement label Feb 5, 2025
@gaborcsardi
Copy link
Member

What are the advantages of this? Isn't it a problem that you'll get different behavior if the blob package is not installed? Or would you make this opt-in?

@eitsupi
Copy link
Author

eitsupi commented Feb 10, 2025

What are the advantages of this?

When type mapping with something like Arrow, which allows nested types, it may not be possible to distinguish whether list(raw) is a vector of raw or raw in a list type.

Isn't it a problem that you'll get different behavior if the blob package is not installed?

I think the same can be said about hms::hms.
Both blob::blob and hms::hms are data structures that can be represented with Base R.
I believe the data will work even if that package is not installed.

For reference, nanoarrow behaves as follows:

nanoarrow::infer_nanoarrow_schema(list(raw()))
#> <nanoarrow_schema binary>
#>  $ format    : chr "z"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 2
#>  $ children  : list()
#>  $ dictionary: NULL
nanoarrow::infer_nanoarrow_schema(blob::blob())
#> <nanoarrow_schema binary>
#>  $ format    : chr "z"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 2
#>  $ children  : list()
#>  $ dictionary: NULL

nanoarrow::as_nanoarrow_array(list(raw())) |>
  as.vector()
#> <blob[1]>
#> [1] blob[0 B]

Created on 2025-02-10 with reprex v2.1.1

Without blob.

nanoarrow::as_nanoarrow_array(list(raw())) |>
  as.vector()
#> [[1]]
#> raw(0)
#>
#> attr(,"ptype")
#> raw(0)
#> attr(,"class")
#> [1] "blob"          "vctrs_list_of" "vctrs_vctr"    "list"

Created on 2025-02-10 with reprex v2.1.1

@gaborcsardi
Copy link
Member

I think the same can be said about hms::hms.

Indeed! I am trying to understand if adding these "very soft" types is useful at all, and how much confusion it causes.

I wonder if nanoparquet should auto-load the corresponding package when it adds an hms, blob, etc. column. And if that package is not installed then maybe give a warning? Or a message?

@eitsupi
Copy link
Author

eitsupi commented Feb 11, 2025

I think it makes sense to show warnings if they are not installed when using read_parquet().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants