Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

class does not work in read_parquet() #104

Closed
gaborcsardi opened this issue Jan 23, 2025 · 0 comments
Closed

class does not work in read_parquet() #104

gaborcsardi opened this issue Jan 23, 2025 · 0 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@gaborcsardi
Copy link
Member

Apologies for (possibly) submitting a support request disguised as an issue report, but I am failing to get the class option to work and can't figure out what is going wrong. Here is the reprex:

# Load nanoparquet

library(nanoparquet)

packageVersion("nanoparquet")

#> [1] '0.3.1.9000'



# Write a data.frame

d <- data.frame(letters)

f <- tempfile()

write_parquet(d, f)



# Read after setting options

options(nanoparquet.class = c("tbl_df", "tbl", "data.frame"))

read_parquet(f) |> class()

#> [1] "tbl"        "data.frame"



# Read using explicit parquet_options()

read_parquet(f, options = parquet_options(class = c("tbl_df", "tbl", "data.frame"))) |> class()

#> [1] "tbl"        "data.frame"



# Try alternative specifications

read_parquet(f, options = parquet_options(class = c("data.frame")))    |> class()

#> [1] "tbl"        "data.frame"

read_parquet(f, options = parquet_options(class = character()))        |> class()

#> [1] "tbl"        "data.frame"

read_parquet(f, options = parquet_options(class = "tbl_df"))           |> class()

#> [1] "tbl"        "data.frame"

read_parquet(f, options = parquet_options(class = ""))                 |> class()

#> [1] "tbl"        "data.frame"

A few other somewhat related thoughts:

  • I appreciate the concern that one does not want to restore a broken object. One possibility for allowing the storage of different types of data frames would be to have an accept_classes (or simply classes) parameter, which would only allow reading (and writing?) objects if any classes were included in the accept_classes list, punting the responsibility to the user. Trying to read/write objects with unlisted class entries would result in an error. A default value of c("tbl_df", "tbl", "data.frame") would allow some very common use cases (i.e. base data frames and tibbles) to be transparently round-tripped. Interaction with class could work by some heuristic about priorities, such as:

    • On write, add class data if accept_classes is set

    • On read, if accept_classes is set and class data is found when reading a parquet file, use that

    • If no class data is found when reading a parquet file, use value from class

    • If one or the other params are null/NA, use the other

    • If both are null/NA, error out

  • Would it make sense to reserve ... as the first parameter in parquet_options() to force the use of name-based parameters and prevent location-based usage? It feels like parquet_options() might accumulate a lot of parameters over time, and having the flexibility to group related parameters would help with documentation, but reordering them would break existing location-based usage. Putting ... first would of course break backwards compatibility now, but ensure that later reordering of parameters would from then on be backwards compatible. Since it is still early days for nanoparquet I think it might be worthwhile trade-off.

  • I wonder why the default value of class is c("tbl", "data.frame") and not c("tbl_df", "tbl", "data.frame")? It seems to me that tbl is more of an "interface" than a class, since all actual instances of tbls that I have encountered previously are actually also other classes such as either tbl_df or more complex types (such as those found in dbplyr). And the returned object is structurally a tbl_df, right, so it would be useful for any functions taking it to be able to use the most specific class information?

Originally posted by @torfason in #82

@gaborcsardi gaborcsardi added the bug an unexpected problem or unintended behavior label Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

1 participant