-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
format = "auto"
#1311
Comments
This looks mostly achievable. The best fit internally would be to treat
I thought about making |
Currently, So now I am totally on board with adding this as an optional feature. Before making it the default, I think I would prefer to wait until |
Yeah, I was proposing you add it and then in a future release make it the default, if it turns out to work well for folks. |
Just need to be careful here and consider the case where a user starts with e.g. |
That's another good case to think through. If
|
Offline, @shikokuchuo pointed out reproducibility issues if |
To clarify: suppose you save a data frame with With So I will plan to write a |
As for |
Some thoughts as a targets user:
|
@multimeric is that a common case? I didn't think many people actually used list-columns, but if you'd find support for them to be helpful, it would be super useful if you could file an issue with a motivating use case so we can add support 😄 I don't think the default would hash those files because it would rely on file stamps. How are you currently ensuring that future steps are run correctly if you're not using targets to manage the recomputation? |
I use list columns a lot, because I'm often working with wacky S4 objects etc. Other bioconductor users are likely going to be in the same situation. Even if list columns were implemented, surely it won't ever be possible to represent an arbitrary R data structure in parquet? For example, what about attributes? This is my concern about making parquet the default for data frames. Sorry, you did propose list(
tar_target(root, c( "/filesystem_a", "/filesystem_b" )),
tar_target(input_data, 1:10),
tar_target(
save_data,
{
dir_path <- file.path(root, tar_name())
dir.create(dir_path)
file_path <- file.path(dir_path, "file.rds")
saveRDS(input_data, file_path)
dir_path
},
pattern = cross(root, input_data),
format = "file"
)
) |
@multimeric you are right that parquet will never store that sort of data. I think it's generally rare in the spectrum of R user, but |
This feels like a similar design point to something like the So I think RDS fits the bill in this regard in that it mostly 'just works', but crappily enough that it gets users looking for better alternatives or rolling their own format. As a heavy targets user I almost never use |
On the plane ride to Posit Conf 2024, I implemented a rendition of format = “auto” based on “file” and “qs” (leaving out “nanoparquet” because of the issues I mentioned above and “file_fast” because some file systems have extremely imprecise time stamps.) |
when qs2 is stable, “auto” will use “qs2” instead of “qs” |
To follow up, ropensci/tarchetypes#197 Implements the nanoparquet storage format we discussed. This is the first general-ish storage format I have considered in a long time, maybe even the first since I introduced tar_format() for user-defined formats. Going forward, my plan is to keep legacy formats like “keras” and “fst” in targets itself, but delegate new formats to tarchetypes and implement them with tar_format(). I think this will help manage scope creep. I may make an exception for qs2 because it is so general, but I haven’t decided. |
Prework
Proposal
I propose that
tar_target()
takes a new type of format calledauto
, and if successful, that would become the default.auto
would behave in the following way:format = "nanoparquet"
, which would be a new format that uses nanoparquet to create parquet files. nanoparquet is a zero-dependency parquet reader/writer so targets could take a hard dependency on this package ensuring that it's available to all users.all(file.exists(output))
is true, it would useformat = "file_fast"
(unlesstrust_object_timestamps
isFALSE
, in which case it would use"file"
).This steers the new user towards high-performance formats while allowing experienced users to continue to pick the best defaults for them.
The text was updated successfully, but these errors were encountered: