`format = "auto"` #1311

hadley · 2024-07-29T22:11:26Z

Prework

I understand and agree to help guide.
I understand and agree to contributing guide.
New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.

Proposal

I propose that tar_target() takes a new type of format called auto, and if successful, that would become the default. auto would behave in the following way:

If the output of the target is a data frame, use format = "nanoparquet", which would be a new format that uses nanoparquet to create parquet files. nanoparquet is a zero-dependency parquet reader/writer so targets could take a hard dependency on this package ensuring that it's available to all users.
If the output is a character vector and all(file.exists(output)) is true, it would use format = "file_fast" (unless trust_object_timestamps is FALSE, in which case it would use "file").
For all other types "rds". (Unless you'd be willing to add qs as a hard dependency, in which case I'd argue that qs is basically uniformly superior to rds. Alternatively you could use qs if it's installed, but I don't know enough about the architecture of targets to fully understand the consequences of that).

This steers the new user towards high-performance formats while allowing experienced users to continue to pick the best defaults for them.

The text was updated successfully, but these errors were encountered:

wlandau · 2024-07-30T17:24:20Z

I propose that tar_target() takes a new type of format called auto, and if successful, that would become the default.

This looks mostly achievable. The best fit internally would be to treat format = "auto" as a placeholder for a lazily-resolved format to be determined after the target runs but before the output is saved/recorded. That way, all the complicated decision-making around repository = "local" vs repository = "aws" should just work. Right now, my only concern is about subsequent runs of the pipeline. _targets.R will have format = "auto", but the existing metadata will have format = "file" or some other fixed format. Then if the cue setting is tar_cue(format = TRUE) (default), the target will rerun because targets will think the format changed. Of course it is possible to ignore tar_cue(format = TRUE) for the special case of format = "auto", but this would slightly weaken reproducibility. I will need more time to think about this.

format = "nanoparquet"

nanoparquet sounds like a great format to support natively. In the meantime, users can define it themselves with tar_format().

I thought about making qs the default format, but old toolchains like the one at my workplace seem to have trouble compiling it. Also, the maintainer is working on a rewrite: https://github.com/traversc/qs2

wlandau · 2024-07-31T20:02:58Z

Right now, my only concern is about subsequent runs of the pipeline.

Currently, tar_cue(format = TRUE) says to invalidate the target if the format listed in _targets/meta/meta disagrees with the one in _targets.R. But maybe we can relax this: don't necessarily invalidate the target if _targets.R has format = "auto" but _targets/meta/meta has format %in% c("file", "file_fast", "nanoparquet", "qs")). I think this would work. _targets/meta/meta is read-only from a user perspective anyway. And if "auto" is changed to any other value in _targets.R, then the tar_cue(format = TRUE) rule will revert to its usual behavior.

So now I am totally on board with adding this as an optional feature. Before making it the default, I think I would prefer to wait until qs2 stabilizes.

hadley · 2024-07-31T20:10:45Z

Yeah, I was proposing you add it and then in a future release make it the default, if it turns out to work well for folks.

shikokuchuo · 2024-08-01T09:58:23Z

Currently, tar_cue(format = TRUE) says to invalidate the target if the format listed in _targets/meta/meta disagrees with the one in _targets.R. But maybe we can relax this: don't necessarily invalidate the target if _targets.R has format = "auto" but _targets/meta/meta has format %in% c("file", "file_fast", "nanoparquet", "qs")). I think this would work. _targets/meta/meta is read-only from a user perspective anyway. And if "auto" is changed to any other value in _targets.R, then the tar_cue(format = TRUE) rule will revert to its usual behavior.

So now I am totally on board with adding this as an optional feature. Before making it the default, I think I would prefer to wait until qs2 stabilizes.

Just need to be careful here and consider the case where a user starts with e.g. format = "file", then moves to format = "auto". If not re-run, the pipeline would differ from one run from scratch if 'auto' would have picked 'nanoparquet' etc.

wlandau · 2024-08-01T12:47:10Z

Just need to be careful here and consider the case where a user starts with e.g. format = "file", then moves to format = "auto". If not re-run, the pipeline would differ from one run from scratch if 'auto' would have picked 'nanoparquet' etc.

That's another good case to think through. If "file" => "auto" is the only change, it seems fine to skip the target. If the target runs the same R code, then presumably "auto" would still choose "file" format anyway (or "file_fast"). The decision would be left to the other criteria in tar_cue().

targets should also treat "file" and "file_fast" as equivalent for this decision.

wlandau · 2024-08-01T18:04:55Z

Offline, @shikokuchuo pointed out reproducibility issues if "auto" in _targets.R becomes something different in _targets/meta/meta. An alternate route is to keep "auto" as the metadata entry but reconstruct the store object with the subclass that best matches the output. (EDIT: wouldn't work because then tar_read() wouldn't know how to read the object.)

wlandau · 2024-08-01T20:58:12Z

@shikokuchuo pointed out reproducibility issues if "auto" in _targets.R becomes something different in _targets/meta/meta.

To clarify: suppose you save a data frame with format = "rds", then switch to format = "auto". At that point, targets would not necessarily rerun the pipeline or recompute the hash of the output, but a hypothetical rerun from scratch would save a nanoparquet file with a different hash. The old hash would be incorrect from a reproducibility standpoint.

With "qs" vs "file", we don't have that problem because these formats are mutually exclusive. There, a format = "auto" would definitely work. And it is exactly where it would provide the most value. More than the other formats, I regularly hear how disruptive it is have to manually write format = "file", e.g. https://fosstodon.org/@grrrck/112853425729179661.

So I will plan to write a format = "auto" which supports "qs", "file", and "file_fast". (And later, "qs2" instead of "qs".)

wlandau · 2024-08-01T21:03:47Z

As for nanoarrow, I got started on the implementation, and I am remembering my original reluctance to add more built-in storage formats for all but the most general cases. For anything more specific than "qs" or "qs2", it would be nice to handle this through tar_format()-powered wrappers in tarchetypes (possibly in community-driven PRs).

multimeric · 2024-08-05T01:07:44Z

Some thoughts as a targets user:

I'm wary of making parquet the default format for data frames considering the common cases like list columns where parquet isn't compatible: https://www.tidyverse.org/blog/2024/06/nanoparquet-0-3-0/#limitations. Should the default format not be the most compatible?
I'm also worried about the automatic file detection, and think this works better as an explicit thing. I'm currently working on a pipeline that has some targets that just return filesystem paths to facilitate downstream targets. They shouldn't be tracked as files because there are TBs of data in there that targets would hash, but format = "auto" seems like it would do this.

hadley · 2024-08-05T13:29:13Z

@multimeric is that a common case? I didn't think many people actually used list-columns, but if you'd find support for them to be helpful, it would be super useful if you could file an issue with a motivating use case so we can add support 😄

I don't think the default would hash those files because it would rely on file stamps. How are you currently ensuring that future steps are run correctly if you're not using targets to manage the recomputation?

multimeric · 2024-08-06T01:50:15Z

I use list columns a lot, because I'm often working with wacky S4 objects etc. Other bioconductor users are likely going to be in the same situation. Even if list columns were implemented, surely it won't ever be possible to represent an arbitrary R data structure in parquet? For example, what about attributes? This is my concern about making parquet the default for data frames.

Sorry, you did propose file_fast which doesn't hash. My target just returns a file path that will be used by the downstream branches as a location to save their data. This target doesn't track changes to the paths at all, but the downstream ones that actually write to subdirectories do do that. Something like:

list(
    tar_target(root, c( "/filesystem_a", "/filesystem_b" )),
    tar_target(input_data, 1:10),
    tar_target(
        save_data, 
        {
            dir_path <- file.path(root, tar_name())
            dir.create(dir_path)
            file_path <- file.path(dir_path, "file.rds")
            saveRDS(input_data, file_path)
            dir_path
        }, 
        pattern = cross(root, input_data),
        format = "file"
    )
)

hadley · 2024-08-06T12:14:36Z

@multimeric you are right that parquet will never store that sort of data. I think it's generally rare in the spectrum of R user, but format = "auto" could also use nanoparquet only for data frames that don't contain list columns.

MilesMcBain · 2024-08-12T10:09:57Z

This feels like a similar design point to something like the {ggplot2} 'bad defaults' idea where you have certain things like histogram bin width etc where there is a bad but serviceable default that will pretty much do an okay job, but is also a nudge toward learning customisation options on offer.

So I think RDS fits the bill in this regard in that it mostly 'just works', but crappily enough that it gets users looking for better alternatives or rolling their own format.

As a heavy targets user I almost never use tar_target always preferring the tar_fst or tar_parquet from {tarchetypes} or my own custom formats, so this actually wouldn't affect me that much, but I do wonder if it is a more fitting candidate for implementation as tarchetypes::tar_auto or similar, since you get to opt in to the convenience (and potential edge case issues), and it keeps the core engine implementation free of special cases.

wlandau · 2024-08-12T22:52:24Z

On the plane ride to Posit Conf 2024, I implemented a rendition of format = “auto” based on “file” and “qs” (leaving out “nanoparquet” because of the issues I mentioned above and “file_fast” because some file systems have extremely imprecise time stamps.)

wlandau · 2024-08-12T22:53:37Z

when qs2 is stable, “auto” will use “qs2” instead of “qs”

wlandau · 2024-09-25T22:27:19Z

To follow up, ropensci/tarchetypes#197 Implements the nanoparquet storage format we discussed.

This is the first general-ish storage format I have considered in a long time, maybe even the first since I introduced tar_format() for user-defined formats. Going forward, my plan is to keep legacy formats like “keras” and “fst” in targets itself, but delegate new formats to tarchetypes and implement them with tar_format(). I think this will help manage scope creep. I may make an exception for qs2 because it is so general, but I haven’t decided.

hadley added the type: new feature label Jul 29, 2024

hadley assigned wlandau Jul 29, 2024

wlandau mentioned this issue Aug 1, 2024

Proposal: read with the same class that the object had on save r-lib/nanoparquet#82

Open

wlandau-lilly closed this as completed in 5878741 Aug 12, 2024

This was referenced Aug 16, 2024

Automatic detection of the low-level file system #1315

Closed

nanoparquet format powered by tar_format() ropensci/tarchetypes#190

Closed

wlandau-lilly pushed a commit that referenced this issue Sep 9, 2024

start working on #1311

4ef51fa

wlandau-lilly pushed a commit that referenced this issue Sep 9, 2024

Fix #1311

f90c1cd

Aariq mentioned this issue Oct 2, 2024

Is format = "auto" possible for geotargets? njtierney/geotargets#100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`format = "auto"` #1311

`format = "auto"` #1311

hadley commented Jul 29, 2024 •

edited

Loading

wlandau commented Jul 30, 2024

wlandau commented Jul 31, 2024 •

edited

Loading

hadley commented Jul 31, 2024

shikokuchuo commented Aug 1, 2024

wlandau commented Aug 1, 2024 •

edited

Loading

wlandau commented Aug 1, 2024 •

edited

Loading

wlandau commented Aug 1, 2024 •

edited

Loading

wlandau commented Aug 1, 2024 •

edited

Loading

multimeric commented Aug 5, 2024

hadley commented Aug 5, 2024

multimeric commented Aug 6, 2024

hadley commented Aug 6, 2024

MilesMcBain commented Aug 12, 2024

wlandau commented Aug 12, 2024

wlandau commented Aug 12, 2024

wlandau commented Sep 25, 2024

format = "auto" #1311

format = "auto" #1311

Comments

hadley commented Jul 29, 2024 • edited Loading

Prework

Proposal

wlandau commented Jul 30, 2024

wlandau commented Jul 31, 2024 • edited Loading

hadley commented Jul 31, 2024

shikokuchuo commented Aug 1, 2024

wlandau commented Aug 1, 2024 • edited Loading

wlandau commented Aug 1, 2024 • edited Loading

wlandau commented Aug 1, 2024 • edited Loading

wlandau commented Aug 1, 2024 • edited Loading

multimeric commented Aug 5, 2024

hadley commented Aug 5, 2024

multimeric commented Aug 6, 2024

hadley commented Aug 6, 2024

MilesMcBain commented Aug 12, 2024

wlandau commented Aug 12, 2024

wlandau commented Aug 12, 2024

wlandau commented Sep 25, 2024

`format = "auto"` #1311

`format = "auto"` #1311

hadley commented Jul 29, 2024 •

edited

Loading

wlandau commented Jul 31, 2024 •

edited

Loading

wlandau commented Aug 1, 2024 •

edited

Loading

wlandau commented Aug 1, 2024 •

edited

Loading

wlandau commented Aug 1, 2024 •

edited

Loading

wlandau commented Aug 1, 2024 •

edited

Loading