Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

format = "auto" #1311

Closed
3 tasks done
hadley opened this issue Jul 29, 2024 · 16 comments
Closed
3 tasks done

format = "auto" #1311

hadley opened this issue Jul 29, 2024 · 16 comments
Assignees

Comments

@hadley
Copy link
Member

hadley commented Jul 29, 2024

Prework

  • I understand and agree to help guide.
  • I understand and agree to contributing guide.
  • New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.

Proposal

I propose that tar_target() takes a new type of format called auto, and if successful, that would become the default. auto would behave in the following way:

  • If the output of the target is a data frame, use format = "nanoparquet", which would be a new format that uses nanoparquet to create parquet files. nanoparquet is a zero-dependency parquet reader/writer so targets could take a hard dependency on this package ensuring that it's available to all users.
  • If the output is a character vector and all(file.exists(output)) is true, it would use format = "file_fast" (unless trust_object_timestamps is FALSE, in which case it would use "file").
  • For all other types "rds". (Unless you'd be willing to add qs as a hard dependency, in which case I'd argue that qs is basically uniformly superior to rds. Alternatively you could use qs if it's installed, but I don't know enough about the architecture of targets to fully understand the consequences of that).

This steers the new user towards high-performance formats while allowing experienced users to continue to pick the best defaults for them.

@wlandau
Copy link
Member

wlandau commented Jul 30, 2024

I propose that tar_target() takes a new type of format called auto, and if successful, that would become the default.

This looks mostly achievable. The best fit internally would be to treat format = "auto" as a placeholder for a lazily-resolved format to be determined after the target runs but before the output is saved/recorded. That way, all the complicated decision-making around repository = "local" vs repository = "aws" should just work. Right now, my only concern is about subsequent runs of the pipeline. _targets.R will have format = "auto", but the existing metadata will have format = "file" or some other fixed format. Then if the cue setting is tar_cue(format = TRUE) (default), the target will rerun because targets will think the format changed. Of course it is possible to ignore tar_cue(format = TRUE) for the special case of format = "auto", but this would slightly weaken reproducibility. I will need more time to think about this.

format = "nanoparquet"

nanoparquet sounds like a great format to support natively. In the meantime, users can define it themselves with tar_format().

I thought about making qs the default format, but old toolchains like the one at my workplace seem to have trouble compiling it. Also, the maintainer is working on a rewrite: https://github.com/traversc/qs2

@wlandau
Copy link
Member

wlandau commented Jul 31, 2024

Right now, my only concern is about subsequent runs of the pipeline.

Currently, tar_cue(format = TRUE) says to invalidate the target if the format listed in _targets/meta/meta disagrees with the one in _targets.R. But maybe we can relax this: don't necessarily invalidate the target if _targets.R has format = "auto" but _targets/meta/meta has format %in% c("file", "file_fast", "nanoparquet", "qs")). I think this would work. _targets/meta/meta is read-only from a user perspective anyway. And if "auto" is changed to any other value in _targets.R, then the tar_cue(format = TRUE) rule will revert to its usual behavior.

So now I am totally on board with adding this as an optional feature. Before making it the default, I think I would prefer to wait until qs2 stabilizes.

@hadley
Copy link
Member Author

hadley commented Jul 31, 2024

Yeah, I was proposing you add it and then in a future release make it the default, if it turns out to work well for folks.

@shikokuchuo
Copy link
Contributor

Currently, tar_cue(format = TRUE) says to invalidate the target if the format listed in _targets/meta/meta disagrees with the one in _targets.R. But maybe we can relax this: don't necessarily invalidate the target if _targets.R has format = "auto" but _targets/meta/meta has format %in% c("file", "file_fast", "nanoparquet", "qs")). I think this would work. _targets/meta/meta is read-only from a user perspective anyway. And if "auto" is changed to any other value in _targets.R, then the tar_cue(format = TRUE) rule will revert to its usual behavior.

So now I am totally on board with adding this as an optional feature. Before making it the default, I think I would prefer to wait until qs2 stabilizes.

Just need to be careful here and consider the case where a user starts with e.g. format = "file", then moves to format = "auto". If not re-run, the pipeline would differ from one run from scratch if 'auto' would have picked 'nanoparquet' etc.

@wlandau
Copy link
Member

wlandau commented Aug 1, 2024

Just need to be careful here and consider the case where a user starts with e.g. format = "file", then moves to format = "auto". If not re-run, the pipeline would differ from one run from scratch if 'auto' would have picked 'nanoparquet' etc.

That's another good case to think through. If "file" => "auto" is the only change, it seems fine to skip the target. If the target runs the same R code, then presumably "auto" would still choose "file" format anyway (or "file_fast"). The decision would be left to the other criteria in tar_cue().

targets should also treat "file" and "file_fast" as equivalent for this decision.

@wlandau
Copy link
Member

wlandau commented Aug 1, 2024

Offline, @shikokuchuo pointed out reproducibility issues if "auto" in _targets.R becomes something different in _targets/meta/meta. An alternate route is to keep "auto" as the metadata entry but reconstruct the store object with the subclass that best matches the output. (EDIT: wouldn't work because then tar_read() wouldn't know how to read the object.)

@wlandau
Copy link
Member

wlandau commented Aug 1, 2024

@shikokuchuo pointed out reproducibility issues if "auto" in _targets.R becomes something different in _targets/meta/meta.

To clarify: suppose you save a data frame with format = "rds", then switch to format = "auto". At that point, targets would not necessarily rerun the pipeline or recompute the hash of the output, but a hypothetical rerun from scratch would save a nanoparquet file with a different hash. The old hash would be incorrect from a reproducibility standpoint.

With "qs" vs "file", we don't have that problem because these formats are mutually exclusive. There, a format = "auto" would definitely work. And it is exactly where it would provide the most value. More than the other formats, I regularly hear how disruptive it is have to manually write format = "file", e.g. https://fosstodon.org/@grrrck/112853425729179661.

So I will plan to write a format = "auto" which supports "qs", "file", and "file_fast". (And later, "qs2" instead of "qs".)

@wlandau
Copy link
Member

wlandau commented Aug 1, 2024

As for nanoarrow, I got started on the implementation, and I am remembering my original reluctance to add more built-in storage formats for all but the most general cases. For anything more specific than "qs" or "qs2", it would be nice to handle this through tar_format()-powered wrappers in tarchetypes (possibly in community-driven PRs).

@multimeric
Copy link

Some thoughts as a targets user:

  • I'm wary of making parquet the default format for data frames considering the common cases like list columns where parquet isn't compatible: https://www.tidyverse.org/blog/2024/06/nanoparquet-0-3-0/#limitations. Should the default format not be the most compatible?
  • I'm also worried about the automatic file detection, and think this works better as an explicit thing. I'm currently working on a pipeline that has some targets that just return filesystem paths to facilitate downstream targets. They shouldn't be tracked as files because there are TBs of data in there that targets would hash, but format = "auto" seems like it would do this.

@hadley
Copy link
Member Author

hadley commented Aug 5, 2024

@multimeric is that a common case? I didn't think many people actually used list-columns, but if you'd find support for them to be helpful, it would be super useful if you could file an issue with a motivating use case so we can add support 😄

I don't think the default would hash those files because it would rely on file stamps. How are you currently ensuring that future steps are run correctly if you're not using targets to manage the recomputation?

@multimeric
Copy link

I use list columns a lot, because I'm often working with wacky S4 objects etc. Other bioconductor users are likely going to be in the same situation. Even if list columns were implemented, surely it won't ever be possible to represent an arbitrary R data structure in parquet? For example, what about attributes? This is my concern about making parquet the default for data frames.

Sorry, you did propose file_fast which doesn't hash. My target just returns a file path that will be used by the downstream branches as a location to save their data. This target doesn't track changes to the paths at all, but the downstream ones that actually write to subdirectories do do that. Something like:

list(
    tar_target(root, c( "/filesystem_a", "/filesystem_b" )),
    tar_target(input_data, 1:10),
    tar_target(
        save_data, 
        {
            dir_path <- file.path(root, tar_name())
            dir.create(dir_path)
            file_path <- file.path(dir_path, "file.rds")
            saveRDS(input_data, file_path)
            dir_path
        }, 
        pattern = cross(root, input_data),
        format = "file"
    )
)

@hadley
Copy link
Member Author

hadley commented Aug 6, 2024

@multimeric you are right that parquet will never store that sort of data. I think it's generally rare in the spectrum of R user, but format = "auto" could also use nanoparquet only for data frames that don't contain list columns.

@MilesMcBain
Copy link

This feels like a similar design point to something like the {ggplot2} 'bad defaults' idea where you have certain things like histogram bin width etc where there is a bad but serviceable default that will pretty much do an okay job, but is also a nudge toward learning customisation options on offer.

So I think RDS fits the bill in this regard in that it mostly 'just works', but crappily enough that it gets users looking for better alternatives or rolling their own format.

As a heavy targets user I almost never use tar_target always preferring the tar_fst or tar_parquet from {tarchetypes} or my own custom formats, so this actually wouldn't affect me that much, but I do wonder if it is a more fitting candidate for implementation as tarchetypes::tar_auto or similar, since you get to opt in to the convenience (and potential edge case issues), and it keeps the core engine implementation free of special cases.

@wlandau
Copy link
Member

wlandau commented Aug 12, 2024

On the plane ride to Posit Conf 2024, I implemented a rendition of format = “auto” based on “file” and “qs” (leaving out “nanoparquet” because of the issues I mentioned above and “file_fast” because some file systems have extremely imprecise time stamps.)

@wlandau
Copy link
Member

wlandau commented Aug 12, 2024

when qs2 is stable, “auto” will use “qs2” instead of “qs”

@wlandau
Copy link
Member

wlandau commented Sep 25, 2024

To follow up, ropensci/tarchetypes#197 Implements the nanoparquet storage format we discussed.

This is the first general-ish storage format I have considered in a long time, maybe even the first since I introduced tar_format() for user-defined formats. Going forward, my plan is to keep legacy formats like “keras” and “fst” in targets itself, but delegate new formats to tarchetypes and implement them with tar_format(). I think this will help manage scope creep. I may make an exception for qs2 because it is so general, but I haven’t decided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants