Archive format for complex values #423

jdidion · 2020-11-19T17:06:06Z

jdidion
Nov 19, 2020
Maintainer

I am aware of an interesting issue that may affect other (especially cloud-based) WDL execution engines. Without going into too much detail: under some extreme scenarios that involve both huge numbers of inputs and nested scatters, the number of inputs/outputs (e.g. from the collect step of a scatter) can be so large as to exceed the guardrails of the database used to maintain job state. For example, a workflow that takes thousands of samples as input, performs all-vs-all comparison of their variants, and generates O(n^2) outputs. I will not comment on whether I think such a workflow is a good idea - I only know that it happens in the wild.

One idea under consideration for how to deal with this scenario is to create a new archive format that will package up the contents of a complex value that contains nested files (e.g. Array[File] or Map[String, Pair[File,File]] as a self-describing archive. The archive contains a manifest with the serialized form of the actual type and value, along with all of the files referenced in the value. The actual implementation of this archive format will use SquashFs.

There are two viable ways to use these archive files - one that is transparent to the user (i.e. at the runtime engine level), and one that is explicit, by either adding functions to the standard library or in a task.

In the implicit solution, there is a runtime flag that says "automatically convert all my complex outputs to archives, and recognize when I'm trying to pass an archive as an input to a parameter with a complex type and automatically unpack it". In practice, this can be very complex to implement and results in a a lot of limitations.

In the explicit solution, we either 1) add functions to the library, 2) support UDFs so these functions can be provided without needing to change the spec, or 3) implement the archive and unarchive processes in a task. My plan is to go with option 3 unless the community thinks that having these functions would be generally useful. I have already written up a specification for the archive format and will be happy to share it if the community thinks this is a useful feature.

File archive(Any): Convert any value to an archive file.

Any unarchive(File): Unpack an archive file and convert it to its original value. The return value can be assigned to a variable of any type - i.e. static type checking will always succeed - but an error will be thrown at runtime if there is an attempt to assign the unarchived value to a variable whose type does not match (is not coercible from) the original type.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archive format for complex values #423

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Archive format for complex values #423

jdidion Nov 19, 2020 Maintainer

Replies: 0 comments

jdidion
Nov 19, 2020
Maintainer