You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am aware of an interesting issue that may affect other (especially cloud-based) WDL execution engines. Without going into too much detail: under some extreme scenarios that involve both huge numbers of inputs and nested scatters, the number of inputs/outputs (e.g. from the collect step of a scatter) can be so large as to exceed the guardrails of the database used to maintain job state. For example, a workflow that takes thousands of samples as input, performs all-vs-all comparison of their variants, and generates O(n^2) outputs. I will not comment on whether I think such a workflow is a good idea - I only know that it happens in the wild.
One idea under consideration for how to deal with this scenario is to create a new archive format that will package up the contents of a complex value that contains nested files (e.g. Array[File] or Map[String, Pair[File,File]] as a self-describing archive. The archive contains a manifest with the serialized form of the actual type and value, along with all of the files referenced in the value. The actual implementation of this archive format will use SquashFs.
There are two viable ways to use these archive files - one that is transparent to the user (i.e. at the runtime engine level), and one that is explicit, by either adding functions to the standard library or in a task.
In the implicit solution, there is a runtime flag that says "automatically convert all my complex outputs to archives, and recognize when I'm trying to pass an archive as an input to a parameter with a complex type and automatically unpack it". In practice, this can be very complex to implement and results in a a lot of limitations.
In the explicit solution, we either 1) add functions to the library, 2) support UDFs so these functions can be provided without needing to change the spec, or 3) implement the archive and unarchive processes in a task. My plan is to go with option 3 unless the community thinks that having these functions would be generally useful. I have already written up a specification for the archive format and will be happy to share it if the community thinks this is a useful feature.
File archive(Any): Convert any value to an archive file.
Any unarchive(File): Unpack an archive file and convert it to its original value. The return value can be assigned to a variable of any type - i.e. static type checking will always succeed - but an error will be thrown at runtime if there is an attempt to assign the unarchived value to a variable whose type does not match (is not coercible from) the original type.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I am aware of an interesting issue that may affect other (especially cloud-based) WDL execution engines. Without going into too much detail: under some extreme scenarios that involve both huge numbers of inputs and nested scatters, the number of inputs/outputs (e.g. from the collect step of a scatter) can be so large as to exceed the guardrails of the database used to maintain job state. For example, a workflow that takes thousands of samples as input, performs all-vs-all comparison of their variants, and generates O(n^2) outputs. I will not comment on whether I think such a workflow is a good idea - I only know that it happens in the wild.
One idea under consideration for how to deal with this scenario is to create a new archive format that will package up the contents of a complex value that contains nested files (e.g.
Array[File]
orMap[String, Pair[File,File]]
as a self-describing archive. The archive contains a manifest with the serialized form of the actual type and value, along with all of the files referenced in the value. The actual implementation of this archive format will use SquashFs.There are two viable ways to use these archive files - one that is transparent to the user (i.e. at the runtime engine level), and one that is explicit, by either adding functions to the standard library or in a task.
In the implicit solution, there is a runtime flag that says "automatically convert all my complex outputs to archives, and recognize when I'm trying to pass an archive as an input to a parameter with a complex type and automatically unpack it". In practice, this can be very complex to implement and results in a a lot of limitations.
In the explicit solution, we either 1) add functions to the library, 2) support UDFs so these functions can be provided without needing to change the spec, or 3) implement the archive and unarchive processes in a task. My plan is to go with option 3 unless the community thinks that having these functions would be generally useful. I have already written up a specification for the archive format and will be happy to share it if the community thinks this is a useful feature.
File archive(Any)
: Convert any value to an archive file.Any unarchive(File)
: Unpack an archive file and convert it to its original value. The return value can be assigned to a variable of any type - i.e. static type checking will always succeed - but an error will be thrown at runtime if there is an attempt to assign the unarchived value to a variable whose type does not match (is not coercible from) the original type.Beta Was this translation helpful? Give feedback.
All reactions