-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathmain.tex
1624 lines (1433 loc) · 54 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[conference]{IEEEtran}
\usepackage{cite}
\usepackage{amsmath}
\usepackage{framed}
\usepackage{listings}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{cleveref}
\begin{document}
\title{An Algebra for Robust Workflow Transformations}
\author{
\IEEEauthorblockN{Nicholas Hazekamp}
\IEEEauthorblockA{University of Notre Dame \\
Notre Dame, Indiana 46556 \\
nhazekam@nd.edu}
\and
\IEEEauthorblockN{Douglas Thain}
\IEEEauthorblockA{University of Notre Dame \\
Notre Dame, Indiana 46556 \\
dthain@nd.edu}
}
\date{18 June 2018}
\maketitle
\begin{abstract}
Scientific workflows
are often designed with a
particular compute site in mind.
As a user changes sites
the workflow needs to adjust.
These changes include
moving from a cluster to a cloud,
updating an operating system,
or investigating failures on a new cluster.
As a workflow is moved,
its tasks do not fundamentally change,
but the steps to configure,
execute, and evaluate tasks differ.
When handling these changes it may be necessary
to use a script to analyze execution failure or
run a container to use the correct operating system.
To improve workflow portability and robustness,
it is necessary to have
a rigorous method that allows transformations on a workflow.
These transformations do not change the tasks,
only the way tasks are \emph{invoked}.
%More of this
Using technologies such as containers, resource managers, and scripts
to transform workflows allow for portability,
but combining these technologies
can lead to complications with execution and error
handling.
% still odd phrasing
We define an algebra to reason about task transformations
at the workflow level and express it
in a declarative form using JSON.
We implemented this algebra in the
Makeflow workflow system and demonstrate
how transformations can be used for
resource monitoring, failure analysis, and software deployment across three sites.
\end{abstract}
\section{Introduction}
% Context: Scientific workflows (give examples)
% Goal: Run on different sites, configurations. (give examples)
% Problem: Modifying workflows is hard to get right. (give examples)
% Solution: An algebra for workflow transformations.
% Implementation: How did you build it?
% Evaluation: Summarize case studies.
Scientific workflows define a set of tasks
and their interdependencies to provide
performance, reproducibility, and portability.
Workflows are used every day in
bioinformatics\cite{pmid20080505, pmid2231712, makeflow-examples, giardine2005galaxy, blankenberg2010galaxy, goecks2010galaxy},
high energy physics\cite{10.1007/978-3-540-24669-5_107, lobster-cluster-2015},
astronomy\cite{10.1007/978-3-540-28642-4_2},
and many other domains.
Workflow management systems provide support for
expressing the required resources,
environment, and configuration for each task.
Correctly specified workflows are explicit
about required setup and environments.
Often these workflows are designed for a
specific site, failing when
moved to different sites.
Differences between execution sites
makes porting workflows hard
and debugging complex.
It is common for workflows to
assume libraries and programs are available,
use applications configured for only one operating system,
or rely on unspecified configurations,
all of which fail on different sites.
Accommodating each site's configuration
requires a number of unique transformations to properly execute.
The tasks themselves do not change,
but the environment, error handling, and configuration
may.
A typical use case is the
need to deploy the same operating system and
software stack on several available
compute sites.
Unfortunately, each site may have a unique
operating system or lack the necessary software.
Users need a way to quickly switch between
each site, but do not want to rewrite the
workflow for each one.
The simple answer is to use containers,
but how do we easily apply these containers
to tasks?
This is further complicated when each site
may use different container technologies
(i.e. Docker\cite{Merkel:2014:DLL:2600239.2600241}
and Singularity\cite{Singularity}).
The ability to combine available tools is required
to handle unique configurations and environments.
% This is new
Unfortunately,
no single tool can address these changes,
but multiple tools are needed in conjunction.
%back to og
As the number and variety of tools increases,
the complexity of combining them increases as well.
For example, if
Singularity and a custom script
are both applied to a simple task by prepending their
commands, characteristics of execution like
exit status, provenance of files, and
the final executed command become opaque.
Properly nesting the container inside of a
script allows for differentiating failures,
debugging, and consistent execution.
Each additional layer must become a more nuanced transformation
as nesting technologies,
such as containers,
resource monitoring,
and error handling,
becomes necessary.
Different combinations of tools are required
depending on the site's unique configuration.
The variable nature of required tools indicates
the importance of only applying tools to a workflow as needed,
rather than adding them to the workflow specification at each site.
We define an algebra for workflow transformations
to address the complexity of
nesting different tools and technologies.
Based on the sandbox model of execution,
this algebra formalizes the operations
for applying transformations to tasks
which produce new tasks.
These transformations can then be applied
in series to produce a task that incorporates
all applied transformations.
Using formalized task transformations,
we are able to precisely apply multiple transformations to a workflow
and cleanly map to each task.
This algebra was expressed using JSON
so that it is independent of
(and therefore portable to)
a variety of systems.
Using this JSON expression, a driver was written
in Makeflow\cite{makeflow-sweet12} that allows us to apply transformations
to a full workflow.
We discuss the challenges
in applying transformations and
how these methods can
be applied incorrectly and incompletely.
To show the efficacy of this solution
we show several case studies.
The first uses a Singularity container
to provide consistent environments,
a resource monitor to give accurate usage stats, and
a sandbox to isolate the available files and workspace.
The second shows a failure handler that captures a
core-dump and converts it into a stack trace,
streamlining analysis and lowering data transfer.
The final case study executes the same workflow on
several sites using an environment builder
that dynamically builds required software at each task.
\section{Background and Challenges}
% What is a workflow.
Scientific workflows are a widely used means of organizing
a large amount of computational work. A workflow consists
of a large number of tasks, typically organized in a graph structure
such that the outputs of some tasks can be used as the inputs
of other tasks. Each task is a unit of work that can be dispatched
to a batch system or cloud facility and can range in scale anywhere
from a brief function call lasting a few seconds to a large parallel
application running on hundreds of nodes for several hours.
Examples of widely used workflow management systems include
Pegasus~\cite{pegasus},
Kepler~\cite{doi:10.1002-cpe.94},
Swift~\cite{swift},
Makeflow~\cite{makeflow-sweet12}, and many others. Figure~\ref{fig:workflow} shows an example of a typical
workflow structure.
\begin{figure}[t]
\includegraphics[width=\columnwidth]{graphics/example_workflow.pdf}
\caption{Example Workflow.}
\small
\emph{This workflow shows
standard split-join behavior.
Each circle is a task and
and each square is a file.
The first task partitions the data,
the next set of tasks analyze the individual partitions,
and the last task aggregates them.
Each task executes independently
and tasks are often run on batch execution systems.}
\label{fig:workflow}
\end{figure}
% Primary goal: run the app, but multiple secondary goals of tailoring, monitoring, debugging
A workflow primarily describes the researcher's work
to run a set of simulations,
to analyze a dataset,
to produce a visualization, etc.
However, like any kind of program, there may be a number of secondary
requirements that must be met to complete the work:
a particular software environment should to be constructed,
resource controls for the batch system will be selected,
monitoring and debugging tools should be applied to the task,
and so forth. This might involve
setting environment variables,
providing additional inputs, capturing additional outputs,
invoking helper processes, and more.
% These actions require modifying the workflow, but can get confused with the primary task and make it difficult to port.
The first version of a workflow, constructed at a particular computing site,
may have all of these aspects intertwined with the definition of the
tasks to be done.
The application may depend upon
software environments installed in fixed paths in a shared filesystem.
Environment controls may be set within individual tasks.
Resources may be
hard coded for a particular batch system
The graph structure may reflect
the current set of debugging tools enabled.
While this may work well at the
first site, it may become necessary to move the workflow to another
site in order to improve performance, increase scale, or to apply the workflow in a new context.
All these site-specific controls are unlikely to work in the new context,
and the receiving user is then stuck with the problem of disentangling the
core code from the local peculiarities.
% Idea: wrappers
An appealing approach to this problem is to define
simple modifications that can be individually applied to tasks
(transformations) in order
to achieve specific local effects. For example, one might have a transformation
to run a task in a container environment, another transformation to perform monitoring
and troubleshooting, and a final transformation to configure a software environment
for the local site. With this approach, the scientific objective of the workflow
can be expressed in a portable way.
A set of external transformations are used to
modify the tasks as needed for the local site. Porting a workflow from one site
to another becomes the simple job of adjusting a few transformations rather than
rewriting the workflow from scratch.
If it is necessary to transform
the workflow in a new way,
a transformation can be written, shared
, and applied to many workflows.
% Challenges of wrappers
However, our experience is that designing and using transformations
is not so easily done. What may seem like a simple and obvious
transformation can end up creating complex interactions and
incorrect results. As a simple example, suppose that
we want to run each task inside a Singularity
container named {\tt centos.img}.
At first, this sounds
as simple as prepending {\tt singularity run centos.img} to each
command string then running the task. While this
works in limited cases, the general case for workflows
with complex task definitions fails. There are several reasons for this:
%\begin{itemize}
%\item
{\bf Substitution semantics.}
Using basic string substitution to embed one command
inside another often complicates execution.
Commands that use
input/output redirection,
consume files,
or change the environment
collide using basic substitution.
Addressing this uncertainty with shell quoting only
further complicates the matter and may
change the execution.
%\item
{\bf Workflow modifications.}
Applying a transformation to a task not
only changes the individual task,
but may also have an effect on the
global structure of the workflow.
A command transformed by a container
now has an additional input ({\tt i.e. centos.img})
which must be accounted for as a
dependency in the workflow.
Container images are large and affect the
scheduling and resource management of the workflow.
In a similar way, the container produces additional
outputs which must be collected and managed by the
workflow.
%\item
{\bf Namespace conflicts.}
Transformations can modify
the local filesystem namespace.
Log files with fixed names,
temporary generated files,
files based on task input files,
or modifications to the working directory
all alter the task and the workflow.
These actions blindly modify files
outside the workflow or cause
race conditions with other concurrent transformations.
Since it is not always possible to alter these
hard-coded paths into unique filenames,
collisions are inevitable.
%\item
{\bf Troubleshooting complications.}
The exit semantics of a transformed task are complex
as it is not sufficient for a transformation to simply
return the task's integer exit status.
Each exit status should be differentiated,
as transformations may fail separately.
For example, preparing the environment may fail because a
necessary software dependency is not present,
a container may fail when pulling the container image over the network, or
a resource monitor may exit when resources are exhausted.
In each of these cases, we must have a means of distinguishing
between \emph{transformation failure} and \emph{task failure}.
When multiple transformations are applied, the result of the
task looks more like a stack trace than a single integer.
%\end{itemize}
% Concl: We need an algebra
To address these challenges, we need a more
rigorous way of defining tasks and the transformations
on those tasks such that any valid transformation applied
to any valid task gives the expected result in a
way that can be nested. In short, we need an algebra
of workflow transformations in order to make scientific
workflows more robust, portable, and usable.
%\input{outtake_background.tex}
\section{An Algebra of Workflow Transformations}
We designed a formal abstraction
to accommodate the execution behavior
of various tools.
This formalism isolates each transformation
for consistent execution,
allowing for organized nesting.
In particular, our abstraction describes how to define
a transformation for a given tool,
as well as aspects of execution to consider.
Transformations are based on the
sandbox model of execution,
which describes all aspects of execution
for which a transformation is responsible.
\subsection{Notation}
For the purpose of expressing tasks and transformations in a
precise way, we use a notation that is based on
JavaScript Object Notation (JSON).
In addition to the standard JSON elements of
atomic values (\verb$true$, \verb$123$, \verb$"hello"$), \emph{dictionaries} \verb${ name: value }$, and \emph{lists} \verb$[ 10, 20, ... ]$,
we add:
\begin{itemize}
\item {\tt let X = Y} is used to bind the name {\tt X} to the value {\tt Y}.
\item {\tt define F(X) = Y} defines a function {\tt F} that will evaluate to the value {\tt Y} using the bound variable {\tt X}.
\item Simple expressions can be built up using standard arithmetic operators and function calls on values and bound variables.
\item {\tt eval X} evaluates the expression X and returns its value.
\end{itemize}
Using this notation, a single task ($T1$) in a workflow
is expressed in JSON like this:
\begin{figure}[H]
\begin{framed}
%\small
\begin{verbatim}
let T1 = {
"command": {
"pre":[ ],
"cmd": "sim.exe < in.txt > out.txt",
"post":[ ],
},
"inputs" : [ "sim.exe", "in.txt" ],
"outputs" : [ "out.txt" ],
"environment": {},
"resources" :
{"cores":1, "memory":1G, "disk":10G }
}
\end{verbatim}
\end{framed}
\label{basic-task}
\end{figure}
Note that the schema is fixed.
Every task consists of a command with a \verb$pre$, \verb$cmd$, and \verb$post$ component, a list of input files, a list of output files,
a dictionary of environment variables,
and a dictionary of necessary resources (all defined in detail later).
Importantly, the formal list of
inputs and outputs is distinct from the command-line to be executed,
as guessing the precise set of files needed from
an arbitrary command-line is difficult. For example,
a program might implicitly require a calibration file {\tt calib.dat}
and yet not mention that on the command line.
The base task's list of inputs
and outputs is drawn from the structure of the DAG by the workflow
manager.
\subsection{Semantics}
\label{sec:sandboxing}
Makeflow allows for tasks in this form to be executed on a wide
variety of execution platforms, including traditional batch
systems (such as
SLURM\cite{Jette02slurm:simple},
HTCondor\cite{condor-hunter},
and SGE\cite{Microsystems:2001:SGE:560889.792378}),
cluster container managers, and cloud services.
Because each of these systems differ in considerable ways,
it is necessary to define precise semantics about the
execution of the task and the namespace in which it lives.
Once these semantics are established, it becomes possible
to write transformations that work correctly regardless of the underlying system.
To accommodate these varied systems, we introduce the
sandbox model of execution.
The {\bf sandbox model} of execution isolates the environment
and limits interactions to only specified files.
Isolating the task to run only the specified environment allows for higher
flexibility about where the task can run as well as increasing the reproducibility
of execution. Limiting the locally available files helps
prevent undocumented file usage, enforcing accuracy of the
file lists.
Applying a sandbox to a task is
a multi-step process for ensuring consistent environment creation.
It goes as such:
\begin{enumerate}
\item Allocate/ensure appropriate space for execution, based on resources.
\item Create sandbox directory.
\item Link/copy inputs to ensure correct in-sandbox name, based on inputs.
\item Enumerate environment variables based on the specified environment.
\item Run task defined command,
using \emph{pre}, \emph{cmd}, and \emph{post}.
\item Move/copy outputs outside of sandbox with appropriate out-sandbox name, based on outputs.
\item Exit and destroy sandbox.
\end{enumerate}
\begin{figure}[H]
\includegraphics[width=\columnwidth]{graphics/sandboxing_short.pdf}
\caption{The sandbox model of task execution.
This shows the different steps needed to isolate
the task from the underlying workflow environment
to prevent side-effect on the environment and
filesystem.}
\label{fig:sandbox}
\end{figure}
\begin{figure}[t]
\begin{framed}
%\small
\begin{verbatim}
define Singularity(T)
{
"command" : {
"cmd": "singularity run image " +
T.script + " > log." + T.ID
}
"inputs" : T.inputs +
["image", T.script],
"outputs" : T.outputs +
["log."+T.ID],
"resources" : {
"disk" : T.resources{disk} + 3G
}
}
\end{verbatim}
\end{framed}
\caption{Abstract {\tt Singularity} transformation}
\small
\emph{Describes the Singularity command, added files
(such as image and output log), and increases
the required disk space. Note, several of the
variables are unbound and will be resolved when
applied to a task. Unaltered fields are left
undefined.}
\label{sing-wrap}
\end{figure}
\subsection{Transformations as Functions}
A transformation is an abstraction of a task
and provides the information needed
to translate a raw program invocation into
a properly defined task.
A transformation contains the same fields defined in a task:
a command, inputs, outputs,
resources, and environment.
However, it is an incomplete task with
unbound variables that are resolved
when applied to a task as a function.
\Cref{sing-wrap} illustrates
Singularity written as a transformation.
As mentioned above, the generic definition of
the transformation contains unbound variables
such as {\tt T.cmd}, {\tt T.inputs}, and
{\tt T.outputs}.
When the transformation is
applied to a task, those variables
are bound from the task's structure.
Singularity requires
additional space (3G) to account
for the Singularity image.
Here, resources are not defined as
a static value.
Rather they are in addition the to
underlying resource.
Additionally, the Singularity transformation
does not
define an environment, so it is left out.
\begin{figure}[t]
\begin{framed}
%\small
\begin{verbatim}
eval Singularity(T1) yields
{
"command": {
"pre":[ ],
"cmd": "singularity run image " +
"t_ID.sh > log.ID"
"post":[ ],
},
"inputs" : ["sim.exe", "in.txt",
"image", "t_ID.sh" ],
"outputs" : ["out.txt",
"log.ID"],
"environment" : {}
"resources" : {
"cores" : 1,
"memory" : 1G,
"disk" : 13G,
}
}
\end{verbatim}
\end{framed}
\caption{Resulting task of applying {\tt Singularity} to {\tt T1}.}
\small
\emph{The transformed task has all of the variables bound.
The file lists have combined the previously defined files
with the files added by {\tt Singularity}.
The resources are resolved and the required values account
for the original task and the transformation.}
\label{sing-task}
\end{figure}
The resulting task of evaluating
\verb$Singularity(T1)$
can be seen in \Cref{sing-task}.
The previously unbound variables have
been resolved, such as {\tt T.inputs}
becoming {\tt ["sim.exe, "in.txt"]}.
The values that were not defined or
extended by \verb$Singularity$ were
resolved from the underlying task,
such as {\tt cores} and {\tt memory}.
Importantly, to create a valid task
even empty fields like $pre$, $post$,
and $environment$ are still specified,
allowing for evaluation and additional
transformations to be applied.
If you look carefully at \Cref{sing-wrap}
you will notice two variables not
bound by the underlying task
directly, {\tt T.script} and {\tt T.ID}.
As part of the abstraction, the
underlying task is emitted as a script
that is called in place of the command,
creating {\tt T.script}.
The ability to treat transformations as functions
is achieved by isolating
each transformation as a separate process.
Isolating a transformation provides
several key benefits:
clearly defined ordering of transformations,
instantiated environments persist only
in that process and its children,
and exit status can be attributed
at each level to track failures.
In practice this is achieved by
producing a script that
defines the task, as seen in \Cref{task-script}.
The second variable, {\tt T.ID}
is key to this method's success.
The ability to uniquely identify
each task provides a clear
mapping to the workflow.
A unique identifier is created
using the checksum of the
current task, which incorporates the
command, input files' names and contents,
output files' names, environment, and
resources.
This identifier is used to identify the output
script and can be used by the
transformation to uniquely identify
files in the workflow.
Additionally, as applying a transformation
produces a new task, the identifier
is updated after each transformation.
\begin{figure}[h]
\begin{framed}
\small
\begin{verbatim}
#!\bin\sh
#ID TASK_CHECKSUM
# POST function
POST(){
# Store exit code for use in analysis.
EXIT=$?
# Run post commands.
# Exit with stored EXIT which may
# have been updated by post.
exit $EXIT
}
# Trap on exit and call POST.
trap POST EXIT INT TERM
# Export specified environment.
# Run pre commands.
# Run core command.
sim.exe < in.txt > out.txt
\end{verbatim}
\end{framed}
\caption{Script created when evaluating {\tt Singularity(T1)}.}
\label{task-script}
\end{figure}
\subsection{Applying the Sandbox Model}
The creation of a script from a task
focuses on isolating just the transformation
but relies on finalization of the task sandbox
laid out in \Cref{sec:sandboxing}.
To consistently apply the sandbox model to a task
we define a sandbox procedure to produce a script
that creates a sandbox, handles files, and runs
the command. This procedure is applied to a
task prior to execution to isolate the task to
a single sandbox directory.
This begins with
creating a unique identifier, based on the task checksum.
The identifier is used to create the
sandbox and script names used in execution. In the
script a {\tt POST} function captures
the exit status, executes {\tt post} commands, and
returns the outputs.
This function is set as a trap to
also analyze failures.
Next, the sandbox is created and
inputs are linked into it.
The process changes directories,
exports the environment, and
runs the {\tt pre} commands.
After the environment is set up,
the task {\tt cmd} can run.
\section{Transformations in Practice}
In applying the above algebra, there are
design considerations to be made.
To maintain the ability to nest several
transformations together, it is important
to consider the naming conflicts, the
importance of differentiating {\tt pre},
{\tt cmd}, and {\tt post}, file management,
resource specification, and how the
environment of a task is extrapolated.
\subsection{ Composability versus Commutability }
An important aspect of this algebra is the ability
to reason about how the combinations of
different transformations interact and if they
can be applied to created a valid task.
Using the previously defined application of
transformations we find that the set of transformations
are composable, but not commutable.
These transformation are not commutable because
the ordering in which they are applied changes the core
evaluation of the task.
This is, by design, to allow for the differentiation
of transformation ordering.
Transformations, in general, are composable.
Any transformation can be
applied to any task and produce a valid task,
with the exception of static name collisions.
A static name collision can result when
an application uses hard-coded or
default names for files,
careless naming, or even randomly
generated names.
Running a single transformation at a time
may not cause a collision,
but nested transformations and concurrent tasks
make collisions inevitable,
as is often seen with output logs
and files sharing names between tasks.
Naming is resolved at the local level by detecting
when applying a transformation
creates overlapping names.
If collisions are detected, the
transformation is not applied and
a failure is returned.
Though this restricts some combinations,
this can be overcome by
better understanding the application
and using options to
produce unique files.
However, if the same restrictions were
applied to tasks across the workflow,
transformations with static names would
be prohibited entirely.
As this may be inevitable,
static files may be remapped
to a unique name in the workflow.
As each task is isolated in a sandbox,
static files can be renamed when moving to the
global namespace using the task identifiers.
Remapping of the file relies on
a more verbose file specification as
a JSON object instead of a string filename.
JSON object specification enables the wrapper
to specify an \emph{inner\_name},
specifying the name inside the sandbox,
and the \emph{outer\_name},
specifying the name in the workflow context.
An example of how this would look with
a statically named file can be seen in
\Cref{json-file} which defines a resource
monitor transformation.
\begin{figure}[h]
\begin{framed}
\small
\begin{verbatim}
define RMonitor(T) {
"command" : [
"cmd": "rmonitor -- " + T.script
]
"inputs" : T.inputs + ["rmonitor",
T.script]
"outputs" : T.outputs +
[{"outer_name="summary."+ID,
"inner_name"="summary"}]
}
\end{verbatim}
\end{framed}
\caption{Verbose JSON object file specification.}
\small
\emph{In this example, the resource monitor uses
a statically named default summary, "summary".
In this case, the summary
file's name is static and will collide in
the global workflow context. To avert this
collision the file is specified with its
static inner\_name, and a unique outer\_name
using the task's ID.}
\label{json-file}
\end{figure}
\begin{figure*}[t]
\includegraphics[width=\textwidth]{graphics/environment_extrapolation_pwrap_ctask_simp.pdf}
\caption{General approach to Sandbox model of execution}
\small
\emph{The environment that exists at task execution is
derived from several sources.
The environment starts at the DAG where
variables are resolved internally and from the host machine.
These values define the task's initial environment.
Transforms are applied to the task which extend the
environment (\textit{a, b, c} are generic transforms),
but applied at execution.
At the execution site, the environment is
defined by the execution node and batch system.
A execution each transformation is
applied and invokes its environment limited
to the transformation's execution.}
\label{figure:env-extrap}
\end{figure*}
\subsection{Command Description}
Commands express the setup, execution, and post processing
of a task.
Commands are broken up into three parts,
${pre}$, ${cmd}$, and ${post}$
based on the command structure outlined.
${Pre}$ is a set of commands that run prior task invocation
and setup the task sandbox.
This includes setting environment variables,
configuring dependencies, and loading modules or software.
For example, a Docker transformation would use ${pre}$ to load or pull images.
${post}$ is a set of command that run after task invocation
and is used to
handle failure by interpreting or masking them,
create outputs to prevent batch system failures from missing files,
or validate correctness of outputs.
${Post}$ can differentiate
docker failing to load an image from
task execution failure,
allowing more nuanced debugging.
The ${cmd}$ string outlines the context in which
the underlying command is invoked.
${cmd}$ outlines how the underlying task is called
and isolates the effects of the calling transformation.
A benefit of separating the command into these parts
is that it allows us to differentiate the failures or
problems that result from each part.
This is useful when determining that the setup of
your container failed so the task should not run
or to prevent the failure of $post$ analysis from
indicating a task failure falsely.
This separation also allows for each transformation
to be clearly expressed in a script,
enabling simplified debugging.
\subsection{File List Management}
As transformations are applied, the list of inputs and outputs grows.
It is key for the correct organization of transformations that the
set of required files is outlined by the task structure
allowing the submitting system to confirm required inputs
and verify expected outputs.
It is possible for a transformation to rename or mask an existing
file in the list.
By doing so, the transformation changes the
context of the task when evaluated.
This can be done to allow for redirecting shared files or when using
installed reference material.
Maintaining a correct set of files helps
prevent task collision.
This information can also map a
$pre$ or $post$ application onto the files,
estimate the space needed for execution,
or log these files for later analysis.
\subsection{Resource Provisioning}
The resources define the necessary allocation for proper task execution.
This value is extended and augmented by transformations
as the context and required resources change.
Commonly, as transformations are applied,
additional disk space is needed to store new files
(like container images).
Resource provisioning may not only be additive
as the transformations are applied,
but also maximal.
This is typically the case used for cores.
The number of cores does not expand as
transformations are added but reflects
the largest number of cores needed by any
transformation.
For example MPI utilizes a static number of cores,
and to reflect that the resources specification
uses the maximum of the provided value and the previous
resource specification.
The value of the resources required for a task
tracks the largest set of each resource.
After the transformations have been applied,
the final task contains a single specification
reflecting the total expected usage.
\subsection{Environment Elaboration}
%GIVE INTRO TO ENVIRONMENT HANDLING
An important aspect of a task is the
execution environment.
The environment defines a variety of values
that control execution such as available executables,
required libraries and values,
and available machines for cluster execution.
However, the environment is often overlooked
or ignored by the researcher,
which causes corruption, errors, and failures.
This can be mitigated on a single site,
but as more sites are utilized managing these environments
becomes unrealistic.
It is important to understand
how the task environment is defined,
when transformations are applied, and how to
direct the execution environment.
\Cref{figure:env-extrap} illustrates many of the
locations a task's environment,
or expected environment,
can be derived and how it
changes over execution.
The workflow is
executed with the submit machine's environment (${E_S}$),
defines an internal DAG environment (${E_D}$),
and dispatches a task specific environment (${E_T}$).
The task environment is defined with values derived from
the DAG and submit machine, but crucially should
not include variables that
reference non-existent programs, libraries, and values
at execution.
After the task is produced,
transformations are applied that may
append, update, or mask the provided variables.
As a transformation is applied, ${E_T}$ is
written out to a script.
The transformation can update values set
in the task and add values based on
needed, such as applications to the $PATH$
or libraries.