-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Defining Environment Variables Within a Task #504
Conversation
Thanks @patmagee for moving this along! As you know, I strongly support the general idea of enabling environment variables as an alternative (often a superior one) to I'd prefer not to need the separate So the call would just look like
(Aside- I think we should get rid of the useless Another idea worth considering was to use type specializations instead of a separate section, keeping the current declaration syntax, e.g.,
This allows the task author to control which environment variables the caller can and can't override. My feeling is that either of the above approaches do the job well and are somewhat simpler for both WDL authors and engine implementers. |
Another angle on the prior comment: the container environment variables are an implementation detail that should be encapsulated inside the task. The caller doesn't really need to know whether a given input will turn into a container environment variable or not. The task author might want to change how they consume an input from interpolation to environment variable, and such a change needn't trigger required changes in calls too. |
@mlin that is a very valid point. The one thing I was trying to avoid was the automatically setting environment variables in the task based on input strings. Without a special prefix that would likely be error prone and potentially break many different tasks. I must admit, I was initially torn between a type and a an Option 1: Keep the block in a task but require values to be setThis option would be a good way of being very declarative. We can keep the The benefit to this approach is that it proves fine grained control over what gets exposed in the environment and it groups everything together. task foo {
input {
String myInp
Int maxNum
}
env {
MYINP = myInp
OTHER = "someValue"
MAXNUM = maxNum
}
} The drawback to this approach is that it adds potentially more boilerplate to a task. The declared environment variables would still only be usable within the command block, but the workflow has no concept of environment variables Option 2: Use a declared typeThis would be following your suggestion and use a type like
Option 3: add some sort of qualifier to the typeWe could also go the route of adding an additional qualifier to the type or the declaration some ideas would be. task foo {
input {
# explicitly state where to expose the string. this would be my favourite probably
String FOO in env
# use a type qualifier preceding the type. This one could also be my favorite
env String FOO
# Use some sort of qualifier coming after the type declaration
String@env FOO
String:env FOO
String{env}
Type qualifiers have the drawback (or benefit) of being applicable to any type. Therefore any type could potentially be set in the env and it's the engines role to figure that out. Ie serialization of a struct or other complex type I am definitely with you that we do not want to expand the concept of what constitutes an input, at least at this point. |
I like this too! One further question would be whether to allow I think I'd say yes, but could be convinced otherwise |
@mlin I do not see a reason to limit this to only strings. I think allowing any type would gain a lot of support from the community. You could put it in a struct then easily use something like
One question then is whether the |
Adding something like directives as a concept can generalize as well. For example in a task you could define |
Due to environment variables not being usable in WDL expressions, that limits their power quite a bit. For some input variables being able to derive other variables from them is quite essential for the good functioning of a task. Therefore environment variables are never a complete solution. As such there is not one way to define inputs, but two equally valid ways when this becomes part of the spec.
I find this a bit shaky ground, as this means that the engine and the shell will be more tightly coupled. I know the current spec defines My main problem with adding environment variables is that instead of one, we create two ways of defining inputs. This is quite confusing, as it is quite arbitrary which method is to be preferred. If I may quote the Zen of Python: "There should be one-- and preferably only one --obvious way to do it." This change violates that principle and therefore it makes WDL less usable. Given that the actual problem that we are trying to solve here is " injection attacks or poor formatting during execution" I think the solution should be more tailored towards that end. Another solution is to quote all WDL variables by default while templating and to provide a function that disables this (conveniently called |
Environment variables are a universal mechanism that every shell/language interpreter will have an idiomatic way to consume, and will also be more familiar to authors learning WDL who do have other scripting experience. I think shakier ground is the suggestion that engines could escape each value in a way that would be so universally safe and compatible. That suggestion works better if, on the contrary, we double-down on specifying that bash is the interpreter -- but even then, we'd face the common pattern of the bash script running an embedded heredoc in another language. Re "Environment variables are not usable within other WDL expressions" -- I doubt that restriction is really needed. I view them as additional input declarations, so of course, they should be usable in other WDL expressions. Re creating two ways of defining inputs -- it's an important point but my personal opinion is that the interpolation of strings directly into the command script was a minor design flaw from the beginning, and we should gradually deprecate them in favor of environment variables; but I don't expect everyone to share that opinion exactly... |
@mlin. Agreed. Redesigning WDL to only use environment variables and deprecate the templating is the better option.
If I add up things correctly I come to the following conclusions:
This solves the templating/injection issue. While restricting the language instead of expanding it. That seems preferable to me. |
That's a really interesting concept. It would let us remove the (command) interpolation syntax entirely. And allow things like arrays to be supplied natively. Are there any use cases which would get worse if this were the case? |
Supplying optional variables comes to mind. Where you can do I guess that could still be made possible for optional values without templating? So that it either evaluates to In retrospect my statement "Input variables can be used in expressions in any part of the WDL. Except the command block." was a bit harsh. I guess you can still allow any sort of expression, as long as it evaluates to a singular variable that can be saved as an environment variable and does not depend on being templated to text. |
And another thing that is hairy: How to work with tools that use the same flag multipe times to specify output files. For instance |
CWL has a pretty nice mechanism for constructing command lines, I think there's value in being separate from that though, for that reason I do like the command block in WDL, and generally making information available via env variables |
So, how about this for a concrete proposal: Instead of the In relation to environment variables, we can use the task foo {
input {
env String HELLO
}
# Also works
String HELLO_FROM_BOB = HELLO + " from bob"
command <<<
echo ${HELLO}
# Still works
echo ~{HELLO}
>>>
}
|
@patmagee while I like the concrete proposal As an example of the can of worms I think that could open, a long time ago @orodeh and I sketched out adding a metadata section for each input declaration, e.g.
To be clear: I'm NOT trying to advance that now, just mentioning the kind of thing that might bog us down if we get into use cases for a general concept like Directive/Decorator. |
@mlin good point, I am happy to keep this restricted to just the |
I modified the PR to remove the explicit |
@mlin @illusional @rhpvorderman @cjllanwarne do you have any issues with the current approach? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally happy with this approach, except I don't think you should be able to mark an environment variable from a workflow call
I'm still not sure how you would achieve something like this with the new approach.
--param1 value1 --param2 value2
versions/development/SPEC.md
Outdated
@@ -2791,7 +2868,7 @@ workflow main_workflow { | |||
|
|||
Any required workflow inputs (i.e. those that are not initialized with a default value or expression) must have their values provided when invoking the workflow. Inputs may be specified for a workflow invocation using any mechanism supported by the execution engine, including the [standard JSON format](#json-input-format). | |||
|
|||
By default, all calls to subworkflows and tasks must have values provided for all required inputs by the caller. However, the execution engine may allow the workflow to leave some subworkflow/task inputs undefined - to be specified by the user at runtime - by setting the `allowNestedInputs` flag to `true` in the `meta` section of the top-level workflow. For example: | |||
By default, all calls to subworkflows and tasks must have values provided for all required inputs and environment variables by the caller. However, the execution engine may allow the workflow to leave some subworkflow/task inputs/environment variables undefined - to be specified by the user at runtime - by setting the `allowNestedInputs` flag to `true` in the `meta` section of the top-level workflow. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still relevant? I think it's more clear to say "any input marked with the env
directive is required, or not too.
versions/development/SPEC.md
Outdated
@@ -2576,14 +2652,15 @@ workflow wf { | |||
|
|||
# Calls my_task with one required input - it is okay to not | |||
# specify a value for my_task.opt_string since it is optional. | |||
call my_task { input: num = i } | |||
call my_task { input: num = i env: NUM = i } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't super love this. I think all tool interactions should be handled by the tool wrapper and not be the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should definitely be removed, I think its outdated from the previous iteration. I will update the PR
GOVERNANCE.md
Outdated
| Patrick Magee | DNAstack | [patmagee](https://github.com/patmagee) | | ||
| Brian O'Connor | University of California, Santa Cruz | [briandoconnor](https://github.com/briandoconnor) | | ||
| Geraldine Van der Auwera | Broad Institute | [vdauwera](https://github.com/vdauwera) | | ||
| Christopher Llanwarne | Broad Institute | [cjllanwarne](https://github.com/cjllanwarne) | | ||
| John Didion | DNAnexus | [jdidion](https://github.com/jdidion) | | ||
| Michael Franklin | Centre for Population Genomics | [illusional](https://github.com/illusional) | | ||
| Amy Paguirigan | Fred Hutch | [vortexing](https://github.com/vortexing) | | ||
| Ruben Vorderman | Leiden University Medical Center | [rhpvorderman](https://github.com/rhpvorderman) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some PR cross-contamination here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, will fix!
@patmagee, my primary comment is that I think this is a inverse approach by requiring the maintainer to put env in front off everything to make sure no shell code can be injected. Forget the Speaking from a BioWDL perspective: we don't have any pressing security concerns since all our tasks are containerized, and only user inputs are mounted in the containers. The containers are run as the UID exclusively and not as root. Furthermore our pipelines are not run as a service, and run by individual users on our compute cluster. They can only break things they have the right to break in the first place so it makes no sense adding lots of guards against this. In practice security is a non-issue for us. I say this not to invalidate the PR, but to give the proper weight to my own opinion: very low. We have no stake in this. |
I have been thinking somewhat more about this and I start to wonder: what does this feature protect you from that containerization does not already? If users can submit their own workflows to the system, then this addition is pointless. So this protects specifically those instances where an entity provides WDL as a service, but did not containerize the backend properly? (Instead opting to provide all bioinformatics tools under the sun in a single environment without dependency conflicts). That seems a very unlikely use case. |
@rhpvorderman you make some good points. To start, containerization gets you pretty far down the road of security assuming that the system is using a secure container technology. Unfortunately even in the most restricted environments where users can only pick from a list of pre-approved workflows there is the opportunity for abuse. For example, If you happen to know the execution environment, you could construct a nefarious string that would run a specific attack within the container (not so bad), but more importantly within the network the container is running in potentially under any authority assigned to that container/vm. Google tools are great examples of this, where there is a metadata api allowing you to run gsutil within a container and actually retrieve a service account dynamically simply by running gsutil on a VM. So all that is to say, I think in some circumstances the security concerns are warranted, but I agree that the overall impact to the user should be as minimal as possible. To be honest, the original motivator was to not override any existing variable within the container by automagically converting inputs into env variables. This actually may not be a big concern, especially not if this is a well documented feature. If all values were accessible either through the WDL semantics ( |
Included this feature in miniwdl v1.8.0 |
@mlin I tried out this feature, and I have to say, really love it as defined. Honestly I am 100% happy with it :D |
I think this is ready for final review, and if everything looks good we can get it voted on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - just a few small suggestions/questions
versions/development/SPEC.md
Outdated
### Environment Variables | ||
|
||
Any input to a task may be converted into an environment variable that will be accessible from within the task's | ||
shell environment during command execution. Unlike inputs, environment variables are not templated into the command when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might say "interpolated" or "replaced in" rather than "templated into"
versions/development/SPEC.md
Outdated
Any input to a task may be converted into an environment variable that will be accessible from within the task's | ||
shell environment during command execution. Unlike inputs, environment variables are not templated into the command when | ||
preparing the execution script, but are actually available at runtime directly from the environment. They may be accessed using | ||
normal shell variable access semantics (ie `$FOO` or `${FOO}`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e.
versions/development/SPEC.md
Outdated
environment, it does not change its access to normal WDL expressions within a command. That means you can refer to a declaration | ||
annotated with `env` either through the shell semantics (`${FOO}`) or through normal WDL semantics (`~{FOO}`). | ||
|
||
When an input is annotated with `env` it is the engines responsibility to serialize the value appropriately into a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
engine's
versions/development/SPEC.md
Outdated
annotated with `env` either through the shell semantics (`${FOO}`) or through normal WDL semantics (`~{FOO}`). | ||
|
||
When an input is annotated with `env` it is the engines responsibility to serialize the value appropriately into a string. | ||
Serializing complex values (`Map`, `Pair`, `Struct`, `Array`) should be equivalent to calling `write_json` on the given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link [WDL Value Serialization and Deserialization](#appendix-a-wdl-value-serialization-and-deserialization)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, write_json
writes the value to a file. Is that the desired behavior - that the environment variable will have a path to the file with the serialized value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point to discuss:
On the one hand, if the developer definitely wants the JSON in a file then they can easily add a private declaration env File x_json = write_json(x)
. Given that's easy, maybe the environment variable for a compound value should contain the JSON directly.
On the other hand, JSON directly in the environment variable isn't very natural to work with in shell; often you'd end up just dumping it into a file to be parsed anyway. Also, developers should be discouraged from from cramming large compound values directly into environment variables, because there will usually be some system-imposed limit on environment size.
TBH, env File x_json = write_json(x)
makes the intent much clearer to somebody reading the code, so whichever way we go on this question, I think we should actually discourage using env
with compound values at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlin JSON directly as environment variables is actually not as cumbersome as you may think. We regularly use it in our own CI/CD infrastructure and pass values to and from tools like jq
which are great at parsing json. We rarely ever need to use temporary files and just pipe input from one command to the other, storing intermediate entries in a environment variable.
I think pointing to the example of the write_json
is actually not an accurate representation of what is happening. we probably want to just say that complex types will be serialized to JSON
prior to being set as an environment variable.
There is of course the limitation on size of the environment variables. So Maybe we can provide guidance here: "Engines may impose limits on the total length a single environment variable is allowed to occupy as well as the number of environment variables that are allowed to be passed into a single task. If such limitations exist, engines should clearly document them for the end user."
I think with the recent updates, this PR is ready for voting and inclusion in the 1.2 Spec. If there is no one opposed, I will happily mark this as open for voting |
@patmagee For your consideration...I asked GPT-4 to rewrite a few of the prose paragraphs more concisely. (Turning into a standard practice!)
(129 words, down from 245) |
I have some doubts about this:
I think it's quite unusual to bind the environment variable to an escaped value. It means our environment variables have a peculiar "format", which seems to counteract the goal of using less-WDL-specific practices. I think the environment variables should be bound to the exact (stringified) value, so that task commands would inherit shell scripting's best practices and pitfalls for handling potentially-malicious environment variables securely. Chiefly, they should be used inside double-quotes, or if necessary explicitly escaped using Automatically escaping the stringified value feels to me like something I wish we'd done when we introduced the WDL-specific |
@mlin I think you are right, and the stanza you highlighted is really a bit of a misguided attempt and enforcing how env variables are set. Fundamentally what I was trying to achieve with that statement is the ability to set environment variables without breaking scripts. This is probabyl more perscriptive then it needs to be, as all I really am trying to state is the engine should set the literal value similar to how the docker .env files works |
@patmagee please change target to |
@jdidion I have update the spec against 1.2 |
This is probably not a use case that we would want to advertise, at least for containers on remote machines. Putting credentials into environment variables for containers is not a great idea. It is better to have the underlying host machine use an appropriate IAM role. For local execution it is probably OK (as long as you know the workflow is not going to echo them to logs or someone else's machines). |
When a task is run outside of a container then presumably the environment is applied to the shell executing that tasks command. Do we need to specify where the environment variable is applied for a containerized task? Would the environment variable be applied to the container (for example as part of some docker command) or would the variable be applied to the command shell that will run in the container? |
@markjschreiber is the distinction that the environment variable may change depending on whether it is injected into the task shell script vs into the docker image? |
Pretty much. I think it's fine to clarify with something like - if the task is run in a container the env var is "injected" into the container and not applied to the shell on the host that runs the container. |
@markjschreiber would saying something like the following work The environment variable should be evaluated by the engine prior to injecting it into the execution environment. if the task is run in a container the env var is "injected" into the container and not applied to the shell on the host that runs the container. |
Perfect! |
This PR is the result of a few discussions that have been ongoing over the past while (#458 and #443) of different ways to make improve the security and usability of WDL.
Declare Environment Variables within a Task
This PR adds the ability to declare environment variables within the task scope, as well as to pass values for those variables within a
call
block. Most WDL engines rely on templating in values to the command section prior to command execution. While this is convenient and allows us to define our own expression language within a command block, it does mean that any WDL command is subject injection attacks or poor formatting during execution.Using environment variables circumvents this issue by not relying on templating at all. Instead the execution engine safely escapes strings prior to setting them within the container (or other execution) environment at runtime. Environment variables are not usable within other WDL expressions and once defined are accessed using the normal shell semantics associated with the given interpreter.
Environment variables are all implicitly strings, and therefore can be set to any expression which either evaluates to a string, or is coercible to one.
Checklist