-
-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Flags, Errors and Exclude #3134
Comments
This is a terrific write-up. I love the example use cases, press-release preview, the drawbacks to the design, and the alternatives considered. Well done @yhakbar 👏 Some feedback in a somewhat random order:
|
Responding to @brikis98 : Feature Flag DynamicityThis design did assume that it would be fully compatible with usage of external web services for feature flag management! I wanted to focus on the core functionality of how the I would guess that the majority of users leveraging the feature flag functionality proposed here would be setting and adjusting environment variables dynamically in their CI/CD pipelines like GitHub Actions, GitLab CI, Jenkins, etc. Prioritizing the ability to toggle feature flags via environment variables and CLI flags was a way to ensure that the feature flag functionality could be used in a wide variety of CI/CD environments, without relying on an external service. e.g. In the context of a GitHub Actions workflow, configuration like the following would allow for the env:
TG_FLAG_use_service_module_v2: ${{ vars.TG_FLAG_use_service_module_v2 }}
run: terragrunt apply -auto-approve Now, for users who are currently using a feature flag management service, I think the current design does not preclude them from using it. There are two ways that I would expect users to use the feature flag functionality as currently proposed in conjunction with a feature flag management service:
I like the idea of seamless integration with feature flag management services that doesn't require leveraging Terragrunt functionality in a manner this sophisticated, however. If this is commonly done within the community, it might be worth it to prioritize a system for integrating with these services directly. Maybe a plugin system that provides nice interfaces for common feature flag management services? Mixing of ConcernsI agree that there's definitely tension between the feature flag concept, the error suppression concept and the module skip concept. The error suppression and module skip concepts do end up constricting the feature flag implementation in such a way that it's not as flexible as folks typically want feature flags to be. Tying it to those concepts requires that the feature flag is boolean to allow for the module to be skipped or not, and that the feature flag is used to determine whether or not to suppress errors. As you described, this prevents usage of string or numeric feature flags. At the same time, I could imagine users wanting to tightly integrate those concepts, as it might only make sense to suppress particular errors within the context of a feature flag being enabled. What do we think about having three separate configuration blocks for feature flags, error suppression, and module skipping? This would allow for more flexibility in how these concepts are used together, and would allow for more complex feature flag configurations that don't necessarily involve error suppression or module skipping. So, instead of: feature "feature_name"{
default = false # Optionally default it so that you can opt-in or out.
# Conditions that result in the feature being skipped.
skip {
actions = ["all"] # Actions to skip when active. Other options might be ["plan", "apply", "all_except_output"], etc
}
# Alter behavior on failure
failure {
ignorable_errors = ".*" # Specify a pattern that will be detected in the error for ignores, or just ignore any error
message = "Flaky feature failing here!" # Add an optional warning message if it fails
# Key-value map that can be used to emit signals on failure
signals = {
safe_to_revert = true # Signal that the apply is safe to revert on failure
}
}
} We could have: feature "feature_name"{
default = "A"
}
skip {
if = feature.feature_name.value == "A"
actions = ["all"]
}
failure {
ignorable_errors = feature.feature_name.value == "A" ? [".*"]: []
message = feature.feature_name.value == "A" ? "Flaky feature failing here!" : "Woah, this feature is supposed to be solid!"
signals = {
safe_to_revert = feature.feature_name.value == "A"
}
} And folks might just conventionally keep the blocks together within the I worry that this might introduce quite a bit of complexity to the configuration, but it might be worth it for the added flexibility. It would allow for the values of feature flags to take on more complex values, and for the other concepts to be used outside of the context of feature flags. Feature Flags as FunctionsI like the idea of not needing additional configuration blocks for feature flags, and instead using them as functions that can be used in various places in the configuration. I don't know if one would end up being more expensive to maintain than the other, so it might be worth preferring the cheaper option. There may be advantages to having the feature flag defined via a block, however. e.g. It might be easier to see all of the feature flags that are available in configuration at a glance: feature "feature_name"{
default = "A"
}
terraform {
source = "github.com/foo//bar?ref=${feature.feature_name.value == "A" ? "v2" : "v1"}"
} Might be easier to spot than: terraform {
source = "github.com/foo//bar?ref=${feature_flag("feature_name") == "A" ? "v2" : "v1"}"
} That would be especially relevant when searching for feature flags to remove once features are stable. This might even lend itself to a There is also functionality that could be added to the feature block that would be difficult to add to a function. For example, configuring a default value for a feature flag might be more likely to be consistent when done via a block than when done via a function. e.g. To keep a default value consistent across all uses of a function, you might have to do something like: locals {
do_experiment = feature_flag("DO_EXPERIMENT", false) # Where the second argument is the default value
value1 = local.do_experiment ? "A" : "B"
value2 = local.do_experiment ? "C" : "D"
# Because a different default is used here, it's harder to reason about the value of the feature flag
value3 = feature_flag("DO_EXPERIMENT", true) ? "E" : "F"
} Whereas, with a block, it's a lot more explicit: feature "do_experiment"{
default = false
}
locals {
value1 = feature.do_experiment.value ? "A" : "B"
value2 = feature.do_experiment.value ? "C" : "D"
# Here we're explicitly negating the value of the feature flag, the default can't vary between uses
value3 = !feature.do_experiment.value ? "E" : "F"
} Having a block also allows for more complex feature flag configurations in the future, like the ability to configure a provider for integration with a feature flag management service or to have validations, etc.
|
I think the env-var and Note that I'm only looking for guidance here; not first-class features built into TG itself. At least, not at this stage. If this somehow becomes super popular, sure, we can think about native support in plugins or whatever, but for now, I just want to make sure that if we say "TG supports feature flags," that we support it's most common use case, which is enabling/disabling features with a click in a UI.
I'm a big +1 on that. I think we'd want to iterate on exactly what the blocks are, but having these as separate entities seems much more powerful, maintainable, understandable, etc.
Your analysis is convincing. The block approach wins, hands-down, for helping with readability, understanding, and static analysis/commands based off feature toggles.
I think making it clear what the behavior will be when a module is skipped (or fails and the failure is ignored). If we use mock outputs or skip or whatever else, we need to make sure it's clear and expected for the user. Maybe even some sort of "use last known in case of skipped or failed dependency" setting, where we use the last known good outputs? Not sure on this, but again, clarity is king here :) |
To address some feedback that has been brought up regarding this RFC: "Headless" IAC Updates On Feature ToggleSome folks have asked about the ability to have feature flag updates triggering infrastructure updates, similar to @brikis98's suggestion above. The envisioned behavior would be something like updating a feature flag in feature flag management software, then having an event dispatched to drive an infrastructure update without having to manually run another Terragrunt update (I'm calling that a "headless" IAC update, but it will likely be called something different if implemented). This might be delivered as any of:
While a feature that users would likely appreciate, it is out of scope for this RFC. The primary goal of this RFC is to provide a convenient mechanism for exposing dynamic runtime behavior configuration in Terragrunt, not to provide a way to trigger infrastructure updates based on feature flag changes. This is something we can revisit at a later date after feature flags are released. Error HandlingFeedback has also been provided that the mechanism here for handling error suppression may be too simplistic. For edge nodes in the DAG, it is likely sufficient behavior that errors can be optionally ignored, and for the status code of the entire For nodes within the middle or start of the DAG, users may want to handle errors in a more nuanced way. If a node in the middle of the DAG fails, users may want to stop or change the execution of the rest DAG. This may be because the failing node is known to be flaky, and that certain errors in its execution can be safely ignored, but that execution of the rest of the DAG should be stopped if that node fails, as there may be no point in continuing. As such, while keeping the behavior of error suppression the same (i.e. the rest of the DAG will continue executing if the previous node fails), some additional configuration will be proposed as to how Expose Error Handling ConfigurationThe proposed In addition, the proposed The e.g. # ./parent/terragrunt.hcl
errors {
# Errors of type foo are retryable, and should be retried up to 3 times with a 5 second sleep interval
retry "foo" {
retryable_errors = [".*Error: foo.*"]
max_attempts = 3
sleep_interval_sec = 5
}
# Errors of type bar are ignorable, and should be ignored
ignore "bar" {
ignorable_errors = [".*Error: bar.*"]
message = "Ignoring error: bar"
}
# Errors of type baz are ignorable, and should suppress the rest of the DAG
ignore "baz" {
ignorable_errors = [".*Error: baz.*"]
message = "Ignoring error: baz"
}
} # ./child/terragrunt.hcl
dependency "parent" {
config_path = "../parent"
}
skip {
# Skip child if any errors are ignored in the parent
if = dependency.foo.errors.ignored
# Skip for any `terragrunt` action except `output`. Important, as dependencies will need to extract output from the parent.
actions = ["all_except_output"]
# Skip dependencies if errors are ignored of type baz
skip_dependencies = dependency.foo.errors.ignore.baz.ignored
} # ./grandchild/terragrunt.hcl
dependency "child" {
config_path = "../child"
}
# Does not run if a `baz` error occurred in `parent`, but will if error `bar` was ignored. The objective here is to provide a more nuanced way to handle errors within the DAG, but to keep behavior relatively predictable. Trade-offsOne trade-off in this adjustment is that unexpected behavior may occur in the DAG for grandchildren of a node that has suppressed errors. They will have no configuration that indicates that their parent has suppressed errors, but may be skipped if a grandparent has suppressed errors. This is currently the case when a grandparent fails with an error, but we currently emit an exit code of 1, and throw an error in that scenario. This approach also requires that child dependencies have explicit error handling of ignored errors in parents, which may be very cumbersome for flaky nodes with many dependants. In the scenario that a flaky node has many dependants, it is likely worth making this trade-off, however, as there may be better context for whether a skip is appropriate within a child than in the parent. This also complicates the API for the In addition to complicating the API of the |
A couple of notes/questions after reading this RFC I think all introduced blocks should be named, in this way we can see which one was triggered, like:
Not sure how it will be handled cases when there are multiple "skip" with contradicted flags like:
Or such constructions shouldn't be used (in case if blocks aren't named, only one will be allowed) Usage of nested setup like:
will not be helpful if users want to have feature flags in the parent file and in children (unknown number) different behavior based on feature flags Part with signals is not quite clear |
The idea was that you shouldn't have multiple When skips occur, a warning should be emitted to stderr that points out which Sorry I wasn't clear about this, but I initially started out the RFC with the configurations for Ya, the |
Prepared beta release with support of feature flags # terragrunt.hcl
feature "run_hook" {
default = false
}
terraform {
before_hook "feature_flag" {
commands = ["apply", "plan", "destroy"]
execute = feature.run_hook.value ? ["sh", "-c", "feature_flag_script.sh"] : [ "sh", "-c", "exit", "0" ]
}
}
Passing feature flags: terragrunt --feature run_hook=true apply
terragrunt --feature run_hook=true --feature string_flag=dev apply https://github.com/gruntwork-io/terragrunt/releases/tag/v0.68.8-beta2024110601 |
Cut a beta release that supports
https://github.com/gruntwork-io/terragrunt/releases/tag/v0.68.15-beta2024111501 |
Released in https://github.com/gruntwork-io/terragrunt/releases/tag/v0.68.16 |
Prepared alpha release with support of errors block: https://github.com/gruntwork-io/terragrunt/releases/tag/v0.69.4-alpha2024120101 errors {
# Retry block for transient errors
retry "retry_network" {
retryable_errors = [".*Error: network timeout.*"]
max_attempts = 3
sleep_interval_sec = 5
}
# Ignore block for non-critical errors
ignore "ignore_warnings" {
ignorable_errors = [
".*Warning: non-critical issue.*"
]
message = "Ignoring non-critical warnings"
}
}
|
Published handling of https://github.com/gruntwork-io/terragrunt/releases/tag/v0.69.6 Closing this RFC ticket as all requested features have been successfully implemented and released. |
Summary
Provide first class support for feature flags as part of Terragrunt HCL configuration.
Allow for dynamic configuration of behavior in select
terragrunt.hcl
files based on the presence or absence of feature flags that are set via environment variables and CLI flags.In addition, update how Terragrunt errors and exckudes are handled to ensure that unstable configuration can be handled gracefully.
Motivation
Terragrunt is frequently used in monorepo contexts, and it lends itself to this in how it segments IAC state into separate directories. One definition of monorepos is a single codebase with multiple independent, but related, projects. By this definition, Terragrunt is very much an IAC monorepo tool. Multiple units of IAC are defined independently, and as a whole, they represent a repository of IAC.
Feature flags are a common way to manage the complexity of a monorepo. They allow for the gradual rollout of new features, the ability to turn off features that are not ready for production, and the ability to manage the complexity of a large codebase.
This is especially important in the context of Terragrunt, where infrastructure is most safely updated when updated in small, incremental changes. In addition, the ability to control how failure is handled in IAC is extremely important. Preventing full resolution of an apply across multiple Terragrunt units because a known flaky unit is failing is not always something that can be remediated by the use of retries, and it can be expensive to do so. Occasionally, it is better to ignore the failure of a known flaky unit and continue with the rest of applies, assuming that the failure is not critical to the overall success of the apply.
An example of such a failure would be a dependency chain where one service is deployed by Terragrunt, and has a
url
output where the service can be accessed, and another service which uses adependency
block to pass thaturl
into the environment variables of a second service.In this example, if the first service fails to deploy, the second service will also fail to deploy. However, if the first service is known to be flaky, and the second service is not dependent on the first service being deployed successfully, it is better to ignore the failure of the first service and continue with the deployment of the second service, leveraging the
url
output from a previous successful apply.Reasons that a unit might be marked in this way include:
Proposal
Provide a combination of:
Proposed Syntax
Examples
The syntax is intended to be flexible enough to support a couple different use-cases that are common when using feature flags.
Dynamic Module Example
Mark a
terragrunt.hcl
file as having a feature that triggers usage of a new module that is not yet stable. In lower environments, this flag is enabled, and in production, it is disabled.In addition, if the apply fails, it is safe to revert the apply, and a special error message is logged to the console.
In this contrived example, the "v2" tag of the module is not currently stable, however, to encourage continuous integration, the platform team has decided to merge in configurations that can use it when a flag is enabled. In the dev environment, the feature flag is enabled, and in the production environment, it is disabled.
When an apply fails, as is expected, a special message is emitted to STDERR to indicate that the source of failure is due to a failure in a feature flag.
In addition, on error, a special
error-signals.json
file will be created in the same directory as theterragrunt.hcl
file with a payload that the platform team knows will be useful to handle the error intelligently. In this scenario, the logic that's being used here that the team has agreed upon is that if anyterragrunt apply
fails, revert to the last commit and re-run the apply, if asate_to_revert
entry is found in theerror-signals.json
for the correspondingterragrunt.hcl
file that was applied.The logic here is definitely not what would work for most organizations to achieve a reliable mechanism for reverting a failed apply. It is merely a demonstration of why authors might want signals emitted on failure.
Unreliable Module Example
Mark a
terragrunt.hcl
file as being unreliable, and ignore any failures with errors matchingNetworking Error
that might occur when applying it.$ tree . ├── reliable │ └── terragrunt.hcl └── unreliable └── terragrunt.hcl
In this example, users are able to mark the
terragrunt.hcl
file in theunreliable
directory as being unreliable, knowing that it predictably produces an error with the messageNetworking Error
that can be safely ignored when re-applied.The ability to ignore errors in the
unreliable
module is handy here, as thereliable
module reads a static output from theunreliable
module that doesn't change much, and uses it as an input.Examples of modules that can have this kind of relationship include:
The dependent modules can continue to codify their dependency relationships to get access to inputs like the database hostname, which is frequently required to connect to the database, and the cluster ID can be passed to the pod, so that its placement can be targeted to the cluster.
In both scenarios, users might find it convenient to be able to avoid failing to successfully deploy the dependent modules when predictable, intermittent errors occur in the dependency.
When using feature flags to support this kind of functionality, the feature flag can be opted-out, via setting an environment variable like so:
TG_FLAG_unreliable='false'
This allows for platform teams to safely test removal of ignored failures until the
feature
configuration blocks can be removed (possibly by only disabling the feature in lower environments).In-progress Module Example
Mark a
terragrunt.hcl
file as being in-progress, excluding all operations on it until a certain feature is complete. The feature can be manually turned on when developing locally, but is off by default.When developing the module locally, use the following flag to activate the module:
This is a simple way to allow incomplete IaC work to be integrated into a code-base without requiring that the code be fully mature before merging it in.
Rapid, frequent and incremental integration is the standard in Continuous Integration, and this provides a mechanism for achieving that for large IaC code bases.
In addition, note the
exclude_dependencies
field being used here, which allows for skipping the dependencies of the module as well. This is useful when building out multiple modules that are dependent on each other, and you want to skip the entire chain of dependencies while a module is in-progress.Technical Details
Some components that will definitely be impacted include:
feature
blocks.error_hook
s andretryable_errors
already alter behavior of a normal Terragrunt execution on failure, This would be another tool that can change how errors are handled in Terragrunt due to thefeature.failure
block.TG_FLAG_<feature name>
.--feature
(or maybe--terragrunt-feature
).terragrunt command
when thefeature.skip
conditions are met.terragrunt run-all command
when they havefeature.skip
conditions met.Press Release
First Class Feature Flags
Terragrunt now has built in support for feature flags, allowing behavior of Terragrunt executions to be altered dynamically at runtime.
Feature flags are a staple of modern DevOps best practices, and using them in Terragrunt will allow you to improve the scalability of your IaC code base.
Use feature flags to support the following, and more:
Feature flags are available as of [RELEASE]. To learn more about how to use them, click [here](link to feature flag documentation).
Drawbacks
Some drawbacks of this proposal include:
terragrunt.hcl
file. Users have already been encounteringterragrunt.hcl
files that are too long and difficult to maintain. This added complexity might maketerragrunt.hcl
files even more difficult to reason about.terragrunt.hcl
files might be very difficult to reason about.exclude
logic, during execution of the module if theenabled
status of the feature is used in controlling behavior, and iffailure
logic is used to handle failure.Alternatives
get_env
, and adding custom logic to adjust behavior of executions based on the values of the environment variables.ignored_errors
companion to theretryable_errors
that just ignores errors instead of retrying them. Customers have been asking for functionality like this to support handling both of failures that are not intermittent enough that they might recover from retrying over a short duration, and to handle errors in modules that are computationally or temporally expensive to just retry soon after failure.get_env
andrun_cmd
. Provide nice walkthroughs on how to achieve common feature flag patterns with existing tooling in Terragrunt.These alternatives, while less expensive than undertaking the introduction of net new functionality in Terragrunt, were considered less beneficial, as first class support for feature flags is generally something that makes a good match for Terragrunt, in my opinion.
Option #2 is also not necessarily mutually exclusive. It might be a good idea to pursue that anyways.
Migration Strategy
None
Unresolved Questions
See the section above about the syntax of feature flags.
I also am not sure how expensive this functionality would be to implement and maintain.
Would the community be interested in this functionality, or would they be more interested in any of the alternatives?
References
Proof of Concept Pull Request
N/A
Edits
feature
,skip
, anderrors
. In addition, the proposal now includes some logic for skipping dependencies.skip
toexclude
, there is alreadyskip
attribute in HCLskip_dependencies
toexclude_dependencies
to match naming conventionThe text was updated successfully, but these errors were encountered: