-
Notifications
You must be signed in to change notification settings - Fork 71
Conversation
Add the option to create VCS backed workspaces using k8s. This can be achieved by adding a `vcs` section in the workspace CR spec like this: ``` vcs: token_id: "oauth client id taken from TFC (or TFE)" repo_identifier: "repository identifier (i.e koikonom/terraform-nullresource-example") ``` The operator now loops every 60 seconds and if it detects a run that took place "out of band" (i.e a user queued it in the UI, or new code was pushed to the repository), then the operator will update the outputs in the k8s resource accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! It's a really neat feature. I like your approach to this in general, and I think it's pretty clever to use the requeue to check periodically. I didn't know about that feature until I read about it yesterday.
I have some ideas for improvement in some areas, as you'll see in the comments below. Aside from those comments, I also might need some help to get this branch running in my local minikube. I think there's an issue in the CRD, or else I'm doing something wrong. But it kept me from being able to deploy and test this.
EDIT: I just got this running by adding ingress_submodules: false
in my Workspace CR. The failure below is what happens if that is omitted.
Once I can deploy it, I plan to run through a full test with a VCS repo. Here's the problem I hit, and what I've done so far:
# Set up minikube to run a registry pod inside of the minikube cluster.
# This allows me to build the image using rootless podman v2
# without having to install docker on my local system.
# More info at https://minikube.sigs.k8s.io/docs/handbook/registry/
minikube start --vm-driver=kvm2 --container-runtime=docker --insecure-registry "10.0.0.0/24"
minikube addons enable registry
cd /home/dakini/go/src/github.com/hashicorp/terraform-k8s
docker system prune -a -f
make dev-docker
docker images |grep k8s-dev
docker tag localhost/terraform-k8s-dev:latest $(minikube ip):5000/terraform-k8s-dev:latest
docker push --tls-verify=false $(minikube ip):5000/terraform-k8s-dev:latest
cd /home/dakini/go/src/github.com/hashicorp/terraform-helm
# make sure values.yaml looks like this:
imagePullPolicy: "Never"
+ imageK8S: "localhost:5000/terraform-k8s-dev:latest"
kubectl create ns preprod
kubectl create -f example/configmap.yaml -n preprod
kubectl create -n preprod secret generic terraformrc --from-file=credentials=/home/dakini/.terraformrc
kubectl create -n preprod secret generic workspacesecrets --from-literal=AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} --from-literal=AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
kubectl create -n preprod secret generic sensitivevars --from-literal=myvar=supersecret
rm -rf ~/.config/helm
rm -rf ~/.cache/helm
rm -rf ~/.helm
rm -rf ~/.local/share/helm
helm install -n preprod operator-test .
kubectl get pods -n preprod
# copy CRD from terraform-k8s repo
cp ../terraform-k8s/deploy/crds/* crds/
# create the workspace
kubectl apply -f example/workspace_git.yaml -n preprod
Result:
I'm able to build and run the image from this branch, however, helm fails to create the example Workspace CR.
$ kubectl apply -f example/workspace_git.yaml -n preprod
error: error validating "example/workspace_git.yaml": error validating data: ValidationError(Workspace.spec.vcs): missing required field "ingress_submodules" in io.terraform.app.v1alpha1.Workspace.spec.vcs; if you choose to ignore these errors, turn validation off with --validate=false
Here is my workspace CR that I'm trying to deploy:
$ cat example/workspace_git.yaml
---
apiVersion: app.terraform.io/v1alpha1
kind: Workspace
metadata:
name: salutations
spec:
vcs:
token_id: REDACTED GITHUB TOKEN
branch: master
repo_identifier: git@github.com:dak1n1/queues.git
sshKeyID: testkey
organization: operator-test
secretsMountPath: "/tmp/secrets"
module:
source: "git::https://github.com/dak1n1/queues.git"
outputs:
- key: queue_urls
moduleOutputName: queue_urls
- key: json_string
moduleOutputName: json_string
variables:
- key: application
value: goodbye
sensitive: false
environmentVariable: false
- key: environment
value: preprod
sensitive: false
environmentVariable: false
- key: some_list
value: '["hello","jello"]'
hcl: true
sensitive: false
environmentVariable: false
- key: some_object
value: '{hello={name= "test"}}'
hcl: true
sensitive: false
environmentVariable: false
- key: AWS_DEFAULT_REGION
valueFrom:
configMapKeyRef:
name: aws-configuration
key: region
sensitive: false
environmentVariable: true
- key: AWS_ACCESS_KEY_ID
sensitive: true
environmentVariable: true
- key: AWS_SECRET_ACCESS_KEY
sensitive: true
environmentVariable: true
- key: CONFIRM_DESTROY
value: "1"
sensitive: false
environmentVariable: true
pkg/controller/workspace/tfc_run.go
Outdated
retryCnt := 0 | ||
foundCV := false | ||
for retryCnt < 2 { | ||
configVersions, err := t.Client.ConfigurationVersions.List(context.Background(), workspace.Status.WorkspaceID, tfc.ConfigurationVersionListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, is context.Background()
the best one to use here? That one is uninterruptable, as far as I'm aware. I wouldn't want to block the process during this List operation. https://golang.org/pkg/context/#Background
context.Background()
is usually used by the main function, and then you would derive a child context from it, and pass that child to functions like this one. (Deriving a child context from context.Background()
lets you set cancellations and timeouts).
Alternatively, it's a really common pattern in operator-sdk to use context.TODO()
everywhere. Context.TODO() is non-nil and empty, like the one you have there. So that would be easy to swap out.
If you do decide to go with context.Background()
, it should probably be created in the controller and then passed into this function from here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a genuine learning opportunity! I will change it :)
break | ||
} | ||
retryCnt++ | ||
time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally when using operator-sdk, we try to avoid sleeps and retries such as this, because the operator-sdk already has a retry mechanism with an exponential backoff built in. In cases where you need to retry, it's generally sufficient to exit the reconcile loop (return reconcile.Result{}, err
in the controller). When it starts up again, it will will retry, then backoff, then retry. This way it's not a blocking operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I will do that, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you suggest I use between reconcile.Result{Requeue: true}, nil
and reconcile.Result{}, err
? Is there a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should both requeue, but the difference is the type of message left in the operator log. It could be most helpful to the user if we passed them the error returned from t.Client.ConfigurationVersions.List
. That way TFC/TFE might tell them "I couldn't find a ConfigurationVersion for this workspace". And then they could investigate why that is. Meanwhile, the operator will keep trying to find it, which gives them a chance to log into TFC and fix the problem. (Disclaimer: I don't know much about TFC and I'm not sure what a ConfigurationVersion is offhand. heh)
I found this piece of documentation helpful yesterday when looking at requeuing options:
return reconcile.Result{}, nil
The reconcile process finished with no errors and does not require another pass
through the reconcile loop.
return reconcile.Result{}, err
The reconcile failed due to an error and Kubernetes should requeue it to try
again.
return reconcile.Result{Requeue: true}, nil
The reconcile did not encounter an error, but Kubernetes should requeue it to
run for another iteration.
return reconcile.Result{RequeueAfter: time.Second*5}, nil
Similar to the previous result, but this will wait for the specified amount of time
before requeuing the request. This approach is useful when there are multiple
steps that must run serially, but may take some time to complete. For example, if
a backend service needs a running database prior to starting, the reconcile can be
requeued with a delay to give the database time to start. Once the database is run‐
ning, the Operator does not requeue the reconcile request, and the rest of the
steps continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ConfigurationVersion object is not immediately available when using a VCS backed workspace, which is why I put the retry in the first place.
Based on the description above I think RequeueAfter is the best course of action.
if !foundCV { | ||
return fmt.Errorf("No config version found") | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the logic chain here makes me think it would be simpler to separate these out into a separate function. That way the CreateRun function can focus on CreateRun logic. For example:
// CreateRun runs a new Terraform Cloud configuration
func (t *TerraformCloudClient) CreateRun(workspace *v1alpha1.Workspace,terraform []byte) error {
configVersion, err := t.checkConfigVersion(workspace, terraform)
message := fmt.Sprintf("%s, apply", TerraformOperator)
options := tfc.RunCreateOptions{
Message: &message,
ConfigurationVersion: configVersion,
Workspace: &tfc.Workspace{
ID: workspace.Status.WorkspaceID,
},
}
run, err := t.Client.Runs.Create(context.TODO(), options)
if err != nil {
return err
}
workspace.Status.RunID = run.ID
workspace.Status.RunStatus = string(run.Status)
return nil
}
func (t *TerraformCloudClient) checkConfigVersion(workspace *v1alpha1.Workspace, terraform []byte) (*tfc.ConfigurationVersion, error) {
if workspace.Spec.VCS == nil {
configVersion, err := t.CreateConfigurationVersion(workspace.Status.WorkspaceID)
if err != nil {
return nil, err
}
os.Mkdir(moduleDirectory, 0777)
if err := ioutil.WriteFile(configurationFilePath, terraform, 0777); err != nil {
return nil, err
}
if err := t.UploadConfigurationFile(configVersion.UploadURL); err != nil {
return nil, err
}
return configVersion, nil
}
if workspace.Spec.VCS != nil {
configVersions, err := t.Client.ConfigurationVersions.List(context.TODO(), workspace.Status.WorkspaceID, tfc.ConfigurationVersionListOptions{})
if err != nil {
return nil, err
}
if len(configVersions.Items) > 0 {
return configVersions.Items[0], nil
}
}
return nil, fmt.Errorf("No config version found")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I bring this up is because I noticed CreateRun logic being run for new ConfigurationVersion but not for existing ConfigurationVersions. That is, the options.ConfigurationVersion is never set for existing configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VCS backed workspaces generate their ConfigurationVersion objects automatically by cloning the linked git repo. By not setting a ConfigurationVersion TFC will automatically use the latest one.
When we specify a module we use a different workspace mode where we have to create the ConfigurationVersion ourselves by generating the necessary tf config and upload it using the TFC API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... that said, you're right. I'll split this function to make it easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for that explanation. I wasn't sure what a ConfigurationVersion was, but it sounds like it's not something we need to fetch from TFC in the case of using the VCS option. 👍
} | ||
return true, nil | ||
} else if err != nil { | ||
r.reqLogger.Error(err, "Failed to get Terraform ConfigMap") | ||
return updated, err | ||
return false, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this code more clear!
} | ||
return true, nil | ||
if found.Data[TerraformConfigMap] == string(template) { | ||
return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is more clear too!
@@ -15,7 +15,7 @@ import ( | |||
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil" | |||
) | |||
|
|||
func configMapForTerraform(name string, namespace string, template []byte) *corev1.ConfigMap { | |||
func configMapForTerraform(name string, namespace string, template []byte, runID string) *corev1.ConfigMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you added a parameter here... did you forget to use it in the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah leftover from a previous approach. I will remove it.
Co-authored-by: Stef Forrester <dak1n1@users.noreply.github.com>
Co-authored-by: Stef Forrester <dak1n1@users.noreply.github.com>
BTW I found it easier to test locally by using this command:
Edit: just remembered that the operator CLI tool expects main.go to be in a specific location so before I run the command above I had to do this:
|
Also added some missing fields to the CRD, minor post-review cleanup and linter fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! Thanks for all the work on this feature. I tested it today and it automatically ran the apply when I updated my VCS repo.
Super exciting! |
Sometimes the workspace may not have a Run structure associated with it. If we attempt to access it without testing if the LastRun is null then we may cause the operator to panic().
Co-authored-by: Stef Forrester <dak1n1@users.noreply.github.com>
Fixes #59 |
Add the option to create VCS backed workspaces using k8s. This can be
achieved by adding a
vcs
section in the workspace CR spec like this:The operator now loops every 60 seconds and if it detects a run
that took place "out of band" (i.e a user queued it in the UI,
or new code was pushed to the repository), then the operator will
update the outputs in the k8s resource accordingly.