Support for VCS backed workspaces #70

koikonom · 2020-09-25T09:47:57Z

Add the option to create VCS backed workspaces using k8s. This can be
achieved by adding a vcs section in the workspace CR spec like this:

 vcs:
   token_id: "oauth client id taken from TFC (or TFE)"
   repo_identifier: "repository identifier (i.e koikonom/terraform-nullresource-example")

The operator now loops every 60 seconds and if it detects a run
that took place "out of band" (i.e a user queued it in the UI,
or new code was pushed to the repository), then the operator will
update the outputs in the k8s resource accordingly.

Add the option to create VCS backed workspaces using k8s. This can be achieved by adding a `vcs` section in the workspace CR spec like this: ``` vcs: token_id: "oauth client id taken from TFC (or TFE)" repo_identifier: "repository identifier (i.e koikonom/terraform-nullresource-example") ``` The operator now loops every 60 seconds and if it detects a run that took place "out of band" (i.e a user queued it in the UI, or new code was pushed to the repository), then the operator will update the outputs in the k8s resource accordingly.

dak1n1

Thanks for putting this together! It's a really neat feature. I like your approach to this in general, and I think it's pretty clever to use the requeue to check periodically. I didn't know about that feature until I read about it yesterday.

I have some ideas for improvement in some areas, as you'll see in the comments below. Aside from those comments, I also might need some help to get this branch running in my local minikube. I think there's an issue in the CRD, or else I'm doing something wrong. But it kept me from being able to deploy and test this.

EDIT: I just got this running by adding ingress_submodules: false in my Workspace CR. The failure below is what happens if that is omitted.

Once I can deploy it, I plan to run through a full test with a VCS repo. Here's the problem I hit, and what I've done so far:

# Set up minikube to run a registry pod inside of the minikube cluster.
# This allows me to build the image using rootless podman v2
# without having to install docker on my local system.
# More info at https://minikube.sigs.k8s.io/docs/handbook/registry/

minikube start --vm-driver=kvm2 --container-runtime=docker --insecure-registry "10.0.0.0/24"
minikube addons enable registry

cd /home/dakini/go/src/github.com/hashicorp/terraform-k8s
docker system prune -a -f
make dev-docker
docker images |grep k8s-dev
docker tag localhost/terraform-k8s-dev:latest $(minikube ip):5000/terraform-k8s-dev:latest
docker push --tls-verify=false $(minikube ip):5000/terraform-k8s-dev:latest

cd /home/dakini/go/src/github.com/hashicorp/terraform-helm

# make sure values.yaml looks like this:
   imagePullPolicy: "Never"
+  imageK8S: "localhost:5000/terraform-k8s-dev:latest"

kubectl create ns preprod
kubectl create -f example/configmap.yaml -n preprod
kubectl create -n preprod secret generic terraformrc --from-file=credentials=/home/dakini/.terraformrc
kubectl create -n preprod secret generic workspacesecrets --from-literal=AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} --from-literal=AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
kubectl create -n preprod secret generic sensitivevars --from-literal=myvar=supersecret

rm -rf ~/.config/helm
rm -rf ~/.cache/helm
rm -rf ~/.helm
rm -rf ~/.local/share/helm

helm install -n preprod operator-test .
kubectl get pods -n preprod

# copy CRD from terraform-k8s repo
cp ../terraform-k8s/deploy/crds/* crds/

# create the workspace
kubectl apply -f  example/workspace_git.yaml  -n preprod

Result:

I'm able to build and run the image from this branch, however, helm fails to create the example Workspace CR.

$ kubectl apply -f  example/workspace_git.yaml  -n preprod
error: error validating "example/workspace_git.yaml": error validating data: ValidationError(Workspace.spec.vcs): missing required field "ingress_submodules" in io.terraform.app.v1alpha1.Workspace.spec.vcs; if you choose to ignore these errors, turn validation off with --validate=false

Here is my workspace CR that I'm trying to deploy:

$ cat  example/workspace_git.yaml
---
apiVersion: app.terraform.io/v1alpha1
kind: Workspace
metadata:
  name: salutations
spec:
  vcs:
    token_id: REDACTED GITHUB TOKEN
    branch: master
    repo_identifier: git@github.com:dak1n1/queues.git
  sshKeyID: testkey
  organization: operator-test
  secretsMountPath: "/tmp/secrets"
  module:
    source: "git::https://github.com/dak1n1/queues.git"
  outputs:
    - key: queue_urls
      moduleOutputName: queue_urls
    - key: json_string
      moduleOutputName: json_string
  variables:
    - key: application
      value: goodbye
      sensitive: false
      environmentVariable: false
    - key: environment
      value: preprod
      sensitive: false
      environmentVariable: false
    - key: some_list
      value: '["hello","jello"]'
      hcl: true
      sensitive: false
      environmentVariable: false
    - key: some_object
      value: '{hello={name= "test"}}'
      hcl: true
      sensitive: false
      environmentVariable: false
    - key: AWS_DEFAULT_REGION
      valueFrom:
        configMapKeyRef:
          name: aws-configuration
          key: region
      sensitive: false
      environmentVariable: true
    - key: AWS_ACCESS_KEY_ID
      sensitive: true
      environmentVariable: true
    - key: AWS_SECRET_ACCESS_KEY
      sensitive: true
      environmentVariable: true
    - key: CONFIRM_DESTROY
      value: "1"
      sensitive: false
      environmentVariable: true

pkg/controller/workspace/tfc_org.go

dak1n1 · 2020-10-01T20:55:20Z

pkg/controller/workspace/tfc_run.go

+		retryCnt := 0
+		foundCV := false
+		for retryCnt < 2 {
+			configVersions, err := t.Client.ConfigurationVersions.List(context.Background(), workspace.Status.WorkspaceID, tfc.ConfigurationVersionListOptions{})


hmm, is context.Background() the best one to use here? That one is uninterruptable, as far as I'm aware. I wouldn't want to block the process during this List operation. https://golang.org/pkg/context/#Background

context.Background() is usually used by the main function, and then you would derive a child context from it, and pass that child to functions like this one. (Deriving a child context from context.Background() lets you set cancellations and timeouts).

Alternatively, it's a really common pattern in operator-sdk to use context.TODO() everywhere. Context.TODO() is non-nil and empty, like the one you have there. So that would be easy to swap out.

If you do decide to go with context.Background(), it should probably be created in the controller and then passed into this function from here:

https://github.com/hashicorp/terraform-k8s/blob/master/pkg/controller/workspace/workspace_controller.go#L238

That was a genuine learning opportunity! I will change it :)

dak1n1 · 2020-10-01T21:19:15Z

pkg/controller/workspace/tfc_run.go

+				break
+			}
+			retryCnt++
+			time.Sleep(100 * time.Millisecond)


Generally when using operator-sdk, we try to avoid sleeps and retries such as this, because the operator-sdk already has a retry mechanism with an exponential backoff built in. In cases where you need to retry, it's generally sufficient to exit the reconcile loop (return reconcile.Result{}, err in the controller). When it starts up again, it will will retry, then backoff, then retry. This way it's not a blocking operation.

OK I will do that, thanks!

What would you suggest I use between reconcile.Result{Requeue: true}, nil and reconcile.Result{}, err ? Is there a difference?

They should both requeue, but the difference is the type of message left in the operator log. It could be most helpful to the user if we passed them the error returned from t.Client.ConfigurationVersions.List. That way TFC/TFE might tell them "I couldn't find a ConfigurationVersion for this workspace". And then they could investigate why that is. Meanwhile, the operator will keep trying to find it, which gives them a chance to log into TFC and fix the problem. (Disclaimer: I don't know much about TFC and I'm not sure what a ConfigurationVersion is offhand. heh)

I found this piece of documentation helpful yesterday when looking at requeuing options:

return reconcile.Result{}, nil The reconcile process finished with no errors and does not require another pass through the reconcile loop. return reconcile.Result{}, err The reconcile failed due to an error and Kubernetes should requeue it to try again. return reconcile.Result{Requeue: true}, nil The reconcile did not encounter an error, but Kubernetes should requeue it to run for another iteration. return reconcile.Result{RequeueAfter: time.Second*5}, nil Similar to the previous result, but this will wait for the specified amount of time before requeuing the request. This approach is useful when there are multiple steps that must run serially, but may take some time to complete. For example, if a backend service needs a running database prior to starting, the reconcile can be requeued with a delay to give the database time to start. Once the database is run‐ ning, the Operator does not requeue the reconcile request, and the rest of the steps continue.

https://www.redhat.com/cms/managed-files/cl-oreilly-kubernetes-operators-ebook-f21452-202001-en_2.pdf

The ConfigurationVersion object is not immediately available when using a VCS backed workspace, which is why I put the retry in the first place.

Based on the description above I think RequeueAfter is the best course of action.

dak1n1 · 2020-10-01T22:10:26Z

pkg/controller/workspace/tfc_run.go

+		if !foundCV {
+			return fmt.Errorf("No config version found")
+		}
+	} else {


Following the logic chain here makes me think it would be simpler to separate these out into a separate function. That way the CreateRun function can focus on CreateRun logic. For example:

// CreateRun runs a new Terraform Cloud configuration func (t *TerraformCloudClient) CreateRun(workspace *v1alpha1.Workspace,terraform []byte) error { configVersion, err := t.checkConfigVersion(workspace, terraform) message := fmt.Sprintf("%s, apply", TerraformOperator) options := tfc.RunCreateOptions{ Message: &message, ConfigurationVersion: configVersion, Workspace: &tfc.Workspace{ ID: workspace.Status.WorkspaceID, }, } run, err := t.Client.Runs.Create(context.TODO(), options) if err != nil { return err } workspace.Status.RunID = run.ID workspace.Status.RunStatus = string(run.Status) return nil } func (t *TerraformCloudClient) checkConfigVersion(workspace *v1alpha1.Workspace, terraform []byte) (*tfc.ConfigurationVersion, error) { if workspace.Spec.VCS == nil { configVersion, err := t.CreateConfigurationVersion(workspace.Status.WorkspaceID) if err != nil { return nil, err } os.Mkdir(moduleDirectory, 0777) if err := ioutil.WriteFile(configurationFilePath, terraform, 0777); err != nil { return nil, err } if err := t.UploadConfigurationFile(configVersion.UploadURL); err != nil { return nil, err } return configVersion, nil } if workspace.Spec.VCS != nil { configVersions, err := t.Client.ConfigurationVersions.List(context.TODO(), workspace.Status.WorkspaceID, tfc.ConfigurationVersionListOptions{}) if err != nil { return nil, err } if len(configVersions.Items) > 0 { return configVersions.Items[0], nil } } return nil, fmt.Errorf("No config version found") }

The reason I bring this up is because I noticed CreateRun logic being run for new ConfigurationVersion but not for existing ConfigurationVersions. That is, the options.ConfigurationVersion is never set for existing configurations.

VCS backed workspaces generate their ConfigurationVersion objects automatically by cloning the linked git repo. By not setting a ConfigurationVersion TFC will automatically use the latest one.

When we specify a module we use a different workspace mode where we have to create the ConfigurationVersion ourselves by generating the necessary tf config and upload it using the TFC API.

... that said, you're right. I'll split this function to make it easier to read.

Nice, thanks for that explanation. I wasn't sure what a ConfigurationVersion was, but it sounds like it's not something we need to fetch from TFC in the case of using the VCS option. 👍

pkg/controller/workspace/workspace_controller.go

dak1n1 · 2020-10-01T22:29:02Z

pkg/controller/workspace/k8s_configmap.go

 		}
 		return true, nil
 	} else if err != nil {
 		r.reqLogger.Error(err, "Failed to get Terraform ConfigMap")
-		return updated, err
+		return false, err


Thanks for making this code more clear!

dak1n1 · 2020-10-01T22:30:01Z

pkg/controller/workspace/k8s_configmap.go

-		}
-		return true, nil
+	if found.Data[TerraformConfigMap] == string(template) {
+		return false, nil


Nice, this is more clear too!

dak1n1 · 2020-10-01T22:32:02Z

pkg/controller/workspace/k8s_configmap.go

@@ -15,7 +15,7 @@ import (
 	"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
 )

-func configMapForTerraform(name string, namespace string, template []byte) *corev1.ConfigMap {
+func configMapForTerraform(name string, namespace string, template []byte, runID string) *corev1.ConfigMap {


I see you added a parameter here... did you forget to use it in the function?

Hmm yeah leftover from a previous approach. I will remove it.

pkg/controller/workspace/k8s_configmap.go

dak1n1 · 2020-10-02T18:42:13Z

Once I added ingress_submodules: false to my CR, the Workspace CR was created and the operator started processing it. However, it gets stuck here:

{"level":"error","ts":1601663745.6996603,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"workspace-controller","request":"preprod/salutations","error":"unprocessable entity\n\nOAuth token does not exist","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}

I get the feeling I'm using my token incorrectly. I put a github oauth token into the token_id field. Was that wrong? It looks like github never received the request.

pkg/apis/app/v1alpha1/workspace_types.go

Co-authored-by: Stef Forrester <dak1n1@users.noreply.github.com>

koikonom · 2020-10-02T19:47:34Z

BTW I found it easier to test locally by using this command:

operator-sdk-v0.18.0-x86_64-apple-darwin run local --watch-namespace <namespace> --operator-flags "sync-workspace --k8s-watch-namespace=<namespace>"

Edit: just remembered that the operator CLI tool expects main.go to be in a specific location so before I run the command above I had to do this:

mkdir -p cmd/manager
ln   -s <full_path_to_repo>/*.go ./cmd/manager

Also added some missing fields to the CRD, minor post-review cleanup and linter fixes.

dak1n1

This looks good! Thanks for all the work on this feature. I tested it today and it automatically ran the apply when I updated my VCS repo.

pkg/apis/app/v1alpha1/workspace_types.go

aareet · 2020-10-07T18:29:51Z

Super exciting!

Sometimes the workspace may not have a Run structure associated with it. If we attempt to access it without testing if the LastRun is null then we may cause the operator to panic().

Co-authored-by: Stef Forrester <dak1n1@users.noreply.github.com>

koikonom · 2020-10-30T10:25:34Z

Fixes #59

koikonom requested a review from a team September 25, 2020 09:47

dak1n1 reviewed Oct 2, 2020

View reviewed changes

pkg/apis/app/v1alpha1/workspace_types.go Outdated Show resolved Hide resolved

dak1n1 reviewed Oct 2, 2020

View reviewed changes

pkg/apis/app/v1alpha1/workspace_types.go Show resolved Hide resolved

koikonom and others added 2 commits October 2, 2020 21:53

Remove duplicate import

1e39da8

Co-authored-by: Stef Forrester <dak1n1@users.noreply.github.com>

Remove whitespace

0aa76fd

Co-authored-by: Stef Forrester <dak1n1@users.noreply.github.com>

koikonom added 2 commits October 2, 2020 21:07

Use k8s events to notify users of certain events

8179ed5

Refactor the Reconcile function to allow for easier handling of TFC runs

0d50c90

Also added some missing fields to the CRD, minor post-review cleanup and linter fixes.

dak1n1 approved these changes Oct 7, 2020

View reviewed changes

pkg/apis/app/v1alpha1/workspace_types.go Show resolved Hide resolved

koikonom and others added 3 commits October 7, 2020 21:41

Handle potentially null pointer

7890048

Sometimes the workspace may not have a Run structure associated with it. If we attempt to access it without testing if the LastRun is null then we may cause the operator to panic().

Update pkg/apis/app/v1alpha1/workspace_types.go

712b730

Co-authored-by: Stef Forrester <dak1n1@users.noreply.github.com>

Merge branch 'master' into vcs_workspace

7049258

koikonom merged commit 62b2e90 into master Oct 7, 2020

koikonom deleted the vcs_workspace branch October 7, 2020 20:56

koikonom mentioned this pull request Oct 23, 2020

Fix bug with timing of configuration version status #78

Merged

koikonom mentioned this pull request Oct 30, 2020

Support VCS integration with TFC #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for VCS backed workspaces #70

Support for VCS backed workspaces #70

koikonom commented Sep 25, 2020

dak1n1 left a comment

dak1n1 Oct 1, 2020

koikonom Oct 2, 2020

dak1n1 Oct 1, 2020

koikonom Oct 2, 2020

koikonom Oct 2, 2020 •

edited

Loading

dak1n1 Oct 2, 2020

koikonom Oct 2, 2020

dak1n1 Oct 1, 2020

dak1n1 Oct 1, 2020

koikonom Oct 2, 2020

koikonom Oct 2, 2020

dak1n1 Oct 2, 2020

dak1n1 Oct 1, 2020

dak1n1 Oct 1, 2020

dak1n1 Oct 1, 2020

koikonom Oct 2, 2020

dak1n1 commented Oct 2, 2020

koikonom commented Oct 2, 2020 •

edited

Loading

dak1n1 left a comment

aareet commented Oct 7, 2020

koikonom commented Oct 30, 2020

Support for VCS backed workspaces #70

Support for VCS backed workspaces #70

Conversation

koikonom commented Sep 25, 2020

dak1n1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koikonom Oct 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dak1n1 commented Oct 2, 2020

koikonom commented Oct 2, 2020 • edited Loading

dak1n1 left a comment

Choose a reason for hiding this comment

aareet commented Oct 7, 2020

koikonom commented Oct 30, 2020

koikonom Oct 2, 2020 •

edited

Loading

koikonom commented Oct 2, 2020 •

edited

Loading