Skip to content
This repository has been archived by the owner on Nov 20, 2024. It is now read-only.

Support for VCS backed workspaces #70

Merged
merged 8 commits into from
Oct 7, 2020
Merged

Support for VCS backed workspaces #70

merged 8 commits into from
Oct 7, 2020

Conversation

koikonom
Copy link
Contributor

Add the option to create VCS backed workspaces using k8s. This can be
achieved by adding a vcs section in the workspace CR spec like this:

 vcs:
   token_id: "oauth client id taken from TFC (or TFE)"
   repo_identifier: "repository identifier (i.e koikonom/terraform-nullresource-example")

The operator now loops every 60 seconds and if it detects a run
that took place "out of band" (i.e a user queued it in the UI,
or new code was pushed to the repository), then the operator will
update the outputs in the k8s resource accordingly.

Add the option to create VCS backed workspaces using k8s. This can be
achieved by adding a `vcs` section in the workspace CR spec like this:
```
 vcs:
   token_id: "oauth client id taken from TFC (or TFE)"
   repo_identifier: "repository identifier (i.e koikonom/terraform-nullresource-example")
```

The operator now loops every 60 seconds and if it detects a run
that took place "out of band" (i.e a user queued it in the UI,
or new code was pushed to the repository), then the operator will
update the outputs in the k8s resource accordingly.
@koikonom koikonom requested a review from a team September 25, 2020 09:47
Copy link
Contributor

@dak1n1 dak1n1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! It's a really neat feature. I like your approach to this in general, and I think it's pretty clever to use the requeue to check periodically. I didn't know about that feature until I read about it yesterday.

I have some ideas for improvement in some areas, as you'll see in the comments below. Aside from those comments, I also might need some help to get this branch running in my local minikube. I think there's an issue in the CRD, or else I'm doing something wrong. But it kept me from being able to deploy and test this.

EDIT: I just got this running by adding ingress_submodules: false in my Workspace CR. The failure below is what happens if that is omitted.

Once I can deploy it, I plan to run through a full test with a VCS repo. Here's the problem I hit, and what I've done so far:

# Set up minikube to run a registry pod inside of the minikube cluster.
# This allows me to build the image using rootless podman v2
# without having to install docker on my local system.
# More info at https://minikube.sigs.k8s.io/docs/handbook/registry/

minikube start --vm-driver=kvm2 --container-runtime=docker --insecure-registry "10.0.0.0/24"
minikube addons enable registry

cd /home/dakini/go/src/github.com/hashicorp/terraform-k8s
docker system prune -a -f
make dev-docker
docker images |grep k8s-dev
docker tag localhost/terraform-k8s-dev:latest $(minikube ip):5000/terraform-k8s-dev:latest
docker push --tls-verify=false $(minikube ip):5000/terraform-k8s-dev:latest

cd /home/dakini/go/src/github.com/hashicorp/terraform-helm

# make sure values.yaml looks like this:
   imagePullPolicy: "Never"
+  imageK8S: "localhost:5000/terraform-k8s-dev:latest"

kubectl create ns preprod
kubectl create -f example/configmap.yaml -n preprod
kubectl create -n preprod secret generic terraformrc --from-file=credentials=/home/dakini/.terraformrc
kubectl create -n preprod secret generic workspacesecrets --from-literal=AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} --from-literal=AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
kubectl create -n preprod secret generic sensitivevars --from-literal=myvar=supersecret

rm -rf ~/.config/helm
rm -rf ~/.cache/helm
rm -rf ~/.helm
rm -rf ~/.local/share/helm

helm install -n preprod operator-test .
kubectl get pods -n preprod

# copy CRD from terraform-k8s repo
cp ../terraform-k8s/deploy/crds/* crds/

# create the workspace
kubectl apply -f  example/workspace_git.yaml  -n preprod

Result:

I'm able to build and run the image from this branch, however, helm fails to create the example Workspace CR.

$ kubectl apply -f  example/workspace_git.yaml  -n preprod
error: error validating "example/workspace_git.yaml": error validating data: ValidationError(Workspace.spec.vcs): missing required field "ingress_submodules" in io.terraform.app.v1alpha1.Workspace.spec.vcs; if you choose to ignore these errors, turn validation off with --validate=false

Here is my workspace CR that I'm trying to deploy:

$ cat  example/workspace_git.yaml
---
apiVersion: app.terraform.io/v1alpha1
kind: Workspace
metadata:
  name: salutations
spec:
  vcs:
    token_id: REDACTED GITHUB TOKEN
    branch: master
    repo_identifier: git@github.com:dak1n1/queues.git
  sshKeyID: testkey
  organization: operator-test
  secretsMountPath: "/tmp/secrets"
  module:
    source: "git::https://github.com/dak1n1/queues.git"
  outputs:
    - key: queue_urls
      moduleOutputName: queue_urls
    - key: json_string
      moduleOutputName: json_string
  variables:
    - key: application
      value: goodbye
      sensitive: false
      environmentVariable: false
    - key: environment
      value: preprod
      sensitive: false
      environmentVariable: false
    - key: some_list
      value: '["hello","jello"]'
      hcl: true
      sensitive: false
      environmentVariable: false
    - key: some_object
      value: '{hello={name= "test"}}'
      hcl: true
      sensitive: false
      environmentVariable: false
    - key: AWS_DEFAULT_REGION
      valueFrom:
        configMapKeyRef:
          name: aws-configuration
          key: region
      sensitive: false
      environmentVariable: true
    - key: AWS_ACCESS_KEY_ID
      sensitive: true
      environmentVariable: true
    - key: AWS_SECRET_ACCESS_KEY
      sensitive: true
      environmentVariable: true
    - key: CONFIRM_DESTROY
      value: "1"
      sensitive: false
      environmentVariable: true

pkg/controller/workspace/tfc_org.go Outdated Show resolved Hide resolved
retryCnt := 0
foundCV := false
for retryCnt < 2 {
configVersions, err := t.Client.ConfigurationVersions.List(context.Background(), workspace.Status.WorkspaceID, tfc.ConfigurationVersionListOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is context.Background() the best one to use here? That one is uninterruptable, as far as I'm aware. I wouldn't want to block the process during this List operation. https://golang.org/pkg/context/#Background

context.Background() is usually used by the main function, and then you would derive a child context from it, and pass that child to functions like this one. (Deriving a child context from context.Background() lets you set cancellations and timeouts).

Alternatively, it's a really common pattern in operator-sdk to use context.TODO() everywhere. Context.TODO() is non-nil and empty, like the one you have there. So that would be easy to swap out.

If you do decide to go with context.Background(), it should probably be created in the controller and then passed into this function from here:

https://github.com/hashicorp/terraform-k8s/blob/master/pkg/controller/workspace/workspace_controller.go#L238

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a genuine learning opportunity! I will change it :)

break
}
retryCnt++
time.Sleep(100 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally when using operator-sdk, we try to avoid sleeps and retries such as this, because the operator-sdk already has a retry mechanism with an exponential backoff built in. In cases where you need to retry, it's generally sufficient to exit the reconcile loop (return reconcile.Result{}, err in the controller). When it starts up again, it will will retry, then backoff, then retry. This way it's not a blocking operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will do that, thanks!

Copy link
Contributor Author

@koikonom koikonom Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you suggest I use between reconcile.Result{Requeue: true}, nil and reconcile.Result{}, err ? Is there a difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should both requeue, but the difference is the type of message left in the operator log. It could be most helpful to the user if we passed them the error returned from t.Client.ConfigurationVersions.List. That way TFC/TFE might tell them "I couldn't find a ConfigurationVersion for this workspace". And then they could investigate why that is. Meanwhile, the operator will keep trying to find it, which gives them a chance to log into TFC and fix the problem. (Disclaimer: I don't know much about TFC and I'm not sure what a ConfigurationVersion is offhand. heh)

I found this piece of documentation helpful yesterday when looking at requeuing options:

return reconcile.Result{}, nil
The reconcile process finished with no errors and does not require another pass
through the reconcile loop.

return reconcile.Result{}, err
The reconcile failed due to an error and Kubernetes should requeue it to try
again.

return reconcile.Result{Requeue: true}, nil
The reconcile did not encounter an error, but Kubernetes should requeue it to
run for another iteration.

return reconcile.Result{RequeueAfter: time.Second*5}, nil
Similar to the previous result, but this will wait for the specified amount of time
before requeuing the request. This approach is useful when there are multiple
steps that must run serially, but may take some time to complete. For example, if
a backend service needs a running database prior to starting, the reconcile can be
requeued with a delay to give the database time to start. Once the database is run‐
ning, the Operator does not requeue the reconcile request, and the rest of the
steps continue.

https://www.redhat.com/cms/managed-files/cl-oreilly-kubernetes-operators-ebook-f21452-202001-en_2.pdf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ConfigurationVersion object is not immediately available when using a VCS backed workspace, which is why I put the retry in the first place.

Based on the description above I think RequeueAfter is the best course of action.

if !foundCV {
return fmt.Errorf("No config version found")
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the logic chain here makes me think it would be simpler to separate these out into a separate function. That way the CreateRun function can focus on CreateRun logic. For example:

// CreateRun runs a new Terraform Cloud configuration
func (t *TerraformCloudClient) CreateRun(workspace *v1alpha1.Workspace,terraform []byte) error {
	configVersion, err := t.checkConfigVersion(workspace, terraform)
	message := fmt.Sprintf("%s, apply", TerraformOperator)
	options := tfc.RunCreateOptions{
		Message: &message,
		ConfigurationVersion: configVersion,
		Workspace: &tfc.Workspace{
			ID: workspace.Status.WorkspaceID,
		},
	}

	run, err := t.Client.Runs.Create(context.TODO(), options)
	if err != nil {
		return err
	}
	workspace.Status.RunID = run.ID
	workspace.Status.RunStatus = string(run.Status)
	return nil
}

func (t *TerraformCloudClient) checkConfigVersion(workspace *v1alpha1.Workspace, terraform []byte) (*tfc.ConfigurationVersion, error) {
	if workspace.Spec.VCS == nil {
		configVersion, err := t.CreateConfigurationVersion(workspace.Status.WorkspaceID)
		if err != nil {
			return nil, err
		}
		os.Mkdir(moduleDirectory, 0777)
		if err := ioutil.WriteFile(configurationFilePath, terraform, 0777); err != nil {
			return nil, err
		}
		if err := t.UploadConfigurationFile(configVersion.UploadURL); err != nil {
			return nil, err
		}
		return configVersion, nil
	}

	if workspace.Spec.VCS != nil {
		configVersions, err := t.Client.ConfigurationVersions.List(context.TODO(), workspace.Status.WorkspaceID, tfc.ConfigurationVersionListOptions{})
		if err != nil {
			return nil, err
		}
		if len(configVersions.Items) > 0 {
			return configVersions.Items[0], nil
		}
	}
	return nil, fmt.Errorf("No config version found")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I bring this up is because I noticed CreateRun logic being run for new ConfigurationVersion but not for existing ConfigurationVersions. That is, the options.ConfigurationVersion is never set for existing configurations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VCS backed workspaces generate their ConfigurationVersion objects automatically by cloning the linked git repo. By not setting a ConfigurationVersion TFC will automatically use the latest one.

When we specify a module we use a different workspace mode where we have to create the ConfigurationVersion ourselves by generating the necessary tf config and upload it using the TFC API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... that said, you're right. I'll split this function to make it easier to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for that explanation. I wasn't sure what a ConfigurationVersion was, but it sounds like it's not something we need to fetch from TFC in the case of using the VCS option. 👍

pkg/controller/workspace/workspace_controller.go Outdated Show resolved Hide resolved
}
return true, nil
} else if err != nil {
r.reqLogger.Error(err, "Failed to get Terraform ConfigMap")
return updated, err
return false, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this code more clear!

}
return true, nil
if found.Data[TerraformConfigMap] == string(template) {
return false, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is more clear too!

@@ -15,7 +15,7 @@ import (
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
)

func configMapForTerraform(name string, namespace string, template []byte) *corev1.ConfigMap {
func configMapForTerraform(name string, namespace string, template []byte, runID string) *corev1.ConfigMap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you added a parameter here... did you forget to use it in the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah leftover from a previous approach. I will remove it.

pkg/controller/workspace/k8s_configmap.go Outdated Show resolved Hide resolved
@dak1n1
Copy link
Contributor

dak1n1 commented Oct 2, 2020

Once I added ingress_submodules: false to my CR, the Workspace CR was created and the operator started processing it. However, it gets stuck here:

{"level":"error","ts":1601663745.6996603,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"workspace-controller","request":"preprod/salutations","error":"unprocessable entity\n\nOAuth token does not exist","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/home/dakini/go/src/github.com/hashicorp/terraform-k8s/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}

I get the feeling I'm using my token incorrectly. I put a github oauth token into the token_id field. Was that wrong? It looks like github never received the request.
token

koikonom and others added 2 commits October 2, 2020 21:53
Co-authored-by: Stef Forrester <dak1n1@users.noreply.github.com>
Co-authored-by: Stef Forrester <dak1n1@users.noreply.github.com>
@koikonom
Copy link
Contributor Author

koikonom commented Oct 2, 2020

BTW I found it easier to test locally by using this command:

operator-sdk-v0.18.0-x86_64-apple-darwin run local --watch-namespace <namespace> --operator-flags "sync-workspace --k8s-watch-namespace=<namespace>"

Edit: just remembered that the operator CLI tool expects main.go to be in a specific location so before I run the command above I had to do this:

mkdir -p cmd/manager
ln   -s <full_path_to_repo>/*.go ./cmd/manager

Copy link
Contributor

@dak1n1 dak1n1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Thanks for all the work on this feature. I tested it today and it automatically ran the apply when I updated my VCS repo.

pkg/apis/app/v1alpha1/workspace_types.go Show resolved Hide resolved
@aareet
Copy link
Contributor

aareet commented Oct 7, 2020

Super exciting!

koikonom and others added 3 commits October 7, 2020 21:41
Sometimes the workspace may not have a Run structure associated with it.
If we attempt to access it without testing if the LastRun is null then
we may cause the operator to panic().
Co-authored-by: Stef Forrester <dak1n1@users.noreply.github.com>
@koikonom koikonom merged commit 62b2e90 into master Oct 7, 2020
@koikonom koikonom deleted the vcs_workspace branch October 7, 2020 20:56
@koikonom
Copy link
Contributor Author

Fixes #59

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants