Documentation updates

mumoshu · Nov 9, 2021 · b8330d7 · b8330d7
1 parent 8d028fd
commit b8330d7
Show file tree

Hide file tree

Showing 3 changed files with 108 additions and 40 deletions.
diff --git a/README.md b/README.md
@@ -16,11 +16,10 @@ If you've been using ephemeral Kubernetes clusters and employed blue-green or ca
 
 In a standard scenario, a system update with `okra` would like the below.
 
-- You provision one or more new clusters with cluster tags like `name=web-1-v2, role=web, version=v2`
-- An external system like ArgoCD with ApplicationSet deploys your apps to the new clusters
-- Okra's `cell-controller` starts discovering the new clusters by tags like `role=web`
-- Once there are enough clusters (with e.g. `role=web`) for the latest version tag like `version=v1`, `cell-controller` starts updating the loadbalancer configuration to gradually migrate traffic from the old to the new clusters.
-- `Okra` run various steps to ensure there are no errors, and it reverts the loadbalancer configuration changes when there are too many errors or test failures.
+- **You** provision one or more new clusters with cluster tags like `name=web-1-v2, role=web, version=v2`
+- **Okra** auto-imports the clusters into **ArgoCD**
+- **ArgoCD ApplicationSet** deploys your apps onto the new clusters
+- **Okra** updates the loadbalancer configuration to gradually migrate traffic to the new clusters, while running various checks to ensure application availability
 
 ## Project Status and Scope
 

diff --git a/cli.md b/cli.md
@@ -4,8 +4,11 @@
 
 It is currently used to run and test various operations used by various okra controllers, providing following commands.
 
-- [create argocdclustersecret](#create-argocdclustersecret)
+- [create cluster](#create-cluster)
 - [create awstargetgroup](#create-awstargetgroup)
+- [list targetgroupbindings](#list-targetgroupbindings)
+- [list awstargetgroups](#list-awstargetgroups)
+- [list latest-awstargetgroups](#list-awslatest-targetgroups)
 - [create cell](#create-cell)
 - [sync cell](#sync-cell)
 - [create awsalbupdate](#create-awsalbupdate)
@@ -18,13 +21,13 @@ It is currently used to run and test various operations used by various okra con
 
 This command runs okra's controller-manager that is composed of several Kubernetes controllers that powers [CRDs](#/crd.md).
 
-## create argocdclustersecret
+## create cluster
 
-`create argocdclustersecret` command replicates the behaviour of `clusterset` controller.
+`create cluster` command replicates the behaviour of `clusterset` controller.
 
 When `--dry-run` is provided, it emits a Kubernetes manifest YAML of a Kubernetes secret to stdout so that it can be applied using e.g. `kubectl apply -f -`.
 
-### create argocdclustersecret --awseks-cluster-name $NAME --version $VERSION
+### create cluster --awseks-cluster-name $NAME --version $VERSION
 
 This command calls AWS EKS DescribeCluster API on the EKS cluster whose name equals to `$NAME` and use the CA data of the cluster to create an ArgoCD cluster secret. The cluster secret is named `$NAME`. ArgoCD ApplicationSet can discover the cluster secret and start a deployment.
 
@@ -56,7 +59,7 @@ stringData:
 
 When `--add-target-group-annotations` is provided, the resulting cluster secret can be annotated with `okra.mumoshu.github.io/target-group/NAME: {"target-group-arn":"TG_ARN"}`, where `NAME` can be any id that can be a part of the annotation key, and `TG_ARN` is the ARN of the target group associated to the cluster.
 
-### create argocdclustersecret --awseks-cluster-tags $KEY=$VALUE --version-from-tag $VERSION_TAG_KEY
+### create cluster --awseks-cluster-tags $KEY=$VALUE --version-from-tag $VERSION_TAG_KEY
 
 This command calls AWS EKS ListClusters API to list all the clusters in your AWS account, and then calls `DescribeCluster` API for each cluster to get the tags of the respective cluster. For each cluster whose tags matches the selector specified via `--cluster-tags`, the command command creates a ArgoCD cluster secret whose name is set to the same value as the name of the EKS cluster.
 
@@ -68,13 +71,35 @@ The `okra.mumo.co/version` label value of the resulting cluster secret is genera
 
 When `--dry-run` is provided, it emits a Kubernetes manifest YAML of a `AWSTargetGroup` resource to stdout so that it can be applied using e.g. `kubectl apply -f -`.
 
-### create awstargetgroup $$RESOURCE_NAME --arn $ARN --label role=$ROLE
+### create awstargetgroup $$RESOURCE_NAME --arn $ARN --labels role=$ROLE
 
 This command creates a `AWSTargetGroup` resource whose name is `$RESOURCE_NAME` and the target group arn is `$ARN` and the `role` label is set to `$ROLE`.
 
-### create awstargetgroup $RESOURCE_NAME --cluster-name $NAME --arn-from-target-group-binding-name $TG_BINDING_NAME --label role=$ROLE
+### create-missing-awstargetgroups --cluster-name $NAME --target-group-binding-selector name=$TG_BINDING_NAME --labels role=$ROLE
 
-This command get the `TargetGroupBinding` resource named `$TG_BINDING_NAME` from the targeted EKS cluster, and create a `AWSTargetGroup` resource with the target group ARN found in the binding resource.
+This command get the `TargetGroupBinding` resource named `$TG_BINDING_NAME` from the targeted EKS cluster, and create a `AWSTargetGroup` resource with the target group ARN found in the binding resource. Each `AWSTargetGroup` gets the `role=$ROLE` label as specified by the `--labels` flag.
+
+The part that finds `TargetGroupBinding` can be run independently with [list targetgroupbindings](#list-targetgroupbindings).
+
+### create-outdated-awstargetgroups --cluster-name $NAME --target-group-binding-selector name=$TG_BINDING_NAME --labels role=$ROLE
+
+This command gets the `TargetGroupBinding` resource named `$TG_BINDING_NAME` from the targeted EKS cluster, list `AWSTargetGroup` resources whose labels contains `role=$ROLE` on the management cluster, and delete any `AWSTargetGroup` resources that doesn't have corresponding `TargetGroupBinding`.
+
+The part that finds `TargetGroupBinding` can be run independently with [list targetgroupbindings](#list-targetgroupbindings).
+
+## list targetgroupbindings
+
+This command fetches and outputs all the `TargetGroupBinding` resources in the target cluster. The target cluster is denoted by the name of an ArgoCD cluster secret.
+
+## list awstargetgroups
+
+This command fetches and outputs all the `AWSTargetGroup` resources in the management cluster.
+
+## list latest-awstargetgroups
+
+This command outputs latest `AWSTargetGroup` resources in the management cluster.
+
+It does so by firstly fetching all the `AWSTargetGroup` resources that matches the selector (`role=web` for example), group the matched resources by `okra.mumo.co/version` tag values (by default), and sort the groups in an descending order of the semver assuming the tag value contains a semver.
 
 ## create cell
 
@@ -84,45 +109,63 @@ This command get the `TargetGroupBinding` resource named `$TG_BINDING_NAME` from
 
 ## sync cell
 
-`sync cell` replicates the behaviour of `cell` controller.
+`sync cell` runs the main reconcilation logic of `cell-controller`.
+
+### sync cell --name $NAME --target-group-selector $SELECTOR
+
+This command syncs `Cell` resource named `$NAME` with various settings.
 
-### sync cell $NAME
+It starts by fetching all the `AWSTargetGroup` resources that matches the selector, group the matched resources by `okra.mumo.co/version` tag values (by default), and sort the groups in an descending order of the semver assuming the tag value contains a semver. This part can be run independently with [list latest-targetgroupbindings](#list-latest-awstargetgroups).
 
-This command loads `Cell` resource named `$NAME`.
+For example, if the selector was `role=web`, it will fetch all the `AWSTargetGroup` resources whose `metadata.labels` matches the selector. It then groups up the groups by the `okra.mumo.co/version` tag values. Say the version tag values are `v1.0.0` and `v1.1.0`, the group of the newest version `v1.1.0` comes first hence becomes the next deployment candidate.
 
-It then fetches all the `AWSTargetGroup` resources that matches the selector (`role=web` for example), group the matched resources by `okra.mumo.co/version` tag values (by default), and sort the groups in an descending order of the semver assuming the tag value contains a semver.
+When the group with the newest version has the desired number of target groups denoted by `spec.replicas`, it starts updating the AWS ALB listener denoted by `$LISTENER_ARN`.
 
-When the group with the newest version has `$REPLICAS` or more target groups in it, it starts updating the AWS ALB listener denoted by `$LISTENER_ARN`.
+Before updating the listener, it firstly ensures that there's exactly one loadbalancer config resource. If it didn't find one, it creates one, which is either an `AWSApplicationLoadBalancerConfig` or `AWSNetworkLoadBalancerConfig` resource depending on the loadbalancer specified in the `Cell` resource.
 
-The lister update is done by creating either an `AWSALBUpdate` or `AWSNLBUpdate` resource depending on the loadbalancer specified in the `Cell` resource. The creation part can be run independently by using [create awsalbupdate](#create-awsalbupdate).
+The initial config's spec field is derived from the current state of the loadbalancer obtained by calling AWS API. The creation part can be run independently by using [create awsapplicationloadbalancerconfig](#create-awsapplicationloadbalancerconfig).
 
-If there was an ongoing `AWSALBUpdate` resource whose `status.phase` is still `InProgress`, the command exists with code 0 without creating another `AWSALBUpdate` resource.
+If there was an on-going `AWSApplicationLoadBalancerConfig` resource whose `status.phase` is still `Updating`, the command exists with code 0 without creating another `AWSApplicationLoadBalancerConfig` resource.
 
-`sync cell` uses `Cell`'s status to signal other K8s controller or clients. It doesn't use the status as a state store.
+If there's an `AWSApplicationLoadBalancerConfig` resource and its `status.phase` is `Updated` or `Created`, an update starts. An update works differently depending on the current step index. The current step index is either derived from `cell.status` or the `--step-index` flag of this command.
 
-## create awsalbupdate
+If the current step has `stepWeight`, it updates target groups' weights. The desired target groups' weights are computed from the step index. The controller sums up all the `stepWeight` values of the steps from 0 to the current index for that.
 
-This command creates a new `AWSALBUpdate` resource. To sync it, use [sync awsalbupdate](#sync-awsalbupdate).
+If and only if the desired weights are different from the current weights, it commits a listener update.
 
-### create awsalbupdate $NAME --listener-arn $LISTENER_ARN --from-target-group-arns $OLD_TG_ARN1 --to-target-group-arns $TO_TG_ARN1
+More concretely, when the listener is being updated, it compares the current target group ARNs and weights stored in `AWSApplicationLoadBalancerConfig` against the desired target group ARNs and their weights computed by the `cell-controller`, determining the next state with the updated target group ARNs and weights. When in the controller, this happens on each reconcilation loop, so that the weights looks like changing gradually.
 
-This command creates a `AWSALBUpdate` resource whose name is `$NAME`.
+If the current step has `sleep`, it exists after updating `cell.status`.
 
-## sync awsalbupdate
+If the current step has `analysis`, it creates a new analysis run from it and exits.
 
-### sync awsalbupdate $NAME
+In any case, if previous step has `sleep`, it loads the start and end time of the sleep from `cell.status`, and exits if the current time is before the sleep end time. Similarly, if previous step has `analysis`, it loads the previous analysis run name from `cell.status` and check if the analysis run has phase `Completed`. If it doesn't, it exists.
 
-This command loads `AWSALBUpdate` resource named `$NAME` and reconciles it.
+In other words, it usually either (1)sleep for a while or (2)runs an analysis before updating the listener. If it was a sleep, the next listener update is pended until it the sleep duration elapses.
 
-Before actually updating the listener, it runs analysis. A listener update is pended until there are enough number of successful analysises that happened after the lastest ALB forward config update.
+If it was an analysis, the listener update is pended until there are enough number of successful analysises that happened after the lastest ALB forward config update. To complete the canary deployment, you need to rerun `sync cell` once again after `run analysis` completed.
 
-A analysis run can be trigered via [run analysis](#run-analysis).
+A analysis run can be trigered via [run analysis](#run-analysis), too.
 
-To complete the canary deployment, you need to rerun `sync awsalbupdate` once again after `run analysis` completed.
+`sync cell` updates `Cell`'s status to signal other K8s controller or clients. It doesn't use the status as a state store.
 
-`sync awsalbupdate` uses `AWSALBUpdate`'s status to signal `cell-controller` about the completion of the update.
+## create awsapplicationloadbalancerconfig
 
-More concretely, `status.phase` is set to `Succeeded`, `Error`, or `Canceled` depending on the result of analysis runs, AWS API outputs, etc. It becomes `Error` when it failed to update the ALB listener forward config before the deadline, or any of the analysis runs failed with `status.phase` of `Error` . It becomes `Canceled` when `spec.canceled` is set to `true` by `cell-controller`.
+This command creates a new `AWSApplicationLoadBalancerConfig` resource. To sync it, use [sync awsapplicationloadbalancerconfig](#sync-awsapplicationloadbalancerconfig).
+
+### create awsapplicationloadbalancerconfig $NAME --listener-arn $LISTENER_ARN
+
+This command creates a `AWSApplicationLoadBalancerConfig` resource whose name is `$NAME`. The target group ARNs and their weights are derived from the current state of the loadbalancer and the listener obtained by calling AWS API.
+
+## sync awsapplicationloadbalancerconfig
+
+### sync awsapplicationloadbalancerconfig $NAME
+
+This command loads `AWSApplicationLoadBalancerConfig` resource named `$NAME` and reconciles it.
+
+`sync awsapplicationloadbalancerconfig` uses `AWSApplicationLoadBalancerConfig`'s status to signal `cell-controller` about the completion of the update.
+
+More concretely, `status.phase` is set to `Created`, `Error`, or `Updated` depending on the situation. It is initially `Created`. If the spec has been changed but the controller failed to update it (i.e. AWS API error), the phase becomes `Error`. If the spec update has been successfully applied to the loadbalancer, the phase becomes `Updated`.
 
 ## run analysis
 
@@ -145,7 +188,7 @@ spec:
   args:
   - name: service-name
   - name: prometheus-port
-    value: 9090
+    value: "9090"
   metrics:
   - name: success-rate
     successCondition: result[0] >= 0.95
@@ -168,22 +211,27 @@ kind: AnalysisRun
 metadata:
   name: run1
 spec:
+  args:
+  - name: service-name
+    value: foo
+  - name: prometheus-port
+    value: "9090"
   metrics:
   - name: success-rate
     successCondition: result[0] >= 0.95
     provider:
       prometheus:
-        address: "http://prometheus.example.com:9090"
+        address: "http://prometheus.example.com:{{args.prometheus-port}}"
         query: |
           sum(irate(
-            istio_requests_total{reporter="source",destination_service=~"foo",response_code!~"5.*"}[5m]
+            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
           )) /
           sum(irate(
-            istio_requests_total{reporter="source",destination_service=~"foo"}[5m]
+            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
           ))
 ```
 
-`AnalysisRun`'s spec is mostly equivalent to that of `AnalysisTemplate`'s, except that `{{args.service-name}}` in the template is replaced with `foo` and `{{args.prometheus-port}}` is replaced with `9090`. `foo` is from the `--args service-name=foo` and `9090` is from the default value defined in the template's args field.
+`AnalysisRun`'s spec is mostly equivalent to that of `AnalysisTemplate`'s, except that the `service-name` arg's `value` in the template is updated to `foo`. `foo` is from the `--args service-name=foo` and `9090` is from the default value defined in the template's args field.
 
 When `--wait` is provided, the command waits until the run to complete. It's considered complete when `status.phase` is either `Error` or `Succeeded`. If phase was `Error`, the command prints a summary of the last `status.metricResults[].measurements[]` item, and exists with code 1.
 

diff --git a/crd.md b/crd.md
@@ -167,6 +167,28 @@ metadata:
     name: cart
 ```
 
+
+# AWSApplicationLoadBalancerConfig
+
+`AWSApplicationLoadBalancerConfig` represents a desired configuration of a specific AWS Application Loadbalancer.
+
+```
+kind: AWSApplicationLoadBalancerConfig
+metadata:
+  name: ...
+spec:
+  listenerARN: $LISTENER_ARN
+  forwardTargetGroups:
+  - name: prev
+    arn: prev1
+    weight: 40
+  - name: next
+    arn: prev2
+    weight: 60
+```
+
+`cell-controller` is responsible for gradually updating `forwardConfig` depending on `stepWeight`. The `awsapplicationloadbalancerconfig-controller` updates the target ALB as exactly as described in the config.
+
 # AWSTargetGroupSet
 
 `AWSTargetGroupSet` auto-discovers clusters and generates `AWSTargetGroup`.
@@ -184,15 +206,14 @@ spec:
   generators:
   - awseks:
       clusterSelector:
-        matchTags:
+        matchLabels:
           role: "web"
       bindingSelector:
         matchLabels:
           role: "web"
   # template is a template for dynamically generated AWSTargetGroup resources
   template:
     metadata:
-      name: web-"{{.awseks.cluster.name}}"
       labels:
         role: "{{.awseks.cluster.tags.role}}"
   # bindingTemplate is optional, and used only when you want to dynamically generate AWSTargetGroupBinding