Setup for blue/green deployment mode #180

glaksh100 · 2020-02-28T01:37:03Z

This PR adds the setup required for blue/green deploys but does not actually implement a blue/green deploy yet. I'm opening this mainly to get feedback on how the status sub-resource would be modified for applications that don't opt-in to the BlueGreen deployment mode.

Changelog

Spec updates:
- DeploymentModeBlueGreen: A deployment mode to opt-in to blue green deploys
- FlinkApplicationDualRunning: A phase for when we have two flink applications running (not yet implemented)
- FlinkApplicationTeardown: A phase for when one of the flink applications is being torn down (not yet implemented)
Status sub-resource:
- []FlinkApplicationVersionStatus: An array of flink application statuses. This in turn contains the FlinkClusterStatus, FlinkJobStatus and FlinkApplicationVersion
- A DesiredApplicationCount to indicate how many applications are expected in any of the Updating phases.
Code changes:
- Adding annotations and environment variables with the application version (i.e. blue/green) where applicable
- Alongside this, a few helper methods in flink.go to do the index math for the status array and also make updates/gets to/from the status object easier.
CRD updates:
- Additional printer columns updated to account for the FlinkApplicationVersion array object

Output
A new Flink application status for an application in the Dual deployment mode looks like below:

Status:
  App Status:
    Cluster Status:
      Available Task Slots:     0
      Cluster Overview URL:     http://localhost:8001/api/v1/namespaces/simple-flink-k8s-staging/services/simple-flink-k8s:8081/proxy/#/overview
      Health:                   Green
      Healthy Task Managers:    1
      Number Of Task Managers:  1
      Number Of Task Slots:     2
    Job Status:
      Completed Checkpoint Count:  17
      Entry Class:                 com.lyft.streamingplatform.SimpleFlinkApp
      Health:                      Green
      Jar Name:                    simple-flink-k8s-1.0.0-SNAPSHOT.jar
      Job ID:                      dce2b8d7a3cd3514291e5a76df8dee88
      Job Overview URL:            http://localhost:8001/api/v1/namespaces/simple-flink-k8s-staging/services/simple-flink-k8s:8081/proxy/#/jobs/dce2b8d7a3cd3514291e5a76df8dee88
      Job Restart Count:           1
      Last Checkpoint:             file:/tmp/checkpoints/flink/externalized-checkpoints/dce2b8d7a3cd3514291e5a76df8dee88/chk-19
      Last Checkpoint Time:        2020-02-27T18:52:32Z
      Last Failing Time:           <nil>
      Parallelism:                 2
      Restore Path:                file:/tmp/checkpoints/flink/savepoints/savepoint-604aac-b08a519286c9
      Restore Time:                2020-02-27T18:35:33Z
      Running Tasks:               5
      Start Time:                  2020-02-27T18:35:33Z
      State:                       RUNNING
      Total Tasks:                 5
  Deploy Hash:                     fc0329ec
  Desired Application Count:       1
  Failed Deploy Hash:              912eaf77
  Last Updated At:                 2020-02-27T18:52:42Z
  Phase:                           DeployFailed
  Rollback Hash:                   fc0329ec

The printer columns look like below:

NAME               PHASE          APPLICATION VERSION   CLUSTER HEALTH   JOB HEALTH   JOB RESTARTS   AGE
simple-flink-k8s   DeployFailed                         Green            Green        1              21m

glaksh100 · 2020-02-28T01:38:22Z

@anandswaminathan @mwylde When you have a chance, I'd like to get high-level feedback on the approach. I thought I'd start here to make reviewing the upcoming changes easier. Let me know if you prefer a different way of me making these changes..

mwylde · 2020-03-03T00:21:53Z

deploy/crd.yaml

    - name: Cluster Health
      type: string
      description: The health of the Flink cluster
-      JSONPath: .status.clusterStatus.health
+      JSONPath: .status.appStatus[*].clusterStatus.health


What does this return when there are multiple app statuses?

What I have here doesn't quite work for multi app statuses. Printer columns only seem to support boolean,date,integer,number,string and also don't support JSONPath operators/expressions :(
I think to surface multi app statuses, I'd have to explicitly specify the array index and the output would look something like:

NAME PHASE APPLICATION VERSION CLUSTER HEALTH JOB HEALTH JOB RESTARTS APPLICATION VERSION CLUSTER HEALTH JOB HEALTH JOB RESTARTS AGE operator-test-app DualRunning green Green Green blue Green Green 6h4

This would mean a few empty columns when there's only 1 application running.

pkg/apis/app/v1beta1/types.go

glaksh100 · 2020-03-11T05:40:47Z

This PR is now ready for review again. Key changes since the last update:

A new CRD version v1beta2 that encapsulates all the required spec/status additions for Blue Green deploys.
v1beta2 has a top-level (repeated) ClusterStatus and JobStatus fields, that subsequently get copied into the VersionStatuses array to ensure backward compatibility between v1beta1 and v1beta2
During the CRD version upgrade, I ran into this bug that prevents status sub-resources from being updated when the stored version of the CRD changes. The fix is available in Kubernetes 1.15 and we can remove the workaround when we upgrade.
Aside from unit/integration tests, I've also tested this out locally by deploying applications with v1beta1 versions and subsequently performing a CRD upgrade and ensuring that applications in all states continue to successfully be served on the new version (a few edge cases were discovered and fixed in the process)

@mwylde @anandswaminathan PTAL when you have a chance! TIA for the review; I know this is a laborious PR :(

glaksh100 · 2020-03-19T22:22:58Z

@mwylde and I chatted offline and came to the following conclusion:

Since Kubernetes only allows one stored version of the CRD at a time, it would mean that we still have on unified status schema for both CRD versions. This in turn minimizes the benefit of upgrading the CRD. So, we will continue to use v1beta1 version of the CRD. All changes will be backward compatible.
In order to keep the Status subresource for Dual mode deployments unchanged, we will still populate the Status.ClusterStatus and Status.JobStatus for Dual mode applications and the VersionStatuses[] array would be populated for BlueGreen deployments.

Keeping the above in mind I've now modified the PR. The cumulative effect of the change (since we've iterated a bit, mentioning changelog since last review may be confusing), is as below:

types.go:
- DeploymentMode: BlueGreen
- New states: DualRunning, Teardown
- Status additions (unused for Dual mode):
- FlinkApplicationVersionStatus: Describes the ClusterStatus, JobStatus, Version, VersionHash for each blue/green version
- VersionStatuses: []FlinkApplicationVersionStatus
- DeployVersion and UpdatingVersion
flink.go
- Helper methods to abstract out the status updates depending on the deployment mode. For Dual deployment mode, the top-level JobStatus and ClusterStatus will be updated. For BlueGreen deployment mode, the corresponding FlinkApplicationVersionStatus in the
  VersionStatuses array will be updated
flink_state_machine.go
- initializeAppStatusIfEmpty(): Initializes the VersionStatuses array if the deployment mode is BlueGreen. This helps preserve the Status sub-resource to look similar to its existing format (output from testing below)
- All references to get/update jobID/jobStatus/clusterStatus should only be made through the Get/Update methods in the flink controller.
Container setup for BlueGreen deploys: Adding a flink-application-version annotation and a FLINK_APPLICATION_VERSION environment variable for BlueGreen deployments.

@mwylde I've now modified the status to not initialize the VersionStatuses at all when the deployment mode is not BlueGreen. So, I believe clients that use the status sub-resource can remain unchanged (verified with our command line spcli client). I feel better about the overall change now, thanks for helping think through this :)

Example of the status sub-resource:

Status:
  Cluster Status:
    Available Task Slots:     2
    Cluster Overview URL:     http://localhost:8001/api/v1/namespaces/operator-test-app-staging/services/operator-test-app:8081/proxy/#/overview
    Health:                   Green
    Healthy Task Managers:    1
    Number Of Task Managers:  1
    Number Of Task Slots:     2
  Deploy Hash:                e3a2e2a0
  Job Status:
    Completed Checkpoint Count:  94
    Entry Class:                 com.lyft.OperatorTestApp
    Health:                      Green
    Jar Name:                    operator-test-app-1.0.0-SNAPSHOT.jar
    Job ID:                      3baf7b47dc58b6d4450f20f9eb04b302
    Job Overview URL:            http://localhost:8001/api/v1/namespaces/operator-test-app-staging/services/operator-test-app:8081/proxy/#/jobs/d4ab43e120f3a30311b588bc81c64308
    Last Checkpoint:             file:/checkpoints/flink/externalized-checkpoints/d4ab43e120f3a30311b588bc81c64308/chk-94
    Last Checkpoint Time:        2020-03-19T22:02:17Z
    Parallelism:                 2
    Running Tasks:               4
    Start Time:                  2020-03-19T21:54:30Z
    State:                       RUNNING
    Total Tasks:                 4
  Last Updated At:               2020-03-19T22:02:50Z
  Phase:                         SubmittingJob
  Savepoint Path:                file:/checkpoints/flink/savepoints/savepoint-d4ab43-0878eae508df
  Savepoint Trigger Id:          d87898ce219937f3c5e167272b73fcb6

glaksh100 · 2020-03-19T22:27:22Z

pkg/controller/k8/cluster.go

+			// This should only ever be encountered once (per application)
+			// when a new CRD version is deployed and an older version of the application exists
+			// As a workaround, we try to update the entire resource instead of only the status
+			// TODO Remove this block when we upgrade to k8s 1.15


I kept this around to help future CRD upgrades (though no longer required for this PR because there's no CRD upgrade anymore). Let me know if you don't want this here.

pkg/controller/flink/job_manager_controller_test.go

pkg/controller/flink/task_manager_controller.go

glaksh100 · 2020-03-26T16:27:47Z

@mwylde Re-requesting review; I made the flink_state_machine.go diffs cleaner by removing unwarranted changes. Thanks for all the reviewing :)

glaksh100 added 11 commits February 19, 2020 15:19

Working version 1

0c3fb79

Create setup for blue green deploys

3fa7911

[WIP] Setup status sub-resource for blue green deploys

a8fbe00

Updates

df89376

fix bug

d5543d7

Fixes

2d7919e

Make running jobs calculation idempotent

f89a2e1

Fix bugs

469cc96

Reset running jobs in recovering phase

0409316

Make status index calculation simpler

57882da

Add container env and annotations

08e5ce4

glaksh100 requested review from anandswaminathan, kumare3 and mwylde as code owners February 28, 2020 01:37

mwylde reviewed Mar 3, 2020

View reviewed changes

glaksh100 added 14 commits March 7, 2020 16:33

Update CRD to v1beta2

5fdbc22

Update CRD to v1beta2

cd44057

Fix CRD update issues

1fdf5a1

Fix lint

1cad6dd

Merge master and restore v1beta1 to original version

7aff3f0

Merge master and restore v1beta1 to original version

dcba167

Upgrade integ test to v1beta2

4783327

Backward compatibility changes

32d4a60

Work around status subresource bug

1a4b8c0

Rename status array to VersionStatuses and add comment on k8s bug

ea2c93b

Remove DesiredApplicationCount

937b965

Minor updates

612de70

Minor updates

5c17983

Initialize counter

b543ab0

glaksh100 added 9 commits March 10, 2020 11:35

Handle edge case for jobId

4e2d93e

Debug

6ef5216

Debug

facab34

fixes

dc63057

Fix edge case

8d33782

Fix unit tests

1577b77

Debug logs

9e11592

Fix overwriting of versionstatuses

3c4d0be

Remove debug logs

bd689d9

glaksh100 changed the title ~~[WIP] Setup for blue/green deployment mode~~ Setup for blue/green deployment mode Mar 11, 2020

glaksh100 added 5 commits March 19, 2020 09:25

Merge master

6abcfe2

Merge master

88c535d

Revert CRD upgrade

9e0a682

Keep Status.ClusterStatus and Status.JobStatus unchanged for Dual mode

d5fbbc0

Remove unwarranted changes

0073401

glaksh100 requested a review from mwylde March 19, 2020 22:23

glaksh100 commented Mar 19, 2020

View reviewed changes

mwylde reviewed Mar 23, 2020

View reviewed changes

pkg/controller/flink/job_manager_controller_test.go Outdated Show resolved Hide resolved

pkg/controller/flink/task_manager_controller.go Show resolved Hide resolved

glaksh100 added 2 commits March 24, 2020 09:29

Make import name more descriptive

527fab0

Revert file name to add_v1beta1

719e881

mwylde previously approved these changes Mar 25, 2020

View reviewed changes

Remove an unnecessary diff

578a82b

glaksh100 dismissed mwylde’s stale review via 578a82b March 26, 2020 15:46

Remove an unnecessary diff

0b381e2

glaksh100 requested a review from mwylde March 26, 2020 20:25

mwylde approved these changes Mar 26, 2020

View reviewed changes

glaksh100 merged commit 4c377a3 into master Mar 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup for blue/green deployment mode #180

Setup for blue/green deployment mode #180

glaksh100 commented Feb 28, 2020

glaksh100 commented Feb 28, 2020

mwylde Mar 3, 2020

glaksh100 Mar 4, 2020

glaksh100 commented Mar 11, 2020

glaksh100 commented Mar 19, 2020

glaksh100 Mar 19, 2020

glaksh100 commented Mar 26, 2020

Setup for blue/green deployment mode #180

Setup for blue/green deployment mode #180

Conversation

glaksh100 commented Feb 28, 2020

glaksh100 commented Feb 28, 2020

mwylde Mar 3, 2020

Choose a reason for hiding this comment

glaksh100 Mar 4, 2020

Choose a reason for hiding this comment

glaksh100 commented Mar 11, 2020

glaksh100 commented Mar 19, 2020

glaksh100 Mar 19, 2020

Choose a reason for hiding this comment

glaksh100 commented Mar 26, 2020