Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup for blue/green deployment mode #180

Merged
merged 43 commits into from
Mar 26, 2020
Merged

Setup for blue/green deployment mode #180

merged 43 commits into from
Mar 26, 2020

Conversation

glaksh100
Copy link
Contributor

This PR adds the setup required for blue/green deploys but does not actually implement a blue/green deploy yet. I'm opening this mainly to get feedback on how the status sub-resource would be modified for applications that don't opt-in to the BlueGreen deployment mode.

Changelog

  • Spec updates:
    • DeploymentModeBlueGreen: A deployment mode to opt-in to blue green deploys
    • FlinkApplicationDualRunning: A phase for when we have two flink applications running (not yet implemented)
    • FlinkApplicationTeardown: A phase for when one of the flink applications is being torn down (not yet implemented)
  • Status sub-resource:
    • []FlinkApplicationVersionStatus: An array of flink application statuses. This in turn contains the FlinkClusterStatus, FlinkJobStatus and FlinkApplicationVersion
    • A DesiredApplicationCount to indicate how many applications are expected in any of the Updating phases.
  • Code changes:
    • Adding annotations and environment variables with the application version (i.e. blue/green) where applicable
    • Alongside this, a few helper methods in flink.go to do the index math for the status array and also make updates/gets to/from the status object easier.
  • CRD updates:
    • Additional printer columns updated to account for the FlinkApplicationVersion array object

Output
A new Flink application status for an application in the Dual deployment mode looks like below:

Status:
  App Status:
    Cluster Status:
      Available Task Slots:     0
      Cluster Overview URL:     http://localhost:8001/api/v1/namespaces/simple-flink-k8s-staging/services/simple-flink-k8s:8081/proxy/#/overview
      Health:                   Green
      Healthy Task Managers:    1
      Number Of Task Managers:  1
      Number Of Task Slots:     2
    Job Status:
      Completed Checkpoint Count:  17
      Entry Class:                 com.lyft.streamingplatform.SimpleFlinkApp
      Health:                      Green
      Jar Name:                    simple-flink-k8s-1.0.0-SNAPSHOT.jar
      Job ID:                      dce2b8d7a3cd3514291e5a76df8dee88
      Job Overview URL:            http://localhost:8001/api/v1/namespaces/simple-flink-k8s-staging/services/simple-flink-k8s:8081/proxy/#/jobs/dce2b8d7a3cd3514291e5a76df8dee88
      Job Restart Count:           1
      Last Checkpoint:             file:/tmp/checkpoints/flink/externalized-checkpoints/dce2b8d7a3cd3514291e5a76df8dee88/chk-19
      Last Checkpoint Time:        2020-02-27T18:52:32Z
      Last Failing Time:           <nil>
      Parallelism:                 2
      Restore Path:                file:/tmp/checkpoints/flink/savepoints/savepoint-604aac-b08a519286c9
      Restore Time:                2020-02-27T18:35:33Z
      Running Tasks:               5
      Start Time:                  2020-02-27T18:35:33Z
      State:                       RUNNING
      Total Tasks:                 5
  Deploy Hash:                     fc0329ec
  Desired Application Count:       1
  Failed Deploy Hash:              912eaf77
  Last Updated At:                 2020-02-27T18:52:42Z
  Phase:                           DeployFailed
  Rollback Hash:                   fc0329ec

The printer columns look like below:

NAME               PHASE          APPLICATION VERSION   CLUSTER HEALTH   JOB HEALTH   JOB RESTARTS   AGE
simple-flink-k8s   DeployFailed                         Green            Green        1              21m

@glaksh100
Copy link
Contributor Author

@anandswaminathan @mwylde When you have a chance, I'd like to get high-level feedback on the approach. I thought I'd start here to make reviewing the upcoming changes easier. Let me know if you prefer a different way of me making these changes..

deploy/crd.yaml Outdated
- name: Cluster Health
type: string
description: The health of the Flink cluster
JSONPath: .status.clusterStatus.health
JSONPath: .status.appStatus[*].clusterStatus.health
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this return when there are multiple app statuses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I have here doesn't quite work for multi app statuses. Printer columns only seem to support boolean,date,integer,number,string and also don't support JSONPath operators/expressions :(
I think to surface multi app statuses, I'd have to explicitly specify the array index and the output would look something like:

NAME                PHASE         APPLICATION VERSION   CLUSTER HEALTH   JOB HEALTH   JOB RESTARTS   APPLICATION VERSION   CLUSTER HEALTH   JOB HEALTH   JOB RESTARTS   AGE
operator-test-app   DualRunning   green                 Green            Green                       blue                  Green            Green                       6h4

This would mean a few empty columns when there's only 1 application running.

pkg/apis/app/v1beta1/types.go Outdated Show resolved Hide resolved
pkg/apis/app/v1beta1/types.go Outdated Show resolved Hide resolved
pkg/apis/app/v1beta1/types.go Outdated Show resolved Hide resolved
@glaksh100 glaksh100 changed the title [WIP] Setup for blue/green deployment mode Setup for blue/green deployment mode Mar 11, 2020
@glaksh100
Copy link
Contributor Author

This PR is now ready for review again. Key changes since the last update:

  • A new CRD version v1beta2 that encapsulates all the required spec/status additions for Blue Green deploys.
  • v1beta2 has a top-level (repeated) ClusterStatus and JobStatus fields, that subsequently get copied into the VersionStatuses array to ensure backward compatibility between v1beta1 and v1beta2
  • During the CRD version upgrade, I ran into this bug that prevents status sub-resources from being updated when the stored version of the CRD changes. The fix is available in Kubernetes 1.15 and we can remove the workaround when we upgrade.
  • Aside from unit/integration tests, I've also tested this out locally by deploying applications with v1beta1 versions and subsequently performing a CRD upgrade and ensuring that applications in all states continue to successfully be served on the new version (a few edge cases were discovered and fixed in the process)

@mwylde @anandswaminathan PTAL when you have a chance! TIA for the review; I know this is a laborious PR :(

@glaksh100
Copy link
Contributor Author

@mwylde and I chatted offline and came to the following conclusion:

  • Since Kubernetes only allows one stored version of the CRD at a time, it would mean that we still have on unified status schema for both CRD versions. This in turn minimizes the benefit of upgrading the CRD. So, we will continue to use v1beta1 version of the CRD. All changes will be backward compatible.
  • In order to keep the Status subresource for Dual mode deployments unchanged, we will still populate the Status.ClusterStatus and Status.JobStatus for Dual mode applications and the VersionStatuses[] array would be populated for BlueGreen deployments.

Keeping the above in mind I've now modified the PR. The cumulative effect of the change (since we've iterated a bit, mentioning changelog since last review may be confusing), is as below:

  • types.go:

    • DeploymentMode: BlueGreen
    • New states: DualRunning, Teardown
    • Status additions (unused for Dual mode):
    • FlinkApplicationVersionStatus: Describes the ClusterStatus, JobStatus, Version, VersionHash for each blue/green version
    • VersionStatuses: []FlinkApplicationVersionStatus
    • DeployVersion and UpdatingVersion
  • flink.go

    • Helper methods to abstract out the status updates depending on the deployment mode. For Dual deployment mode, the top-level JobStatus and ClusterStatus will be updated. For BlueGreen deployment mode, the corresponding FlinkApplicationVersionStatus in the
      VersionStatuses array will be updated
  • flink_state_machine.go

    • initializeAppStatusIfEmpty(): Initializes the VersionStatuses array if the deployment mode is BlueGreen. This helps preserve the Status sub-resource to look similar to its existing format (output from testing below)
    • All references to get/update jobID/jobStatus/clusterStatus should only be made through the Get/Update methods in the flink controller.
  • Container setup for BlueGreen deploys: Adding a flink-application-version annotation and a FLINK_APPLICATION_VERSION environment variable for BlueGreen deployments.

@mwylde I've now modified the status to not initialize the VersionStatuses at all when the deployment mode is not BlueGreen. So, I believe clients that use the status sub-resource can remain unchanged (verified with our command line spcli client). I feel better about the overall change now, thanks for helping think through this :)

Example of the status sub-resource:

Status:
  Cluster Status:
    Available Task Slots:     2
    Cluster Overview URL:     http://localhost:8001/api/v1/namespaces/operator-test-app-staging/services/operator-test-app:8081/proxy/#/overview
    Health:                   Green
    Healthy Task Managers:    1
    Number Of Task Managers:  1
    Number Of Task Slots:     2
  Deploy Hash:                e3a2e2a0
  Job Status:
    Completed Checkpoint Count:  94
    Entry Class:                 com.lyft.OperatorTestApp
    Health:                      Green
    Jar Name:                    operator-test-app-1.0.0-SNAPSHOT.jar
    Job ID:                      3baf7b47dc58b6d4450f20f9eb04b302
    Job Overview URL:            http://localhost:8001/api/v1/namespaces/operator-test-app-staging/services/operator-test-app:8081/proxy/#/jobs/d4ab43e120f3a30311b588bc81c64308
    Last Checkpoint:             file:/checkpoints/flink/externalized-checkpoints/d4ab43e120f3a30311b588bc81c64308/chk-94
    Last Checkpoint Time:        2020-03-19T22:02:17Z
    Parallelism:                 2
    Running Tasks:               4
    Start Time:                  2020-03-19T21:54:30Z
    State:                       RUNNING
    Total Tasks:                 4
  Last Updated At:               2020-03-19T22:02:50Z
  Phase:                         SubmittingJob
  Savepoint Path:                file:/checkpoints/flink/savepoints/savepoint-d4ab43-0878eae508df
  Savepoint Trigger Id:          d87898ce219937f3c5e167272b73fcb6

@glaksh100 glaksh100 requested a review from mwylde March 19, 2020 22:23
// This should only ever be encountered once (per application)
// when a new CRD version is deployed and an older version of the application exists
// As a workaround, we try to update the entire resource instead of only the status
// TODO Remove this block when we upgrade to k8s 1.15
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this around to help future CRD upgrades (though no longer required for this PR because there's no CRD upgrade anymore). Let me know if you don't want this here.

mwylde
mwylde previously approved these changes Mar 25, 2020
@glaksh100
Copy link
Contributor Author

@mwylde Re-requesting review; I made the flink_state_machine.go diffs cleaner by removing unwarranted changes. Thanks for all the reviewing :)

@glaksh100 glaksh100 requested a review from mwylde March 26, 2020 20:25
@glaksh100 glaksh100 merged commit 4c377a3 into master Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants