add deployments to tracking model #4837

cgardens · 2021-07-19T20:42:33Z

depends on #4797
(arguably) blocks #4716

What

For a workspace, we want to know "where" it is deployed (e.g. on cloud, OSS)
For a workspace, we want to identify if it is colocated with other workspaces. Our tracking model is about to become workspace-centric. Which means that if a user has multiple workspaces on their self-hosted install, we would not be able to track that they are the same user. Thus we want some identifier that can give us some notion of this. The more rigorous way of doing this would be to add an organization concept, but that is a more invasive change. For now, we can make the assumption that workspaces (that are not on cloud) that are colocated on the same install (deployment).

How

Add a deployment id that tracks an install of Airbyte. Its lifecycle with be the same as a volume.
- if an Airbyte instance is turned off and on, the deployment id remains the same
- if an Airbyte instance is spun down and its volumes destroyed, when it is spun back up, the deployment id will change (call it deployment id prime). even when data is then imported into this instance the deployment id will not be overwritten by the import, the deployment id will remain deployment id prime.
When Airbyte first starts up with fresh volumes / persistence, it generated a deployment id (see ServerApp.java)
When Airbyte imports data it needs to make sure that it keeps its original deployment id (ConfigDumpImporter.java)

to do

- unit tests (going to wait until i get a sanity check on the approach before i write tests)
- track OSS versus cloud

...eduler/persistence/src/main/java/io/airbyte/scheduler/persistence/DefaultJobPersistence.java

davinchia · 2021-07-20T10:45:14Z

airbyte-analytics/src/main/java/io/airbyte/analytics/Deployment.java

+  /**
+   * deploymentId - identifier for the deployment
+   */
+  private final UUID deploymentId;


opinion: I think it's useful to leave a comment here summarising how deploymetId -> workspace tie together.

Maybe instanceId? I think instance has been the most common way to refer to the concept of a deployment (and that's also reflected in our docs). Either is fine with me, just don't want to diverge naming. Feels like it's an AirbyteInstance with an instanceId and deploymentMode/deploymentEnvironment (vars that describe the deployment of the instance with that id).

added the comment.

discussed with @jrhizor offline. our goal here is to make sure the data model matches naming in the docs. we are keeping deployment, but i updated naming in docs to match deployment.

davinchia · 2021-07-20T10:46:25Z

...yte-scheduler/persistence/src/main/java/io/airbyte/scheduler/persistence/JobPersistence.java

+  /**
+   * Returns a deployment UUID.
+   */
+  Optional<UUID> getDeployment() throws IOException;


leaving the deployment id -> all workspaces for a deployment comment here is also fine.

davinchia · 2021-07-20T10:47:18Z

airbyte-server/src/main/java/io/airbyte/server/ConfigDumpImport.java

@@ -330,6 +338,36 @@ public void importDatabaseFromArchive(final Path storageRoot, final String airby
    }
  }

+  /**
+   * The deployment concept is specific to the environment that Airbyte is running in (not the data


amazing. I was just wondering why.

davinchia · 2021-07-20T10:48:45Z

airbyte-server/src/main/java/io/airbyte/server/ServerApp.java

@@ -176,6 +179,18 @@ public void configure() {
    server.join();
  }

+  private static void createDeploymentIfNoneExists(final JobPersistence jobPersistence) throws IOException {


to make sure I understand: it makes sense to have the server do this because the scheduler's start up depends on the server. this way, we know the scheduler can access the deployment id as soon as it starts up?

yeah. at least how we have everything structured now, we know that the main method of the server runs first and scheduler waits for the database to be in a ready state (in other words fully set up by the sever) before it does anything.

which is actually a great reminder... because now the the tracker depends on the database for the deployment id, so we cannot initialize the tracking client until after busy loop that waits for the db to be set up.

davinchia

One comment - looks good!

jrhizor

looks good overall. just some comments on naming.

jrhizor · 2021-07-20T20:22:37Z

airbyte-analytics/src/main/java/io/airbyte/analytics/Deployment.java

+  /**
+   * deploymentId - identifier for the deployment
+   */
+  private final UUID deploymentId;


Maybe instanceId? I think instance has been the most common way to refer to the concept of a deployment (and that's also reflected in our docs). Either is fine with me, just don't want to diverge naming. Feels like it's an AirbyteInstance with an instanceId and deploymentMode/deploymentEnvironment (vars that describe the deployment of the instance with that id).

...eduler/persistence/src/main/java/io/airbyte/scheduler/persistence/DefaultJobPersistence.java

deployment clean clean up key/value; add deployment mode add coment explaining deployment id lifecycle move client initialization so that it will include deployment id unit test fix key case docs instance => deployment fix setdeployment method

cgardens · 2021-07-24T00:23:21Z

airbyte-scheduler/app/src/main/java/io/airbyte/scheduler/app/SchedulerApp.java

+        configs.getTrackingStrategy(),
+        // todo (cgardens) - we need to do the `#runServer` pattern here that we do in `ServerApp` so that
+        // the deployment mode can be set by the cloud version.
+        new Deployment(DeploymentMode.OSS, jobPersistence.getDeployment().orElseThrow(), configs.getWorkerEnvironment()),


@jrhizor i did this heinous thing to get around the merge conflict. Obviously this is not how we want to set deployment mode. do you have thoughts on the right way to inject it? i'm still wrapping my head around the new factory.

tuliren · 2021-07-26T04:35:20Z

airbyte-server/src/main/java/io/airbyte/server/ConfigDumpImport.java

+      throws IOException {
+    // filter out the deployment record from the import data, if it exists.
+    Stream<JsonNode> stream = metadataTableStream
+        .filter(record -> record.get(DefaultJobPersistence.METADATA_KEY_COL).asText().equals(DatabaseSchema.AIRBYTE_METADATA.toString()));


@cgardens, this line looks problematic. The metadata records are like this:

Record: {"key":"server_uuid","value":"e895a584-7dbf-48ce-ace6-0bc9ea570c34"} Record: {"key":"deployment_id","value":"55b923a9-42e5-4b0d-a5d2-b8d0316bfb2b"} Record: {"key":"airbyte_version","value":"dev"} Record: {"key":"2021-07-26T03:34:31.996221Z_init_db","value":"dev"}

The METADATA_KEY_COL column is never AIRBYTE_METADATA. So none of the metadata will be imported. Based on the comment, what you want is the following, right?

Suggested change

.filter(record -> record.get(DefaultJobPersistence.METADATA_KEY_COL).asText().equals(DatabaseSchema.AIRBYTE_METADATA.toString()));

.filter(record -> !record.get(DefaultJobPersistence.METADATA_KEY_COL).asText().equals(DEPLOYMENT_ID_KEY));

I think this is going to cause a severe problem for anyone that tries to import their configs. Right now the database connection depends on the existence of the service_uuid metadata. If that record does not exist, the server and scheduler will be waiting for the connection forever.

Fix here: #4977.

github-actions bot added the area/platform issues related to the platform label Jul 19, 2021

cgardens commented Jul 19, 2021

View reviewed changes

...eduler/persistence/src/main/java/io/airbyte/scheduler/persistence/DefaultJobPersistence.java Outdated Show resolved Hide resolved

cgardens requested a review from jrhizor July 19, 2021 21:42

cgardens marked this pull request as ready for review July 19, 2021 21:42

cgardens requested a review from davinchia July 19, 2021 21:44

davinchia reviewed Jul 20, 2021

View reviewed changes

davinchia approved these changes Jul 20, 2021

View reviewed changes

Base automatically changed from cgardens/refactor_import_export to master July 20, 2021 16:23

jrhizor approved these changes Jul 20, 2021

View reviewed changes

cgardens force-pushed the cgardens/add_deployment branch 2 times, most recently from 171c035 to a60ccd9 Compare July 23, 2021 19:33

github-actions bot added the area/documentation Improvements or additions to documentation label Jul 23, 2021

cgardens added 2 commits July 23, 2021 16:53

home free; home fries?

c8f778d

deployment clean clean up key/value; add deployment mode add coment explaining deployment id lifecycle move client initialization so that it will include deployment id unit test fix key case docs instance => deployment fix setdeployment method

clean

129ca8f

cgardens force-pushed the cgardens/add_deployment branch from 3706f57 to 129ca8f Compare July 23, 2021 23:56

cgardens merged commit 67dd8a3 into master Jul 24, 2021

cgardens deleted the cgardens/add_deployment branch July 24, 2021 00:22

cgardens commented Jul 24, 2021

View reviewed changes

tuliren reviewed Jul 26, 2021

View reviewed changes

This was referenced Jul 26, 2021

Create jobs database tables without init container #4942

Merged

🐞 Fix metadata import filter on deployment id #4977

Merged

inesmarcal mentioned this pull request Mar 30, 2024

[Snyk] Upgrade webpack-dev-server from 4.9.2 to 4.15.1 inesmarcal/airbyte#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add deployments to tracking model #4837

add deployments to tracking model #4837

cgardens commented Jul 19, 2021 •

edited

Loading

davinchia Jul 20, 2021

jrhizor Jul 20, 2021

cgardens Jul 20, 2021 •

edited

Loading

cgardens Jul 23, 2021

davinchia Jul 20, 2021

davinchia Jul 20, 2021

davinchia Jul 20, 2021

cgardens Jul 20, 2021

cgardens Jul 20, 2021

davinchia left a comment

jrhizor left a comment

jrhizor Jul 20, 2021

cgardens Jul 24, 2021

tuliren Jul 26, 2021 •

edited

Loading

tuliren Jul 26, 2021

	.filter(record -> record.get(DefaultJobPersistence.METADATA_KEY_COL).asText().equals(DatabaseSchema.AIRBYTE_METADATA.toString()));
	.filter(record -> !record.get(DefaultJobPersistence.METADATA_KEY_COL).asText().equals(DEPLOYMENT_ID_KEY));

add deployments to tracking model #4837

add deployments to tracking model #4837

Conversation

cgardens commented Jul 19, 2021 • edited Loading

What

How

to do

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgardens Jul 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davinchia left a comment

Choose a reason for hiding this comment

jrhizor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuliren Jul 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgardens commented Jul 19, 2021 •

edited

Loading

cgardens Jul 20, 2021 •

edited

Loading

tuliren Jul 26, 2021 •

edited

Loading