Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add deployments to tracking model #4837

Merged
merged 2 commits into from
Jul 24, 2021
Merged

add deployments to tracking model #4837

merged 2 commits into from
Jul 24, 2021

Conversation

cgardens
Copy link
Contributor

@cgardens cgardens commented Jul 19, 2021

depends on #4797
(arguably) blocks #4716

What

  • For a workspace, we want to know "where" it is deployed (e.g. on cloud, OSS)
  • For a workspace, we want to identify if it is colocated with other workspaces. Our tracking model is about to become workspace-centric. Which means that if a user has multiple workspaces on their self-hosted install, we would not be able to track that they are the same user. Thus we want some identifier that can give us some notion of this. The more rigorous way of doing this would be to add an organization concept, but that is a more invasive change. For now, we can make the assumption that workspaces (that are not on cloud) that are colocated on the same install (deployment).

How

  • Add a deployment id that tracks an install of Airbyte. Its lifecycle with be the same as a volume.
    • if an Airbyte instance is turned off and on, the deployment id remains the same
    • if an Airbyte instance is spun down and its volumes destroyed, when it is spun back up, the deployment id will change (call it deployment id prime). even when data is then imported into this instance the deployment id will not be overwritten by the import, the deployment id will remain deployment id prime.
  • When Airbyte first starts up with fresh volumes / persistence, it generated a deployment id (see ServerApp.java)
  • When Airbyte imports data it needs to make sure that it keeps its original deployment id (ConfigDumpImporter.java)

to do

  • - unit tests (going to wait until i get a sanity check on the approach before i write tests)
  • - track OSS versus cloud

@github-actions github-actions bot added the area/platform issues related to the platform label Jul 19, 2021
@cgardens cgardens requested a review from jrhizor July 19, 2021 21:42
@cgardens cgardens marked this pull request as ready for review July 19, 2021 21:42
@cgardens cgardens requested a review from davinchia July 19, 2021 21:44
/**
* deploymentId - identifier for the deployment
*/
private final UUID deploymentId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opinion: I think it's useful to leave a comment here summarising how deploymetId -> workspace tie together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instanceId? I think instance has been the most common way to refer to the concept of a deployment (and that's also reflected in our docs). Either is fine with me, just don't want to diverge naming. Feels like it's an AirbyteInstance with an instanceId and deploymentMode/deploymentEnvironment (vars that describe the deployment of the instance with that id).

Copy link
Contributor Author

@cgardens cgardens Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed with @jrhizor offline. our goal here is to make sure the data model matches naming in the docs. we are keeping deployment, but i updated naming in docs to match deployment.

/**
* Returns a deployment UUID.
*/
Optional<UUID> getDeployment() throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving the deployment id -> all workspaces for a deployment comment here is also fine.

@@ -330,6 +338,36 @@ public void importDatabaseFromArchive(final Path storageRoot, final String airby
}
}

/**
* The deployment concept is specific to the environment that Airbyte is running in (not the data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing. I was just wondering why.

@@ -176,6 +179,18 @@ public void configure() {
server.join();
}

private static void createDeploymentIfNoneExists(final JobPersistence jobPersistence) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make sure I understand: it makes sense to have the server do this because the scheduler's start up depends on the server. this way, we know the scheduler can access the deployment id as soon as it starts up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. at least how we have everything structured now, we know that the main method of the server runs first and scheduler waits for the database to be in a ready state (in other words fully set up by the sever) before it does anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is actually a great reminder... because now the the tracker depends on the database for the deployment id, so we cannot initialize the tracking client until after busy loop that waits for the db to be set up.

Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment - looks good!

Base automatically changed from cgardens/refactor_import_export to master July 20, 2021 16:23
Copy link
Contributor

@jrhizor jrhizor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall. just some comments on naming.

/**
* deploymentId - identifier for the deployment
*/
private final UUID deploymentId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instanceId? I think instance has been the most common way to refer to the concept of a deployment (and that's also reflected in our docs). Either is fine with me, just don't want to diverge naming. Feels like it's an AirbyteInstance with an instanceId and deploymentMode/deploymentEnvironment (vars that describe the deployment of the instance with that id).

@cgardens cgardens force-pushed the cgardens/add_deployment branch 2 times, most recently from 171c035 to a60ccd9 Compare July 23, 2021 19:33
@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Jul 23, 2021
deployment

clean

clean up key/value; add deployment mode

add coment explaining deployment id lifecycle

move client initialization so that it will include deployment id

unit test

fix key case

docs instance => deployment

fix setdeployment method
@cgardens cgardens merged commit 67dd8a3 into master Jul 24, 2021
@cgardens cgardens deleted the cgardens/add_deployment branch July 24, 2021 00:22
configs.getTrackingStrategy(),
// todo (cgardens) - we need to do the `#runServer` pattern here that we do in `ServerApp` so that
// the deployment mode can be set by the cloud version.
new Deployment(DeploymentMode.OSS, jobPersistence.getDeployment().orElseThrow(), configs.getWorkerEnvironment()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhizor i did this heinous thing to get around the merge conflict. Obviously this is not how we want to set deployment mode. do you have thoughts on the right way to inject it? i'm still wrapping my head around the new factory.

throws IOException {
// filter out the deployment record from the import data, if it exists.
Stream<JsonNode> stream = metadataTableStream
.filter(record -> record.get(DefaultJobPersistence.METADATA_KEY_COL).asText().equals(DatabaseSchema.AIRBYTE_METADATA.toString()));
Copy link
Contributor

@tuliren tuliren Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgardens, this line looks problematic. The metadata records are like this:

Record: {"key":"server_uuid","value":"e895a584-7dbf-48ce-ace6-0bc9ea570c34"}
Record: {"key":"deployment_id","value":"55b923a9-42e5-4b0d-a5d2-b8d0316bfb2b"}
Record: {"key":"airbyte_version","value":"dev"}
Record: {"key":"2021-07-26T03:34:31.996221Z_init_db","value":"dev"}

The METADATA_KEY_COL column is never AIRBYTE_METADATA. So none of the metadata will be imported. Based on the comment, what you want is the following, right?

Suggested change
.filter(record -> record.get(DefaultJobPersistence.METADATA_KEY_COL).asText().equals(DatabaseSchema.AIRBYTE_METADATA.toString()));
.filter(record -> !record.get(DefaultJobPersistence.METADATA_KEY_COL).asText().equals(DEPLOYMENT_ID_KEY));

I think this is going to cause a severe problem for anyone that tries to import their configs. Right now the database connection depends on the existence of the service_uuid metadata. If that record does not exist, the server and scheduler will be waiting for the connection forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix here: #4977.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Improvements or additions to documentation area/platform issues related to the platform
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants