Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate seed connector specs on build #7501

Merged
merged 28 commits into from
Nov 3, 2021

Conversation

lmossman
Copy link
Contributor

@lmossman lmossman commented Oct 29, 2021

What

Resolves #7142.

The result of this PR is that during the platform build, seed connector definitions are read from the resource files, and specs for them are fetched from GCS and written to corresponding resource files, unless the specs have already been written there. The goal is that specs for all seeded connector definitions are also available in the seed and they never need to be fetched during server startup.

How

This is achieved through a gradle build script that fetches the current seed definition and spec resource files, compares them to see which specs need to be fetched, fetches the specs, and writes them to the spec resource files.

Separately, a change is made to the YamlSeedConfigPersistence class to read the spec resource files along with the definition files, and merge the specs into the definitions before returning them.

Recommended reading order

  1. SeedConnectorSpecGenerator.java / test
  2. airbyte-config/specs/build.gradle
  3. YamlSeedConfigPersistence.java

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/SUMMARY.md
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions
  • Connector added to connector index like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions
  • Connector version bumped like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here

Connector Generator

  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed.

@github-actions github-actions bot added area/platform issues related to the platform area/scheduler labels Oct 29, 2021
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class GcsBucketSpecFetcher {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class replaces the logic that used to live inside the GcsBucketSpecCacheSchedulerClient because this logic is also needed by the new SeedConnectorSpecGenerator script. Moving it was necessary to avoid a circular dependency.

@lmossman lmossman temporarily deployed to more-secrets October 29, 2021 23:49 Inactive
@lmossman lmossman temporarily deployed to more-secrets October 29, 2021 23:57 Inactive
@lmossman lmossman temporarily deployed to more-secrets October 30, 2021 00:16 Inactive
@lmossman
Copy link
Contributor Author

lmossman commented Oct 30, 2021

My only concern with these changes is that if someone is actively using an old version of a connector in the seed, then this logic may cause the spec of the most recent version of that connector to be loaded into the config database, which will cause the SpecFetcher to then return that updated spec, despite the fact that the old version of the docker container is actually being used in the connection.

Is this a valid concern?

@lmossman lmossman temporarily deployed to more-secrets October 30, 2021 00:32 Inactive
Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! A bunch of cosmetic comments but otherwise looks good.

@sherifnada do you have an opinion on whether for seeds we should store specs as a single file or multiple files?

@@ -10,37 +10,37 @@
import static org.mockito.Mockito.verifyNoInteractions;
import static org.mockito.Mockito.when;

import io.airbyte.config.specs.GcsBucketSpecFetcher;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like most of the tests in here should move to GcsBucketSpecFetcherTest. this test suite is mostly testing that logic. I think if you move the tests with the functionality that would make sense and then all you have left here is some mocks, sanity checking that the client uses the fetcher sensibly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it actually seems like this test suite is not really testing the GcsBucketSpecFetcher logic, as it mocks out that class. These tests are checking if the calls are delegated to the next client in the chain if the call to the GCS fetcher fails.

However, your comment inspired me to write tests for GcsBucketSpecFetcher that truly test the logic in that class: https://github.com/airbytehq/airbyte/pull/7501/files#diff-7ba5d616dd88de6fdfcaaef8c81e0e68a42c28f09bf67d0de0266796f151a538R20

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed! thank you!

Comment on lines 59 to 62
static {
SEED_ROOT_OPTION.setRequired(true);
OPTIONS.addOption(SEED_ROOT_OPTION);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid this tatic block if you declare OPTIONS after SEED_ROOT_OPTION and do something likew:

  private static final Options OPTIONS = new Options().addOption(SEED_ROOT_OPTION)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also had to switch to using the Option builder so that I could set required to true at initialization time, but now the static block is gone!

return new DockerImageSpec().withDockerImage(dockerImage).withSpec(spec);
}

private static CommandLine parse(final String[] args) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Clis#parse to DRY this up. (just added to master): https://github.com/airbytehq/airbyte/pull/7301/files

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good comment! mention fetching from the gcs bucket and that it will fail if it's not there and why we made that choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think mentioning specifically the choice we made here to only pull from the bucket cache is worth mentioning. someone is going to come along and not understand why we did it. i think there's value in saying, since this is happening at build time, we depend on the bucket cache instead of needing to run docker containers during the build which could be slow and unwieldy. if there is a failure look in the bucket cache and figure out how to get the spec for the connector in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I have updated the comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice comment

seedConnectorSpecGenerator.run(outputRoot, ConnectorType.DESTINATION);
}

public void run(final Path seedRoot, final ConnectorType connectorType) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the enum is a nice touch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I just realized that I probably need to get rid of the enum. Because we want to reuse this class in the cloud build, and the seed definition files in cloud have slightly different names (source_definitions_mask.yaml vs source_definitions.yaml), so with the current code I am unable to pass the desired filenames to this class from cloud. And I'm guessing we don't want to have something like a CLOUD_SOURCE value in this enum, as it is in OSS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'll probably keep the enum and just change the signature of this run() method to take in those two filenames separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arghhh, ignore all of my comments above. I was just looking at the wrong resources folder in cloud, the file names actually do match between both. Going to revert this change.


@VisibleForTesting
final List<DockerImageSpec> fetchUpdatedSeedSpecs(final JsonNode seedDefinitions, final JsonNode currentSeedSpecs) {
final List<String> seedDefinitionsDockerImages = MoreIterators.toList(seedDefinitions.elements()).stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by convention we generally put the .stream on the new line by itself when a stream chain is going to be broken up over multiple lines.

json -> new DockerImageSpec().withDockerImage(json.get(DOCKER_IMAGE_FIELD).asText())
.withSpec(Jsons.object(json.get(SPEC_FIELD), ConnectorSpecification.class))));

return seedDefinitionsDockerImages.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a smart approach for handling filtering out old specs. very nice!

assertEquals("https://docs.airbyte.io/integrations/destinations/s3", s3Destination.getDocumentationUrl());
assertEquals(URI.create("https://docs.airbyte.io/integrations/destinations/s3"), s3Destination.getSpec().getDocumentationUrl());

System.out.println(mysqlSource.getSpec());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get rid of the sprint statement

@@ -49,14 +51,18 @@ public void testGetConfig() throws Exception {
assertEquals(s3DestinationId, s3Destination.getDestinationDefinitionId().toString());
assertEquals("S3", s3Destination.getName());
assertEquals("airbyte/destination-s3", s3Destination.getDockerRepository());
assertEquals("https://docs.airbyte.io/integrations/destinations/s3", s3Destination.getDocumentationUrl());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol. probably worthwhile to keep this line too. i know it is silly and duplicative but for now the system expects both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, meant to add not replace. Thanks

@lmossman lmossman temporarily deployed to more-secrets November 1, 2021 17:50 Inactive
@lmossman lmossman temporarily deployed to more-secrets November 1, 2021 17:57 Inactive
@lmossman lmossman temporarily deployed to more-secrets November 2, 2021 17:34 Inactive
@lmossman
Copy link
Contributor Author

lmossman commented Nov 2, 2021

I have added a README to explain this process: https://github.com/airbytehq/airbyte/blob/lmossman/fetch-seed-specs-on-build/airbyte-config/specs/README.md
But happy to update any other documentation if necessary

@lmossman lmossman temporarily deployed to more-secrets November 2, 2021 17:51 Inactive
@lmossman lmossman temporarily deployed to more-secrets November 2, 2021 17:54 Inactive
@lmossman lmossman temporarily deployed to more-secrets November 2, 2021 17:57 Inactive
@lmossman
Copy link
Contributor Author

lmossman commented Nov 2, 2021

@davinchia one other question about this: Charles and I also realized that there is a need to run the SeedConnectorSpecGenerator script that this PR adds in the cloud build. I started to work on a PR for this (here), but I realized that I need the :airbyte-config:init module that this PR adds to publish a jar so that the cloud build can depend on it.

I have found a couple of PRs that seem to be related to enabling/disabling jar publishing for submodules:
#5275
#5097
I don't fully understand the state of jar publishing at this time. Could you help me understand what work needs to be done to have a jar published for :airbyte-config:init so that the cloud build can depend on it?

@tuliren
Copy link
Contributor

tuliren commented Nov 2, 2021

what the intention of the getNewFields logic is in the DatabaseConfigPersistence class?

@lmossman, the rationale is that sometimes we want to add new fields to connector definitions. The sourceType in the source connector is an example. The getNewFields logic takes care of that. Without this update, the existing connector definitions can be invalid (i.e. missing a required field).

@tuliren
Copy link
Contributor

tuliren commented Nov 2, 2021

My only concern with these changes is that if someone is actively using an old version of a connector in the seed, then this logic may cause the spec of the most recent version of that connector to be loaded into the config database, which will cause the SpecFetcher to then return that updated spec, despite the fact that the old version of the docker container is actually being used in the connection.

Is this a valid concern?

I think this is a valid concern. But this code block will only add the new fields to the connector definition. It will not bump the version of the connector. So I think the spec should remain the same, thus not a problem.

@lmossman
Copy link
Contributor Author

lmossman commented Nov 2, 2021

After meeting with @tuliren and @cgardens, we agreed on a solution to the getNewFields issue - to ignore the spec field in that logic for now, and then after the faux major version bump we will remove the getNewFields logic entirely. Here is the commit to ignore it: f02e8a3

@lmossman lmossman temporarily deployed to more-secrets November 2, 2021 22:55 Inactive
Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

@@ -366,7 +366,14 @@ void updateConfigsFromSeed(final DSLContext ctx, final ConfigPersistence seedCon

final ConnectorInfo connectorInfo = connectorRepositoryToIdVersionMap.get(repository);
final JsonNode currentDefinition = connectorInfo.definition;
final Set<String> newFields = getNewFields(currentDefinition, latestDefinition);

// todo (lmossman) - this logic to remove the "spec" field is temporary; it is necessary to avoid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you. excellent explanatory comment!

@@ -0,0 +1,16 @@
# Generating Seed Connector Specs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for adding the readme!

@lmossman lmossman temporarily deployed to more-secrets November 3, 2021 03:36 Inactive
@davinchia
Copy link
Contributor

Spoke to Lake offline on the publishing question.

@lmossman
Copy link
Contributor Author

lmossman commented Nov 3, 2021

Yep, Davin was able to help me understand how the cloud repo interacts with the airbyte-server jar which contains the dependencies as well, and I was able to test this out successfully!

@lmossman lmossman merged commit a534bb2 into master Nov 3, 2021
@lmossman lmossman deleted the lmossman/fetch-seed-specs-on-build branch November 3, 2021 05:03
lmossman added a commit that referenced this pull request Nov 3, 2021
lmossman added a commit that referenced this pull request Nov 3, 2021
@lmossman lmossman restored the lmossman/fetch-seed-specs-on-build branch November 3, 2021 22:19
@lmossman lmossman temporarily deployed to more-secrets November 3, 2021 22:21 Inactive
schlattk pushed a commit to schlattk/airbyte that referenced this pull request Jan 4, 2022
* add specs module with logic to fetch specs on build

* format + build and add gradle dependency for new script

* check seed file for existing specs + refactor

* add tests + a bit more refactoring

* run gw format

* update yaml config persistence to merge specs into definitions

* add comment

* delete secrets migration to be consistent with master

* add dep

* add tests for GcsBucketSpecFetcher

* get rid of static block + format

* DRY up parse call

* add GCS details to comment

* formatting + fix test

* update comment

* do not format seed specs files

* change signature of run to allow cloud to reuse this script

* run gw format

* revert commits that change signature of run

* fix comment typo

Co-authored-by: Davin Chia <davinchia@gmail.com>

* rename enum to be distinct from the enum in cloud

* add missing dependencies between modules

* add readme for seed connector spec generator

* reword

* reference readme in comment

* ignore 'spec' field in newFields logic

Co-authored-by: Davin Chia <davinchia@gmail.com>
schlattk pushed a commit to schlattk/airbyte that referenced this pull request Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When building the platform, add the specs for each connector in the seed
5 participants