-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate seed connector specs on build #7501
Conversation
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class GcsBucketSpecFetcher { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class replaces the logic that used to live inside the GcsBucketSpecCacheSchedulerClient because this logic is also needed by the new SeedConnectorSpecGenerator
script. Moving it was necessary to avoid a circular dependency.
My only concern with these changes is that if someone is actively using an old version of a connector in the seed, then this logic may cause the spec of the most recent version of that connector to be loaded into the config database, which will cause the SpecFetcher to then return that updated spec, despite the fact that the old version of the docker container is actually being used in the connection. Is this a valid concern? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! A bunch of cosmetic comments but otherwise looks good.
@sherifnada do you have an opinion on whether for seeds we should store specs as a single file or multiple files?
@@ -10,37 +10,37 @@ | |||
import static org.mockito.Mockito.verifyNoInteractions; | |||
import static org.mockito.Mockito.when; | |||
|
|||
import io.airbyte.config.specs.GcsBucketSpecFetcher; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like most of the tests in here should move to GcsBucketSpecFetcherTest
. this test suite is mostly testing that logic. I think if you move the tests with the functionality that would make sense and then all you have left here is some mocks, sanity checking that the client uses the fetcher sensibly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it actually seems like this test suite is not really testing the GcsBucketSpecFetcher logic, as it mocks out that class. These tests are checking if the calls are delegated to the next client in the chain if the call to the GCS fetcher fails.
However, your comment inspired me to write tests for GcsBucketSpecFetcher that truly test the logic in that class: https://github.com/airbytehq/airbyte/pull/7501/files#diff-7ba5d616dd88de6fdfcaaef8c81e0e68a42c28f09bf67d0de0266796f151a538R20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed! thank you!
static { | ||
SEED_ROOT_OPTION.setRequired(true); | ||
OPTIONS.addOption(SEED_ROOT_OPTION); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid this tatic block if you declare OPTIONS
after SEED_ROOT_OPTION
and do something likew:
private static final Options OPTIONS = new Options().addOption(SEED_ROOT_OPTION)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also had to switch to using the Option builder so that I could set required to true at initialization time, but now the static block is gone!
return new DockerImageSpec().withDockerImage(dockerImage).withSpec(spec); | ||
} | ||
|
||
private static CommandLine parse(final String[] args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See Clis#parse
to DRY this up. (just added to master): https://github.com/airbytehq/airbyte/pull/7301/files
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good comment! mention fetching from the gcs bucket and that it will fail if it's not there and why we made that choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think mentioning specifically the choice we made here to only pull from the bucket cache is worth mentioning. someone is going to come along and not understand why we did it. i think there's value in saying, since this is happening at build time, we depend on the bucket cache instead of needing to run docker containers during the build which could be slow and unwieldy. if there is a failure look in the bucket cache and figure out how to get the spec for the connector in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I have updated the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice comment
seedConnectorSpecGenerator.run(outputRoot, ConnectorType.DESTINATION); | ||
} | ||
|
||
public void run(final Path seedRoot, final ConnectorType connectorType) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the enum is a nice touch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I just realized that I probably need to get rid of the enum. Because we want to reuse this class in the cloud build, and the seed definition files in cloud have slightly different names (source_definitions_mask.yaml
vs source_definitions.yaml
), so with the current code I am unable to pass the desired filenames to this class from cloud. And I'm guessing we don't want to have something like a CLOUD_SOURCE
value in this enum, as it is in OSS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'll probably keep the enum and just change the signature of this run()
method to take in those two filenames separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the change: https://github.com/airbytehq/airbyte/pull/7501/files/59ad7e63d4b46e08119c81012ede079e6df9b054..81d33ee538f5b6f9561e9c446f8b09e2f6408564
not as pretty as before, but at least cloud can properly use this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arghhh, ignore all of my comments above. I was just looking at the wrong resources
folder in cloud, the file names actually do match between both. Going to revert this change.
|
||
@VisibleForTesting | ||
final List<DockerImageSpec> fetchUpdatedSeedSpecs(final JsonNode seedDefinitions, final JsonNode currentSeedSpecs) { | ||
final List<String> seedDefinitionsDockerImages = MoreIterators.toList(seedDefinitions.elements()).stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by convention we generally put the .stream
on the new line by itself when a stream chain is going to be broken up over multiple lines.
json -> new DockerImageSpec().withDockerImage(json.get(DOCKER_IMAGE_FIELD).asText()) | ||
.withSpec(Jsons.object(json.get(SPEC_FIELD), ConnectorSpecification.class)))); | ||
|
||
return seedDefinitionsDockerImages.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a smart approach for handling filtering out old specs. very nice!
assertEquals("https://docs.airbyte.io/integrations/destinations/s3", s3Destination.getDocumentationUrl()); | ||
assertEquals(URI.create("https://docs.airbyte.io/integrations/destinations/s3"), s3Destination.getSpec().getDocumentationUrl()); | ||
|
||
System.out.println(mysqlSource.getSpec()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get rid of the sprint statement
@@ -49,14 +51,18 @@ public void testGetConfig() throws Exception { | |||
assertEquals(s3DestinationId, s3Destination.getDestinationDefinitionId().toString()); | |||
assertEquals("S3", s3Destination.getName()); | |||
assertEquals("airbyte/destination-s3", s3Destination.getDockerRepository()); | |||
assertEquals("https://docs.airbyte.io/integrations/destinations/s3", s3Destination.getDocumentationUrl()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol. probably worthwhile to keep this line too. i know it is silly and duplicative but for now the system expects both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, meant to add not replace. Thanks
I have added a README to explain this process: https://github.com/airbytehq/airbyte/blob/lmossman/fetch-seed-specs-on-build/airbyte-config/specs/README.md |
@davinchia one other question about this: Charles and I also realized that there is a need to run the SeedConnectorSpecGenerator script that this PR adds in the cloud build. I started to work on a PR for this (here), but I realized that I need the I have found a couple of PRs that seem to be related to enabling/disabling jar publishing for submodules: |
@lmossman, the rationale is that sometimes we want to add new fields to connector definitions. The |
I think this is a valid concern. But this code block will only add the new fields to the connector definition. It will not bump the version of the connector. So I think the spec should remain the same, thus not a problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
@@ -366,7 +366,14 @@ void updateConfigsFromSeed(final DSLContext ctx, final ConfigPersistence seedCon | |||
|
|||
final ConnectorInfo connectorInfo = connectorRepositoryToIdVersionMap.get(repository); | |||
final JsonNode currentDefinition = connectorInfo.definition; | |||
final Set<String> newFields = getNewFields(currentDefinition, latestDefinition); | |||
|
|||
// todo (lmossman) - this logic to remove the "spec" field is temporary; it is necessary to avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you. excellent explanatory comment!
@@ -0,0 +1,16 @@ | |||
# Generating Seed Connector Specs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for adding the readme!
Spoke to Lake offline on the publishing question. |
Yep, Davin was able to help me understand how the cloud repo interacts with the airbyte-server jar which contains the dependencies as well, and I was able to test this out successfully! |
* add specs module with logic to fetch specs on build * format + build and add gradle dependency for new script * check seed file for existing specs + refactor * add tests + a bit more refactoring * run gw format * update yaml config persistence to merge specs into definitions * add comment * delete secrets migration to be consistent with master * add dep * add tests for GcsBucketSpecFetcher * get rid of static block + format * DRY up parse call * add GCS details to comment * formatting + fix test * update comment * do not format seed specs files * change signature of run to allow cloud to reuse this script * run gw format * revert commits that change signature of run * fix comment typo Co-authored-by: Davin Chia <davinchia@gmail.com> * rename enum to be distinct from the enum in cloud * add missing dependencies between modules * add readme for seed connector spec generator * reword * reference readme in comment * ignore 'spec' field in newFields logic Co-authored-by: Davin Chia <davinchia@gmail.com>
This reverts commit a534bb2.
What
Resolves #7142.
The result of this PR is that during the platform build, seed connector definitions are read from the resource files, and specs for them are fetched from GCS and written to corresponding resource files, unless the specs have already been written there. The goal is that specs for all seeded connector definitions are also available in the seed and they never need to be fetched during server startup.
How
This is achieved through a gradle build script that fetches the current seed definition and spec resource files, compares them to see which specs need to be fetched, fetches the specs, and writes them to the spec resource files.
Separately, a change is made to the YamlSeedConfigPersistence class to read the spec resource files along with the definition files, and merge the specs into the definitions before returning them.
Recommended reading order
Pre-merge Checklist
Expand the relevant checklist and delete the others.
New Connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/SUMMARY.md
docs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampledocs/integrations/README.md
airbyte-integrations/builds.md
Airbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing./publish
command described hereUpdating a connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampleAirbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing./publish
command described hereConnector Generator
-scaffold
in their name) have been updated with the latest scaffold by running./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates
then checking in your changes