Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

airbyte-ci: consolidate the use of Gradle #32877

Conversation

alafanechere
Copy link
Contributor

@alafanechere alafanechere commented Nov 28, 2023

What

Closes #32557
The format command did not use any gradle caching logic which could make it faster.
We have such a logic on the GradleTask step class.
This PR is an attempt to consolidate a way to run Gradle tasks for multiple use cases.

How

  • Create a GradleTaskExecutorcontext manager class by copying the logic existing in the GradleTask(Step) class.
    Using a context manager is helpful because it eases the gradle cache seeding on setup (aenter) and persisting the gralde cache on teardown (aexit).
  • Use it in format check/fix java
  • Use it GradleTask(Step)

Benefits:

  • A more readable GradleTask(Step)
  • Cache sync for free thanks to the use of the context manager pattern
  • Reusable class for any future Gradle related use case

Copy link

vercel bot commented Nov 28, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Dec 1, 2023 5:19pm

Copy link
Contributor Author

Current dependencies on/for this PR:

This stack of pull requests is managed by Graphite.

@alafanechere alafanechere marked this pull request as ready for review November 28, 2023 17:05
@alafanechere alafanechere force-pushed the augustin/11-28-airbyte-ci_consolidate_the_use_of_Gradle_btw_format_and_connector_test/build branch from 4b2b3a8 to 833288e Compare November 28, 2023 17:21
stdout = await (await executor.run_task("spotlessCheck")).stdout()
return CommandResult(check.commands["java"], status=StepStatus.SUCCESS, stdout=stdout)
except dagger.ExecError as e:
return CommandResult(check.commands["java"], status=StepStatus.FAILURE, stderr=e.stderr, stdout=e.stdout, exc_info=e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth factoring this CommandResult constructor boilerplate? It's basically get_check_command_result but without the run_check call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

# Ensure that the .m2 directory exists.
f"mkdir -p {local_maven_repository_path}",
# Copy the gradle cache from the volume to the gradle home.
f"(rsync -a --stats --mkpath {gradle_dep_cache_path}/ {gradle_home_path} || true)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't break up the with_exec block formerly defined in with_whole_git_repo; it won't work, because dagger's layer caching will get in the way. It's so painful to get right, trust me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right.
So I should always chain rsync + gradle exec when running a gradle task right?

sh_dash_c(
                    [
                        # Ensure that the .m2 directory exists.
                        f"mkdir -p {self.LOCAL_MAVEN_REPOSITORY_PATH}",
                        # Load from the cache volume.
                        f"(rsync -a --stats --mkpath {self.GRADLE_DEP_CACHE_PATH}/ {self.GRADLE_HOME_PATH} || true)",
                        # Resolve all dependencies and write their checksums to './gradle/verification-metadata.dryrun.xml'.
                        self._get_gradle_command("help", *warm_dependency_cache_args),
                        # Build the CDK and publish it to the local maven repository.
                        self._get_gradle_command(":airbyte-cdk:java:airbyte-cdk:publishSnapshotIfNeeded"),
                        # Store to the cache volume.
                        f"(rsync -a --stats {self.GRADLE_HOME_PATH}/ {self.GRADLE_DEP_CACHE_PATH} || true)",
                    ]
                )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More accurately: you want exactly that block above to run before running any other gradle task. Those subsequent tasks don't need to rsync because the S3 build cache will take over.

@alafanechere alafanechere force-pushed the augustin/11-28-airbyte-ci_consolidate_the_use_of_Gradle_btw_format_and_connector_test/build branch 2 times, most recently from b676c6f to 2718129 Compare November 29, 2023 16:50
@alafanechere
Copy link
Contributor Author

@postamar I believe I made the change you suggested.

  • The cache volume is only used to cache dependencies: no rsync happens outside of aenter
  • It's warmed up when entering the context manager (in aenter)
  • The cache rsync + gradle help + cdk install commands are chained

I made a slight change to the GradleTask(Step) logic.
We previously:

  1. Warmed up the deps cache by mounting the whole repo on the base gradle container
  2. Created a second container on which we mount the connector code and its dependencies + the local maven deps to run the final gradle task (build, test)

I changed it to:

  1. Provision the gradle container with a warmed-up dependency cache during aenter of GradleTaskExecutor
  2. Run the final gradle task on this gradle container

Let me know if this change is impactful. So far the source-postgres tests are running correctly and in ~7mn.

@property
def dependency_cache_volume(self) -> dagger.CacheVolume:
"""This cache volume is for sharing gradle dependencies (jars and poms) across all pipeline runs."""
return self.dagger_client.cache_volume(self.DEPENDENCY_CACHE_VOLUME_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why do we want to have this cache volume as a property when we're never going to want to reference it outside the class.

"""

rsync_cache_volume_to_gradle_home = f"(rsync -a --stats --mkpath {gradle_dep_cache_path}/ {gradle_home_path} || true)"
rsync_gradle_home_to_cache_volume = f"(rsync -a --stats {self.GRADLE_HOME_PATH}/ {self.GRADLE_DEP_CACHE_PATH} || true)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does defining these strings as variables really improve readability? To me it's just more lines of code.


Returns:
Callable: A function which takes a gradle container and returns a gradle container with the S3 build cache credentials set as environment variables.
"""
Copy link
Contributor

@postamar postamar Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, nobody is ever going to call this, nor should we encourage them to. Is this docstring really useful? Same comment applies elsewhere.


rsync_cache_volume_to_gradle_home = f"(rsync -a --stats --mkpath {gradle_dep_cache_path}/ {gradle_home_path} || true)"
rsync_gradle_home_to_cache_volume = f"(rsync -a --stats {self.GRADLE_HOME_PATH}/ {self.GRADLE_DEP_CACHE_PATH} || true)"
# Running a gradle task like "help" with these arguments will trigger updating all dependencies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's --write-verification-metadata --dry-run which warms the cache. Gradle's CLI kind of sucks here.

Best move the warm_dependency_cache_args = ["--write-verification-metadata", "sha256", "--dry-run"] if not self.local_execution else ["--dry-run"] in here, the last thing we want is people customizing this stuff. The whole point of this change is to re-use the same container for all gradle tasks, after all, and only the execution context should change things.

dagger.Container: The gradle container with the task executed.
"""
if "gradlew" not in await self.gradle_container.directory(".").entries():
raise Exception("gradlew not found in the current directory, please mount a directory with gradlew to the container.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gratuitous complexity? The failure will become apparent rather quickly, won't it?

"""
if "gradlew" not in await self.gradle_container.directory(".").entries():
raise Exception("gradlew not found in the current directory, please mount a directory with gradlew to the container.")
self.gradle_container = self.gradle_container.with_exec(sh_dash_c([self._get_gradle_command(task_name, *args)]))
Copy link
Contributor

@postamar postamar Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't think we want to mutate self.gradle_container here.

Copy link
Contributor

@postamar postamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests aren't happy but I haven't looked into that. I only have one blocking comment: I don't think self.gradle_container should be mutated.

Otherwise, I'd like to encourage you to trim down the complexity of this code by avoiding to anticipate future problems too much. The airbyte-ci codebase is dangerously large as it is.

build_and_publish_local_cdk = self._get_gradle_command(":airbyte-cdk:java:airbyte-cdk:publishSnapshotIfNeeded")

gradle_commands = [resolve_deps_and_write_checksum]
if build_cdk:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this argument and always run the CDK build task. Gradle will do the right thing.

self.gradle_container = await (
self.get_base_gradle_container()
.with_(self.set_s3_build_cache_credentials(self.s3_build_cache_access_key_id, self.s3_build_cache_secret_key))
.with_mounted_directory(self.workdir, self.gradle_project_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not going to do what we want. You want to maintain the old behaviour where the whole repo is mounted to download all the dependencies. This way, in practice, that task is a no-op. Otherwise the cache volume gets polluted from one run to the next depending on which connector you build and you end up downloading stuff, which is slow.

Please, maintain the existing behaviour exactly. It took me several days to get right because it's so, so full of gotchas.

@alafanechere alafanechere force-pushed the augustin/11-28-airbyte-ci_consolidate_the_use_of_Gradle_btw_format_and_connector_test/build branch from 7f60e4f to 2baf06b Compare December 1, 2023 10:44
@alafanechere alafanechere force-pushed the augustin/11-28-airbyte-ci_consolidate_the_use_of_Gradle_btw_format_and_connector_test/build branch from 2baf06b to f06d3a6 Compare December 1, 2023 10:45
@alafanechere alafanechere changed the title airbyte-ci: consolidate the use of Gradle btw format and connector test/build airbyte-ci: consolidate the use of Gradle Dec 1, 2023
@alafanechere
Copy link
Contributor Author

@postamar I removed the changed related to format as I definitely prefer the approach you implement in #32999 . I'll put this PR in draft mode. Feel free to iterate on it for the daggerization of gradle.yml

@alafanechere alafanechere marked this pull request as draft December 1, 2023 10:47
@postamar
Copy link
Contributor

postamar commented Dec 1, 2023

Sounds good! This certainly changes things. I'll get to it right now.

@postamar postamar dismissed their stale review December 1, 2023 17:20

dismissing my own review

@postamar
Copy link
Contributor

postamar commented Dec 1, 2023

OK, I understand why we mutate self.gradle_container. Still not sure if I like it but I can't come up with anything better.

Anyway, here's what I came up with in light of the changes to airbyte-ci format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

can we cache the gradle binary in airbyte-ci?
2 participants