Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

airbyte-ci: better gradle caching #31535

Merged
merged 30 commits into from
Oct 23, 2023
Merged

Conversation

postamar
Copy link
Contributor

@postamar postamar commented Oct 18, 2023

Informs #31439

This PR defines two cache volumes for java connector pipelines:

  • one which is persisted globally and which is used to cache downloaded jars and poms, and not more, because we don't trust gradle too much and would prefer clean builds from source;
  • another which is private to the pipeline and whose purpose is to avoid duplication of effort during distTar, test and integrationTest.

Care is taken to not bust the dagger container layer cache because updating the global cache requires mounting the whole git repo.

@vercel
Copy link

vercel bot commented Oct 18, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Oct 23, 2023 2:57pm

@postamar postamar marked this pull request as ready for review October 18, 2023 03:15
@postamar
Copy link
Contributor Author

postamar commented Oct 18, 2023

This works locally but I haven't yet assessed its impact on source-postgres CI performance. I'm cautiously optimistic.
First run: https://github.com/airbytehq/airbyte/actions/runs/6555788875

╭─────────────────────── SOURCE-POSTGRES - TEST RESULTS ───────────────────────╮
│                                Steps results                                 │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓ │
│ ┃ Step                                             ┃ Result     ┃ Duration ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩ │
│ │ Build connector tar                              │ Successful │ 04mn31s  │ │
│ │ Java Connector Unit Tests                        │ Successful │ 16mn16s  │ │
│ │ Build source-*** docker image for           │ Successful │ 16.04s   │ │
│ │ platform(s) linux/x86_64                         │            │          │ │
│ │ Acceptance tests                                 │ Successful │ 02mn52s  │ │
│ │ Java Connector Integration Tests                 │ Successful │ 06mn32s  │ │
│ │ Validate metadata for source-***            │ Successful │ 31.09s   │ │
│ │ Connector version semver check                   │ Successful │ 0.30s    │ │
│ │ Connector version increment check                │ Skipped    │ 0.00s    │ │
│ │ QA checks                                        │ Successful │ 39.29s   │ │
│ └──────────────────────────────────────────────────┴────────────┴──────────┘ │
│ ℹ️  You can find more details with step executions logs in the saved HTML     │
│ report.                                                                      │
╰────────── ⏲️  Total pipeline duration for source-***: 21mn12s ───────────╯

Second run: https://github.com/airbytehq/airbyte/actions/runs/6559827735

╭─────────────────────── SOURCE-POSTGRES - TEST RESULTS ───────────────────────╮
│                                Steps results                                 │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓ │
│ ┃ Step                                             ┃ Result     ┃ Duration ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩ │
│ │ Build connector tar                              │ Successful │ 03mn56s  │ │
│ │ Java Connector Unit Tests                        │ Successful │ 15mn59s  │ │
│ │ Build source-*** docker image for           │ Successful │ 10.02s   │ │
│ │ platform(s) linux/x86_64                         │            │          │ │
│ │ Acceptance tests                                 │ Successful │ 02mn52s  │ │
│ │ Java Connector Integration Tests                 │ Successful │ 06mn21s  │ │
│ │ Validate metadata for source-***            │ Successful │ 4.04s    │ │
│ │ Connector version semver check                   │ Successful │ 0.32s    │ │
│ │ Connector version increment check                │ Skipped    │ 0.00s    │ │
│ │ QA checks                                        │ Successful │ 3.28s    │ │
│ └──────────────────────────────────────────────────┴────────────┴──────────┘ │
│ ℹ️  You can find more details with step executions logs in the saved HTML     │
│ report.                                                                      │
╰────────── ⏲️  Total pipeline duration for source-***: 20mn03s ───────────╯

edit: unsurprisingly the second run didn't run on the same runner as the first so it started with a cold cache again. The penalty for this is fairly low, on the order of 2m30. (edit edit: I'm less sure now) A couple of days after merging this change we'll want to check that the caches are usually warm. This is the case when running airbyte-ci locally so I'm not worried about the fact that the cache works. Note that prior to this change, it didn't (because we mounted the cache volume before mounting the source files)

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a couple of clarification and would appreciate perceiving a performance boost before approving.
I believe you did not notice a perfomance boost because we have to enable remote caching of your new persistent volume here for cross runner remote cache.

My understanding so far is that:

  1. This change will allow us to cache java dependency across pipeline, avoiding the costly downloads.
  2. The gradle build cache is only used for the duration of a connector pipeline. Connectors sharing the same tasks won't benefit from increment build caching.

I'd like to understand the reasoning behind 2 a bit more. Do you plan on enabling the S3 remote cache to get a shared build cache?

Comment on lines 41 to 42
"""This cache volume is for sharing gradle state across all pipeline runs."""
return self.context.dagger_client.cache_volume("gradle-persistent-cache")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual cross-pipeline-run persistency of this cache can be enabled by adding gradle-persistent-cache to this list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be simpler to rename this volume to gradle-cache then. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename to gradle-dependency-cache ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll open a PR in airbyte-infra.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@property
def connector_transient_cache_volume(self) -> CacheVolume:
"""This cache volume is for sharing gradle state across tasks within a single connector pipeline run."""
return self.context.dagger_client.cache_volume(f"gradle-{self.context.connector.technical_name}-transient-cache")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a pipeline identifier (like the git_revision) to the cache volume name?
If dagger ever enable an automatic cross pipeline run persistence of the cache volume it will guarantee this volume will be only used for a single commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objections. We can discuss this offline but I was under the impression that mounting this volume as PRIVATE would be enough. On second thought, perhaps not.

Comment on lines 45 to 47
def connector_transient_cache_volume(self) -> CacheVolume:
"""This cache volume is for sharing gradle state across tasks within a single connector pipeline run."""
return self.context.dagger_client.cache_volume(f"gradle-{self.context.connector.technical_name}-transient-cache")
Copy link
Contributor

@alafanechere alafanechere Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that using a "per pipeline" build cache will disable remote incremental build caching that can benefit multiple connectors on different pipeline.
E.G: with our current implementation I thought that if a pipeline builds the java base it will seed the cache and next pipelines (on other connectors) will hit the cache and won't rebuild the java base.

An alternative approach to support remote caching of incremental builds would be to use the gradle S3 remote cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this concern is applicable now that we have the CDK. The "java base" you mention is the CDK now and when building a connector we just pull the jars. Regarding more aggressive caching see my top-level response.

with_whole_git_repo = (
gradle_container_base
# Mount the whole repo.
.with_mounted_directory("/airbyte", self.context.get_repo_dir("."))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last time I tried mounting the whole repo without include it was a very long operation when running locally, due to the number/size of files to mount to the dagger build context.
Can we only mount the "gradle projects"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience it wasn't noticeably long, like a couple of seconds. I didn't look into it too deeply. In any case, yes, let's only mount directories which either have a build.gradle file or recursively contain a subdirectory which contains a build.gradle file. Is there a shorthand for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No shorthand, probably a glob like here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a glob is going to cut it. Are you OK with me doing this change in a followup PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

# Mount the whole repo.
.with_mounted_directory("/airbyte", self.context.get_repo_dir("."))
# Mount the cache volume for the gradle cache which is persisted throughout all pipeline runs.
# We deliberately don't mount any cache volumes before mounting the git repo otherwise these will effectively be always empty.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it be empty as it's not sharing path with /airbyte?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't know why. Perhaps it's a dagger bug? In any case, if you mount the cache volume before the repo, that container layer gets aggressively cached by the dagger engine (as it should) and never gets rebuilt for subsequent airbyte-ci runs (which makes sense); but, somehow, this means that the cache volume is always empty on my laptop. Mounting it after the cache gets busted by a source file change makes it perform as expected.

Presumably, the same thing happens in our runners. It's weird and unexpected for sure.

Comment on lines 177 to 179
.with_mounted_directory("/airbyte", self.context.get_repo_dir(".", include=include))
# Mount the sources for the connector and its dependencies.
.with_mounted_directory(str(self.context.connector.code_directory), await self.context.get_connector_dir())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to avoid doubling the mount operation from host FS to container I think we can get the directory from the with_whole_git_repo:
.with_mounted_directory("/airbyte", with_whole_git_repo.directory("/airbyte", include=include)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I hadn't thought about the performance impact of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below for why I couldn't do this in the end.

Comment on lines 177 to 179
.with_mounted_directory("/airbyte", self.context.get_repo_dir(".", include=include))
# Mount the sources for the connector and its dependencies.
.with_mounted_directory(str(self.context.connector.code_directory), await self.context.get_connector_dir())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we already have the connector code at this stage as we already mounted the full repo in the last with_mounted_directory operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we deliberately don't base this container off the container with the whole repo mounted.

The reason for this is that mounting the whole repo is only necessary to, more or less, download all the jars and compile the CDK. These outputs change far less often than the git repo itself.

Consider the following scenario: you create a PR up which makes changes to two unrelated connectors. We'll have airbyte-ci run and the following will happen:

  • All the layers in with_whole_git_repo need to be rebuilt; that's normal, the git repo is different than on master.
  • All the layers in gradle_container also need to be rebuilt, as expected.

Now, let's say you push a commit which only changes one of the two connectors and airbyte-ci runs on the same runner:

  • All the layers in with_whole_git_repo need to be rebuilt; that's normal, the git repo is different again.
  • The resulting /root/.gradle and /root/.m2 directors are rebuilt but they end up being the same as in the previous airbyte-ci run, because there were no changes to the dependencies and to the CDK sources.
  • For the connector which was not changed, the layers in gradle_container can be re-used.

What this means is that the layers for the connector container (with the partial git repo mount) are more likely to be reused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: this is mostly speculation on my part and this deserves a test. I'll try this out on my machine to confirm!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it turns out, this speculation was wrong. I rewrote this a half-dozen times and still couldn't get /root/.gradle to be reused, even when paring it down to /root/.gradle/caches/modules-2 as explained in https://docs.gradle.org/current/userguide/dependency_resolution.html#sub:cache_copy and below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to get this to work in the end, but it required mounting connector sources from the host instead of from the other container!

# This will download gradle itself, a bunch of poms and jars, compile the gradle plugins, configure tasks, etc.
.with_exec(self._get_gradle_command(":airbyte-cdk:java:airbyte-cdk:publishSnapshotIfNeeded"))
# Mount the cache volume for the transient gradle cache used for this connector only.
# This volume is PRIVATE meaning it exists only for the duration of the dagger pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dagger has a pipeline terminology that differ from what we call a pipeline.

It's not clear to me if the PRIVATE sharing means that the cache volume will exist for the duration of a dagger session or for the duration of a sub-pipeline.
We currently have Step subpipelines.
If private cache volume exist for the duration of a sub-pipeline it will defeat your purpose of sharing the cache across steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I did find the caching to work, but now I'm starting to thing that it only worked accidentally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about writing an integration test that would help us actually test the caching behavior? Something using a dummy GradleTask subclass and running gradle command on a dummy gradle project. It'd be helpful to validate our caching assumptions and prevent regressions...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea. Do we already have similar kinds of integrations tests in airbyte-ci?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of. I this test I wrote some assertions about cache / cache bursting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@postamar
Copy link
Contributor Author

Thanks for this review!

My understanding so far is that: ...

Your understanding matches mine. I think that for CI purposes a clean build is always favourable. In other words, if we can build from source, and the source is in the repo, then we should.

I think if we can get this scheme to work then we can do without the S3 cache. Configuring the gradle build and downloading the jars are what's expensive and we should be comfortable to cache these steps.

@postamar
Copy link
Contributor Author

The magicache cache sync isn't enabled yet so there's no point in running this more than once but here's a run on source-postgres: https://github.com/airbytehq/airbyte/actions/runs/6565091506

@alafanechere
Copy link
Contributor

alafanechere commented Oct 19, 2023

@postamar I re-ran the test pipeline:
https://github.com/airbytehq/airbyte/actions/runs/6573520735

Expectation: this will seed the persistent dependency cache.

We should trigger a second run a couple of minutes after it succeeds and check if we get performance boost: we expect the deps to be cached in the remote volume.

First run duration: 23m 5s
Second run: 22mn

@postamar
Copy link
Contributor Author

@alafanechere I actually opened up a PR based on this one to test these assumptions: #31580

I looked at your runs also. Something's still off. The dependency cache is handled correctly from what I can tell, but the transient cache isn't. I think I know why.

@postamar
Copy link
Contributor Author

I'm fed up with my own two-dagger-cache scheme. For a start it doesn't work. Furthermore I'm worried it's unmaintainable. It's certainly untestable. So I'm getting rid of it.

@octavia-squidington-iv octavia-squidington-iv requested a review from a team October 20, 2023 03:45
@airbyte-oss-build-runner
Copy link
Collaborator

source-paypal-transaction test report (commit d057b84600) - ✅

⏲️ Total pipeline duration: 59.00s

Step Result
Build source-paypal-transaction docker image for platform(s) linux/x86_64
Acceptance tests
Code format checks
Validate metadata for source-paypal-transaction
Connector version semver check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-paypal-transaction test

@airbyte-oss-build-runner
Copy link
Collaborator

source-paypal-transaction test report (commit 3bf9fcfe8a) - ✅

⏲️ Total pipeline duration: 01mn20s

Step Result
Build source-paypal-transaction docker image for platform(s) linux/x86_64
Acceptance tests
Code format checks
Validate metadata for source-paypal-transaction
Connector version semver check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-paypal-transaction test

@c-p-b
Copy link
Contributor

c-p-b commented Oct 20, 2023

When I looked at the second run (in the comment above) I found that it had run for longer because the deps cache was cold.

I've come to the bottom of why the dagger cache volume so often appears to be empty. It's because it is! As you know the dagger engine relies on buildkit. This periodically performs garbage collection, which gets rid of orphaned container layers and so forth, including cache volumes presumably. This volume contains many gigabytes of data and the gradle layers are in all likelihood equally fat. It's worth noting that the cache gets emptied less often on my laptop than on the runners; perhaps that's because the engine is less active on my machine?

The default buildkit policies are documented here: https://docs.docker.com/build/cache/garbage-collection/ I know this is a docker page but it's the same for buildkit. I don't know if the dagger engine uses the same default buildkit policy but we should probably edit whatever policy dagger uses to suit our needs: we want to hang on to the cache volumes as long as possible bearing in mind that we build lots of fat layers very often every day.

cc @cpdeethree @alafanechere

edit: here's another run to test that the pipeline works: https://github.com/airbytehq/airbyte/actions/runs/6583132357

there is a setting in the build kit engine related to this. I believe it’s called keepBytes. I currently have it set to 50%, we could bump that up to 90% or so

@postamar
Copy link
Contributor Author

Indeed keepBytes is one of them, defined in [[worker.oci.gcpolicy]] sections. It's the same structure as the doc I linked and I believe there's more to it that that one setting. I'll open up a PR in airbyte-infra and we can discuss there maybe.

@postamar
Copy link
Contributor Author

Running https://github.com/airbytehq/airbyte/actions/runs/6592254879 to validate.

This should be good to go once https://github.com/airbytehq/airbyte-infra/pull/63 is approved, merged and deployed in prod.

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is about setting the env var on the container in the mounted_connector_secrets function

@postamar
Copy link
Contributor Author

Let's see where this goes.

@postamar postamar merged commit 9835f6b into master Oct 23, 2023
22 checks passed
@postamar postamar deleted the postamar/better-gradle-caching branch October 23, 2023 15:26
ariesgun pushed a commit to ariesgun/airbyte that referenced this pull request Oct 23, 2023
Co-authored-by: postamar <postamar@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants