Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make spotless tasks all cacheable with the Gradle Build Cache #280

Closed
JLLeitschuh opened this issue Aug 9, 2018 · 15 comments
Closed

Make spotless tasks all cacheable with the Gradle Build Cache #280

JLLeitschuh opened this issue Aug 9, 2018 · 15 comments

Comments

@JLLeitschuh
Copy link
Member

This should be as simple as adding the @CachableTask annotation to all of the relevant tasks.

@nedtwigg
Copy link
Member

nedtwigg commented Aug 9, 2018

I'm skeptical that this will be a performance improvement. I'd want to see a benchmark before merging.

@JLLeitschuh
Copy link
Member Author

I'm guessing that the performance benefits won't be that large on a project with few sources, but on a project with thousands of files, I'm guessing that the build cache will provide a nice short circuit for an initial checkout and run of spotlessCheck.

@nedtwigg
Copy link
Member

nedtwigg commented Jan 2, 2020

I'm still skeptical, but thanks to @jitpack, it's now trivially easy for any curious party to run an integration test benchmark and find out. Make the one-line change, push to any public github repo, and you can test it like so.

@bigdaz
Copy link
Contributor

bigdaz commented Apr 2, 2020

@nedtwigg We have a customer that is currently working at optimizing their build using Gradle Enterprize. At this stage when they run gradle clean build the majority of time is spent in com.diffplug.gradle.spotless.SpotlessTask. That means clean build runs in 4s instead of 1s, because everything else is available in the local cache.

I can share a GE screenshot directly with you if necessary.

@nedtwigg
Copy link
Member

nedtwigg commented Apr 2, 2020

I did a quick investigation, and it looks like we can probably support this usecase by bumping our minimum-required gradle to 3.0, and we can certainly support it if we bump to 3.5. We have to go all the way to 4.9 to get completely native config-avoidance APIs, which is the other sticking point we've had.

No need for a screenshot, a gradle engineer with a real customer usecase is proof-enough for me that there's value in bumping the minimum. I'm definitely on-board with bumping to 3.0 (3.5 if necessary), and if this heavy-duty usecase can also show that we get measurable benefit from 4.9 then I'm okay with that too. Implementing this is not high on my apocalypse-todo-list, but releasing good PRs quickly is always high on my todo list, even in an apocalypse.

@bigdaz
Copy link
Contributor

bigdaz commented Apr 13, 2020

Thanks @nedtwigg. I took a deeper look and it's not going to be as simple as I'd hoped. It looks like SpotlessTask is used in 2 different "modes": check and apply, but in both cases the source input files are also registered as being output files for the task.

I don't think this model is really correct:

  • when doing a spotlessJavaCheck, the java source files are inputs to the task, but they should not be part of the task outputs.
  • when doing a spotlessJavaApply, the java source files are indeed both input and output of the task.

Caching for the check case makes sense, with the cache key being made up of the hash of all or the source files, plus any other input parameters.
Caching for the apply case doesn't really make sense. Since the task modifies it's inputs in place, it's not really sensible to consider caching the outputs for a set of inputs.

What do you think? Would it make sense to split these 2 cases into 2 separate task implementations, thus allowing the inputs/outputs to be more correctly modelled?

@nedtwigg
Copy link
Member

I agree that it doesn't make sense to cache apply. For check, the only output that can be cached is effectively Unit - either the code is formatted correctly or it is not.

I also agree that Spotless does not model the task correctly. I'm grateful to have a gradle dev look at it - the reason for the weird model that we chose is a limitation (as I understand it) in the Gradle incremental API. The downside of breaking it into two tasks is that you duplicate incremental-change tracking in check vs apply. For example, let's say you run check on a big project with no cache, it chugs for 3s and then passes. Then, after changing no files, you now run apply. That apply ought to know "since check is now up-to-date, so am I" and it can complete immediately, and as implemented today that happens. It's easy to create a situation where check is now instantaneous for a clean checkout of master, but the day-to-day performance is strictly worse, because check and apply can no longer take advantage of each other's up-to-date knowledge.

A best-of-both-worlds answer might be this:

  • define @CacheableTask SpotlessTask as f(config, files) = filenames_with_dirty_formatting, and it writes out those filenames in a newline-delimited file build/spotless-dirty.txt as its only output file.
  • SpotlessCheckTask depends on SpotlessTask, and its only input is build/spotless-dirty.txt, and it just throws an error if it's not empty.
  • SpotlessApplyTask depends on SpotlessTask, and its only input is also build/spotless-dirty.txt, but it will modify those files (although they aren't declared as outputs), which will trigger the incremental mechanism in @CacheableTask SpotlessTask on the next run.

The hard part of the structure I outlined above is that @CacheableTask SpotlessTask will have to use build/spotless-bad-formatting.txt as an input (not just an output), because it needs to remember that just because a file wasn't changed and thus passed along by the gradle incremental update system, that doesn't mean that now it's clean. To keep caching happy, probably it cannot be listed as an official input. This might open up a complicated bug, but it seems doable...

I'm happy to split SpotlessTask, but I think it's important for apply and check to be in the same "up-to-date" pass, since they are linked so tightly (don't want to optimize for fresh checkout performance at the expense of normal usage performance).

@bigdaz
Copy link
Contributor

bigdaz commented Apr 15, 2020

Thanks for the detailed analysis.

First up, I would caution against making too strong assumptions when optimizing for "normal usage". In my experience, people tend to use tools in many varied and unexpected ways!

  • I'm seeing fresh checkouts and ephemeral build agents used more and more frequently. (Yesterday I encountered a second customer who had 8s out of 16s build times taken up by SpotlessTask)
  • A user may work on a small part of a large multiproject build: whenever retrieving other changes from VCS they would benefit greatly from caching of checks for other modules.
  • It's possible that people run check 100s of times more often than apply

Proposed solution

I think the solution you outline could work. What I like about this mechanism is that there's no single task that has the same inputs and outputs. However, if I understand correctly, one downside is that the analysis for any badly formatted file would need to be executed twice: once to generate the spotless-dirty.txt file and again to actually modify the target files.

An alternative

Another possible solution would be for the main SpotlessTask to do the analysis for each changed file, and generate a patch to correct any file that has bad formatting. These patch files would be regular build outputs, and their generation would be easily cacheable. Then, the check task would simply fail if there were any patch files present, and the apply task would apply the changes to the main source tree. Neither the check or apply tasks would need to be cacheable.

Instead of using patch files, it would also be possible for the main SpotlessTask to simply write the "fixed" content for each bad file into it's build output directory. Then, the apply task could simply copy these files into the main source tree.

WDYT?

@nedtwigg
Copy link
Member

^^^ All excellent points. A few comments:

I love the patch idea, but I think that just writing-out the plain fixed content is probably easier to implement, debug, and maintain.

If you're writing out the "clean" state for every file, whether it is dirty or not, then you're copying the entire source tree on every run of apply or check, and you're caching it as an output too (not just hashing as input). If you try to optimize a bit by not writing a file for clean files, as a way to signal that they are clean, then it gets tricky to do up-to-date checking. Gradle handles the 1:1 transformation of .java -> .class very well, but .java -> either clean.java or nothing is a little less straightforward. For that latter case, you might still need build/spotless-dirty.txt to track that yourself.

My instinct is that the formatter runs fast, and the disk is slow, but your caching results show that I'm wrong about that. I'm sure a gradle engineer has better heuristics around that than me. I'm always happy to merge performance-enhancing PRs with a real-world usecase that shows a speed increase. I definitely believe that caching build/spotless-dirty.txt will improve initial-checkout performance, with a small hit to "single-changed-file" performance, which is already very fast. I'm a little skeptical that caching the entire source-tree as an input and an output is a performance improvement, but I would take your word for it if you told me that it was.

@nedtwigg
Copy link
Member

nedtwigg commented May 4, 2020

An issue that we're going to have (just popped up in #559) is that most of our FormatterStep cache keys are machine-dependent at the moment. The task reorganization that @bigdaz has proposed is a definite improvement, and will make the local buildcache useful to users. But in order for Spotless to be path-independent (necessary for shared buildcache to be useful), we need to make FileSignature path and machine independent. I think we should discuss and implement this separately, the issue lives at #566.

@nedtwigg
Copy link
Member

Just did a little experiment. Now that @bigdaz has implemented #576, it makes sense to mark SpotlessTask as @CacheableTask. And lucky for us, we can do that, and Gradle 2.x doesn't seem to mind. So we don't have to bump the minimum Gradle to support the buildcache. However, the buildcache will not work for the vast majority of cases until #566 is resolved. So rather than thrash bandwidth on the buildcache where we know the keys will never match, we might as well wait for #566 to get resolved. So whatever PR solved #566 can just add that one annotation and we should be good!

@bigdaz
Copy link
Contributor

bigdaz commented May 14, 2020

However, the buildcache will not work for the vast majority of cases until #566 is resolved

Do you mean that even local cache doesn't work in many instances?
Because from what I've seen, running clean builds with a populated local cache is a common use case. In many cases there's no real reason for clean (it's hard to wean users off clean!). Another common use case involving the local cache is switching between local branches.

@nedtwigg
Copy link
Member

Local cache will work in all instances. My impression was that the value came from the using the remote cache results on dev builds. I assumed that it was expensive to fill a remote build cache with stuff that will never hit anywhere besides the CI checkout path it was built on.

But it's very easy to turn on, so here it is! 971e32c. Hopefully we'll have a fix soon, so that even if we are thrashing a cache, we won't be for long.

@nedtwigg nedtwigg pinned this issue May 14, 2020
@nedtwigg
Copy link
Member

The local build cache works as of plugin-gradle 4.0.0. The remote build-cache will always miss until #566 is resolved.

@nedtwigg nedtwigg unpinned this issue Jun 4, 2020
@nedtwigg
Copy link
Member

nedtwigg commented Jul 2, 2020

As of plugin-gradle 4.5.0, the remote buildcache should be working. If anyone has problems with cache misses that they think should hit, please let us know in this issue.

@nedtwigg nedtwigg closed this as completed Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants