Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize usage of machine-wide caches #3930

Open
lihaoyi opened this issue Nov 9, 2024 · 22 comments
Open

Standardize usage of machine-wide caches #3930

lihaoyi opened this issue Nov 9, 2024 · 22 comments

Comments

@lihaoyi
Copy link
Member

lihaoyi commented Nov 9, 2024

In the past we only had Coursier managing its own machine-global cache for maven central, but this is starting to grow:

  • Custom Java version support needs somewhere to put the JVMs
  • Android support needs somewhere to put the Android SDK and other heavyweight tools
  • The example PythonModule needs somewhere to cache its PIP downloads
  • The example TypescriptModule could use somewhere to cache its NPM downloads
  • Once the filesystem-independent out/ folder layout lands, we could begin sharing out/ folder results between projects, and that would also need some standard place to put the artifacts

The basic idea is that when someone calls clean, they usually dont want to start all the way from scratch re-downloading every jar from Maven Central and re-downloading their JVM. These external dowloads typically are downloaded once and cached forever, and only very rarely do people want to clean them, in comparison to needing to clean local build outputs.

We should try and standardize how these "global" cached downloads are handled so people can just plug into the standard, rather than creatively coming up with their own solutions that end up being half-baked or inconsistent

@0xnm
Copy link
Contributor

0xnm commented Nov 12, 2024

My 2c on this: such cache structure vary a lot between the different ecosystems, because the particular ecosystem dictates the structure.

For example, in case of Maven dependencies the standard file cache layout supports artifact versioning as a part of the artifact's full path and it is easy to have different artifact versions out-of-project, somewhere on the machine.

If we take a look on the NPM dependencies, for example, node_modules doesn't support hosting different versions of the same artifact by default (unless package-aliasing is used? or maybe workspaces can help?). In this case such cache unlikely can be machine-wide and then becomes more project-wide (especially if we think that some dependencies are built-in, e.g. test reporting in Kotlin/JS, and versions can be different between different Mill installations). Now, let's say we have a project with several JS modules, and it won't be very wise to download the same NPM dependency several times (for each module), so it probably makes sense to have some folder at the root of the project, but there is no:

  • built-in API to access root module (or root out folder). The closest way to access it (instead of using T.dest which is bound to the task) with the current API will be, probably, something like upd: okay, it seems T.workspace can access root, so it will be like:
val dir = T.workspace / "out" / "js"
  • way to control that there is only one job to download the particular dependency instead of the parallel ones. For example, if Module A declares a dependency foo and Module B declares the same dependency foo, it shouldn't be parallel downloads: if Module A is already downloading foo, Module B should wait and consume the result instead of shooting its own download job (is it the case currently with Maven dependencies download?).

@alexarchambault
Copy link
Contributor

  • if Module A is already downloading foo, Module B should wait and consume the result instead of shooting its own download job (is it the case currently with Maven dependencies download?)

For coursier (that handles Maven dependencies), it's fine / safe to have several tasks or threads attempt to download the same file. Only one of them will effectively download the file, while the others wait for it (and may also download other things on their own in parallel), and another might take over if the first doing the download is interrupted (resuming the download if the repository supports it, not re-doing it from zero).

@alexarchambault
Copy link
Contributor

We addressed many parallel download issues triggered by Mill's heavy use of concurrency recently 😅, it seems to be working fine now

@alexarchambault
Copy link
Contributor

Wanted to briefly mention a few details about the coursier cache, following the short discussion in #4066.

Unlike other Maven dependency resolvers (Maven itself, Gradle, Ivy), the cache of coursier is quite generic, and simply caches artifacts using their URL as key. To illustrate with the coursier CLI cs, the get sub-command allows to download a file and print its path in cache:

$ cs get https://repo1.maven.org/maven2/com/google/cloud/libraries-bom/26.50.0/libraries-bom-26.50.0.pom
~/Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/com/google/cloud/libraries-bom/26.50.0/libraries-bom-26.50.0.pom

This works well as most artifacts are immutable. For mutable ones, these have to be flagged as "changing" upfront (internally, coursier does it for snapshot artifacts and version listings), and a TTL is used to decide whether to check for new content or not:

$ cs get --changing https://repo1.maven.org/maven2/org/scala-lang/scala-library/maven-metadata.xml
~/Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/org/scala-lang/scala-library/maven-metadata.xml

@alexarchambault
Copy link
Contributor

alexarchambault commented Dec 3, 2024

AFAIK, Nix has a quite generic cache mechanism too, except it uses a checksum as key rather than a URL (not sure Nix can be easily used on all platforms from Mill though)

@alexarchambault
Copy link
Contributor

And Bazel must offer similar features too

@lihaoyi
Copy link
Member Author

lihaoyi commented Dec 4, 2024

I think it would make sense to offer this as a built in feature using Coursier.

  • Keying on URLs is generally fine, since [https://www.w3.org/Provider/Style/URI](Cool URLs Don't Change)
  • Coursier naturally has to provide all the bells and whistles around concurrency, interruption, etc. that would be a lot of work to re-implement in Mill.
  • We can generally be confident in the robustness of the coursier implementation, since it is heavily exercised via dependency resolution in Mill's build and test suite, more so than any ad-hoc implementation we come up with

I think to make this a reality, a few steps are necessary:

  1. Write reference documentation for programmatic use of the coursier cache API somewhere. https://get-coursier.io/docs/cache only covers the behavior of the cache but not how to actually use it from code. We could add the API usage docs there, or on mill-build.org, but it needs to live somewhere

  2. Decide on whether to expose Coursier's API as-is or wrap it in a Mill wrapper helper. This depends largely on how well Coursier's Cache API fits the com-lihaoyi style that Mill and it's libraries are written in (simple function calls and objects, lots of default parameters, minimal builder or factory or config objects)

  3. Review and discuss the exposed API and semantics to make sure it is what we want it to look like. Stable APIs in Mill are restricted from changing, so we should get it right the first time rather than needing to wait a year for the next breaking release to fix issues that could have been prevented up front

  4. Write an example test in example/fundamentals/libraries/ with documentation and include it on fundamentals/bundled-libraries.adoc together with usage docs for other bundled libraries

  5. Update the example/javalib/dependencies/4-downloading-unmanaged-jars examples to use this new API rather than requests.get directly

It's a bunch of work but seems straightforward. Let's hold off on #4066 until we've done this, and then we can dogfood our new APIs to download the test repos rather than doing it ad-hoc.

@alexarchambault probably makes sense for you to take the lead on this whenever you have some spare cycles

@jodersky
Copy link
Member

jodersky commented Dec 4, 2024

I'd like to add a feature request to this: allow arbitrary postprocessing of a file before caching it. For example, if I download a zip archive, I'd like to unzip it before caching the result.

@jodersky
Copy link
Member

jodersky commented Dec 4, 2024

I'm just thinking about this in general, and this came to mind.

If we had reproducible out/ folders (#4065), would it make sense to allow the results to be cached in a global directory? We could do this optionally by introducing a Task.Global{} (or something like that), where the result would be cached in a well-known location, under some checksum, and the dest folder would also be preserved there. We'd need some locking mechanism as well, but I think that this way we could share results "purely" with mill. For example, we could continue using requests.get and do arbitrary post-processing in tasks, and mark the task to use a global cache.


// Task.Global is not saved in `out/`, but in `~/.cache/mill/<some checksum>
def tool = Task.Global {
  val archive = requests.get("...")
  unzip(archive, T.dest) // T.dest in a global is also saved in the global cache
  ...
   PathRef(T.dest)
}

// global tasks can be reused like any other, except they'll be shared across mill builds on a machine, and maybe across many machines in the future
def invokeTool = Task.Command {
  exec(tool(), "foo", "bar"
}

It would also be a step in the direction of supporting remote caches.

I'm not quite sure how we'd address the cache in a way that does not depend on the task name, but if we figure that it out I think we could try this approach.

Edit: I just saw there's already been a similar discussion in #2101.

@sequencer
Copy link
Contributor

@lihaoyi
Copy link
Member Author

lihaoyi commented Dec 4, 2024

@jodersky I think the main reason I would want to separate project-level caches in the out/ folder from global caches in ~/.cache/mill is the level reliability:

  • I expect project-level caches to get into bad states every once in a while and need a clean
  • I expect global caches to never get into bad states and need to be cleaned

Thus it has always made sense to be able to put things in the out/ folder that are cheap-ish to re-compute and occassionally get thrown away, while expensive-to-download stuff that never gets re-computed goes in ~/.cache/mill. Mill tries to blur the two a bit by making the stuff in out/ more reliable (than SBT/Maven/Gradle at least), and thus less likely to need to be cleaned, but I don't think I can say it's reliable enough to match the stability of e.g. the global coursier artifact cache.

@jodersky
Copy link
Member

jodersky commented Dec 4, 2024

I agree, that's why I was thinking of specifically marking tasks somehow that would be suited for global caching. Maybe this mechanism could then also be used to mark which tasks should be cached remotely, when we add that feature.

Expanding on that, to guarantee more reliability, we could create a separate type of serializer for global tasks, and selectively implement what kind of return types we accept for globally cached tasks. For instance, I think that even with #4065, allowing arbitrary PathRefs in a global cache seems risky. @lefou floated an idea of a "ContentRef" type in #2101, which we could use to represent the content of all of T.dest. It seems like it could be a good fit to get started, and would be much more resistant to accidental corruptions.

@lihaoyi
Copy link
Member Author

lihaoyi commented Dec 4, 2024

yeah it's possible. The remote caching POC (#2777) used a Mill query to select which tasks you wanted to remote cache, but there are other ways we could annotate it as well. Now we've started using Task(foo = true){ ... } to pass flags to tasks, we could use that approach to mark remote cacheable tasks in code rather than separately from the CLI

@sequencer
Copy link
Contributor

I personally would like to seek for an option in mill that not downloading anything(sandboxing) but exposing the dependency to a machine-readable way that ask external tool to download and put them in a correct folder. So that we can safely cache it, and reuse in different clean machines.

The current state is that mill always assumes it has the network connection, which is far way from sandboxing. Sometimes ivy downloading even are not reproducible at all, and we need to cache that with an input hash see here

Eventually what we expect is an out directory that only depends on the build.sc, and we use nix to take build.sc as input. For the dependencies in maven, we'd like to ask mill to dump the entire dependencies and we use coursier to download and cache each piece of them, then exposing the nix cache directory to mill.

@sideeffffect
Copy link
Contributor

sideeffffect commented Dec 5, 2024

@jodersky

allow arbitrary postprocessing of a file before caching it. For example, if I download a zip archive, I'd like to unzip it before caching the result

I think I have good news regarding this. Among the many cool things Coursier can do, it downloads/"installs" various JVM distributions. And those are certainly distributed as zip or other archives. So it must be possible already today -- Coursier already does this.

https://get-coursier.io/docs/cli-java

@lihaoyi
Copy link
Member Author

lihaoyi commented Dec 5, 2024

One thing we may need to be concerned about for post-processing is where this post-processing is defined, how it is keyed for uniqueness, and how error-prone it is. The nice things about the dumb "download URL" workflow is that there is not much config to think about, it gives a great cache key to start with (the URL itself), it rarely fails (and can always be retried).

Technically the downloaded contents can vary based on HTTP headers and all that (e.g. around authentication), but generally downloading stuff from public URLs is pretty straightforward

If we start allowing arbitrary post-processing, we need some way to hash the post-processing logic to add it to the cache key (maybe using methodCodeHashSignatures?), and we need to ensure the post-processing is appropriately configurable (I guess if it's arbitrary code that's already the case?) and that it doesn't fail or cause the caches to get into a bad state. Probably solvable problems, but definitely things we need to be careful about if we choose to go beyond just "dumb download URL" approach to the global caches

@jodersky
Copy link
Member

jodersky commented Dec 5, 2024 via email

@lihaoyi
Copy link
Member Author

lihaoyi commented Dec 8, 2024

Apart from the download (and maybe unpack) kinds of caching, we also need some story for what to do with external tools outside of our control which require cache locations. PIP and NPM are two examples we're facing right now, but I'm sure there'll be others

@alexarchambault
Copy link
Contributor

I'd like to add a feature request to this: allow arbitrary postprocessing of a file before caching it. For example, if I download a zip archive, I'd like to unzip it before caching the result.

As @sideeffffect mentioned, this is already supported by coursier. Support for that has been added some time ago, for JVM handling support.

From the command-line, one can pass --archive to cs get to download and extract an archive:

$ cs get https://cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip
~/Library/Caches/Coursier/v1/https/cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip

$ cs get https://cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip --archive
~/Library/Caches/Coursier/arc/https/cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip

$ ls "$(cs get https://cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip --archive)"
LICENSE.gpl2        LICENSE.lgpl3       bin                 libexec
LICENSE.gpl3        README.md           include             x86_64-linux-cosmo
LICENSE.lgpl2       aarch64-linux-cosmo lib

(note the arc instead of v1 in the path)

From the API, one needs to use coursier.cache.ArchiveCache:

import coursier.cache.ArchiveCache
import coursier.util.Artifact
val cache = ArchiveCache()
val dir = cache.get(Artifact("https://cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip"))
  .unsafeRun()(cache.cache.ec).toTry.get
os.list(os.Path(dir))

@alexarchambault
Copy link
Contributor

For arbitrary post-processing, I wouldn't recommend modifying files or directories that coursier returns, but… some changes should be fine: making a file executable, adding new files in an extracted archive, etc. Once an archive is extracted, coursier doesn't check for added (or removed) files.

@alexarchambault
Copy link
Contributor

I think to make this a reality, a few steps are necessary:

  1. Write reference documentation for programmatic use of the coursier cache API somewhere. https://get-coursier.io/docs/cache only covers the behavior of the cache but not how to actually use it from code. We could add the API usage docs there, or on mill-build.org, but it needs to live somewhere

I agree more API docs (sometimes, even some API doc) is needed

@alexarchambault
Copy link
Contributor

alexarchambault commented Dec 18, 2024

2. Decide on whether to expose Coursier's API as-is or wrap it in a Mill wrapper helper. This depends largely on how well Coursier's Cache API fits the com-lihaoyi style that Mill and it's libraries are written in (simple function calls and objects, lots of default parameters, minimal builder or factory or config objects)

I wouldn't say it fits the com-lihaoyi style, currently. It needs a lot of polishing, at a minimum. Here is how one can enable logging with progress bars, in the archive example above:

import coursier.cache.{ArchiveCache, FileCache}
import coursier.cache.loggers.RefreshLogger
import coursier.util.Artifact

val cache = FileCache().withLogger(RefreshLogger.create())
val archiveCache = ArchiveCache().withCache(cache)
val dir = archiveCache.get(Artifact("https://cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip"))
  .unsafeRun()(cache.ec).toTry.get
os.list(os.Path(dir))

Default values were despised in the first years of coursier (which was actually a good thing in the core of the resolver, not to accidentally use a default value). Then I started using data classes a lot, so there's a lot of such data classes around. But there's very little method calls with default values. And coursier has its own IO (that wraps scala.concurrent.Future), coursier.util.Task, which is useful internally, but less so for end users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants