Standardize usage of machine-wide caches #3930

lihaoyi · 2024-11-09T18:18:52Z

In the past we only had Coursier managing its own machine-global cache for maven central, but this is starting to grow:

Custom Java version support needs somewhere to put the JVMs
Android support needs somewhere to put the Android SDK and other heavyweight tools
The example PythonModule needs somewhere to cache its PIP downloads
The example TypescriptModule could use somewhere to cache its NPM downloads
Once the filesystem-independent out/ folder layout lands, we could begin sharing out/ folder results between projects, and that would also need some standard place to put the artifacts

The basic idea is that when someone calls clean, they usually dont want to start all the way from scratch re-downloading every jar from Maven Central and re-downloading their JVM. These external dowloads typically are downloaded once and cached forever, and only very rarely do people want to clean them, in comparison to needing to clean local build outputs.

We should try and standardize how these "global" cached downloads are handled so people can just plug into the standard, rather than creatively coming up with their own solutions that end up being half-baked or inconsistent

The text was updated successfully, but these errors were encountered:

0xnm · 2024-11-12T22:06:20Z

My 2c on this: such cache structure vary a lot between the different ecosystems, because the particular ecosystem dictates the structure.

For example, in case of Maven dependencies the standard file cache layout supports artifact versioning as a part of the artifact's full path and it is easy to have different artifact versions out-of-project, somewhere on the machine.

If we take a look on the NPM dependencies, for example, node_modules doesn't support hosting different versions of the same artifact by default (unless package-aliasing is used? or maybe workspaces can help?). In this case such cache unlikely can be machine-wide and then becomes more project-wide (especially if we think that some dependencies are built-in, e.g. test reporting in Kotlin/JS, and versions can be different between different Mill installations). Now, let's say we have a project with several JS modules, and it won't be very wise to download the same NPM dependency several times (for each module), so it probably makes sense to have some folder at the root of the project, but there is no:

~~built-in API to access root module (or root out folder). The closest way to access it (instead of using T.dest which is bound to the task) with the current API will be, probably, something like~~ upd: okay, it seems T.workspace can access root, so it will be like:

val dir = T.workspace / "out" / "js"

way to control that there is only one job to download the particular dependency instead of the parallel ones. For example, if Module A declares a dependency foo and Module B declares the same dependency foo, it shouldn't be parallel downloads: if Module A is already downloading foo, Module B should wait and consume the result instead of shooting its own download job (is it the case currently with Maven dependencies download?).

alexarchambault · 2024-12-03T15:14:51Z

if Module A is already downloading foo, Module B should wait and consume the result instead of shooting its own download job (is it the case currently with Maven dependencies download?)

For coursier (that handles Maven dependencies), it's fine / safe to have several tasks or threads attempt to download the same file. Only one of them will effectively download the file, while the others wait for it (and may also download other things on their own in parallel), and another might take over if the first doing the download is interrupted (resuming the download if the repository supports it, not re-doing it from zero).

alexarchambault · 2024-12-03T15:17:13Z

We addressed many parallel download issues triggered by Mill's heavy use of concurrency recently 😅, it seems to be working fine now

alexarchambault · 2024-12-03T15:37:25Z

Wanted to briefly mention a few details about the coursier cache, following the short discussion in #4066.

Unlike other Maven dependency resolvers (Maven itself, Gradle, Ivy), the cache of coursier is quite generic, and simply caches artifacts using their URL as key. To illustrate with the coursier CLI cs, the get sub-command allows to download a file and print its path in cache:

$ cs get https://repo1.maven.org/maven2/com/google/cloud/libraries-bom/26.50.0/libraries-bom-26.50.0.pom
~/Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/com/google/cloud/libraries-bom/26.50.0/libraries-bom-26.50.0.pom

This works well as most artifacts are immutable. For mutable ones, these have to be flagged as "changing" upfront (internally, coursier does it for snapshot artifacts and version listings), and a TTL is used to decide whether to check for new content or not:

$ cs get --changing https://repo1.maven.org/maven2/org/scala-lang/scala-library/maven-metadata.xml
~/Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/org/scala-lang/scala-library/maven-metadata.xml

alexarchambault · 2024-12-03T15:38:25Z

AFAIK, Nix has a quite generic cache mechanism too, except it uses a checksum as key rather than a URL (not sure Nix can be easily used on all platforms from Mill though)

alexarchambault · 2024-12-03T15:40:30Z

And Bazel must offer similar features too

lihaoyi · 2024-12-04T01:09:47Z

I think it would make sense to offer this as a built in feature using Coursier.

Keying on URLs is generally fine, since [https://www.w3.org/Provider/Style/URI](Cool URLs Don't Change)
Coursier naturally has to provide all the bells and whistles around concurrency, interruption, etc. that would be a lot of work to re-implement in Mill.
We can generally be confident in the robustness of the coursier implementation, since it is heavily exercised via dependency resolution in Mill's build and test suite, more so than any ad-hoc implementation we come up with

I think to make this a reality, a few steps are necessary:

Write reference documentation for programmatic use of the coursier cache API somewhere. https://get-coursier.io/docs/cache only covers the behavior of the cache but not how to actually use it from code. We could add the API usage docs there, or on mill-build.org, but it needs to live somewhere
Decide on whether to expose Coursier's API as-is or wrap it in a Mill wrapper helper. This depends largely on how well Coursier's Cache API fits the com-lihaoyi style that Mill and it's libraries are written in (simple function calls and objects, lots of default parameters, minimal builder or factory or config objects)
Review and discuss the exposed API and semantics to make sure it is what we want it to look like. Stable APIs in Mill are restricted from changing, so we should get it right the first time rather than needing to wait a year for the next breaking release to fix issues that could have been prevented up front
Write an example test in example/fundamentals/libraries/ with documentation and include it on fundamentals/bundled-libraries.adoc together with usage docs for other bundled libraries
Update the example/javalib/dependencies/4-downloading-unmanaged-jars examples to use this new API rather than requests.get directly

It's a bunch of work but seems straightforward. Let's hold off on #4066 until we've done this, and then we can dogfood our new APIs to download the test repos rather than doing it ad-hoc.

@alexarchambault probably makes sense for you to take the lead on this whenever you have some spare cycles

jodersky · 2024-12-04T09:58:31Z

I'd like to add a feature request to this: allow arbitrary postprocessing of a file before caching it. For example, if I download a zip archive, I'd like to unzip it before caching the result.

jodersky · 2024-12-04T13:03:41Z

I'm just thinking about this in general, and this came to mind.

If we had reproducible out/ folders (#4065), would it make sense to allow the results to be cached in a global directory? We could do this optionally by introducing a Task.Global{} (or something like that), where the result would be cached in a well-known location, under some checksum, and the dest folder would also be preserved there. We'd need some locking mechanism as well, but I think that this way we could share results "purely" with mill. For example, we could continue using requests.get and do arbitrary post-processing in tasks, and mark the task to use a global cache.


// Task.Global is not saved in `out/`, but in `~/.cache/mill/<some checksum>
def tool = Task.Global {
  val archive = requests.get("...")
  unzip(archive, T.dest) // T.dest in a global is also saved in the global cache
  ...
   PathRef(T.dest)
}

// global tasks can be reused like any other, except they'll be shared across mill builds on a machine, and maybe across many machines in the future
def invokeTool = Task.Command {
  exec(tool(), "foo", "bar"
}

It would also be a step in the direction of supporting remote caches.

I'm not quite sure how we'd address the cache in a way that does not depend on the task name, but if we figure that it out I think we could try this approach.

Edit: I just saw there's already been a similar discussion in #2101.

sequencer · 2024-12-04T13:19:49Z

fwiw, we are using mill+nix in our project, see a minimal example here: https://github.com/chipsalliance/chisel-nix/blob/6a8222ad27b3fa4f2cd871152696afdc636c5b44/templates/chisel/nix/pkgs/mill-builder.nix#L6

lihaoyi · 2024-12-04T13:31:22Z

@jodersky I think the main reason I would want to separate project-level caches in the out/ folder from global caches in ~/.cache/mill is the level reliability:

I expect project-level caches to get into bad states every once in a while and need a clean
I expect global caches to never get into bad states and need to be cleaned

Thus it has always made sense to be able to put things in the out/ folder that are cheap-ish to re-compute and occassionally get thrown away, while expensive-to-download stuff that never gets re-computed goes in ~/.cache/mill. Mill tries to blur the two a bit by making the stuff in out/ more reliable (than SBT/Maven/Gradle at least), and thus less likely to need to be cleaned, but I don't think I can say it's reliable enough to match the stability of e.g. the global coursier artifact cache.

jodersky · 2024-12-04T13:34:56Z

I agree, that's why I was thinking of specifically marking tasks somehow that would be suited for global caching. Maybe this mechanism could then also be used to mark which tasks should be cached remotely, when we add that feature.

Expanding on that, to guarantee more reliability, we could create a separate type of serializer for global tasks, and selectively implement what kind of return types we accept for globally cached tasks. For instance, I think that even with #4065, allowing arbitrary PathRefs in a global cache seems risky. @lefou floated an idea of a "ContentRef" type in #2101, which we could use to represent the content of all of T.dest. It seems like it could be a good fit to get started, and would be much more resistant to accidental corruptions.

lihaoyi · 2024-12-04T13:37:47Z

yeah it's possible. The remote caching POC (#2777) used a Mill query to select which tasks you wanted to remote cache, but there are other ways we could annotate it as well. Now we've started using Task(foo = true){ ... } to pass flags to tasks, we could use that approach to mark remote cacheable tasks in code rather than separately from the CLI

sequencer · 2024-12-04T13:45:08Z

I personally would like to seek for an option in mill that not downloading anything(sandboxing) but exposing the dependency to a machine-readable way that ask external tool to download and put them in a correct folder. So that we can safely cache it, and reuse in different clean machines.

The current state is that mill always assumes it has the network connection, which is far way from sandboxing. Sometimes ivy downloading even are not reproducible at all, and we need to cache that with an input hash see here

Eventually what we expect is an out directory that only depends on the build.sc, and we use nix to take build.sc as input. For the dependencies in maven, we'd like to ask mill to dump the entire dependencies and we use coursier to download and cache each piece of them, then exposing the nix cache directory to mill.

sideeffffect · 2024-12-05T00:30:02Z

@jodersky

allow arbitrary postprocessing of a file before caching it. For example, if I download a zip archive, I'd like to unzip it before caching the result

I think I have good news regarding this. Among the many cool things Coursier can do, it downloads/"installs" various JVM distributions. And those are certainly distributed as zip or other archives. So it must be possible already today -- Coursier already does this.

https://get-coursier.io/docs/cli-java

lihaoyi · 2024-12-05T01:12:06Z

One thing we may need to be concerned about for post-processing is where this post-processing is defined, how it is keyed for uniqueness, and how error-prone it is. The nice things about the dumb "download URL" workflow is that there is not much config to think about, it gives a great cache key to start with (the URL itself), it rarely fails (and can always be retried).

Technically the downloaded contents can vary based on HTTP headers and all that (e.g. around authentication), but generally downloading stuff from public URLs is pretty straightforward

If we start allowing arbitrary post-processing, we need some way to hash the post-processing logic to add it to the cache key (maybe using methodCodeHashSignatures?), and we need to ensure the post-processing is appropriately configurable (I guess if it's arbitrary code that's already the case?) and that it doesn't fail or cause the caches to get into a bad state. Probably solvable problems, but definitely things we need to be careful about if we choose to go beyond just "dumb download URL" approach to the global caches

jodersky · 2024-12-05T09:30:55Z

I think we can indeed defer *arbitrary* postprocessing to a later point (maybe once we have a robust solution to global/remote caching of tasks). However, I would argue that archive extraction is a common enough scenario to support.

…

On Thu, Dec 5, 2024, 02:12 Li Haoyi ***@***.***> wrote: One thing we may need to be concerned about for post-processing is where this post-processing is defined, how it is keyed for uniqueness, and how error-prone it is. The nice things about the dumb "download URL" workflow is that there is not much config to think about, it gives a great cache key to start with (the URL itself), it rarely fails (and can always be retried). Technically the downloaded contents can vary based on HTTP headers and all that (e.g. around authentication), but generally downloading stuff from public URLs is pretty straightforward If we start allowing arbitrary post-processing, we need some way to hash the post-processing logic to add it to the cache key (maybe using methodCodeHashSignatures?), and we need to ensure the post-processing is appropriately configurable (I guess if it's arbitrary code that's already the case?) and that it doesn't fail or cause the caches to get into a bad state. Probably solvable problems, but definitely things we need to be careful about if we choose to go beyond just "dumb download URL" approach to the global caches — Reply to this email directly, view it on GitHub <#3930 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHV3JEH72YYSNEACNBF3QD2D6R7ZAVCNFSM6AAAAABRPLJQOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJYHA3TIOBQGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

lihaoyi · 2024-12-08T06:57:50Z

Apart from the download (and maybe unpack) kinds of caching, we also need some story for what to do with external tools outside of our control which require cache locations. PIP and NPM are two examples we're facing right now, but I'm sure there'll be others

alexarchambault · 2024-12-18T19:15:01Z

I'd like to add a feature request to this: allow arbitrary postprocessing of a file before caching it. For example, if I download a zip archive, I'd like to unzip it before caching the result.

As @sideeffffect mentioned, this is already supported by coursier. Support for that has been added some time ago, for JVM handling support.

From the command-line, one can pass --archive to cs get to download and extract an archive:

$ cs get https://cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip
~/Library/Caches/Coursier/v1/https/cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip

$ cs get https://cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip --archive
~/Library/Caches/Coursier/arc/https/cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip

$ ls "$(cs get https://cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip --archive)"
LICENSE.gpl2        LICENSE.lgpl3       bin                 libexec
LICENSE.gpl3        README.md           include             x86_64-linux-cosmo
LICENSE.lgpl2       aarch64-linux-cosmo lib

(note the arc instead of v1 in the path)

From the API, one needs to use coursier.cache.ArchiveCache:

import coursier.cache.ArchiveCache
import coursier.util.Artifact
val cache = ArchiveCache()
val dir = cache.get(Artifact("https://cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip"))
  .unsafeRun()(cache.cache.ec).toTry.get
os.list(os.Path(dir))

alexarchambault · 2024-12-18T19:22:01Z

For arbitrary post-processing, I wouldn't recommend modifying files or directories that coursier returns, but… some changes should be fine: making a file executable, adding new files in an extracted archive, etc. Once an archive is extracted, coursier doesn't check for added (or removed) files.

alexarchambault · 2024-12-18T19:23:39Z

I think to make this a reality, a few steps are necessary:

Write reference documentation for programmatic use of the coursier cache API somewhere. https://get-coursier.io/docs/cache only covers the behavior of the cache but not how to actually use it from code. We could add the API usage docs there, or on mill-build.org, but it needs to live somewhere

I agree more API docs (sometimes, even some API doc) is needed

alexarchambault · 2024-12-18T19:49:34Z

2. Decide on whether to expose Coursier's API as-is or wrap it in a Mill wrapper helper. This depends largely on how well Coursier's Cache API fits the com-lihaoyi style that Mill and it's libraries are written in (simple function calls and objects, lots of default parameters, minimal builder or factory or config objects)

I wouldn't say it fits the com-lihaoyi style, currently. It needs a lot of polishing, at a minimum. Here is how one can enable logging with progress bars, in the archive example above:

import coursier.cache.{ArchiveCache, FileCache}
import coursier.cache.loggers.RefreshLogger
import coursier.util.Artifact

val cache = FileCache().withLogger(RefreshLogger.create())
val archiveCache = ArchiveCache().withCache(cache)
val dir = archiveCache.get(Artifact("https://cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip"))
  .unsafeRun()(cache.ec).toTry.get
os.list(os.Path(dir))

Default values were despised in the first years of coursier (which was actually a good thing in the core of the resolver, not to accidentally use a default value). Then I started using data classes a lot, so there's a lot of such data classes around. But there's very little method calls with default values. And coursier has its own IO (that wraps scala.concurrent.Future), coursier.util.Task, which is useful internally, but less so for end users.

lihaoyi mentioned this issue Dec 3, 2024

Download test repos via coursier #4066

Draft

lihaoyi mentioned this issue Dec 4, 2024

First Class Python Support (4000USD Bounty) #3928

Open

33 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize usage of machine-wide caches #3930

Standardize usage of machine-wide caches #3930

lihaoyi commented Nov 9, 2024

0xnm commented Nov 12, 2024 •

edited

Loading

alexarchambault commented Dec 3, 2024

alexarchambault commented Dec 3, 2024

alexarchambault commented Dec 3, 2024

alexarchambault commented Dec 3, 2024 •

edited

Loading

alexarchambault commented Dec 3, 2024

lihaoyi commented Dec 4, 2024

jodersky commented Dec 4, 2024

jodersky commented Dec 4, 2024 •

edited

Loading

sequencer commented Dec 4, 2024

lihaoyi commented Dec 4, 2024

jodersky commented Dec 4, 2024 •

edited

Loading

lihaoyi commented Dec 4, 2024

sequencer commented Dec 4, 2024

sideeffffect commented Dec 5, 2024 •

edited

Loading

lihaoyi commented Dec 5, 2024

jodersky commented Dec 5, 2024 via email

lihaoyi commented Dec 8, 2024

alexarchambault commented Dec 18, 2024

alexarchambault commented Dec 18, 2024

alexarchambault commented Dec 18, 2024

alexarchambault commented Dec 18, 2024 •

edited

Loading

Standardize usage of machine-wide caches #3930

Standardize usage of machine-wide caches #3930

Comments

lihaoyi commented Nov 9, 2024

0xnm commented Nov 12, 2024 • edited Loading

alexarchambault commented Dec 3, 2024

alexarchambault commented Dec 3, 2024

alexarchambault commented Dec 3, 2024

alexarchambault commented Dec 3, 2024 • edited Loading

alexarchambault commented Dec 3, 2024

lihaoyi commented Dec 4, 2024

jodersky commented Dec 4, 2024

jodersky commented Dec 4, 2024 • edited Loading

sequencer commented Dec 4, 2024

lihaoyi commented Dec 4, 2024

jodersky commented Dec 4, 2024 • edited Loading

lihaoyi commented Dec 4, 2024

sequencer commented Dec 4, 2024

sideeffffect commented Dec 5, 2024 • edited Loading

lihaoyi commented Dec 5, 2024

jodersky commented Dec 5, 2024 via email

lihaoyi commented Dec 8, 2024

alexarchambault commented Dec 18, 2024

alexarchambault commented Dec 18, 2024

alexarchambault commented Dec 18, 2024

alexarchambault commented Dec 18, 2024 • edited Loading

0xnm commented Nov 12, 2024 •

edited

Loading

alexarchambault commented Dec 3, 2024 •

edited

Loading

jodersky commented Dec 4, 2024 •

edited

Loading

jodersky commented Dec 4, 2024 •

edited

Loading

sideeffffect commented Dec 5, 2024 •

edited

Loading

alexarchambault commented Dec 18, 2024 •

edited

Loading