-
-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize usage of machine-wide caches #3930
Comments
My 2c on this: such cache structure vary a lot between the different ecosystems, because the particular ecosystem dictates the structure. For example, in case of Maven dependencies the standard file cache layout supports artifact versioning as a part of the artifact's full path and it is easy to have different artifact versions out-of-project, somewhere on the machine. If we take a look on the NPM dependencies, for example,
|
For coursier (that handles Maven dependencies), it's fine / safe to have several tasks or threads attempt to download the same file. Only one of them will effectively download the file, while the others wait for it (and may also download other things on their own in parallel), and another might take over if the first doing the download is interrupted (resuming the download if the repository supports it, not re-doing it from zero). |
We addressed many parallel download issues triggered by Mill's heavy use of concurrency recently 😅, it seems to be working fine now |
Wanted to briefly mention a few details about the coursier cache, following the short discussion in #4066. Unlike other Maven dependency resolvers (Maven itself, Gradle, Ivy), the cache of coursier is quite generic, and simply caches artifacts using their URL as key. To illustrate with the coursier CLI
This works well as most artifacts are immutable. For mutable ones, these have to be flagged as "changing" upfront (internally, coursier does it for snapshot artifacts and version listings), and a TTL is used to decide whether to check for new content or not:
|
AFAIK, Nix has a quite generic cache mechanism too, except it uses a checksum as key rather than a URL (not sure Nix can be easily used on all platforms from Mill though) |
And Bazel must offer similar features too |
I think it would make sense to offer this as a built in feature using Coursier.
I think to make this a reality, a few steps are necessary:
It's a bunch of work but seems straightforward. Let's hold off on #4066 until we've done this, and then we can dogfood our new APIs to download the test repos rather than doing it ad-hoc. @alexarchambault probably makes sense for you to take the lead on this whenever you have some spare cycles |
I'd like to add a feature request to this: allow arbitrary postprocessing of a file before caching it. For example, if I download a zip archive, I'd like to unzip it before caching the result. |
I'm just thinking about this in general, and this came to mind. If we had reproducible
It would also be a step in the direction of supporting remote caches. I'm not quite sure how we'd address the cache in a way that does not depend on the task name, but if we figure that it out I think we could try this approach. Edit: I just saw there's already been a similar discussion in #2101. |
fwiw, we are using mill+nix in our project, see a minimal example here: https://github.com/chipsalliance/chisel-nix/blob/6a8222ad27b3fa4f2cd871152696afdc636c5b44/templates/chisel/nix/pkgs/mill-builder.nix#L6 |
@jodersky I think the main reason I would want to separate project-level caches in the
Thus it has always made sense to be able to put things in the |
I agree, that's why I was thinking of specifically marking tasks somehow that would be suited for global caching. Maybe this mechanism could then also be used to mark which tasks should be cached remotely, when we add that feature. Expanding on that, to guarantee more reliability, we could create a separate type of serializer for global tasks, and selectively implement what kind of return types we accept for globally cached tasks. For instance, I think that even with #4065, allowing arbitrary PathRefs in a global cache seems risky. @lefou floated an idea of a "ContentRef" type in #2101, which we could use to represent the content of all of T.dest. It seems like it could be a good fit to get started, and would be much more resistant to accidental corruptions. |
yeah it's possible. The remote caching POC (#2777) used a Mill query to select which tasks you wanted to remote cache, but there are other ways we could annotate it as well. Now we've started using |
I personally would like to seek for an option in mill that not downloading anything(sandboxing) but exposing the dependency to a machine-readable way that ask external tool to download and put them in a correct folder. So that we can safely cache it, and reuse in different clean machines. The current state is that mill always assumes it has the network connection, which is far way from sandboxing. Sometimes ivy downloading even are not reproducible at all, and we need to cache that with an input hash see here Eventually what we expect is an |
I think I have good news regarding this. Among the many cool things Coursier can do, it downloads/"installs" various JVM distributions. And those are certainly distributed as zip or other archives. So it must be possible already today -- Coursier already does this. |
One thing we may need to be concerned about for post-processing is where this post-processing is defined, how it is keyed for uniqueness, and how error-prone it is. The nice things about the dumb "download URL" workflow is that there is not much config to think about, it gives a great cache key to start with (the URL itself), it rarely fails (and can always be retried). Technically the downloaded contents can vary based on HTTP headers and all that (e.g. around authentication), but generally downloading stuff from public URLs is pretty straightforward If we start allowing arbitrary post-processing, we need some way to hash the post-processing logic to add it to the cache key (maybe using |
I think we can indeed defer *arbitrary* postprocessing to a later point
(maybe once we have a robust solution to global/remote caching of tasks).
However, I would argue that archive extraction is a common enough scenario
to support.
…On Thu, Dec 5, 2024, 02:12 Li Haoyi ***@***.***> wrote:
One thing we may need to be concerned about for post-processing is where
this post-processing is defined, how it is keyed for uniqueness, and how
error-prone it is. The nice things about the dumb "download URL" workflow
is that there is not much config to think about, it gives a great cache key
to start with (the URL itself), it rarely fails (and can always be retried).
Technically the downloaded contents can vary based on HTTP headers and all
that (e.g. around authentication), but generally downloading stuff from
public URLs is pretty straightforward
If we start allowing arbitrary post-processing, we need some way to hash
the post-processing logic to add it to the cache key (maybe using
methodCodeHashSignatures?), and we need to ensure the post-processing is
appropriately configurable (I guess if it's arbitrary code that's already
the case?) and that it doesn't fail or cause the caches to get into a bad
state. Probably solvable problems, but definitely things we need to be
careful about if we choose to go beyond just "dumb download URL" approach
to the global caches
—
Reply to this email directly, view it on GitHub
<#3930 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHV3JEH72YYSNEACNBF3QD2D6R7ZAVCNFSM6AAAAABRPLJQOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJYHA3TIOBQGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Apart from the download (and maybe unpack) kinds of caching, we also need some story for what to do with external tools outside of our control which require cache locations. PIP and NPM are two examples we're facing right now, but I'm sure there'll be others |
As @sideeffffect mentioned, this is already supported by coursier. Support for that has been added some time ago, for JVM handling support. From the command-line, one can pass
(note the From the API, one needs to use import coursier.cache.ArchiveCache
import coursier.util.Artifact
val cache = ArchiveCache()
val dir = cache.get(Artifact("https://cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip"))
.unsafeRun()(cache.cache.ec).toTry.get
os.list(os.Path(dir)) |
For arbitrary post-processing, I wouldn't recommend modifying files or directories that coursier returns, but… some changes should be fine: making a file executable, adding new files in an extracted archive, etc. Once an archive is extracted, coursier doesn't check for added (or removed) files. |
I agree more API docs (sometimes, even some API doc) is needed |
I wouldn't say it fits the com-lihaoyi style, currently. It needs a lot of polishing, at a minimum. Here is how one can enable logging with progress bars, in the archive example above: import coursier.cache.{ArchiveCache, FileCache}
import coursier.cache.loggers.RefreshLogger
import coursier.util.Artifact
val cache = FileCache().withLogger(RefreshLogger.create())
val archiveCache = ArchiveCache().withCache(cache)
val dir = archiveCache.get(Artifact("https://cosmo.zip/pub/cosmocc/cosmocc-3.9.7.zip"))
.unsafeRun()(cache.ec).toTry.get
os.list(os.Path(dir)) Default values were despised in the first years of coursier (which was actually a good thing in the core of the resolver, not to accidentally use a default value). Then I started using data classes a lot, so there's a lot of such data classes around. But there's very little method calls with default values. And coursier has its own IO (that wraps |
In the past we only had Coursier managing its own machine-global cache for maven central, but this is starting to grow:
The basic idea is that when someone calls
clean
, they usually dont want to start all the way from scratch re-downloading every jar from Maven Central and re-downloading their JVM. These external dowloads typically are downloaded once and cached forever, and only very rarely do people want to clean them, in comparison to needing to clean local build outputs.We should try and standardize how these "global" cached downloads are handled so people can just plug into the standard, rather than creatively coming up with their own solutions that end up being half-baked or inconsistent
The text was updated successfully, but these errors were encountered: