Proof of concept: Remote Caching #726

psilospore · 2019-11-03T00:48:15Z

I spoke to you on gitter about remote caching maybe a month or so ago and here's my attempt.
I have a remote caching server implementation and a write up here: https://github.com/psilospore/mill-remote-cache-server

This was a fun exercise! Let me know how I could improve this.

In addition to the information found in the readme linked above, on the mill front I wrote:

A remote caching client that talks to my remote caching server RemoteCacher
It uses some dependencies that are only there for the proof of concept I could do it differently if you'd like.
It checks what tasks are cached, uploads tasks in a gzipped tar, and fetches tasks by uncompressing them and moving artifacts to their proper place in out.
You mentioned I should first make mill work with relative paths. I did struggle with that a bit but another method I thought about was just making it relative when I upload it. There are probably some problems with that approach I'll mention it later.
Not sure if I should use apache commons compress or call out to tar with ammonite.ops.%%.

…hecking if task is cached and fetching

…ar in scala

…ater try uploading compressed content to tmp file then reading it?

lihaoyi · 2019-11-03T03:26:05Z

Interesting! @psilospore how well does this work? Are we caching all tasks or just certain ones? Mill has a lot of very-small intermediate tasks (e.g. computing version numbers, grouping classpaths into bigger classpaths) where the round trip time of fetching from the cache would likely dwarf the time taken to do the computation locally

lihaoyi · 2019-11-03T03:35:19Z

Also, how does it work for things like coursier-resolved jars, which typically live in ~/.coursier outside the project folder?

psilospore · 2019-11-03T06:21:17Z

are we caching all tasks or just certain ones? Mill has a lot of very-small intermediate tasks (e.g. computing version numbers, grouping classpaths into bigger classpaths) where the round trip time of fetching from the cache would likely dwarf the time taken to do the computation locally

I was thinking of maybe adding a flag to Task like remoteCaching with true as the default. I attempted changing Target.apply to this implicit def apply[T](t: T, remotelyCaching: Boolean = true) so in my build I could do something like this def t = T.apply({ 2 + 2}, remotelyCaching = false) but it seems like I can't have default arguments with macros. Even if I could do that I'm not sure if that's ideal. Would you have a suggestion?

Also there is a task forkEnv that I has references to environment specific variables I think that we wouldn't also want to upload.

Also, how does it work for things like coursier-resolved jars, which typically live in ~/.coursier outside the project folder?

Oh that's the thing I said I would mention later but forgot. But yes coursier resolved jars won't work with what I have. The thing I wrote to change absolute paths to relative paths works fine with the compile task for example:

    "value": [
        "ref:c9969f7e:/Users/syedajafri/dev/mill/scratch/bar/src/Main.scala"
    ],
    "valueHash": -286262790,
    "inputsHash": -475356575
}

to

    "value": [
        "ref:c9969f7e:../../bar/src/Main.scala"
    ],
    "valueHash": -286262790,
    "inputsHash": -475356575
}

But the compileClasspath task which points to a directory in the coursier cache would be:

{
    "value": [
        "qref:e9b58b05:../../../../../Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.8/scala-library-2.12.8.jar"
    ],
    "valueHash": -1288013907,
    "inputsHash": -615246851
}

Which will point to the wrong path when someone else builds, but maybe I could upload it with a environment variable something like $COURSIER_CACHE

{
    "value": [
        "qref:e9b58b05:$COURSIER_CACHE/v1/https/repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.8/scala-library-2.12.8.jar"
    ],
    "valueHash": -1288013907,
    "inputsHash": -615246851
}

I could also do something like that with the project directory with $WORKSPACE so the compile task could instead be:

    "value": [
        "ref:c9969f7e:$WORKSPACE/bar/src/Main.scala"
    ],
    "valueHash": -286262790,
    "inputsHash": -475356575
}

lihaoyi · 2019-11-06T07:28:58Z

w.r.t. flagging on specific targets to be remote cached, maybe we could make it something like T.remoteCached{ ... }? The fact that macros can't take defaults is annoying, but copy-pasting the T.apply{ ... } macro under a different name should work.

w.r.t. Coursier resolved jars, perhaps one solution could be to copy the jar files into the T.ctx().dest folder before returning them to Mill? That way they can be properly relativized to the root of the mill repository, similar to any other compiled files. There shouldn't be too many files, so copying them shouldn't waste too much time or space (few hundred mb?).

We could symlink them instead, so they get properly picked up and uploaded to the remote cache, but do not waste space on the uploader's computer. OTOH they would still waste space on the downloader's computer, unless we have a separate step in the remote cache downloader that first downloads the remote-cached PathRefs to ~/.mill/remote-cache/<hash-of-target-inputs>/... and then symlinks them into the relevant folder. That way they won't waste space on either uploader or downloader computers

psilospore · 2019-11-10T14:56:30Z

w.r.t. flagging on specific targets to be remote cached, maybe we could make it something like T.remoteCached{ ... }?

Also we probably want some nice way to set the remote cache location. Bazel lets you specify your remote cache with a command line option --remote_cache=localhost:7000. Also someone setting up the build might just want other devs to just run bazel build without a bunch of arguments so in Bazel you can checkin a .bazelrc to your VCS with some default commands set: https://docs.bazel.build/versions/master/remote-caching.html#run-bazel-using-the-remote-cache Maybe we could do a similar approach? Something like .millrc?

Another configuration we could set is remote cache threshold and it could be set to some default value. So any tasks under that threshold won't be uploaded, and would result in a cache miss in other builds so they would be computed locally.

That being said what if we had remote caching by default with a default threshold value and the user could override it with a command line arg or in their .millrc?
Also there might some special case where a user might want to prevent a task from being remotely cached like forkEnv. Maybe for those special cases we could have a T.disableRemoteCache?

For the coursier resolved jars let me get back to you on that. I want to look at how mill works with coursier and maybe test some stuff out first when I have some time.

By the way I'm going to Scale by the bay. I saw that you were giving a talk. Maybe we can chat some time during the conference.

lihaoyi · 2019-11-11T15:35:57Z

A command line option to specify a cache URL seems reasonable. These would apply to all(-ish) targets within a build, so a top level flag is justified

Not sure what you mean by threshold, but having that as a flag seems reasonable too.

Now that we have ./mill bootstrap launchers, that becomes a natural place to put any such auxiliary flags or config that is specific to a particular project. It already specifies the MILL_VERSION after all. No need for a separate .millrc

I'll be at Scala by the bay, happy to talk

psilospore · 2019-11-24T20:35:07Z

Sorry I didn't get to catch up with you during the conference. Great talk though! I read the blog post form of that a while ago and after being frustrated with SBT that ended up getting me interested in Bazel and subsequently mill.

Now that we have ./mill bootstrap launchers, that becomes a natural place to put any such auxiliary flags or config that is specific to a particular project. It already specifies the MILL_VERSION after all. No need for a separate .millrc

Didn't know about the launcher but that sounds good.

Not sure what you mean by threshold, but having that as a flag seems reasonable too.

I mean a minimum size for a task to upload. To solve the issue of the issue you mentioned earlier:

very-small intermediate tasks (e.g. computing version numbers, grouping classpaths into bigger classpaths) where the round trip time of fetching from the cache would likely dwarf the time taken to do the computation locally

So if there's a task that is 1 KB after compression then it's probably not worth uploading and fetching that. So we could set a threshold to 1 MB for example and it would not upload. Which for anyone else would be a cache miss and recomputed.

But rethinking that it would probably be better to have the threshold be based on time vs size. Locally in one of my mill-profile.json I see a lot of tasks that only take 1 to 10 ms to compute and some tasks that take a long time to compute like 8000 ms. We could only upload tasks that take longer than the specified threshold. There could be a default value as well if the user does not specify it.

Also sorry but I might be busy prepping and searching for a job for a while so it may take me a while to update this.

lihaoyi · 2019-12-01T02:37:19Z

@psilospore no problem! I am also in no hurry, as many of my own Mill projects are small enough remote caching isn't a huge necessity. This PR can take as long as it needs to take

sequencer · 2020-03-04T13:13:07Z

Any process of remote build/cache?
I also wanna use/implement this in my project, not use if there is some great wheel to use.

psilospore · 2020-03-04T15:13:55Z

Hey @sequencer I've been busy so I haven't had time although I have been thinking of this especially recently since I am attempting to use mill at my new gig and this might be nice to have especially for CI. I think I might get to this after one of the other open source initiatives I have is finished.

If you have time to contribute that would be greatly appreciated! Even input on some of the discussions we have had would really help.

sequencer · 2020-03-06T22:28:06Z

Hello, @psilospore, thank you for working on this!
I wanna have such features,

store metadata at local PC.
send metadata and metatdatas to server, and server will execute task remotely.
after server finish the job, it will inform client, and send back the result of task.
client contains the task metadata and remote machine information for caching.

maybe it exceeds your cache server somewhere, but if you wanna work on this, I'll be very willing to cooperate with you.

psilospore · 2020-03-11T19:06:45Z

@sequencer are you talking about implementing remote execution like bazel has? I think that would be very cool but I probably wouldn't have time to do that in the near future. It also seems like it would be complex. I would be happy to review your PR, make suggestions, and maybe contribute in a limited amount if you would take the lead on that. We probably want @lihaoyi's opinion on potentially adding a feature like remote execution first.

sequencer · 2020-03-12T01:55:33Z

Yes, But I'm still new to Scala and mill. trying to implement VLSI build system with mill.
If there are any ideas(architectures), I can have try to implement a demo of this.

lefou · 2020-03-28T13:44:08Z

If I understand the current approach correctly, we check if we have a cached execution result in the remote cache, if not we execute the target locally and upload it to the remote cache. How about just uploading execution results, which were "expensive" enough, e.g. they took more than a bunch of seconds to compute locally. So we will never find cached results for "small" targets. There would be no need to explicitly mark cacheable targets.Of course, asking for the fact if a target is cached need to be cheap and fast, ideally we can ask in advance the cache status for the whole build graph.

psilospore · 2020-03-31T16:49:50Z

@lefou Thanks for the feedback! I think that's what I proposed earlier.

But rethinking that it would probably be better to have the threshold be based on time vs size. Locally in one of my mill-profile.json I see a lot of tasks that only take 1 to 10 ms to compute and some tasks that take a long time to compute like 8000 ms. We could only upload tasks that take longer than the specified threshold. There could be a default value as well if the user does not specify it.

Getting your feedback makes me think I should go forward with that.

Of course, asking for the fact if a target is cached need to be cheap and fast, ideally we can ask in advance the cache status for the whole build graph.

I load the cached targets into a Map initially which would essentially achieve that.

case class Cached(hashesAvailable: Map[String, List[Int]])

If it's in the cache then I fetch the artifacts associated with that target.

lefou · 2020-03-31T19:19:24Z

@lefou Thanks for the feedback! I think that's what I proposed earlier.

@psilospore I'm sorry, I must somehow overlooked that.

Getting your feedback makes me think I should go forward with that.

Absolutely.

I load the cached targets into a Map initially which would essentially achieve that.
case class Cached(hashesAvailable: Map[String, List[Int]])
If it's in the cache then I fetch the artifacts associated with that target.

I think, it would make sense to refetch the cache state in-between, to profit from freshly inserted cached targets. Consider that another build may run in parallel, e.g. on another CI node, or the co-workers which just checked-out the latest commit as did you.

I tend to think, we should even consider local caches too. I sometimes used ccache (which is a compiler cache for C-like compilers) in the past, and my Gentoo machines still uses it under the hood today. It would speed up the build in case I switch branches often.

lefou · 2020-03-31T19:31:20Z

Regarding caching dependencies with coursier, I think we should disable caching those targets completely. The best thing to cache those would be some kind of shared dist/proxy/mirror repository. I don't know if coursier already supports this, but Maven does and most Linux distros also have a concept of mirror repositories.

psilospore · 2020-04-04T00:15:37Z

Regarding caching dependencies with coursier, I think we should disable caching those targets completely. The best thing to cache those would be some kind of shared dist/proxy/mirror repository.

Good idea and it would simplify stuff when I get back to it. Thanks!

I think, it would make sense to refetch the cache state in-between, to profit from freshly inserted cached targets. Consider that another build may run in parallel, e.g. on another CI node, or the co-workers which just checked-out the latest commit as did you.

I might need to follow up on that. I think I have an idea but not sure if I'm completely following along.

I tend to think, we should even consider local caches too. I sometimes used ccache (which is a compiler cache for C-like compilers) in the past, and my Gentoo machines still uses it under the hood today. It would speed up the build in case I switch branches often.

I'll have to research that.

lefou · 2020-06-17T06:40:37Z

We should probably only consider those targets for caching, which have relative paths, relative to the project base dir. All other paths (either absolute or relative but pointing outside of the project (out) dir) should be ignored and never cached. As a result, we will never be able to cache all targets, but on the other side we can cache some targets with relative small effort.

lefou · 2022-04-08T05:50:34Z

Some follow up discussions can be found here: #1400

lefou · 2022-04-08T06:46:30Z

I'm closing this for now. We can re-open or create a new one at every time.

psilospore and others added 9 commits October 21, 2019 09:56

temp

10a3688

Successfully pushing to remote cache -> s3. Additional work towards c…

83dce07

…hecking if task is cached and fetching

Temp commit struggling to shell out to tar properly will attempt to t…

5e56ce9

…ar in scala

one more

7af0ed4

temp 2

35760ee

Using tar subprocess again. Had trouble with apache compress. Maybe l…

e2591f5

…ater try uploading compressed content to tmp file then reading it?

clean up

cdcdcb8

more clean

86fd632

final cleanup?

250339a

lefou added the rebase-needed This pull request needs a git rebase before a merge is possible label Mar 31, 2020

lefou marked this pull request as draft April 22, 2020 21:28

He-Pin mentioned this pull request May 7, 2020

implement RemoteCache sbt/sbt#5534

Merged

lefou mentioned this pull request Jan 9, 2021

Bump zinc to 1.4.4 #1094

Merged

lefou closed this Apr 8, 2022

lefou mentioned this pull request Oct 30, 2022

Reuse out folder among different machines/directories #2101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of concept: Remote Caching #726

Proof of concept: Remote Caching #726

psilospore commented Nov 3, 2019 •

edited

Loading

lihaoyi commented Nov 3, 2019

lihaoyi commented Nov 3, 2019

psilospore commented Nov 3, 2019 •

edited

Loading

lihaoyi commented Nov 6, 2019

psilospore commented Nov 10, 2019 •

edited

Loading

lihaoyi commented Nov 11, 2019

psilospore commented Nov 24, 2019 •

edited

Loading

lihaoyi commented Dec 1, 2019

sequencer commented Mar 4, 2020

psilospore commented Mar 4, 2020

sequencer commented Mar 6, 2020

psilospore commented Mar 11, 2020

sequencer commented Mar 12, 2020

lefou commented Mar 28, 2020 •

edited

Loading

psilospore commented Mar 31, 2020 •

edited

Loading

lefou commented Mar 31, 2020 •

edited

Loading

lefou commented Mar 31, 2020

psilospore commented Apr 4, 2020

lefou commented Jun 17, 2020

lefou commented Apr 8, 2022

lefou commented Apr 8, 2022

Proof of concept: Remote Caching #726

Proof of concept: Remote Caching #726

Conversation

psilospore commented Nov 3, 2019 • edited Loading

lihaoyi commented Nov 3, 2019

lihaoyi commented Nov 3, 2019

psilospore commented Nov 3, 2019 • edited Loading

lihaoyi commented Nov 6, 2019

psilospore commented Nov 10, 2019 • edited Loading

lihaoyi commented Nov 11, 2019

psilospore commented Nov 24, 2019 • edited Loading

lihaoyi commented Dec 1, 2019

sequencer commented Mar 4, 2020

psilospore commented Mar 4, 2020

sequencer commented Mar 6, 2020

psilospore commented Mar 11, 2020

sequencer commented Mar 12, 2020

lefou commented Mar 28, 2020 • edited Loading

psilospore commented Mar 31, 2020 • edited Loading

lefou commented Mar 31, 2020 • edited Loading

lefou commented Mar 31, 2020

psilospore commented Apr 4, 2020

lefou commented Jun 17, 2020

lefou commented Apr 8, 2022

lefou commented Apr 8, 2022

psilospore commented Nov 3, 2019 •

edited

Loading

psilospore commented Nov 3, 2019 •

edited

Loading

psilospore commented Nov 10, 2019 •

edited

Loading

psilospore commented Nov 24, 2019 •

edited

Loading

lefou commented Mar 28, 2020 •

edited

Loading

psilospore commented Mar 31, 2020 •

edited

Loading

lefou commented Mar 31, 2020 •

edited

Loading