Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept: Remote Caching #726

Closed
wants to merge 9 commits into from

Conversation

psilospore
Copy link

@psilospore psilospore commented Nov 3, 2019

Hey @lihaoyi,

I spoke to you on gitter about remote caching maybe a month or so ago and here's my attempt.
I have a remote caching server implementation and a write up here: https://github.com/psilospore/mill-remote-cache-server

This was a fun exercise! Let me know how I could improve this.

In addition to the information found in the readme linked above, on the mill front I wrote:

  • A remote caching client that talks to my remote caching server RemoteCacher
  • It uses some dependencies that are only there for the proof of concept I could do it differently if you'd like.
  • It checks what tasks are cached, uploads tasks in a gzipped tar, and fetches tasks by uncompressing them and moving artifacts to their proper place in out.
  • You mentioned I should first make mill work with relative paths. I did struggle with that a bit but another method I thought about was just making it relative when I upload it. There are probably some problems with that approach I'll mention it later.
  • Not sure if I should use apache commons compress or call out to tar with ammonite.ops.%%.

@lihaoyi
Copy link
Member

lihaoyi commented Nov 3, 2019

Interesting! @psilospore how well does this work? Are we caching all tasks or just certain ones? Mill has a lot of very-small intermediate tasks (e.g. computing version numbers, grouping classpaths into bigger classpaths) where the round trip time of fetching from the cache would likely dwarf the time taken to do the computation locally

@lihaoyi
Copy link
Member

lihaoyi commented Nov 3, 2019

Also, how does it work for things like coursier-resolved jars, which typically live in ~/.coursier outside the project folder?

@psilospore
Copy link
Author

psilospore commented Nov 3, 2019

are we caching all tasks or just certain ones? Mill has a lot of very-small intermediate tasks (e.g. computing version numbers, grouping classpaths into bigger classpaths) where the round trip time of fetching from the cache would likely dwarf the time taken to do the computation locally

I was thinking of maybe adding a flag to Task like remoteCaching with true as the default. I attempted changing Target.apply to this implicit def apply[T](t: T, remotelyCaching: Boolean = true) so in my build I could do something like this def t = T.apply({ 2 + 2}, remotelyCaching = false) but it seems like I can't have default arguments with macros. Even if I could do that I'm not sure if that's ideal. Would you have a suggestion?

Also there is a task forkEnv that I has references to environment specific variables I think that we wouldn't also want to upload.

Also, how does it work for things like coursier-resolved jars, which typically live in ~/.coursier outside the project folder?

Oh that's the thing I said I would mention later but forgot. But yes coursier resolved jars won't work with what I have. The thing I wrote to change absolute paths to relative paths works fine with the compile task for example:

    "value": [
        "ref:c9969f7e:/Users/syedajafri/dev/mill/scratch/bar/src/Main.scala"
    ],
    "valueHash": -286262790,
    "inputsHash": -475356575
}

to

    "value": [
        "ref:c9969f7e:../../bar/src/Main.scala"
    ],
    "valueHash": -286262790,
    "inputsHash": -475356575
}

But the compileClasspath task which points to a directory in the coursier cache would be:

{
    "value": [
        "qref:e9b58b05:../../../../../Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.8/scala-library-2.12.8.jar"
    ],
    "valueHash": -1288013907,
    "inputsHash": -615246851
}

Which will point to the wrong path when someone else builds, but maybe I could upload it with a environment variable something like $COURSIER_CACHE

{
    "value": [
        "qref:e9b58b05:$COURSIER_CACHE/v1/https/repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.8/scala-library-2.12.8.jar"
    ],
    "valueHash": -1288013907,
    "inputsHash": -615246851
}

I could also do something like that with the project directory with $WORKSPACE so the compile task could instead be:

    "value": [
        "ref:c9969f7e:$WORKSPACE/bar/src/Main.scala"
    ],
    "valueHash": -286262790,
    "inputsHash": -475356575
}

@lihaoyi
Copy link
Member

lihaoyi commented Nov 6, 2019

w.r.t. flagging on specific targets to be remote cached, maybe we could make it something like T.remoteCached{ ... }? The fact that macros can't take defaults is annoying, but copy-pasting the T.apply{ ... } macro under a different name should work.

w.r.t. Coursier resolved jars, perhaps one solution could be to copy the jar files into the T.ctx().dest folder before returning them to Mill? That way they can be properly relativized to the root of the mill repository, similar to any other compiled files. There shouldn't be too many files, so copying them shouldn't waste too much time or space (few hundred mb?).

We could symlink them instead, so they get properly picked up and uploaded to the remote cache, but do not waste space on the uploader's computer. OTOH they would still waste space on the downloader's computer, unless we have a separate step in the remote cache downloader that first downloads the remote-cached PathRefs to ~/.mill/remote-cache/<hash-of-target-inputs>/... and then symlinks them into the relevant folder. That way they won't waste space on either uploader or downloader computers

@psilospore
Copy link
Author

psilospore commented Nov 10, 2019

w.r.t. flagging on specific targets to be remote cached, maybe we could make it something like T.remoteCached{ ... }?

Also we probably want some nice way to set the remote cache location. Bazel lets you specify your remote cache with a command line option --remote_cache=localhost:7000. Also someone setting up the build might just want other devs to just run bazel build without a bunch of arguments so in Bazel you can checkin a .bazelrc to your VCS with some default commands set: https://docs.bazel.build/versions/master/remote-caching.html#run-bazel-using-the-remote-cache Maybe we could do a similar approach? Something like .millrc?

Another configuration we could set is remote cache threshold and it could be set to some default value. So any tasks under that threshold won't be uploaded, and would result in a cache miss in other builds so they would be computed locally.

That being said what if we had remote caching by default with a default threshold value and the user could override it with a command line arg or in their .millrc?
Also there might some special case where a user might want to prevent a task from being remotely cached like forkEnv. Maybe for those special cases we could have a T.disableRemoteCache?

For the coursier resolved jars let me get back to you on that. I want to look at how mill works with coursier and maybe test some stuff out first when I have some time.

By the way I'm going to Scale by the bay. I saw that you were giving a talk. Maybe we can chat some time during the conference.

@lihaoyi
Copy link
Member

lihaoyi commented Nov 11, 2019

A command line option to specify a cache URL seems reasonable. These would apply to all(-ish) targets within a build, so a top level flag is justified

Not sure what you mean by threshold, but having that as a flag seems reasonable too.

Now that we have ./mill bootstrap launchers, that becomes a natural place to put any such auxiliary flags or config that is specific to a particular project. It already specifies the MILL_VERSION after all. No need for a separate .millrc

I'll be at Scala by the bay, happy to talk

@psilospore
Copy link
Author

psilospore commented Nov 24, 2019

Sorry I didn't get to catch up with you during the conference. Great talk though! I read the blog post form of that a while ago and after being frustrated with SBT that ended up getting me interested in Bazel and subsequently mill.

Now that we have ./mill bootstrap launchers, that becomes a natural place to put any such auxiliary flags or config that is specific to a particular project. It already specifies the MILL_VERSION after all. No need for a separate .millrc

Didn't know about the launcher but that sounds good.

Not sure what you mean by threshold, but having that as a flag seems reasonable too.

I mean a minimum size for a task to upload. To solve the issue of the issue you mentioned earlier:

very-small intermediate tasks (e.g. computing version numbers, grouping classpaths into bigger classpaths) where the round trip time of fetching from the cache would likely dwarf the time taken to do the computation locally

So if there's a task that is 1 KB after compression then it's probably not worth uploading and fetching that. So we could set a threshold to 1 MB for example and it would not upload. Which for anyone else would be a cache miss and recomputed.

But rethinking that it would probably be better to have the threshold be based on time vs size. Locally in one of my mill-profile.json I see a lot of tasks that only take 1 to 10 ms to compute and some tasks that take a long time to compute like 8000 ms. We could only upload tasks that take longer than the specified threshold. There could be a default value as well if the user does not specify it.

Also sorry but I might be busy prepping and searching for a job for a while so it may take me a while to update this.

@lihaoyi
Copy link
Member

lihaoyi commented Dec 1, 2019

@psilospore no problem! I am also in no hurry, as many of my own Mill projects are small enough remote caching isn't a huge necessity. This PR can take as long as it needs to take

@sequencer
Copy link
Contributor

Any process of remote build/cache?
I also wanna use/implement this in my project, not use if there is some great wheel to use.

@psilospore
Copy link
Author

Hey @sequencer I've been busy so I haven't had time although I have been thinking of this especially recently since I am attempting to use mill at my new gig and this might be nice to have especially for CI. I think I might get to this after one of the other open source initiatives I have is finished.

If you have time to contribute that would be greatly appreciated! Even input on some of the discussions we have had would really help.

@sequencer
Copy link
Contributor

Hello, @psilospore, thank you for working on this!
I wanna have such features,

  1. store metadata at local PC.
  2. send metadata and metatdatas to server, and server will execute task remotely.
  3. after server finish the job, it will inform client, and send back the result of task.
  4. client contains the task metadata and remote machine information for caching.

maybe it exceeds your cache server somewhere, but if you wanna work on this, I'll be very willing to cooperate with you.

@psilospore
Copy link
Author

@sequencer are you talking about implementing remote execution like bazel has? I think that would be very cool but I probably wouldn't have time to do that in the near future. It also seems like it would be complex. I would be happy to review your PR, make suggestions, and maybe contribute in a limited amount if you would take the lead on that. We probably want @lihaoyi's opinion on potentially adding a feature like remote execution first.

@sequencer
Copy link
Contributor

Yes, But I'm still new to Scala and mill. trying to implement VLSI build system with mill.
If there are any ideas(architectures), I can have try to implement a demo of this.

@lefou
Copy link
Member

lefou commented Mar 28, 2020

If I understand the current approach correctly, we check if we have a cached execution result in the remote cache, if not we execute the target locally and upload it to the remote cache. How about just uploading execution results, which were "expensive" enough, e.g. they took more than a bunch of seconds to compute locally. So we will never find cached results for "small" targets. There would be no need to explicitly mark cacheable targets.Of course, asking for the fact if a target is cached need to be cheap and fast, ideally we can ask in advance the cache status for the whole build graph.

@psilospore
Copy link
Author

psilospore commented Mar 31, 2020

@lefou Thanks for the feedback! I think that's what I proposed earlier.

But rethinking that it would probably be better to have the threshold be based on time vs size. Locally in one of my mill-profile.json I see a lot of tasks that only take 1 to 10 ms to compute and some tasks that take a long time to compute like 8000 ms. We could only upload tasks that take longer than the specified threshold. There could be a default value as well if the user does not specify it.

Getting your feedback makes me think I should go forward with that.


Of course, asking for the fact if a target is cached need to be cheap and fast, ideally we can ask in advance the cache status for the whole build graph.

I load the cached targets into a Map initially which would essentially achieve that.

case class Cached(hashesAvailable: Map[String, List[Int]])

If it's in the cache then I fetch the artifacts associated with that target.

@lefou
Copy link
Member

lefou commented Mar 31, 2020

@lefou Thanks for the feedback! I think that's what I proposed earlier.

@psilospore I'm sorry, I must somehow overlooked that.

Getting your feedback makes me think I should go forward with that.

Absolutely.

I load the cached targets into a Map initially which would essentially achieve that.

case class Cached(hashesAvailable: Map[String, List[Int]])

If it's in the cache then I fetch the artifacts associated with that target.

I think, it would make sense to refetch the cache state in-between, to profit from freshly inserted cached targets. Consider that another build may run in parallel, e.g. on another CI node, or the co-workers which just checked-out the latest commit as did you.

I tend to think, we should even consider local caches too. I sometimes used ccache (which is a compiler cache for C-like compilers) in the past, and my Gentoo machines still uses it under the hood today. It would speed up the build in case I switch branches often.

@lefou
Copy link
Member

lefou commented Mar 31, 2020

Regarding caching dependencies with coursier, I think we should disable caching those targets completely. The best thing to cache those would be some kind of shared dist/proxy/mirror repository. I don't know if coursier already supports this, but Maven does and most Linux distros also have a concept of mirror repositories.

@lefou lefou added the rebase-needed This pull request needs a git rebase before a merge is possible label Mar 31, 2020
@psilospore
Copy link
Author

Regarding caching dependencies with coursier, I think we should disable caching those targets completely. The best thing to cache those would be some kind of shared dist/proxy/mirror repository.

Good idea and it would simplify stuff when I get back to it. Thanks!

I think, it would make sense to refetch the cache state in-between, to profit from freshly inserted cached targets. Consider that another build may run in parallel, e.g. on another CI node, or the co-workers which just checked-out the latest commit as did you.

I might need to follow up on that. I think I have an idea but not sure if I'm completely following along.

I tend to think, we should even consider local caches too. I sometimes used ccache (which is a compiler cache for C-like compilers) in the past, and my Gentoo machines still uses it under the hood today. It would speed up the build in case I switch branches often.

I'll have to research that.

@lefou lefou marked this pull request as draft April 22, 2020 21:28
@lefou
Copy link
Member

lefou commented Jun 17, 2020

We should probably only consider those targets for caching, which have relative paths, relative to the project base dir. All other paths (either absolute or relative but pointing outside of the project (out) dir) should be ignored and never cached. As a result, we will never be able to cache all targets, but on the other side we can cache some targets with relative small effort.

@lefou lefou mentioned this pull request Jan 9, 2021
@lefou
Copy link
Member

lefou commented Apr 8, 2022

Some follow up discussions can be found here: #1400

@lefou
Copy link
Member

lefou commented Apr 8, 2022

I'm closing this for now. We can re-open or create a new one at every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rebase-needed This pull request needs a git rebase before a merge is possible
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants