-
-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proof of concept: Remote Caching #726
Conversation
…hecking if task is cached and fetching
…ater try uploading compressed content to tmp file then reading it?
Interesting! @psilospore how well does this work? Are we caching all tasks or just certain ones? Mill has a lot of very-small intermediate tasks (e.g. computing version numbers, grouping classpaths into bigger classpaths) where the round trip time of fetching from the cache would likely dwarf the time taken to do the computation locally |
Also, how does it work for things like coursier-resolved jars, which typically live in |
I was thinking of maybe adding a flag to Also there is a task
Oh that's the thing I said I would mention later but forgot. But yes coursier resolved jars won't work with what I have. The thing I wrote to change absolute paths to relative paths works fine with the
to
But the
Which will point to the wrong path when someone else builds, but maybe I could upload it with a environment variable something like $COURSIER_CACHE
I could also do something like that with the project directory with $WORKSPACE so the
|
w.r.t. flagging on specific targets to be remote cached, maybe we could make it something like w.r.t. Coursier resolved jars, perhaps one solution could be to copy the jar files into the We could symlink them instead, so they get properly picked up and uploaded to the remote cache, but do not waste space on the uploader's computer. OTOH they would still waste space on the downloader's computer, unless we have a separate step in the remote cache downloader that first downloads the remote-cached |
Also we probably want some nice way to set the remote cache location. Bazel lets you specify your remote cache with a command line option Another configuration we could set is remote cache threshold and it could be set to some default value. So any tasks under that threshold won't be uploaded, and would result in a cache miss in other builds so they would be computed locally. That being said what if we had remote caching by default with a default threshold value and the user could override it with a command line arg or in their For the coursier resolved jars let me get back to you on that. I want to look at how mill works with coursier and maybe test some stuff out first when I have some time. By the way I'm going to Scale by the bay. I saw that you were giving a talk. Maybe we can chat some time during the conference. |
A command line option to specify a cache URL seems reasonable. These would apply to all(-ish) targets within a build, so a top level flag is justified Not sure what you mean by Now that we have I'll be at Scala by the bay, happy to talk |
Sorry I didn't get to catch up with you during the conference. Great talk though! I read the blog post form of that a while ago and after being frustrated with SBT that ended up getting me interested in Bazel and subsequently mill.
Didn't know about the launcher but that sounds good.
I mean a minimum size for a task to upload. To solve the issue of the issue you mentioned earlier:
So if there's a task that is 1 KB after compression then it's probably not worth uploading and fetching that. So we could set a threshold to 1 MB for example and it would not upload. Which for anyone else would be a cache miss and recomputed. But rethinking that it would probably be better to have the threshold be based on time vs size. Locally in one of my Also sorry but I might be busy prepping and searching for a job for a while so it may take me a while to update this. |
@psilospore no problem! I am also in no hurry, as many of my own Mill projects are small enough remote caching isn't a huge necessity. This PR can take as long as it needs to take |
Any process of remote build/cache? |
Hey @sequencer I've been busy so I haven't had time although I have been thinking of this especially recently since I am attempting to use mill at my new gig and this might be nice to have especially for CI. I think I might get to this after one of the other open source initiatives I have is finished. If you have time to contribute that would be greatly appreciated! Even input on some of the discussions we have had would really help. |
Hello, @psilospore, thank you for working on this!
maybe it exceeds your cache server somewhere, but if you wanna work on this, I'll be very willing to cooperate with you. |
@sequencer are you talking about implementing remote execution like bazel has? I think that would be very cool but I probably wouldn't have time to do that in the near future. It also seems like it would be complex. I would be happy to review your PR, make suggestions, and maybe contribute in a limited amount if you would take the lead on that. We probably want @lihaoyi's opinion on potentially adding a feature like remote execution first. |
Yes, But I'm still new to Scala and mill. trying to implement VLSI build system with mill. |
If I understand the current approach correctly, we check if we have a cached execution result in the remote cache, if not we execute the target locally and upload it to the remote cache. How about just uploading execution results, which were "expensive" enough, e.g. they took more than a bunch of seconds to compute locally. So we will never find cached results for "small" targets. There would be no need to explicitly mark cacheable targets.Of course, asking for the fact if a target is cached need to be cheap and fast, ideally we can ask in advance the cache status for the whole build graph. |
@lefou Thanks for the feedback! I think that's what I proposed earlier.
Getting your feedback makes me think I should go forward with that.
I load the cached targets into a Map initially which would essentially achieve that.
If it's in the cache then I fetch the artifacts associated with that target. |
@psilospore I'm sorry, I must somehow overlooked that.
Absolutely.
I think, it would make sense to refetch the cache state in-between, to profit from freshly inserted cached targets. Consider that another build may run in parallel, e.g. on another CI node, or the co-workers which just checked-out the latest commit as did you. I tend to think, we should even consider local caches too. I sometimes used |
Regarding caching dependencies with coursier, I think we should disable caching those targets completely. The best thing to cache those would be some kind of shared dist/proxy/mirror repository. I don't know if coursier already supports this, but Maven does and most Linux distros also have a concept of mirror repositories. |
Good idea and it would simplify stuff when I get back to it. Thanks!
I might need to follow up on that. I think I have an idea but not sure if I'm completely following along.
I'll have to research that. |
We should probably only consider those targets for caching, which have relative paths, relative to the project base dir. All other paths (either absolute or relative but pointing outside of the project (out) dir) should be ignored and never cached. As a result, we will never be able to cache all targets, but on the other side we can cache some targets with relative small effort. |
Some follow up discussions can be found here: #1400 |
I'm closing this for now. We can re-open or create a new one at every time. |
Hey @lihaoyi,
I spoke to you on gitter about remote caching maybe a month or so ago and here's my attempt.
I have a remote caching server implementation and a write up here: https://github.com/psilospore/mill-remote-cache-server
This was a fun exercise! Let me know how I could improve this.
In addition to the information found in the readme linked above, on the mill front I wrote:
out
.tar
withammonite.ops.%%
.