Reuse out folder among different machines/directories #2101

sequencer · 2022-10-30T17:04:14Z

I’m using mill to manage a relative big system. A usage is forking entire workspace , including 'out', sending to other machines, executed some thing else, fork and send again and again. This is used for distributing works to different nodes.
however due to the absolute path in PathRef, out cannot be directly packaged and sent away.
Thus I’m proposing to make the path in PathRef relative to T.ctx.workspace to reduce the dependency to environment out of build context.

The text was updated successfully, but these errors were encountered:

lefou · 2022-10-30T17:15:48Z

We're open for discussion and development to support such scenarios. We discussed some aspects of it in the past and I'd like to reference it here FYI:

lefou · 2022-10-30T17:23:38Z

Although mill.api.PathRef is used in targets, it's not necessarily limited to it. Also, it is not limited to paths below T.workspace, so simply changing to a relative path may not work. I already wrote up an idea to let PathRefs support relative paths, which may work (see #1400 (comment)), but it was under the context, that we actively distribute work to other nodes. This is a bit different from your use case.

sequencer · 2022-10-30T17:28:35Z

Yes I just checked those infos befor submitting this issue. Let me know what I can contribute to make this happening.

sequencer · 2022-10-30T17:32:05Z

Actually, for the build script trigger reevaluating issue, I wonder if it’s possible to manually give each Module a version to reduce this behavior, which may give the chance to throw rebuild burden to users.

CircuitCoder · 2022-10-30T17:34:20Z

Maybe another idea:

When walking the directory subtree rooted at p (the argument path in PathRef.apply), use each file's relative path to p to update the digest.

The reasoning is that if two directory tree has the same shape and the same content, we should consider them the "same" directory in the sense of caching. In this implementation, moving the entire subtree p to another location won't change its hash.

sequencer · 2022-10-30T17:39:08Z

another(maybe unrelative) question is:
Is possible to get the full reason why a target is reevaluated, I thought when input is same caching should work, but have no idea to know why input is changed in some complex cases.

lefou · 2022-10-30T20:57:32Z

Well, currently PathRef is some kind of blackbox. Once there is one bit different, it's completely different. We probably could model that differently (e.g. a tree-like structure, so we also detect changes in parts of it or can re-use sub-parts) , but I think it's probably not efficient enough. Both, speed and memory wise.

So splitting it structurally on the smallest part might be overkill. But as @sequencer suggested, just keeping the info stable relative to T.workspace (if possible) might already be good enough, or at least a good start. We could also try to keep the path-info separated from the content hash, so that the proof whether two PathRefs are of equal content is possible, even when the containing path is different. That's probably what @CircuitCoder meant? In Mill, we are interested in both information, but not necessarily so closely coupled.

I think some experiments or creating a POC wouldn't hurt and could be fun. Got for it!

About the other issue, tracking the origin of a change, this should be probably discussed separately. But once, we invent some tracking, the whole user experience might change. And I fear not to the better. It might result in more complicated API for Input/Output types, for example. That's just a feeling though. Currently, hash-based change detection isn't easy to track (although Mill evaluator knows exactly which tasks were out of date, see out/profile.json), but it's damn fast and easy to apply.

CircuitCoder · 2022-10-31T03:31:40Z

I'm trying to get minimal implementations of both approaches tested:

Add PathRef.rel(base: os.Path, path: os.RelPath, quick: boolean) (CircuitCoder@37cafd7). Also, the default ScalaModule.allSourceFiles was changed to PathRef.rel for testing purposes. (CI)
Change Path.apply into calculating digest with relative paths (CircuitCoder@986fa0c, CI)

It seems that the CI still needs a few hours to complete. Meanwhile, I'll try to check if these changes allow us to move cache directories.

UPDATE: a test failed because the repo name is different. Unfortunately, I already have a repo named mill, so cannot fork the repo using that name. I'll try to fork it into another organization and rerun CI.

lefou · 2022-10-31T07:54:23Z

UPDATE: a test failed because the repo name is different. Unfortunately, I already have a repo named mill, so cannot fork the repo using that name. I'll try to fork it into another organization and rerun CI.

All build and test jobs should not depend on the repo name. Only jobs, that in some way do release or publishing are bound to the repo name. You can also open a draft PRs, then the appropriate CI jobs will run on Mill repo.

lefou · 2022-10-31T08:07:58Z

Maybe, it's necessary to have some predefined, named path anchors, from which we can start digesting. workspace might be one of them. coursier-cache could be another one. (In general, I really like the idea of having some easy possibility to decouple the out directory from the project directory. E.g. In autotools/make land, you can build from another location. If we can accomplish something equal in Mill, we can easily build from read-only source trees or put all build output into some RAM-backed storage. Just thinking.) These anchors are probably a bit soft and Mill decides their real path at startup time. It can also be necessary, to ditch the quick optimization of PathRef, which we use for coursier dependencies primarily.

sequencer · 2022-10-31T08:54:38Z

What I have been dreaming that PathRef can only store the hash of a file, and add an new API to access file system.

lefou · 2022-10-31T09:50:30Z

We probably need to try to resolve the PathRef against some predefined list of directories, so we can store that virtual name and a sub-path instead. Some alg like that:

// pseudo code

// defined at Mill startup time
val knownLocs: Seq[(PathName, Path)] = Seq(
  "out" -> outPath,
  "workspace" -> T.workspace,
  "coursier-cache" -> cachePath
)

// called at PathRef creation time
def createPathRef(path: Path): PathRef = {
  knownLocs.find { case (name, prefix) => path isSubPathOf prefix } match {
    case Some((name, prefix)) => // create a PathRef with sub-path and virtual base path
    case None => // create a non-portable PathRef with absolute path
  }
}

PathRef comparision is than still based on content hash and the resolved path.

lefou · 2022-10-31T09:54:52Z

Just some mockup

$ mill show main.jar
"vref:843cb117:out:main/jar.dest/out.jar"
$ mill show main.sources
[
  "vref:2318a653:workspace:main/src"
]

lefou · 2022-11-07T12:15:04Z

I experimented with this a bit. In my first iteration, I made the PathRef.sig independent of the associated path. I also implemented JSON pickling which takes some context into account, so the actual paths are relative to the specified context. Unfortunately, this alone will not automatically make the Mill cache distributabe, but at least I learned some things.

I will open a PR with the decoupling of PathRef.sig from the path soon, as it is probably a nice feature to have without any disadvantages (I think). It will not contain the context path feature though. (Edit: this is the PR: #2106)

In addition to a PathRef, which transports the information of a path and a content signature, we probably also need more types. One for a content tree (the .sig part of PathRef), but which hashCode does not change for different paths. With such a type we could refer to files/directories, where we are only interested in the content and potentially the relative file names. This might be proper input for compilers then. We need a good name though, e.g. TreeRef or ContentRef.

Additionally, we need an type to transport a path disregarding the content. This is almost a thin wrapper around os.Path, but we want to make it context aware. It's essentially a context plus a relative path. This one is needed to refer to actual files, but swap the context depending on the actual runtime location. One usage might be the compiler analysis file of Zinc, which we currently hold in CompileResult.analysisFile. We need a good name too. ContextPath?

The current PathRef could be then constructed from these two, as it depends on the actual path (which should be aware of a context) and the content signature.

To make the Mill cache distributable, we also need to refactor some targets, e.g. a compile target should not depend on targets that return PathRefs but ContentRefs instead. Same for the CompileResult. We probably need to review the whole scalalib architecture under this perspective.

lefou · 2022-11-07T12:37:45Z

SubPathRef might be a good name (for ContentRef).

lolgab · 2022-11-07T19:46:03Z

Do exist cases when paths in the .json files in out are not subpaths of T.workspace ?
Could we process all the os.Paths and assume them to be relative to T.workspace? Probably not, asking just to validate this idea.

lefou · 2022-11-07T20:14:25Z

Currently, out is hard-coded, so your assumption is valid. But I'm planning to change that, or at least make it overridable via config or cli, to support more use cases like caching or memory-backed storage or read-only project directories.

lefou · 2022-11-07T20:20:43Z

@lolgab To your second question: We already store paths that are located outside of the T.workspace, e.g. coursier and ivy artifacts. Making these relative to T.workspace has more potential to cause harm than good.

lefou · 2022-11-22T07:12:57Z

PathRef.sig is also storing information about permissions. As these have different structure under Windows vs. Unix-based system, we probably won't be able to share between different OSes without further changes.

lefou mentioned this issue Nov 28, 2022

Amend the way Mill invokes Zinc to enable benefits from remote caching #2153

Open

lefou mentioned this issue Sep 21, 2023

POC Remote Caching #2777

Draft

lihaoyi mentioned this issue Oct 4, 2024

Make out/ folder contents (more) reproducible and filesystem layout agnostic (1500USD bounty) #3660

Open

rahat2134 added a commit to rahat2134/mill that referenced this issue Oct 17, 2024

Implement reproducible out/ folder (com-lihaoyi#2101)

9ac205e

rahat2134 mentioned this issue Oct 17, 2024

[WIP]Implement reproducible out/ folder contents across different filesystem layouts #3765

Draft

jodersky mentioned this issue Dec 4, 2024

Standardize usage of machine-wide caches #3930

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse out folder among different machines/directories #2101

Reuse out folder among different machines/directories #2101

sequencer commented Oct 30, 2022

lefou commented Oct 30, 2022

lefou commented Oct 30, 2022

sequencer commented Oct 30, 2022 •

edited

Loading

sequencer commented Oct 30, 2022 •

edited

Loading

CircuitCoder commented Oct 30, 2022 •

edited

Loading

sequencer commented Oct 30, 2022

lefou commented Oct 30, 2022 •

edited

Loading

CircuitCoder commented Oct 31, 2022 •

edited

Loading

lefou commented Oct 31, 2022

lefou commented Oct 31, 2022

sequencer commented Oct 31, 2022

lefou commented Oct 31, 2022

lefou commented Oct 31, 2022 •

edited

Loading

lefou commented Nov 7, 2022 •

edited

Loading

lefou commented Nov 7, 2022

lolgab commented Nov 7, 2022

lefou commented Nov 7, 2022

lefou commented Nov 7, 2022

lefou commented Nov 22, 2022

Reuse out folder among different machines/directories #2101

Reuse out folder among different machines/directories #2101

Comments

sequencer commented Oct 30, 2022

lefou commented Oct 30, 2022

lefou commented Oct 30, 2022

sequencer commented Oct 30, 2022 • edited Loading

sequencer commented Oct 30, 2022 • edited Loading

CircuitCoder commented Oct 30, 2022 • edited Loading

sequencer commented Oct 30, 2022

lefou commented Oct 30, 2022 • edited Loading

CircuitCoder commented Oct 31, 2022 • edited Loading

lefou commented Oct 31, 2022

lefou commented Oct 31, 2022

sequencer commented Oct 31, 2022

lefou commented Oct 31, 2022

lefou commented Oct 31, 2022 • edited Loading

lefou commented Nov 7, 2022 • edited Loading

lefou commented Nov 7, 2022

lolgab commented Nov 7, 2022

lefou commented Nov 7, 2022

lefou commented Nov 7, 2022

lefou commented Nov 22, 2022

sequencer commented Oct 30, 2022 •

edited

Loading

sequencer commented Oct 30, 2022 •

edited

Loading

CircuitCoder commented Oct 30, 2022 •

edited

Loading

lefou commented Oct 30, 2022 •

edited

Loading

CircuitCoder commented Oct 31, 2022 •

edited

Loading

lefou commented Oct 31, 2022 •

edited

Loading

lefou commented Nov 7, 2022 •

edited

Loading