Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse out folder among different machines/directories #2101

Open
sequencer opened this issue Oct 30, 2022 · 19 comments
Open

Reuse out folder among different machines/directories #2101

sequencer opened this issue Oct 30, 2022 · 19 comments

Comments

@sequencer
Copy link
Contributor

I’m using mill to manage a relative big system. A usage is forking entire workspace , including 'out', sending to other machines, executed some thing else, fork and send again and again. This is used for distributing works to different nodes.
however due to the absolute path in PathRef, out cannot be directly packaged and sent away.
Thus I’m proposing to make the path in PathRef relative to T.ctx.workspace to reduce the dependency to environment out of build context.

@lefou
Copy link
Member

lefou commented Oct 30, 2022

We're open for discussion and development to support such scenarios. We discussed some aspects of it in the past and I'd like to reference it here FYI:

@lefou
Copy link
Member

lefou commented Oct 30, 2022

Although mill.api.PathRef is used in targets, it's not necessarily limited to it. Also, it is not limited to paths below T.workspace, so simply changing to a relative path may not work. I already wrote up an idea to let PathRefs support relative paths, which may work (see #1400 (comment)), but it was under the context, that we actively distribute work to other nodes. This is a bit different from your use case.

@sequencer
Copy link
Contributor Author

sequencer commented Oct 30, 2022

Yes I just checked those infos befor submitting this issue. Let me know what I can contribute to make this happening.

@sequencer
Copy link
Contributor Author

sequencer commented Oct 30, 2022

Actually, for the build script trigger reevaluating issue, I wonder if it’s possible to manually give each Module a version to reduce this behavior, which may give the chance to throw rebuild burden to users.

@CircuitCoder
Copy link

CircuitCoder commented Oct 30, 2022

Maybe another idea:

When walking the directory subtree rooted at p (the argument path in PathRef.apply), use each file's relative path to p to update the digest.

The reasoning is that if two directory tree has the same shape and the same content, we should consider them the "same" directory in the sense of caching. In this implementation, moving the entire subtree p to another location won't change its hash.

@sequencer
Copy link
Contributor Author

another(maybe unrelative) question is:
Is possible to get the full reason why a target is reevaluated, I thought when input is same caching should work, but have no idea to know why input is changed in some complex cases.

@lefou
Copy link
Member

lefou commented Oct 30, 2022

Well, currently PathRef is some kind of blackbox. Once there is one bit different, it's completely different. We probably could model that differently (e.g. a tree-like structure, so we also detect changes in parts of it or can re-use sub-parts) , but I think it's probably not efficient enough. Both, speed and memory wise.

So splitting it structurally on the smallest part might be overkill. But as @sequencer suggested, just keeping the info stable relative to T.workspace (if possible) might already be good enough, or at least a good start. We could also try to keep the path-info separated from the content hash, so that the proof whether two PathRefs are of equal content is possible, even when the containing path is different. That's probably what @CircuitCoder meant? In Mill, we are interested in both information, but not necessarily so closely coupled.

I think some experiments or creating a POC wouldn't hurt and could be fun. Got for it!

About the other issue, tracking the origin of a change, this should be probably discussed separately. But once, we invent some tracking, the whole user experience might change. And I fear not to the better. It might result in more complicated API for Input/Output types, for example. That's just a feeling though. Currently, hash-based change detection isn't easy to track (although Mill evaluator knows exactly which tasks were out of date, see out/profile.json), but it's damn fast and easy to apply.

@CircuitCoder
Copy link

CircuitCoder commented Oct 31, 2022

I'm trying to get minimal implementations of both approaches tested:

  • Add PathRef.rel(base: os.Path, path: os.RelPath, quick: boolean) (CircuitCoder@37cafd7). Also, the default ScalaModule.allSourceFiles was changed to PathRef.rel for testing purposes. (CI)
  • Change Path.apply into calculating digest with relative paths (CircuitCoder@986fa0c, CI)

It seems that the CI still needs a few hours to complete. Meanwhile, I'll try to check if these changes allow us to move cache directories.

UPDATE: a test failed because the repo name is different. Unfortunately, I already have a repo named mill, so cannot fork the repo using that name. I'll try to fork it into another organization and rerun CI.

@lefou
Copy link
Member

lefou commented Oct 31, 2022

UPDATE: a test failed because the repo name is different. Unfortunately, I already have a repo named mill, so cannot fork the repo using that name. I'll try to fork it into another organization and rerun CI.

All build and test jobs should not depend on the repo name. Only jobs, that in some way do release or publishing are bound to the repo name. You can also open a draft PRs, then the appropriate CI jobs will run on Mill repo.

@lefou
Copy link
Member

lefou commented Oct 31, 2022

Maybe, it's necessary to have some predefined, named path anchors, from which we can start digesting. workspace might be one of them. coursier-cache could be another one. (In general, I really like the idea of having some easy possibility to decouple the out directory from the project directory. E.g. In autotools/make land, you can build from another location. If we can accomplish something equal in Mill, we can easily build from read-only source trees or put all build output into some RAM-backed storage. Just thinking.) These anchors are probably a bit soft and Mill decides their real path at startup time. It can also be necessary, to ditch the quick optimization of PathRef, which we use for coursier dependencies primarily.

@sequencer
Copy link
Contributor Author

What I have been dreaming that PathRef can only store the hash of a file, and add an new API to access file system.

@lefou
Copy link
Member

lefou commented Oct 31, 2022

We probably need to try to resolve the PathRef against some predefined list of directories, so we can store that virtual name and a sub-path instead. Some alg like that:

// pseudo code

// defined at Mill startup time
val knownLocs: Seq[(PathName, Path)] = Seq(
  "out" -> outPath,
  "workspace" -> T.workspace,
  "coursier-cache" -> cachePath
)

// called at PathRef creation time
def createPathRef(path: Path): PathRef = {
  knownLocs.find { case (name, prefix) => path isSubPathOf prefix } match {
    case Some((name, prefix)) => // create a PathRef with sub-path and virtual base path
    case None => // create a non-portable PathRef with absolute path
  }
}

PathRef comparision is than still based on content hash and the resolved path.

@lefou
Copy link
Member

lefou commented Oct 31, 2022

Just some mockup

$ mill show main.jar
"vref:843cb117:out:main/jar.dest/out.jar"
$ mill show main.sources
[
  "vref:2318a653:workspace:main/src"
]

@lefou
Copy link
Member

lefou commented Nov 7, 2022

I experimented with this a bit. In my first iteration, I made the PathRef.sig independent of the associated path. I also implemented JSON pickling which takes some context into account, so the actual paths are relative to the specified context. Unfortunately, this alone will not automatically make the Mill cache distributabe, but at least I learned some things.

I will open a PR with the decoupling of PathRef.sig from the path soon, as it is probably a nice feature to have without any disadvantages (I think). It will not contain the context path feature though. (Edit: this is the PR: #2106)

In addition to a PathRef, which transports the information of a path and a content signature, we probably also need more types. One for a content tree (the .sig part of PathRef), but which hashCode does not change for different paths. With such a type we could refer to files/directories, where we are only interested in the content and potentially the relative file names. This might be proper input for compilers then. We need a good name though, e.g. TreeRef or ContentRef.

Additionally, we need an type to transport a path disregarding the content. This is almost a thin wrapper around os.Path, but we want to make it context aware. It's essentially a context plus a relative path. This one is needed to refer to actual files, but swap the context depending on the actual runtime location. One usage might be the compiler analysis file of Zinc, which we currently hold in CompileResult.analysisFile. We need a good name too. ContextPath?

The current PathRef could be then constructed from these two, as it depends on the actual path (which should be aware of a context) and the content signature.

To make the Mill cache distributable, we also need to refactor some targets, e.g. a compile target should not depend on targets that return PathRefs but ContentRefs instead. Same for the CompileResult. We probably need to review the whole scalalib architecture under this perspective.

@lefou
Copy link
Member

lefou commented Nov 7, 2022

SubPathRef might be a good name (for ContentRef).

@lolgab
Copy link
Member

lolgab commented Nov 7, 2022

Do exist cases when paths in the .json files in out are not subpaths of T.workspace ?
Could we process all the os.Paths and assume them to be relative to T.workspace? Probably not, asking just to validate this idea.

@lefou
Copy link
Member

lefou commented Nov 7, 2022

Currently, out is hard-coded, so your assumption is valid. But I'm planning to change that, or at least make it overridable via config or cli, to support more use cases like caching or memory-backed storage or read-only project directories.

@lefou
Copy link
Member

lefou commented Nov 7, 2022

@lolgab To your second question: We already store paths that are located outside of the T.workspace, e.g. coursier and ivy artifacts. Making these relative to T.workspace has more potential to cause harm than good.

@lefou
Copy link
Member

lefou commented Nov 22, 2022

PathRef.sig is also storing information about permissions. As these have different structure under Windows vs. Unix-based system, we probably won't be able to share between different OSes without further changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants