Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runfiles workspace directories use wrong workspace name if repository renaming is in effect #15029

Closed
phst opened this issue Mar 12, 2022 · 39 comments
Assignees
Labels
area-Bzlmod Bzlmod-specific PRs, issues, and feature requests P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug

Comments

@phst
Copy link
Contributor

phst commented Mar 12, 2022

Description of the problem / feature request:

If a workspace B depends on a workspace A, but uses the name C instead of A to name the workspace, the runfiles for workspace A will appear as C in the runfiles directory, breaking runfiles lookup for any library in workspace A.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  • /tmp/test/a/WORKSPACE:

    workspace(name = "a")
  • /tmp/test/a/BUILD:

    cc_library(
        name = "lib",
        srcs = ["lib.cc"],
        hdrs = ["lib.h"],
        data = ["data.txt"],
        visibility = ["//visibility:public"],
        deps = ["@bazel_tools//tools/cpp/runfiles"],
    )
  • /tmp/test/a/lib.h:

    #include <string>
    void Test(const std::string& argv0);
  • /tmp/test/a/lib.cc:

    #include "lib.h"
    
    #include <exception>
    #include <fstream>
    #include <iostream>
    #include <string>
    
    #include "tools/cpp/runfiles/runfiles.h"
    
    void Test(const std::string& argv0) {
      using bazel::tools::cpp::runfiles::Runfiles;
      const Runfiles* runfiles = Runfiles::Create(argv0);
      if (runfiles == nullptr) throw std::runtime_error("missing runfiles");
      std::string filename = runfiles->Rlocation("a/data.txt");
      if (filename.empty()) throw std::runtime_error("can't find runfile");
      std::ifstream stream(filename);
      stream.exceptions(std::ios::failbit | std::ios::badbit);
      if (!stream.is_open()) throw std::runtime_error("can't open runfile");
      std::cout.exceptions(std::ios::failbit | std::ios::badbit);
      std::cout << stream.rdbuf() << std::endl;
    }
  • /tmp/test/a/data.txt:

    hello world
    
  • /tmp/test/b/WORKSPACE:

    workspace(name = "b")
    
    local_repository(
        name = "c",  ## "a"
        path = "/tmp/test/a",
        repo_mapping = {"@c": "@a"},
    )
  • /tmp/test/b/BUILD:

    cc_binary(
        name = "bin",
        srcs = ["bin.cc"],
        deps = ["@c//:lib"],  ## "@a//:lib"
    )
  • /tmp/test/b/bin.cc:

    #include "lib.h"
    
    int main(int argc, char** argv) {
      Test(argc > 0 ? argv[0] : "");
    }

Then, cd /tmp/test/b && bazel run :bin will fail. In /tmp/test/b/bazel-bin/bin.runfiles there will be a subdirectory c (i.e. the name of the repository in b), but no subdirectory a (i.e. the actual name of the a repository).

The same works after renaming the local repository in b/WORKSPACE to a (as the comments indicate) and commenting out the repo_mapping attribute.

This is a problem because code in workspace A can't possible know that its clients import it under a different name. Therefore, code in workspace A can only refer to runfiles within its own workspace under the name that it has declared in its own WORKSPACE file (a in this example). This means that the runfiles actually have to be available under the a name, independent of any repository mappings. So Bazel should not only create a symlink c/data.txt in the runfiles directory, but also a/data.txt. More generally, any runfile should be available under both its local workspace name and any renamed/remapped workspace names. In case of name clashes, Bazel should signal an error, as it does for other name clashes.

What operating system are you running Bazel on?

macOS

What's the output of bazel info release?

release 5.0.0-homebrew

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

Irrelevant

Have you found anything relevant by searching the web?

There are probably many related discussions, but I haven't found a bug report about this specific issue itself. Probably the problem will become much worse with bzlmod due to its extensive use of repository renaming, and I'd consider it a blocker for enabling bzlmod by default.

@fmeum
Copy link
Collaborator

fmeum commented Mar 13, 2022

More generally, any runfile should be available under both its local workspace name and any renamed/remapped workspace names. In case of name clashes, Bazel should signal an error, as it does for other name clashes.

I fear this would run counter to the bzlmod principle that repositories introduced by transitive dependencies should not be able to break you. It may work well enough with the current WORKSPACE system and infrequent uses of repository mappings, but as soon as all repos are mapped, this will probably result in frequent breakages.

The way I have solved this problem so far has been to generate the string passed to rlocation using a Starlark rule, which can compute the correct path by combining ctx.workspace_name and short_path. This has the advantage that it works with Bazel as is, but the downside that it requires language-specific code generation and forces users to move their data dependencies to a separate target. I have implemented this approach for Java and C++ (see https://github.com/fmeum/rules_runfiles), but it wouldn't be difficult to extend to other languages. This does bring other benefits with it such as compile-time checking and IDE completions for rlocation paths.

@phst If you find the time to try out rules_runfiles, I would be very interested in your feedback on whether this approach is viable.

Alternatives that do not rely on code generation seem tricky: We could try to provide the complete repository mapping to the runfiles library similar to how this is done for the RUNFILES_MANIFEST. That's not as simple as it sounds though: At runtime, it's not possible to tell from which workspace a given rlocation call originates. The most common case will be that of the binary target itself, but this will not hold in general (a binary could depend on a library from a different repository that brings in runfiles).

@Wyverald I fully agree with @phst that this is one of the major problems to solve before the official release of bzmlod. Maybe we could hold a brainstorming meeting with interested parties at some point?

@Wyverald
Copy link
Member

Maybe we could hold a brainstorming meeting with interested parties at some point?

Yes, this would be great (I'd also like a chance to understand the problem better). cc @meteorcloudy

@meteorcloudy
Copy link
Member

This is indeed a very important problem to solve, @Wyverald can you schedule a meeting and add interested people to join the brainstorming?

@oquenchil oquenchil added type: support / not a bug (process) untriaged team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website labels Mar 16, 2022
@phst
Copy link
Contributor Author

phst commented Mar 17, 2022

More generally, any runfile should be available under both its local workspace name and any renamed/remapped workspace names. In case of name clashes, Bazel should signal an error, as it does for other name clashes.

I fear this would run counter to the bzlmod principle that repositories introduced by transitive dependencies should not be able to break you. It may work well enough with the current WORKSPACE system and infrequent uses of repository mappings, but as soon as all repos are mapped, this will probably result in frequent breakages.

That might be, though my gut feeling says that it might not be so bad in practice. Clashes only occur if the transitive workspace dependency set contains two workspaces with identical local names and both contain a library with an identically-named runfile and both libraries are linked into the same binary. I think at least the first two of these conditions are already pretty rare: people tend to use unambiguous local workspace names such as rules_python, and since languages such as C++ don't include workspace names in import filenames, package names also tend to be globally unique. But I agree that it's a problem.

The way I have solved this problem so far has been to generate the string passed to rlocation using a Starlark rule, which can compute the correct path by combining ctx.workspace_name and short_path. This has the advantage that it works with Bazel as is, but the downside that it requires language-specific code generation and forces users to move their data dependencies to a separate target. I have implemented this approach for Java and C++ (see https://github.com/fmeum/rules_runfiles), but it wouldn't be difficult to extend to other languages. This does bring other benefits with it such as compile-time checking and IDE completions for rlocation paths.

That is probably a better way forward, yes. However, I'm quite concerned about existing codebases. The runfiles libraries are old, officially supported, and widely used, and their behavior is at least semi-officially documented (https://docs.google.com/document/d/e/2PACX-1vSDIrFnFvEYhKsCMdGdD40wZRBX3m3aZ5HhVj4CtHPmiXKDCxioTUbYsDydjKtFDAzER5eg7OjJWs3V/pub). Besides, the "desired" behavior is even officially documented (https://docs.bazel.build/versions/5.0.0/skylark/lib/globals.html#workspace):

For example, if there is a runfile foo/bar in the local repository and the WORKSPACE file contains workspace(name = 'baz'), then the runfile will be available under mytarget.runfiles/baz/foo/bar.

Note that this refers explicitly to the local workspace name (as specified in WORKSPACE), not any remapped name.

I think before we can make such a significant change to the behavior of runfiles, we'd need to do at least the following:

@phst If you find the time to try out rules_runfiles, I would be very interested in your feedback on whether this approach is viable.

I don't doubt that it's viable; this bug is currently mostly theoretical, I don't have a concrete example (but I guess that's only the case because repository remappings are still rare).

@fmeum
Copy link
Collaborator

fmeum commented Mar 18, 2022

That might be, though my gut feeling says that it might not be so bad in practice. Clashes only occur if the transitive workspace dependency set contains two workspaces with identical local names and both contain a library with an identically-named runfile and both libraries are linked into the same binary. I think at least the first two of these conditions are already pretty rare: people tend to use unambiguous local workspace names such as rules_python, and since languages such as C++ don't include workspace names in import filenames, package names also tend to be globally unique.

That's a great point, which to me suggests a two-layered approach that could look as follows (this also summarized results of our discussion to some extent):

  1. Introduce a flag --legacy_best_effort_remap_runfiles with values on, off, and auto, where the latter evaluates to on with --enable_bzlmod and off otherwise. If the flag is effectively on, for each repository, the runfiles directory and manifest contain rlocation path mappings for both the canonical name of that repository as well as all names under which it is mapped by any repository (including itself). We could then use existing conflict checkers or add our own logic to provide users with an informative error message in case this leads to conflicts, e.g. "Both 'canonical_name_1andcanonical_name_2` are reference as 'mapped_name' in the build and contain a runfile at path 'path'." and suggest a way out.

  2. Serialize the entire repository mapping, which is essentially a function (canonical name of current repository, mapped name of a repository A) -> canonical name of A, to a file that can be looked up in the runfiles at a fixed rlocation path (e.g. REPOSITORY_MAPPING, which can't clash with other valid runfiles since it doesn't contain a /). This incidentally solves Bzlmod: stardoc is broken by repo mappings #14140. The format of this file could be canonically indented JSON, which would make it trivial to parse with and still easy to parse without a JSON parser.

  3. Offer new overloads of the rlocation functions in the various runfiles libraries that allow passing in the current repository name of the call site (which very importantly is not necessarily the repository of the *_binary in which the call site ends up). Roughly, I could see the following APIs being achievable, assuming we get some logic added to the individual rules:

C++: If every cc_* rule had an implicit local define -DBAZEL_CURRENT_REPOSITORY=<canonical repo name>, then Rlocation calls would only have to change from Rlocation("repo/path/to/pkg/file") to Rlocation(BAZEL_CURRENT_REPOSITORY, "repo/path/to/pkg/file"). A macro may even serve to make the change invisible to users.

Java: Assume every java_* rule contained an implicit source file as follows or, alternatively, a compile-time dependency on a neverlink library containing such a source file:

package com.google.devtools.build.runfiles;

public final class Bazel {
  public static final String CURRENT_REPOSITORY = "<canonical  repo name>";
  private Bazel();
}

Since the JLS mandates that primitive constants be inlined into class files, users could just call rlocation(Bazel.CURRENT_REPOSITORY, "repo/path/to/pkg/file") and the first argument would always have the correct value.

Python/Shell: I don't know these languages well enough to say what a good API could look like. Potentially, we may not need any user-facing changes though as it may be possible to extract the canonical name of the current repo from the path of the current source file at runtime.

  1. Eventually drop support for --legacy_best_effort_runfiles_mapping.

@meteorcloudy meteorcloudy added P2 We'll consider working on this in future. (Assignee optional) area-Bzlmod Bzlmod-specific PRs, issues, and feature requests type: bug and removed untriaged type: support / not a bug (process) labels Mar 22, 2022
@meteorcloudy meteorcloudy added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website labels Mar 22, 2022
@phst
Copy link
Contributor Author

phst commented Apr 18, 2022

In addition to #15029 (comment), I think we should have some way to access the current repository name that assumes as little as possible from the programming language. For example, have a rule that generates a source file that contains the repository name as a constant (or function returning a constant). Such a thing should be possible in even the most basic programming languages.

We should also document the expected behavior of runfiles and runfile APIs so that rule sets for new languages can behave accordingly.

We should then also deprecate the old variants of rlocation (or even the entire legacy runfiles API), since with repository mapping the one-argument functions don't follow the "easy to use correctly" principle any more.

Rlocation(BAZEL_CURRENT_REPOSITORY, "repo/path/to/pkg/file")

Do you mean Rlocation(BAZEL_CURRENT_REPOSITORY, "path/to/pkg/file")?

A macro may even serve to make the change invisible to users.

This is mostly personal style, but I'd say that would be too much magic. I think it's OK for users to have to explicitly select a repository name.

it may be possible to extract the canonical name of the current repo from the path of the current source file at runtime

This assumes that the current source file is always $RUNFILES_DIR/REPO_NAME/PACKAGE/file.py, right? Is that assumption always correct?

@fmeum
Copy link
Collaborator

fmeum commented Apr 19, 2022

Rlocation(BAZEL_CURRENT_REPOSITORY, "repo/path/to/pkg/file")

Do you mean Rlocation(BAZEL_CURRENT_REPOSITORY, "path/to/pkg/file")?

When a target in repo my_repo wants to get the path to the file path/to/pkg/file in the repo other_repo, then the Rlocation would need to know both my_repo (to look up the proper repo mapping) and other_repo/path/to/pkg/file (to apply the repo mapping to). We could of course split out the repository name of the rlocation path into a third argument, but that may actually make things more confusing.

A macro may even serve to make the change invisible to users.

This is mostly personal style, but I'd say that would be too much magic. I think it's OK for users to have to explicitly select a repository name.

I personally also see this is as being too magical. It would probably only work in C++ anyway and even there I don't know whether it couldn't break in unexpected ways.

it may be possible to extract the canonical name of the current repo from the path of the current source file at runtime

This assumes that the current source file is always $RUNFILES_DIR/REPO_NAME/PACKAGE/file.py, right? Is that assumption always correct?

I know next to nothing about the Python rules. @meteorcloudy Can you say whether this is a safe assumption? In general, what do you think would be the most idiomatic way to get the current repo name in Python?

@meteorcloudy
Copy link
Member

Can you say whether this is a safe assumption?

I guess that's true in most cases, but not sure how it should work if you use source file from another repository. Should the current repo be the repo of the target's package or the repo of the source file? Also it may not be easy to parse out the repo name from the file path, how do you determine which segment is runfiles_dir, repo name and package path.

I'm thinking about a general solution. Please tell me what you think.

In @bazel_tools//runfiles/runfiles_lib.bzl, we provide a macro for language xx to define a runfiles library target:

def define_xx_runfiles_lib():
  genrule(
    name = "gen_current_repo_name_source",
    cmd = "generate a source file that exports the current repo name",
    output = ["current_repo_name.xx"]
  )
  
  xx_library(
    name = "xx_runfiles_lib",
    srcs = ["current_repo_name.xx", "runfiles_lib_with_repo_mappings.xx"],
    deps = ["@bazel_tools//tools/xx/repo_mappings", "@bazel_tools//tools/xx/runfiles"],
  )

When I need to access runfiles, I call the macro to define a xx_runfiles_lib in my repository.
For example at <workspace root>/BUILD

load("@bazel_tools//runfiles/runfiles_lib.bzl", "define_xx_runfiles_lib")
define_xx_runfiles_lib()

Then for the target that needs to access runfiles, it depends on this runfiles library that's aware of its own repo, eg. foo/bar/BUILD

xx_binary(
   name = "bin",
   srcs = [...],
   deps = ["//:xx_runfiles_lib"],
)

Currently, a genrule can export RULEDIR, I think it's easy to also export CURRENT_REPO (repo of the package where the genrule is defined) as a predefined variable.

The downside is that you need to define (and build) the runfiles library in each repo, but you only need to define it once for each repo (that needs runfiles access).

@fmeum
Copy link
Collaborator

fmeum commented Apr 20, 2022

@meteorcloudy I am worried about what exactly you would put in current_repo_name.xx. The constant containing the canonical repository name defined in this file AFAIU has to satisfy the following two incompatible requirements:

  1. Its name has to be static (e.g., not depend on either the mapped or the canonical name of the current repo) so that other source files in the repository can reference it regardless of what the name of the current repository is.

  2. It has to be globally unique so that there are no conflicts when a target transitively depends on multiple define_xx_runfiles_lib targets from different repositories.

Of course, the define_xx_runfiles_lib macro could make the name of the constant configurable, but then we would be back to seeing things like IO_BAZEL_RULES_GO_REPOSITORY_NAME in various places in the code.

@meteorcloudy
Copy link
Member

@fmeum I think you're right, it's really hard to make my proposal work.

As for getting repo name from source file path in python, I did some test with https://github.com/meteorcloudy/my_tests/tree/master/python_deps_test
The result of __file__ in each file are:

$ ./bin
/Users/pcloudy/workspace/my_tests/python_deps_test/lib.py
The number is 42
/private/var/tmp/_bazel_pcloudy/b824dd8c9deead95cea17bd1c46be46d/execroot/__main__/bazel-out/darwin_arm64-fastbuild/bin/bin.runfiles/fib_repo/fib.py
Fib(5) == 8
/private/var/tmp/_bazel_pcloudy/b824dd8c9deead95cea17bd1c46be46d/execroot/__main__/bazel-out/darwin_arm64-fastbuild/bin/bin.runfiles/__main__/bin.py

As you can see the weird one is lib.py which comes from the workspace root instead of the runfiles tree. I guess somehow the python start up script added the workspace root to PYTHONPATH. Otherwise, it seems we can parse the repo name correctly.

@Wyverald
Copy link
Member

@tetromino had some thoughts on this issue (cc @brandjon):

For languages that have a __FILE__ macro of some form (C++ and Python do), have the runfiles library take in this path as an argument from the calling library. Then the runfiles library can use that to extract the source file's repo, consult the repo mapping database file that bazel generates, etc.

For the remaining languages that don't have this, create it yourself by inventing a new placeholder syntax (%__FILE__%) and having the build rule for that language perform template substitution.

This avoids magic, and in theory allows runfiles to be resolved with respect to callstack frames other than the innermost caller (not that that'll ever be needed).

The main downside is that it changes the signature of the runfiles method, meaning you either have to do a migration and fully eliminate the old way, or else you tolerate both ways indefinitely (which is dangerous because the old way remains a trap that the code author won't notice but their users will).

@fmeum
Copy link
Collaborator

fmeum commented May 11, 2022

This sounds like a reasonable approach. Just a few points:

  • We should clarify which repo we use as the base for looking up the repo mapping: the one that contains the source file or the one that contains the target that references the source file. Are there any non-pathological situations where this might matter? My approach mentioned above would realize the latter, this approach the former.
  • Java has StackWalker and Go has runtime.Caller, so either language could even look up repo mappings with the same API as offered today. I would count that as a clear advantage of the file-based over the target-based lookup.

@Wyverald
Copy link
Member

Are there any non-pathological situations where this might matter?

I really hope not o,o I think file-based technically makes a bit more sense, but again I really hope nobody is using a source file from another repo while the target includes runfiles and the source file refers to such runfiles...

@sluongng
Copy link
Contributor

May I formally request that changes to runfiles library to go through the proposal process in https://github.com/bazelbuild/proposals? It might help tremendously for future references looking back as well as lay a good foundation for future documentation effort. (there is no doc for runfiles lib today)

@fmeum
Copy link
Collaborator

fmeum commented Jun 11, 2022

I have thought more about robust ways to implement repo mapping aware runfile lookups in Java and arrived at a code generation approach borrowing ideas from rules_runfiles. Approaches based on StackWalker potentially wouldn't work after e.g. ProGuard ran. On the other hand, Go's runtime.Caller should always work.

With the prototype at fmeum@fdbc854, Java libraries that directly depend on the runfiles library get access to a compile-time constant CURRENT_REPO_NAME containing the canonical name of their repository. There is no classpath or .class file overhead (the constant is inlined) and no increase in the number of generated actions for java_library targets that don't directly depend on the runfiles library.

@comius It would be very helpful to know whether you would find modifications to the Java builtin rules along these lines acceptable if they should turn out to be necessary to realize a user-friendly API for runfiles lookups. Based on your feedback on this prototype, I would think about more independent alternatives if necessary.

@comius
Copy link
Contributor

comius commented Jul 27, 2022

I took a quick look at the

prototype at fmeum@fdbc854
(and this is a quick reply -I don't know all the requirements/caveats about the problem you're solving).

You probably shouldn't add a provider, new source and action building it to a java_library that depend on runfiles. If you do, it looks like you're handling problem at the wrong location. Why should one java_libraries be suddenly different from others.

BuildInfo is doing a very similar thing on java_binary. It's creating a .properties file with information about repository (it can add git's commit id and it can detect it's changed). It should be easily extended to include additional data, like repo name you need.

See src/main/java/com/google/devtools/build/lib/rules/java/JavaBuildInfoFactory.java. ATM native impl. is used. @buildbreaker2021 is working on implementing it via default API (ctx.version_file, ctx.info_file) in Starlark rules.

I think you should be able to solve the problem this way (but I didn't read all the comments on the ticket in detail).

The BuildInfo generated .properties file is added java_binary's deploy jar (or to its runfiles). I think there are some differences, when you depend on another java_binary from deps or from data attribute. Those differences might help you handle your cases correctly.

@fmeum
Copy link
Collaborator

fmeum commented Jul 27, 2022

@comius The provider itself isn't strictly necessary - it is used only to keep the repository mapping manifest small (less rebuilds because irrelevant repositories change their repo mappings) and to only emit the new actions in java_library when needed.

I looked at JavaBuildInfoFactory.java and although I only have a basic understanding of what it does, I think that it can't solve the problem that this proposal handles. To answer your question:

Why should one java_libraries be suddenly different from others.

In order to know the correct repository mapping context to apply, we essentially have to tie every Java source file that accesses runfiles to the repository containing the java_library target that compiled the source file. I think that this fundamentally can't be solved by adding resources: How would source files brought in by different java_library targets know that they should look into different resources (or different keys in the same resource)?

As far as I can tell, there are only two approaches to distinguish source files:

  1. Use language features to access debug symbols and get the source file parse that way. Possible in Java, but fragile since end users may choose to strip debug symbols.
  2. Generate code that differs per java_library target, but can always be referenced in the same way - this is the approach taken by my prototype.

@comius
Copy link
Contributor

comius commented Oct 27, 2022

Hey all. I've finally got some tactile knowledge about repo remappings and how to use them in runfiles library.
i recently discovered there's a small custom runfiles library in kotlin_rules.

So at the moment we add an additional repo_mappings file as described in fmeum's comment: #15029 (comment), option #2.

Basically we have two maps:

repo_mappings: (canonical current repo, mapped A) -> canonical_A
runfiles_manifest: canonical_A/relative_path -> absolute_path

The problem I'm seeing choosing #2 is that, it immediately and completely breaks every runfiles library.
Requiring to use annotations also completely breaks all Java users of runfiles library.

Migrating runfiles doesn't seem such a big problem, except that the format is public and there are locations where people parse the files themselves. I consider the "data" functionality quite an important one, and we probably shouldn't be migrating all such java libraries.

If I understand correctly #2 was chosen, because it correctly handles possible conflicts. And I think those conflicts are also very rare. It seems we'll be putting long hours to migrate everything for a rare conflict.

Could we do something that provides a more "gradual breakdown", allowing for a more gradual migration?

I was thinking about something that's close to #1, but not exactly it.

Option #1+:
With --enable_bzlmod and when there are no conflicts, have runfiles_manifest exactly the same.
For all conflicting entries add them to runfiles_manifest with a canonical name of the repo where they are used, for example:

runfiles_manifest:
  mapped_A/unique_relative_path -> absolute_path_1
  canonical_current_repo_A/mapped_A/conflicting_relative_path -> absolute_path_1
  canonical_current_repo_B/mapped_A/conflicting_relative_path -> absolute_path_2

Old runfiles lib will work on a non-conflicting case and break on the conflicts. Monorepos will work as before, without any regression.

The change should be easy to implement here:

if (legacyExternalRunfiles) {
checker.put(manifest, getLegacyExternalPath(path), entry.getValue());
}
// Always add the non-legacy .runfiles/repo/whatever path.
checker.put(manifest, getExternalPath(path), entry.getValue());

We could slightly improve error handling by adding also something like:
mapped_A/conflicting_relative_path -> an_error_message. Which will probably result in "File not found: an_error_message".

My take on Java:

cc @cushon
Since I think "data" functionality is important, it should be more seamless to the users of Bazel.
I suspect that each Java library needs to be/become aware of the global name of the repository it lives in. I think the right location for this is META-INFO - and I think Bazel java builder can be relatively easily extended to add such information there.

What's more tricky is, that runfiles library, should obtain its callee's META-INFO. I think this might be possible in Java, slightly magic though. But it provides the right user experience - no java library using data needs to be modified.

@fmeum
Copy link
Collaborator

fmeum commented Oct 27, 2022

The problem I'm seeing choosing #2 is that, it immediately and completely breaks every runfiles library.

In fact, many runfiles libraries out there are already broken in one way or another. In the best case, they just don't work in some situations (on Windows, on Linux without a runfiles directory, ...), in the worst case they non-hermetically pick up runfiles from the output base. I do see value in having these converge towards a single, reliable implementation per language.

Monorepos will work as before, without any regression.

Monorepos using the official runfiles libraries will not have to make any changes as the main repo will be the default "current repo" context without having to pass anything manually.
Monorepos rolling their own runfiles library would have to extend it to be able to read the repo_mapping file or use the official runfiles libraries under the hood, but shouldn't have to perform any global changes across the codebase.

With --enable_bzlmod and when there are no conflicts, have runfiles_manifest exactly the same.

The problem I see is that it wouldn't be "the same", but more of it: If 10 Bazel modules depend on the protobuf module by 10 different names, the runfiles manifest and runfiles directory would need to include 10x as many entries for protobuf runfiles. This can be implemented, even as a fallback for the current design, but it won't come for free, especially with sandboxed execution and will affect everyone's build performance, regardless of whether they actively use runfiles lookups or not. Maybe we could think about offering it via a flag?
@lberki @Wyverald What do you think?

There is also the problem that most runfiles libraries today don't check whether the target file exists when using the runfiles directory as a source. Checking whether a given path conflicts or not would require introducing an existence check, which may also break existing users.

What's more tricky is, that runfiles library, should obtain its callee's META-INFO. I think this might be possible in Java, slightly magic though.

With the standard multi-jar model, this is possible: You can use StackWalker (Java 9+) or sun.reflect.Reflection.getCallerClass() (Java 8 only) to get the Class, then grab the code location from that and inspect the corresponding jar for its META-INF. But I think that this would break down with deploy jars or many other kinds of jar postprocessing and I haven't been able to think of magic that would make this case work - if it's possible, that would be great!

@comius
Copy link
Contributor

comius commented Oct 27, 2022

If 10 Bazel modules depend on the protobuf module by 10 different names, the runfiles manifest and runfiles directory would need to include 10x as many entries for protobuf runfiles.

I agree that's a "theoretical" worst case. Not also that you only get a conflict when both repo name and path matches and point to a different file. Do you have data what would happen in large bazel repos? Tensorflow would be an example, but it's not using bzlmod. I suspect maven dependencies might create conflicts or perhaps not.

Note also you don't need existence check in the case you use runfiles_manifest. The new library should first check for the key with canonical_current_repo_A/mapped_A/ and if it doesn't exist in the mapping for mapped_A.

that this would break down with deploy jars

Argh, you're right. Back to the drawing board. Annotations sounds like the best next option for now. We can't add a file next to each .java file with import Runfiles in it.

@fmeum
Copy link
Collaborator

fmeum commented Oct 27, 2022

I agree that's a "theoretical" worst case.

Even the not so bad average case of having every module depended on via its module name and the legacy repo name ("protobuf" vs "com_google_protobuf") would already double the number of external symlinks with sandboxed execution. Given that creating these symlinks is already a performance concern right now, a factor of 2 may actually be pretty bad.

Do you have data what would happen in large bazel repos? Tensorflow would be an example, but it's not using bzlmod.

That is actually a major problem: I don't know of anybody who already uses Bzlmod at any kind of non-trivial scale, yet we have to design its approach to runfiles discovery or else nobody will be able to use them. Not a great position to be in and very easy to make mistakes that will be hard to change when it has matured.

That is why, when in doubt, I prefer having users migrate on to some clearly defined API that we can change the implementation of in future versions of Bazel without breaking workflows again. If we continue to support runfiles lookups that don't go through the official runfiles libraries and even make the layout of the runfiles directory more complex, I'm worried that we might get ourselves into an even worse situation.

We should also keep in mind that migrating to Bzlmod isn't straightforward and definitely requires quite a bit of migration work, so having to migrate runfiles may not be the most time-consuming part.

Note also you don't need existence check in the case you use runfiles_manifest. The new library should first check for the key with canonical_current_repo_A/mapped_A/ and if it doesn't exist in the mapping for mapped_A.

Yes, that should work well. But in situations where the manifest isn't available, e.g. with remote execution, we would still need a strategy that works with just the runfiles directory. It's possible to do, but may again end up breaking someone (although most likely only a minority of users).

@comius
Copy link
Contributor

comius commented Oct 27, 2022

("protobuf" vs "com_google_protobuf") would already double the number of external symlinks with sandboxed execution.

The number of symlinks created by protobuf would double, right. Can we count how many there are on tensorflow pre--bzlmod perhaps? My guess is it's a small number.

You can improve for almost a factor of 2 by flipping --legacy_external_runfiles on the whole number of symlinks.

creating these symlinks is already a performance concern right now
Is there a link about this issue?

I prefer having users migrate on to some clearly defined API.

I agree. I think a more backwards compatible API I'm proposing is also clearly defined. And we get a longer and more subtle migration period. This means less unsatisfied/broken users.

If we continue to support runfiles lookups that don't go through the official runfiles libraries.

We can't really hide manifests or the implementation from the users. So there will always be un-official versions, even after currently envisioned migration.

having to migrate runfiles may not be the most time-consuming part

In my case with rules_kotlin, this seems to be the most time-consuming part + version skew problems. I need new Bazel release and I need to wait rules_kotlin release and then I need to wait protobuf release (which depends on rules_kotlin nowadays). Having backwards compatible runfiles, it would just work with current releases.

@fmeum
Copy link
Collaborator

fmeum commented Oct 27, 2022

The number of symlinks created by protobuf would double, right. Can we count how many there are on tensorflow pre--bzlmod perhaps? My guess is it's a small number.

That's a good idea, we'll see whether we can get some numbers and extrapolate from there.

creating these symlinks is already a performance concern right now

Is there a link about this issue?

It was just my perception given the investment in sandboxfs and the rumor I picked up that Google doesn't use sandboxed execution internally due to performance concerns. I certainly know less about this than you though :-)

I prefer having users migrate on to some clearly defined API.

I agree. I think a more backwards compatible API I'm proposing is also clearly defined. And we get a longer and more subtle migration period. This means less unsatisfied/broken users.

Sorry, I didn't mean to say that your proposal doesn't offer a clearly defined API! I'm just a bit worried about the API surface it introduces. "Put this kind of string and your repo into Rlocation and get an absolute path", while blunt, is something that allows for a lot of freedom for future development, including implementing the scheme you propose as a backend to ease migration. Starting with the scheme and having people actively depend on it during their Bzlmod migration would seem to be locking in more of the details of the runfiles tree.

If we continue to support runfiles lookups that don't go through the official runfiles libraries.

We can't really hide manifests or the implementation from the users. So there will always be un-official versions, even after currently envisioned migration.

Yes, I think the only way to have more convergence is to actually make using the runfiles libraries more beneficial for end users. @laszlocsomor had some ideas about this and @phst already implemented some of them for Go: For languages that support them (Go, Java), we could offer a file system view on the runfiles that offers the same functionality regardless of whether its backed by a manifest or a directory. This would allow users to perform operations such as listing directory contents and traversing the runfiles tree. I will look into this once the Bazel 6 release craze is over.

having to migrate runfiles may not be the most time-consuming part

In my case with rules_kotlin, this seems to be the most time-consuming part + version skew problems. I need new Bazel release and I need to wait rules_kotlin release and then I need to wait protobuf release (which depends on rules_kotlin nowadays). Having backwards compatible runfiles, it would just work with current releases.

Realistically speaking, with Bzlmod, you will have to wait for a good chunk of your transitive dependencies migrating anyway, simply because it requires much higher label "hygiene" than has been necessary so far. I agree that writing a new runfiles library for a ruleset is quite a bit of work, but it needs to be done only once and only by ruleset authors, not by end users. I am also working on a Starlark reference implementation to help third-party ruleset authors implement them.

@lberki
Copy link
Contributor

lberki commented Oct 27, 2022

@fmeum the rumor is half true: Google doesn't use sandboxed execution, but not due to performance concerns, but because most of our builds run remotely and thus a sandbox is not necessary. Or maybe you have better information than I do :)

@Wyverald has thought a lot about how runfiles could work sensibly with bzlmod, but he hasn't found any solution that is:

  • backwards compatible
  • doesn't require too much magic from the language
  • has decent performance

In fact, fulfilling just two of these tree is difficult, so I think his decision to go with the simplest solution that can be called "good enough" is sound.

Losing backwards compatibility this way is not great, but given that there are cases when the runfiles symlink tree does not even exist (--nobuild_runfile_links and Windows) I think requiring a runfiles library will eventually be inevitable, so we can just as well make that leap now and piggyback on the incompatible change called bzlmod. It also has the advantage that if someone very smart comes up with a better way to organize the symlink tree, it can be seamlessly changed as long as the runfiles library is used to access it.

(The astute observer will note that a runfiles tree is pretty much necessary for interpreted languages, e.g. Python. The game plan for that in turn is to make the runfiles tree not necessary. Somehow.)

@comius
Copy link
Contributor

comius commented Oct 27, 2022

@lberki
I think the proposal I've put forward is simpler, backwards compatible, has a decent performance in practice and requires the same amount of magic as the proposal that is being pursued at the moment.

Now, should we drop it without fairly evaluating it?

@meteorcloudy
Copy link
Member

I agree that's a "theoretical" worst case. Not also that you only get a conflict when both repo name and path matches and point to a different file. Do you have data what would happen in large bazel repos? Tensorflow would be an example, but it's not using bzlmod.

@comius
I do think your proposal will have significantly performance impact for TensorFlow because the python tests of tensorflow often have a lot of runfiles (all transitive python sources for the test will end up in the runfiles tree), creating the runfiles tree is especially slow on macOS and Windows. I believe another script language rules_nodejs could also depend on a large runfiles symlink tree. So duplicating the entire runfiles symlink tree is probably not a good idea.

@fmeum
Copy link
Collaborator

fmeum commented Oct 27, 2022

I don't actually see the two proposals as being exclusive and if we are willing to do the work, we could implement and benefit from both:

If we get the order of lookups in the Bazel-provided runfiles libraries right, they should remain drop-in compatible with a hypothetical --bzlmod_compatibility_runfiles flag that expands the runfiles tree in the way @comius proposes. Since the flag would be controlled by end users, rulesets would still be incentivized to migrate to the official runfiles libraries, ensuring optimal (future) efficiency and cross-platform support. At the same time, end users with complex projects could use --bzlmod_compatibility_runfiles as an escape hatch to keep their existing runfiles lookup routines functioning as well as possible while consciously accepting the performance price.

(Now that I am reading through this thread again, this is actually quite similar to #15029 (comment))

Given that this functionality would be mostly relevant for end users rather than rulesets (@comius, please correct me here if I should be misrepresenting your proposal), it wouldn't be an issue if we would introduce this flag only in 6.1.0, giving us more time to think this through and implement it.

@comius What do you think, could that be a reasonable path forward? If you are interested, we could schedule a short call and go over how this could be implemented and whether there are any unforeseen issues.

@lberki
Copy link
Contributor

lberki commented Oct 28, 2022

@comius sorry, I let my enthusiasm for cleaning up legacy behavior shine through.

I think that it has merit, modulo the performance concerns voiced by @meteorcloudy and the fact that then the there would be two ways the runfiles library would have to check potentially two entries to find a particular runfile.

An argument against --bzlmod_compatibility_runfiles is that we already have a number of flags controlling runfiles (--build_runfile_manifests, --build_runfile_links, --enable_runfiles, --legacy_external_runfiles, experimental_skip_runfile_manifests) and I don't think anyone has ever thought through all these combinations and then we'd be adding at least two more (--enable_bzlmod and --bzlmod_compatibility_runfiles) to the mix. It's not an absolute counterargument, but at least let's think through the interaction between all these and the proposed new ones.

Since then, another issue surfaced ( #16379 ) which will probably require another change to the runfiles infrastructure.

I see the following possible paths ahead:

  1. Incrementally try to fix each bug in the runfiles system (for example, the current one with @comius 's proposal if the performance concerns are not prohibitive) and document which set of flags one should use in which case (no bzlmod, bzlmod + performance does not matter, bzlmod + performance matters cross product Windows or not Windows) and ship enough of these flags in Bazel 6.0 for bzlmod to be viable, which in practice means a fix for this bug in particular, but nothing more
  2. Design a nice and new runfiles system that fulfills every requirement and deprecate most of the current menagerie of runfiles flags. This will probably be a very incompatible change and trying to ship it in Bazel 6.0 would not go well given the little time available we have before the release
  3. Fix this particular bug so that Bazel 6.0 can be shipped, then go on with (2). This would mean two changes in runfiles behavior (the one the fix to this bug will cause and the nice all-encompassing solution we'll eventually have) but it would probably be the least amount of disruption. If we go with this approach, we should choose the least disruptive approach for fixing this bug so as to minimize overall churn.

WDYT?

@fmeum
Copy link
Collaborator

fmeum commented Oct 28, 2022

I may be missing an interaction, but doesn't the set of flags you mentioned neatly divide into two parts, one of which we can mostly ignore?

  1. Flags controlling in which form the collection of runfiles is represented during execution and in the output tree, but not affecting the collection itself: --build_runfile_manifests, --build_runfile_links, --enable_runfiles, experimental_skip_runfile_manifests
  2. Flags controlling the collection of runfiles, but not how they are represented during execution: --legacy_external_runfiles, also --bzlmod_compatibility_runfiles if we choose to add it

--enable_bzlmod doesn't belong to either category as it has the side effect of emitting the repo mapping manifest, but neither affects the collection of runfiles nor the way they are represented during execution.

For the purpose of resolving this issue and #16379, we should be able to ignore all the flags in group 1 as long as we ensure that everything we come up works both with the directory and the manifest representation of runfiles. That would certainly reduce the cognitive overhead while thinking about solutions.

@lberki
Copy link
Contributor

lberki commented Oct 28, 2022

I am on board with ignoring flags in group (1) if we can manage to simplify our collective lives, but I don't know if the are orthogonal enough with the new repo mapping file to be able to do so. @Wyverald ? (if he doesn't chime in, let's ignore them because I think it's likely that we can do so(

re: --enable_bzlmod you are right in the sense that it doesn't change the way the runfiles tree and the manifest are generated per se, but it still affects what's actually in the runfiles tree. Otherwise this bug wouldn't exist, would it?

@fmeum
Copy link
Collaborator

fmeum commented Oct 28, 2022

If we settled on "nobody uses repository mappings with WORKSPACE", then yes, that would be true, but note that @phst's example that started this issue specifically doesn't use --enable_bzlmod.

Regarding #16379: Given that Bazel's configuration identifiers probably aren't something we want users to hardcode (which is the opposite of the situation for apparent repository names), I don't think #16379 can be solved without asking users to go through a helper such as $(rootpath), $(rlocationpath) or File.rlocation_path. If that is indeed true, then the way we solve it would be largely transparent to users and wouldn't really have to influence the decisions we make to solve the current issue.

@lberki
Copy link
Contributor

lberki commented Oct 28, 2022

Indeed! I was assuming @phst was using bzlmod and it's just that he formulated the bug with repo mappings to be minimal, but maybe not.

@Wyverald
Copy link
Member

I am on board with ignoring flags in group (1) if we can manage to simplify our collective lives, but I don't know if the are orthogonal enough with the new repo mapping file to be able to do so.

Pretty sure we can ignore those, as they are exercised by tests and the change to add the repo mapping manifest still passed those tests.

[this issue being about repo mappings, not exactly --enable_bzlmod]

While that's true, we've essentially treated all repo mappings bugs so far as "bzlmod bugs", if nothing else then because repo mappings basically did not work for anything past "translating labels in BUILD files". Pretty much any other usage you can think of revolving repo names, it was broken. So I'm okay with things staying broken but only being fixed by turning on --enable_bzlmod, if it makes our lives easier.

[@fmeum's idea about --bzlmod_compatibility_runfiles in 6.1.0]

I like this idea very much!


My take on this whole discussion remains that we should go for an eventually good API first, and worry about backwards compatibility second. Two main reasons: 1) like @lberki said --enable_bzlmod is a backwards compatibility breaker anyway, so this is a good opportunity to imagine what we should've designed without all the mistakes in the past; and 2) the backwards-compatible approach with runfiles is somewhat doomed anyway because it doesn't work on Windows.

@phst
Copy link
Contributor Author

phst commented Oct 30, 2022

Indeed! I was assuming @phst was using bzlmod and it's just that he formulated the bug with repo mappings to be minimal, but maybe not.

Your assumption is correct :)

@comius
Copy link
Contributor

comius commented Nov 3, 2022

I tested a simple implementation of my proposal in #15029 (comment), but it just doesn't work. The problem are functions Artifact.getRunfilesPath and Label.getWorkspaceName, which only return one path and can't do a fallback.

It looks to me that backwards-compatible approaches are now fully doomed.

@lberki
Copy link
Contributor

lberki commented Nov 4, 2022

@comius can you elaborate why it doesn't work?. I don't understand your terse sentence "only return one path and can't do a fallback"; AFAICT when the runfiles input manifest is written, we have both the repository mappings and the runfiles artifacts, which is all the data one needs to product a runfiles manifest, no matter what format one chooses.

(I still think that the approach you linked above wouldn't work, but for a different reason as discussed above, but let's explore the problem space before deciding)

@comius
Copy link
Contributor

comius commented Nov 4, 2022

@comius can you elaborate why it doesn't work?. I don't understand your terse sentence "only return one path and can't do a fallback"; AFAICT when the runfiles input manifest is written, we have both the repository mappings and the runfiles artifacts, which is all the data one needs to product a runfiles manifest, no matter what format one chooses.

The idea was to produce duplicated entries only on collision. The writing of runfiles manifest like this works. All the data is there.

The problem is that for example Artifact.getRunfilesPath is the way how Artifact expands on a command line. At the time it's called there is no information about conflicts and it can expand only into a single string. You get to choose to expand into a "no collision" path, which will break when there is a collision. Or you can expand into a "collision" path, which will break when there's no collision.

And you can't obtain the data about collisions, because it's only available at a particular binary.

@lberki
Copy link
Contributor

lberki commented Nov 4, 2022

Ah, got it. So it's not that writing the runfiles manifest doesn't work, it's that Bazel assumes that the runfiles path of an artifact is constant, which would not be under your proposal.

I could imagine having symlinks to an artifact both at its canonical location (which would be constant) and its "legacy" one (similarly to what @fmeum recommends in #15029 (comment)), but at that point, we are back to the question how many extra symlinks is acceptable.

@Wyverald
Copy link
Member

This has basically been fixed in 6.0 if you use the various runfiles libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Bzlmod Bzlmod-specific PRs, issues, and feature requests P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug
Projects
None yet
Development

No branches or pull requests

8 participants