Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow http_archive directly in MODULE.bazel #17141

Closed
keith opened this issue Jan 4, 2023 · 25 comments
Closed

Allow http_archive directly in MODULE.bazel #17141

keith opened this issue Jan 4, 2023 · 25 comments
Labels
area-Bzlmod Bzlmod-specific PRs, issues, and feature requests P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: feature request

Comments

@keith
Copy link
Member

keith commented Jan 4, 2023

When attempting to build our codebase with bzlmod, one pattern I noticed came up a lot was one-off http_archives that download binaries or other source repos that don't need to be part of the bzlmod graph because they're either too simple, or one offs etc. In this case I often found myself creating a dead simple module extension that called the macro that we were previously calling from our WORKSPACE, which works fine. The 2 downsides I see to this are there's some unnecessary feeling overhead, and you have to mirror the names of the workspaces in the http_archive name, as well as in the use_repo call, and you have to keep those in sync as you add things.

I think one solution to this would be to allow http_archives in the MODULE.bazel, but that might have some other major downsides as well.

@ShreeM01 ShreeM01 added type: bug team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. untriaged area-Bzlmod Bzlmod-specific PRs, issues, and feature requests labels Jan 5, 2023
@fmeum
Copy link
Collaborator

fmeum commented Jan 5, 2023

I have a suggestion that could potentially handle both single http_archives and macros with many plain old repo rules with little overhead compared to WORKSPACE:

  1. When resolving a use_extension call to a module extension defined in a bzl file, in addition to exported module_extension objects, also match exported top-level defs without any parameters (maybe allow parameters with default values). Treat def my_macro() just like my_macro = module_extension(implementation = ...), i.e. a module extension without any tag classes. Use the function docstring to populate the doc parameter of module_extension. In addition, internally label this module extension as static.
  2. When use_repo is called without extra positional or keyword arguments, internally label the module extension usage as import-all. It is an error to call use_repo both with and without extra arguments on proxies for the same module extension.
  3. When computing the repository mapping of a module, for every module extension usage labeled as import-all, first check whether the module extension is static. If it isn't, fail with an error. If it is, eagerly evaluate it and add all repos generated by it to the repository mapping.

The important property of a static module extension is that since it doesn't have access to module_ctx, it can't depend on other module extensions or cause Skyframe restarts. This ensure that there is a consistent evaluation order (#17048 (comment)) and makes it so that eagerly evaluating the extension shouldn't be costly.

Assuming this can be made to work, using it would look as follows:

# lib.bzl

def my_macro():
    http_archive(name = "foo", ...)
    http_archive(name = "bar", ...)
    ...

# MODULE.bazel

deps = use_extension("//:lib.bzl", "my_macro")
use_repo(deps)
# or even
use_repo(use_extension("//:lib.bzl", "my_macro"))

It would allow for direct reuse of existing WORKSPACE macros. While direct invocations of repository rules such as http_archive in MODULE.bazel still wouldn't be possible, the only overhead involved would be to wrap them in a macro, which is familiar to users based on the current WORKSPACE setting.

If there are concerns about implicit import-all breaking builds due to changes in transitive dependencies, the following additional check could be added to 3.:

3*. If a module extension usage is labeled as import-all, check whether the module containing the MODULE.bazel file that is currently being evaluated is equal to the module that contains the module extension. If it isn't, fail.

@Wyverald Do you think that this could work with the current evaluation model?

@Wyverald
Copy link
Member

Wyverald commented Jan 5, 2023

The important property of a static module extension is that since it doesn't have access to module_ctx, it can't depend on other module extensions or cause Skyframe restarts.

I don't think this part is true -- depending on other module extensions is as simple as loading from a repo generated by another module extension. Granted, it's unlikely to happen, but we still need to be aware of the possibility and deal with it accordingly.

Compared to the duplication of repo names in use_repo, I'm less concerned about the overhead of adding a simple ext = module_extension(lambda ctx: macro()) line. The alternative of calling an exported function directly feels a bit magical to me (especially given that it doesn't theoretically prevent the macro from trigger an extension eval).


Thinking about it a bit more, any form of use_(all_)repo(s) will need to consider the following points:

  1. Computing the main repo mapping, which now happens before any analysis is performed, would potentially involve fetching source repos (unless 3* applies). This could be a performance bottleneck for cold builds and/or builds where internet is slow.
  2. We need to work out how conflicts should be resolved (what if two use_all_repos extensions generate a repo with the same name?). The answer could be to allow only one use_all_repos in one MODULE.bazel file (icky), or somehow let use_all_repos specify a unique prefix that gets prepended to all repos generated by the extension.

I'm still pondering whether bare http_archives could be allowed in MODULE.bazel files. It circumvents the complications around use_all_repos, but does not really fit what's currently in MODULE.bazel, and creates a second way of doing the same thing. And the logical next question would be "what about http_file, git_repository, etc..., and before long we're allowing arbitrary repos in MODULE.bazel files. And maybe that's fine, even??? More thinking time required :)

@fmeum
Copy link
Collaborator

fmeum commented Jan 6, 2023

I don't think this part is true -- depending on other module extensions is as simple as loading from a repo generated by another module extension. Granted, it's unlikely to happen, but we still need to be aware of the possibility and deal with it accordingly.

That's a great point, I didn't think of loads being able to do that at all.

Computing the main repo mapping, which now happens before any analysis is performed, would potentially involve fetching source repos (unless 3* applies). This could be a performance bottleneck for cold builds and/or builds where internet is slow.

Wouldn't that apply even with 3*? If module A uses its own extension, the .bzl file containing that extension will probably load repo rules from other modules, even if it is a very simple extension that doesn't use module_ctx. With use_all_repos, these loads would have to be evaluated before the repo mapping can be computed.

What I didn't really understand before I thought about this in more detail: the root cause of the performance problems seems to be that loads are untyped. Without knowing that the identifier my_repo_rule we load from another file is actually just a repo rule (and not e.g. a macro), we can't evaluate the current file at all without eagerly resolving that load. From this point of view, use_extension is a kind of typed load (promised to resolve to either an extension or an error) and thus can be evaluated lazily. I only just realized that this is why use_extension works so well and load doesn't.

This means that any use_all_repos approach that doesn't delegate to external tooling and thus requires eager evaluation of the extension will have to balance two competing factors:

  1. The extension needs to load its repo rules from somewhere. It will at least need access to other module repos to do that, but these repos can transitively load from module extensions.
  2. It's transitive loads should not trigger costly evaluations, which includes downloads.

I don't think these can be balanced reasonably at all since already direct loads from module repos trigger source archive downloads.

Compared to the duplication of repo names in use_repo, I'm less concerned about the overhead of adding a simple ext = module_extension(lambda ctx: macro()) line. The alternative of calling an exported function directly feels a bit magical to me (especially given that it doesn't theoretically prevent the macro from trigger an extension eval).

I just thought it would help users recognize the WORKSPACE idioms they are used to more easily in Bzlmod land. But I agree that part isn't too important and we can forget about it.

We need to work out how conflicts should be resolved (what if two use_all_repos extensions generate a repo with the same name?). The answer could be to allow only one use_all_repos in one MODULE.bazel file (icky), or somehow let use_all_repos specify a unique prefix that gets prepended to all repos generated by the extension.

For any kind of use_all_repos concept, assuming it is limited to extensions defined by the current module, I think we should just assume 3* and fail the build if there are collisions. It's bad if your direct dependencies or even your transitive dependencies break your build via repo name collisions, but if you explicitly use_all_repos two extensions defined in your own module and they both define the same repo, I would say it's on the user to clean this up.

I'm still pondering whether bare http_archives could be allowed in MODULE.bazel files. It circumvents the complications around use_all_repos, but does not really fit what's currently in MODULE.bazel, and creates a second way of doing the same thing. And the logical next question would be "what about http_file, git_repository, etc..., and before long we're allowing arbitrary repos in MODULE.bazel files. And maybe that's fine, even??? More thinking time required :)

Following the "typed loads are fine, untyped loads are bad" philosophy from above, couldn't we allow something like this in MODULE.bazel?

http_archive = use_repo_rule("@bazel_tools//tools/build_defs/repo:http.bzl", "http_file")
http_archive(
    name = "my_data",
    url = "https://company.com/data.tar.gz"
)

It should behave equivalently to

# extensions.bzl
def one_of_extension():
    http_archive(
        name = "my_data",
        url = "https://company.com/data.tar.gz"
    )

one_of_extension = module_extension(lambda ctx: one_of_extension())

# MODULE.bazel
one_of_extension = use_extension("//:extensions.bzl", "one_of_extension")
use_repo(one_of_extension, "my_data")

Since we know that use_repo_rule resolves to a repo rule, we can load it lazily and generate an appropriate canonical repository name (e.g. have it be treated as a repo defined by a single synthetic extension per module, which would error on name collisions). Since all repo rules specify their name, we do not need an explicit use_repo call.

If we tell users that use_repo_rule is to load single things and use_extension is to load multiple things using actual dependency resolution logic, even though one could load single things in two ways I don't think there would be much potential for confusion.

@Wyverald Wyverald added P2 We'll consider working on this in future. (Assignee optional) type: feature request and removed untriaged type: bug labels Jan 17, 2023
@fmeum
Copy link
Collaborator

fmeum commented Jan 24, 2023

Backed by experience with this approach in the context of rules_go and after a discussion with @Wyverald, I believe that the following API for http_archive and http_file can be realized with Bzlmod:

# MODULE.bazel
bazel_dep(name = "http", version = "1.0.0")
http.archive(name = "my_archive", url = "https://...", sha256 = "...")
http.file(name = "my_file", url = "https://...", sha256 = "...")
...
use_repo(http, "http")

# BUILD.bazel
load("@http//:def.bzl", "http")

java_binary(
    name = "main",
    ...
    data = [
        http.file("my_file"),
        http.archive("my_archive", "//path/to/pkg:target"),
    ],
    ....
)

The http macro takes care of preventing name collisions between Bazel modules and also enforces strict visibility: A module can only access the repos it declared. There is also only ever a single repo ("http") that needs to be listed in use_repo.

@keith @alexeagle @kormide What do you think about this? Do you consider it more usable than the handcrafted one-of module extension?

@kormide
Copy link

kormide commented Jan 24, 2023

@keith @alexeagle @kormide What do you think about this? Do you consider it more usable than the handcrafted one-of module extension?

This looks good to me. So then http is some sort of alias repo that gets created after processing all of the file/archive tag classes in the module extension?

It may need an optional argument in case they rename the repo in use_repo, right?

@fmeum
Copy link
Collaborator

fmeum commented Jan 24, 2023

This looks good to me. So then http is some sort of alias repo that gets created after processing all of the file/archive tag classes in the module extension?

The technique would be the same as for the go macro: The canonical repository name is resolved dynamically by using the Label constructor in the hub repo.

It may need an optional argument in case they rename the repo in use_repo, right?

Which repo? If users do use_repo(my_http = "http"), that just changes where they load the http macro from. The actual repos created by the extension are an implementation detail that users don't have to know.

@github-actions
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 30 days. It will be closed in the next 7 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler".

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label Feb 24, 2023
@fmeum
Copy link
Collaborator

fmeum commented Feb 24, 2023

@bazelbuild/triage Not stale

@sgowroji sgowroji removed the stale Issues or PRs that are stale (no activity for 30 days) label Feb 24, 2023
@fmeum
Copy link
Collaborator

fmeum commented Apr 19, 2023

With 6a47481 merged, the following helper function should simplify the job of porting a WORKSPACE macro to Bzlmod by turning it into a module extension and using module_ctx.extension_metadata to let Bazel generate fixup commands for out-of-date use_repo calls. @Wyverald and I are looking for an accessible home for such a function.

This snippet currently requires using last_green, but the features will be in 6.2.0. The fixup commands require a version of buildozer including bazelbuild/buildtools@a3f457d.

def wrap_macro(macro, *, dev_dependency = False):
    def impl(module_ctx):
        macro()
        return module_ctx.extension_metadata(
            root_module_direct_deps = [] if dev_dependency else "all",
            root_module_direct_dev_deps = "all" if dev_dependency else [],
        )
    return module_extension(impl)

@fmeum
Copy link
Collaborator

fmeum commented Jun 27, 2023

Now that use_extension(..., isolate = True) exists, a reusable http extension with use_repo fixups could actually be implemented.

Starting from @SalmaSamy's #16000, one would need to:

  1. Add a check enforcing module_ctx.is_isolated and telling users to use isolate = True in an error message if they aren't. This ensures that different usages won't collide even if they use the same repo names.
  2. Return module_ctx.extension_metadata (see comment above) so that Bazel emits fixup commands for use_repo.

If anybody wants to give this a try, please let me know and I can offer support and a (pre-)review.

@Wyverald
Copy link
Member

Wyverald commented Jul 5, 2023

Fabian and I revisited this over chat. It appears to me that there's no major argument against something like use_repo_rule in MODULE.bazel that allows invocation of arbitrary repo rules. The semantics would be akin to having a one-off, WORKSPACE-macro-esque extension, except that the use_repo calls are automatically generated. Example:

# MODULE.bazel
http_archive = use_repo_rule("@bazel_tools//:http.bzl", "http_archive")
http_archive(name="foo", ...)
git_repository = use_repo_rule("@bazel_tools//:git.bzl", "git_repository")
git_repository(name="bar", ...)

This would be equivalent to today's:

# MODULE.bazel
one_off_ext = use_extension(":one_off.bzl", "one_off_ext")
use_repo(one_off_ext, "foo", "bar")

# one_off.bzl
load("@bazel_tools//:http.bzl", "http_archive")
load("@bazel_tools//:git.bzl", "git_repository")
def _one_off_ext_impl(ctx):
  http_archive(name="foo", ...)
  git_repository(name="bar", ...)
one_off_ext = module_extension(_one_off_ext_impl)

Off the top of my head, I can think of some (weak) arguments against this:

  1. it adds another way to do the same thing (counterargument: it's much more ergonomic)
  2. it adds to the confusion around the difference between modules and repos (counterargument: ehhh...???)
  3. it creates more potential for people to ask for loads in MODULE.bazel for repo rule arguments (counterargument: by that point, you can just move the complex logic into an extension)

So I'm inclined to just say "yes" to use_repo_rule.

@Wyverald
Copy link
Member

One thing I realized while implementing this: we need to be rather strict that the thing referred to by use_repo_rule is a repository_rule object, not a random callable -- which means that it cannot be a macro wrapper around a repo rule. This means that some popular repo rules might not be usable in this manner.

@fmeum
Copy link
Collaborator

fmeum commented Sep 23, 2023

One thing I realized while implementing this: we need to be rather strict that the thing referred to by use_repo_rule is a repository_rule object, not a random callable -- which means that it cannot be a macro wrapper around a repo rule.

Ignoring for the moment whether we should support this, what is the technical reason for this restriction? Couldn't we put arbitrary restrictions on the fake tag usage of the repo rule (e.g. enforce attr types only), let it generate arbitrarily many repos for which we then validate the final attribute values, turning them into RepoSpecs, and then only bring the one with the expected name into scope?

@Wyverald
Copy link
Member

Hmm... we could indeed do the thing you said. What I'm doing right now is just the straightforward way -- name a repo rule, call it, and we'll create a repo for you. If the thing can be a macro, then we'd need to additionally worry about name conflicts within a single call, and whether it generates a repo with the expected name, etc. That does indeed sound doable, just a bit more complicated.

I'm thinking that the majority use case of this would be http_archive, which doesn't have a macro wrapper anyway. Are there other expected usages of use_repo_rule?

Wyverald added a commit that referenced this issue Sep 25, 2023
This CL introduces a new directive in the MDOULE.bazel file, `use_repo_rule`, which is a convenient way to declare repos that are only visible to the current module. See #17141 (comment) for example usage.

Under the hood, this is implemented as an "innate" module extension for each module that uses `use_repo_rule`. The ID of this extension is a fake one: with the bzl file label of `@@name~version//:MODULE.bazel` and name of `__innate`. Each tag of this extension corresponds to a repo. The name of the tag is the string `${repo_rule_bzl_label}%${repo_rule_name}`, and the attributes of the tag are the attributes of the repo.

Fixes #17141.

PiperOrigin-RevId: 567722226
Change-Id: I16b8dc562dd20a676118867d831d3b2f5631f767
@fmeum
Copy link
Collaborator

fmeum commented Sep 26, 2023

I'm thinking that the majority use case of this would be http_archive, which doesn't have a macro wrapper anyway. Are there other expected usages of use_repo_rule?

Not that I am aware of, I think that supporting pure repo rules only is very reasonable - supporting http_archive, http_file and git_repository should be good enough.

Wyverald added a commit that referenced this issue Sep 27, 2023
This CL introduces a new directive in the MDOULE.bazel file, `use_repo_rule`, which is a convenient way to declare repos that are only visible to the current module. See #17141 (comment) for example usage.

Under the hood, this is implemented as an "innate" module extension for each module that uses `use_repo_rule`. The ID of this extension is a fake one: with the bzl file label of `@@name~version//:MODULE.bazel` and name of `__innate`. Each tag of this extension corresponds to a repo. The name of the tag is the string `${repo_rule_bzl_label}%${repo_rule_name}`, and the attributes of the tag are the attributes of the repo.

Fixes #17141.

PiperOrigin-RevId: 567722226
Change-Id: I16b8dc562dd20a676118867d831d3b2f5631f767
rrbutani added a commit to rrbutani/bazel-toolchain that referenced this issue Oct 22, 2023
rrbutani added a commit to rrbutani/bazel-toolchain that referenced this issue Oct 22, 2023
@farcube
Copy link

farcube commented Dec 12, 2023

http_file = use_repo_rule("@bazel_tools//:http.bzl", "http_file")

fails for me on 7.0.0 via:

ERROR: Analysis of target '//:mytarget' failed; build aborted: error loading '@@bazel_tools//:http.bzl' for repo rules, requested by @@myrule~override//:MODULE.bazel:13:10: Every .bzl file must have a corresponding package, but '@@bazel_tools//:http.bzl' does not have one. Please create a BUILD file in the same or any parent directory. Note that this BUILD file does not need to do anything except exist.

Shouldn't this case be covered now? Sorry if I misunderstood the point of use_repo_rule.

@meteorcloudy
Copy link
Member

http_file is located at @bazel_tools//tools/build_defs/repo:http.bzl, see https://bazel.build/rules/lib/repo/http

@cloudhan
Copy link

Here is the correct incarnation: http_archive = use_repo_rule("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

@cloudhan
Copy link

@Wyverald could use_repo_rule be placed in bazel_skylib to provide some compatibility for old LTS version?

@Wyverald
Copy link
Member

@Wyverald could use_repo_rule be placed in bazel_skylib to provide some compatibility for old LTS version?

No, as you can't load() things in MODULE.bazel files anyway, so having "compatibility helpers" in skylib doesn't help. But since use_repo_rule is really just syntactic sugar for a module extension, you could always just write the extension yourself if you need to support older LTS versions.

@cloudhan
Copy link

cloudhan commented Jan 17, 2024

But isn't skylib a bzlmod now? If we cannot load, we can just use_extension

bazel_dep(name = "bazel_skylib", version = "...")
compat = use_extension("@bazel_skylib//compat:extensions.bzl", "use_repo_rule")
compat.use_repo_rule(...)

right?

@lgalfaso
Copy link

# MODULE.bazel
http_archive = use_repo_rule("@bazel_tools//:http.bzl", "http_archive")
http_archive(name="foo", ...)
git_repository = use_repo_rule("@bazel_tools//:git.bzl", "git_repository")
git_repository(name="bar", ...)

There is one odd thing with this change that makes it not 100% equivalent. If you need to have any dependencies as data dependencies (say on foo), then they stop showing up at ${TEST_SRCDIR}/foo/..., and start showing up at ${TEST_SRCDIR}/_main~_repo_rules~foo/... that looks like exposing a little of the implementation details.

@meteorcloudy
Copy link
Member

Indeed, however the canonical repo name isn't any API users should depend on, you can access data files using runfiles library.

@lgalfaso
Copy link

Indeed, however the canonical repo name isn't any API users should depend on, you can access data files using runfiles library.

Just like many things with Bazel, things are not as simple, as sometimes what it says in one place is different than what it says somewhere else.

Say you have a dependency foo, the BUILD file at the root of the file is something along the lines of

filegroup(
    name = "bar",
    srcs = glob(["man/**/*.txt"]),
)

Say that @foo//:bar contains 1000s of files.

Then you have a cc_test that you would like to use the files from @foo//:bar

cc_test(
    name = "some_test",
    srcs = ["some_test.cpp"],
    data = ["@foo//:bar"],
    env = {
        "FOO_ROOT": # What do you put here to make a reference to the root directory of `@foo//:bar` (not the list of files, just the root)?
    },
)

Now, if you were to follow
https://bazel.build/concepts/dependencies#data-dependencies

These files are available using the relative path path/to/data/file. In tests, you can refer to these files by joining the paths of the test's source directory and the workspace-relative path, for example, ${TEST_SRCDIR}/workspace/path/to/data/file

then I would expect for ${TEST_SRCDIR}/foo/man to be reference to the man directory in @foo//:bar (for completeness, say that the workspace name of @foo is "foo").

And a third alternative is if you were to follow https://bazel.build/reference/be/c-cpp#cc_test.data where it states

Your C++ code can access these data files like so:

const std::string path = devtools_build::GetDataDependencyFilepath(
     "my/test/data/file");

Where is this library?

@meteorcloudy
Copy link
Member

Thanks for pointing those issues! I'm certainly not proud of our documentation, it looks like some docs are only meant for Blaze and should be updated for Bazel. Do you mind filing an issue for this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Bzlmod Bzlmod-specific PRs, issues, and feature requests P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: feature request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

10 participants