--action_env ignored when `cfg = "exec"` is used #17062

Flamefire · 2022-12-21T13:59:01Z

Description of the bug:

We need to pass some environment variables like "$CPATH" to the compiler when building TensorFlow with Bazel. This is cumbersome itself and has led to hard-to-debug issues like #12059 in the past already.

Now we again see failures caused by action-env values not passed to the compiler invocation in TensorFlow 2.8.4 which I tracked down to tensorflow/tensorflow@07cbc7b which changes cfg = "host" to cfg = "exec"

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Build TensorFlow 2.8.4 with Bazel 4.2.2 passing --action-env=CPATH and observe that it is not passed to some compiler invocations resulting in e.g.:

In file included from bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/core/framework/dataset_options.pb.cc:4:
bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/core/framework/dataset_options.pb.h:10:10: fatal error: google/protobuf/port_def.inc: No such file or directory
   10 | #include <google/protobuf/port_def.inc>

So can you provide information on how to use --action-env (or similar) in such circumstances?

An explanation on what is actually being done with the change to "exec" from "host" would also be very welcome. In our case we are not cross-compiling so host, target and build machine are all the same.

I would clearly classify this behavior as a bug because the documentation states:

Specifies the set of environment variables available during the execution of all actions.

But obviously there are now actions where those are missing but they should be in "all actions"

Which operating system are you running Bazel on?

REHL 7

The text was updated successfully, but these errors were encountered:

"host" cfg is deprecated, and "exec" cfg saves ~10% compile time on kernels like polygamma. PiperOrigin-RevId: 404540569 Change-Id: I2a8be13501a42dc9ad7fbee0ac20b6bdf3124262

fmeum · 2022-12-21T15:23:16Z

Does passing in CPATH via --host_action_env work? The user manual you link to does seem to be outdated, the full command-line reference mentions that --action_env applies to the target configuration whereas --host_action_env applies to host and exec configuration.

Flamefire · 2022-12-21T16:22:15Z

Does passing in CPATH via --host_action_env work? The user manual you link to does seem to be outdated

Yes that seems to work. But I guess now I have to duplicate all action_env into host_action_env. We are using --distinct_host_configuration=false which looks like it should work. But now can't find it in the docs.

fmeum · 2022-12-21T16:25:56Z

@Flamefire --distinct_host_configuration is a no-op at HEAD. Whether to bring it back has already been discussed in the context of Tensorflow at some point, but I can't find the issue anymore. @meteorcloudy who I think was part of that discussion.

meteorcloudy · 2022-12-22T07:27:01Z

@fmeum I believe it's #14848 (comment)
/cc @gregestren who was also in that discussion.

Flamefire · 2022-12-24T16:19:25Z

Seems my comment didn't got saved so again the most important bits as multiple people spent days trying to figure out why TF started to fail to compile due to the wrong/unset env variables (I finally found the commit that broke/changed it by bisecting over 2000 commits):

We are in, what should be, the most simple situation: Build a software (TensorFlow) with Bazel on the machine it is supposed to run on. On other build systems it is a variation of configure && make && make install but for Bazel which by default deletes all environment variables we need to pass commandline options to have them set (this is about e.g. $CPATH and $LIBRARY_PATH)

Previously --action_env has mostly worked as it was documented as "environment variables available during the execution of all actions". Above I linked the documentation of Bazel 4.2.2 which is the version of Bazel I've been using. @fmeum Mentioned that this is wrong. As behavior of Bazel changes considerably between versions it would be good to have correct documentation for each specific version and I would count this at least as a bug report against the 4.2.2 documentation if the observed behavior is intended and hence the new documentation (which doesn't seem to be specific to a version?) is correct.

However even before the change in TF it was brittle as the used environment depended on many other factors, for example:

Use of tools vs exec_tools: Rename exec_tools to tools tensorflow/tensorflow#44901
Unset use_default_shell_env: Add use_default_shell_env = True to all ctx.actions.run_shell rules tensorflow/tensorflow#44549
Now the cfg attr

I don't fully understand the difference between "host" and "exec" configurations, they sound very similar. Also the observed behavior surprises me: --action_env applies to "target" configurations per the new docs but reverting the TF commit (so changing cfg = "exec" back to cfg = "host") makes the environment variables be passed. But with cfg = "exec" I need to set them via --host_action_env. Why is that? The last part makes sense per the new docs but why --action_env does apply to "host" cfgs isn't clear to me. Has this changed? Or is this the expected effect of --distinct_host_configuration=false? In which Bazel version will this be a no-op?

Next question is why use_default_shell_env isn't set by default? The name implies that.

And finally I'd like to ask what the "correct" way would be to build something with Bazel on the machine it is to be run i.e. a local, **non-**crosscompilation build.

Things that come to mind:

Keep/Restore the behavior of --action_env applying to all actions and introduce a separate --target_action_env similar to --host_action_env
Have a flag like --no-distinct-configurations so that "host", "target" and "exec" are all the same avoiding potential rebuilds and issues like we have seen with unexpected changes to action environments

I would also greatly appreciate an easy way to find out why a specific C++ file is compiled (or library linked) with specific env variables (i.e. in a specific configuration). E.g. it would help if --subcommands would show the configuration name of an action and there would be a way to query how & why a specific file is built. It was very confusing that a library was build with missing env variables but calling bazel build on the seemingly obvious target had the env variables and hence worked. Turned out that the library was build again as part of a dependency of a dependency of a tool which had somewhere cfg = "exec". But I haven't found a way in the documentation to find that from a source tree and/or with bazel. So the only way I found feasible is find the commit which changed the behavior and check those changes.

I hope that helps and Merry Christmas! 🎅

gregestren · 2023-01-04T18:56:56Z

Generally exec and host should be basically the same. I say basically because exec is intentionally supposed to generalize the concept of 'host. In particular, it's possible to have multiple execution architectures per build. exec supports this, while host doesn't.

I understand you're saying this doesn't matter for your use case, where all architectures are the same.

action_env should pass the same exact way to exec as it does to host. I suspect the reason you're not seeing that is --distinct_host_configuration=false, as you mention.

As @fmeum mentioned, --distinct_host_configuration=false is a no-op after a certain Bazel version (I'd have to check which). For any Bazel after that version I'd expect --action_env to behave equivalent for host and exec.

But if your baseline is an older version where --distinct_host_configuration=false works, that flag makes host an alias for the target configuration. So even though --action_env only applies to the target configuration, that kicks in here. We've never had a way to make exec an alias for the target configuration, hence the difference.

Most important, I think, is if you really need some equivalent of --distinct_host_configuration=false going forward. TensorFlow has argued for that kind of functionality. I think we could revive that by making an exec transition a no-op. And if it works for TensorFlow that's great. But I'd want to double check we're sure that the semantics that implies are okay. Particularly that any flags that should only apply to a host action (like, say, --compilation_mode=opt instead of --compilation_mode=dbg will not trigger in this mode. So there's some fundamental awkwardness there.

Flamefire · 2023-01-09T09:19:43Z

For any Bazel after that version I'd expect --action_env to behave equivalent for host and exec.

You mean "behave equivalent for" as "have no effect on", correct? As I understood --host_action_env must be used for host and exec.

Most important, I think, is if you really need some equivalent of --distinct_host_configuration=false going forward. TensorFlow has argued for that kind of functionality. I think we could revive that by making an exec transition a no-op.

A small step would be to have --action_env (i.e. no prefix like host_) pass those variables to literally *all actions. Or a flag with that effect, e.g. --all_action_env. Our current workaround is to duplicate --action_env to --host_action_env which makes the commandline huge and may at some point hit a limit.

What would help us even more is to have no such "transitions", i.e. keep all environment variables from the current environment for all actions. We have modules which on load set up paths in $CPATH, $PATH, $LD_LIBRARY_PATH etc. as there are multiple versions of the same software installed side-by-side so only one can be "active" at a time which means not having a central location for all software installations, i.e. not installing everything into /usr or similar.

Particularly that any flags that should only apply to a host action (like, say, --compilation_mode=opt instead of --compilation_mode=dbg will not trigger in this mode. So there's some fundamental awkwardness there.

I don't understand that part. Can you elaborate or provide a link to the docs what flags only apply to some actions and the implications?

gregestren · 2023-01-09T20:10:27Z

For any Bazel after that version I'd expect --action_env to behave equivalent for host and exec.

You mean "behave equivalent for" as "have no effect on", correct? As I understood --host_action_env must be used for host and exec.

Yes. According to my reading --action_env doesn't propagate to either the host or exec configs. While --host_action_env exclusively applies to the host / exec configs.

The only exception to this (the only place where host and exec could diverge) is when setting --distinct_host_configuration=false on a sufficiently old Bazel version.

A small step would be to have --action_env (i.e. no prefix like host_) pass those variables to literally *all actions. Or a flag with that effect, e.g. --all_action_env. Our current workaround is to duplicate --action_env to --host_action_env which makes the commandline huge and may at some point hit a limit.

What would help us even more is to have no such "transitions", i.e. keep all environment variables from the current environment for all actions. We have modules which on load set up paths in $CPATH, $PATH, $LD_LIBRARY_PATH etc. as there are multiple versions of the same software installed side-by-side so only one can be "active" at a time which means not having a central location for all software installations, i.e. not installing everything into /usr or similar.

Doing some spelunking, #4008 looks pretty relevant - an old bug that looks like it's asking for the same sort of thing. That was fixed by #12122 which led to the current --action_env / --host_action_env split. Which I understand you're saying is insufficient.

@meteorcloudy asked me at #12122 (comment) if we could just propagate --action-env to the host / exec config. I didn't give a direct answer because I don't consider myself an authority on the --action_env feature. I don't feel I fully understand the implications of such a change one way or another.

What we need is a clear owner of the --action_env feature who can evaluate that and recommend. @meteorcloudy any idea who that would be?

Particularly that any flags that should only apply to a host action (like, say, --compilation_mode=opt instead of --compilation_mode=dbg will not trigger in this mode. So there's some fundamental awkwardness there.

I don't understand that part. Can you elaborate or provide a link to the docs what flags only apply to some actions and the implications?

I don't think there's a clear API or documentation for this now. The authoritative source for now is looking at the code: https://cs.opensource.google/search?q=%22FragmentOptions%20getExec%22&sq=&ss=bazel.

For example https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/analysis/config/CoreOptions.java;l=985;drc=3d29b2e14f143cd56dfee4d49d991a6e330f3a85 has the line that propagates --host_action_env.

keith · 2023-01-09T21:44:16Z

I don't think we should make --action_env apply to the exec configuration by default. I think similar to this issue #13839 the inability to use flags that only target the target configuration causes issues.

fmeum · 2023-01-09T22:05:39Z

@Flamefire Just in case you didn't know: You can use --action_env=FOOBAR to use the value of FOOBAR from the environment, which does help with command line length limits.

I agree with Keith that --action_env should not automatically apply to the exec configuration. That would end up causing plenty of unexpected rebuilds where they are not expected (looking at you, protoc).

Flamefire · 2023-01-10T08:21:04Z

For example https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/analysis/config/CoreOptions.java;l=985;drc=3d29b2e14f143cd56dfee4d49d991a6e330f3a85 has the line that propagates --host_action_env.

As someone not involved in Bazel development that is hard to understand. There are 2 lines: actionEnvironment and hostActionEnvironment are both passed hostActionEnvironment. Now I'm non the wiser... Hence some documentation on those "contexts" and where and why and when envs apply. E.g. I never understood why use_default_action_env is not the default which has led to issues for us as variables were not forwarded and finding that this is what was missing was hard.

I don't think we should make --action_env apply to the exec configuration by default. I think similar to this issue #13839 the inability to use flags that only target the target configuration causes issues.

My suggestion was to have a uniform notation. E.g. there could be target_action_env and exec_action_env for specific options and action_env for "all" actions (as it was documented before and IMO makes sense naming-wise)

That would end up causing plenty of unexpected rebuilds where they are not expected (looking at you, protoc).

That would indeed be an issue. But is that situation different to passing all variables to action_env and host_action_env? Because that's what we do now and given the situation that there is/should be only a single configuration (build machine=target machine) I'd expect things to be build only once.

uri-canva · 2023-01-17T03:39:16Z

Doing some spelunking, #4008 looks pretty relevant - an old bug that looks like it's asking for the same sort of thing. That was fixed by #12122 which led to the current --action_env / --host_action_env split.

I'm quite confused by that sequence of events. #4008 is an issue that happens because a lot of rules in the ecosystem don't support --action_env because they're implemented with run_shell, which by default doesn't use the action env (use_default_shell_env = False). In my mind the fix for that would be for run_shell to use the action env by default, not to introduce a --host_action_env.

uri-canva · 2023-01-17T03:39:57Z

Maybe this issue is just another aspect of #3320?

oquenchil · 2023-01-20T11:19:49Z

It looks like people who are most familiar with this area do not think this is a good idea so closing for now but please feel free to continue the discussion.

Flamefire · 2023-01-20T11:59:56Z

I'm quite confused by that sequence of events. #4008 is an issue that happens because a lot of rules in the ecosystem don't support --action_env because they're implemented with run_shell, which by default doesn't use the action env (use_default_shell_env = False).

That's also what confused me: The option is named "use_default" and the default for this is "False"!

Maybe this issue is just another aspect of #3320?

Pretty much yes. It's at least another instance where env variables are not used in actions unexpectedly.

@fmeum @keith

I don't think we should make --action_env apply to the exec configuration by default. I think similar to this issue #13839 the inability to use flags that only target the target configuration causes issues.

I agree with Keith that --action_env should not automatically apply to the exec configuration. That would end up causing plenty of unexpected rebuilds where they are not expected (looking at you, protoc).

Yes maybe not "automatically apply". But there is a need for an option to pass env variables to all actions. This would even reduce the number of rebuilds because the environment is the same, wouldn't it?
It is just that --action_env would be the best fitting name for such an option but I'd also be happy with a new name just so we don't need to figure out again and again why a change in Bazel or TensorFlow now causes env variables to not be passed again which is very time consuming.

fmeum · 2023-01-20T12:48:49Z

This would even reduce the number of rebuilds because the environment is the same, wouldn't it?

I don't think it would as the target and exec configurations differ in other ways (on purpose) even if you align --action_env and --host_action.

@Flamefire I still think that the best solution for your use case is a --distinct_exec_configuration flag. With --nodistinct_exec_configuration, your --action_env would apply to all actions: It applies to all actions built in the target configuration, but with --nodistinct_exec_configuration, nothing ever transitions from the target to the exec configuration. That should resolve both issues you are experiencing:

No need to duplicate --action_env as --host_action_env.
No spurious rebuilds because compilers are built in the exec configuration.

What do you think about that?

Re use_default_shell_env: I personally think that --action_env isn't a good concept and more of an escape hatch as it potentially applies to all actions (bad for caching) but then again only to those that explicitly opt into it (bad when you really need it). Most custom actions inherently don't care about the environment (e.g. generating a Javadoc does not require CC or PATH) and there should be ways to more selectively inherit environment variables from the shell environment.

Flamefire · 2023-01-20T13:38:06Z

@Flamefire I still think that the best solution for your use case is a --distinct_exec_configuration flag. With --nodistinct_exec_configuration, your --action_env would apply to all actions: It applies to all actions built in the target configuration, but with --nodistinct_exec_configuration, nothing ever transitions from the target to the exec configuration. That should resolve both issues you are experiencing:

Just so I understand this correctly: There are 3 configurations: host, exec and target. I don't understand the distinction between host and exec but that is what caused the issue I reported here so there seem to be differences.
So if that --nodistinct_exec_configuration never causes any transition so "--action_env would apply to all actions" then this is what we need. I just want to make sure that a spurious "host" exec somewhere does not cause the same problem again.

Re use_default_shell_env: I personally think that --action_env isn't a good concept and more of an escape hatch as it potentially applies to all actions (bad for caching) but then again only to those that explicitly opt into it (bad when you really need it). Most custom actions inherently don't care about the environment (e.g. generating a Javadoc does not require CC or PATH) and there should be ways to more selectively inherit environment variables from the shell environment.

We are building TensorFlow using Bazel in a fresh folder, so no cache there, and want it to apply to all actions (or just don't clear the environment in the first place) because surprisingly many tools do care about the environment which we e.g. noticed by the change in TensorFlow from exec_tools to tools (or vice versa) where the already installed protobuf compiler failed to execute due to missing libraries (caused by unset LD_LIBRARY-PATH). So yes not all actions care about all environment variables but IMO Bazel as the build system and even build file authors cannot always know which are actually required. For example we have PYTHONNOUSERSITE=1 set in our action_env because user modules can cause sudden build failures when installing software for the cluster. And some (build) tools from TensorFlow are Python modules and hence potentially affected by that but might be called by binaries that then use a Python module and so on.

fmeum · 2023-01-20T14:16:45Z

There is no host configuration anymore, everything that used to be host is now exec. However, as part of that change, --distinct_host_configuration went away, which is why you saw a regression.

I agree that BUILD file authors can't know which env variables are required, which is precisely why I believe --action_env is generally the wrong tool: It puts the burden on the Bazel end user to maintain a global list of variables that actions forgot to set themselves. For very few, very generic variables such as PATH and LD_LIBRARY_PATH I can see that being reasonable, but PYTHONNOUSERSITE is definitely something the Python rules should set (or at least set if configured in a certain rules_python-specific way).

Flamefire · 2023-01-20T15:00:54Z

There is no host configuration anymore, everything that used to be host is now exec. However, as part of that change, --distinct_host_configuration went away, which is why you saw a regression.

I'm a bit confused here. In which way is there "no host configuration anymore"? Because I was using TensorFlow 2.8.4 with Bazel 4.2.2 which failed and after reverting tensorflow/tensorflow@07cbc7b, i.e. changing to exec="host" it worked again passing --action-env.

Or are you referring to some change in Bazel after 4.2.2 where using exec="host" would be an error? Or how was Bazel changed, i.e. what would happen now when using exec="host"?

If you could tell me which version introduced which change related to this it would really help our tooling around that. Thanks a lot!

fmeum · 2023-01-20T15:05:14Z

@gregestren would know best when exactly cfg = "host" became/becomes equivalent to cfg = "exec". That definitely wasn't the case in Bazel 4.

keith · 2023-01-20T16:50:10Z

I agree that the default for use_default_shell_env should be reconsidered, I'm not sure if there's an issue for that or not. I agree with fabian that having a --distinct_exec_configuration is most likely to serve your needs

uri-canva · 2023-01-21T01:26:31Z

It sounds like in general --action_env, --test_env and --repo_env should be avoided? Instead of them rules should receive input from the user using bazel native mechanisms such as build settings and attributes, and then set the appropriate environment when invoking executables based on them? Is this documented somewhere? We've been minimising our use of those flags for a while now but it would have been nice to have known this earlier in our bazel journey.

fmeum · 2023-01-21T08:38:36Z

--repo_env is quite useful as it still only applies to repo rules that declare a dependency on the particular environment variable and changes to the flag take effect immediately rather than after a server restart.

Having better doc on when to use either of these flags and when some other approach is better would be very useful. You could file a doc issue for that.

gregestren · 2023-01-24T00:01:33Z

@gregestren would know best when exactly cfg = "host" became/becomes equivalent to cfg = "exec". That definitely wasn't the case in Bazel 4.

I believe Bazel 6.0.

--distinct_exec_configurationis definitely an option. I'm not sure how widespread the need is, but it clear is in this case.

Recently I had some more out of the box ideas on how we might determine what makes up exec configurations. But that's probably a conversation for another place.

Flamefire referenced this issue in tensorflow/tensorflow Dec 21, 2022

Build tf_to_kernel in "exec" cfg.

07cbc7b

"host" cfg is deprecated, and "exec" cfg saves ~10% compile time on kernels like polygamma. PiperOrigin-RevId: 404540569 Change-Id: I2a8be13501a42dc9ad7fbee0ac20b6bdf3124262

Flamefire mentioned this issue Dec 21, 2022

--action_env with defined value is not working as expected #8578

Closed

sgowroji added team-Rules-CPP Issues for C++ rules type: support / not a bug (process) untriaged labels Dec 22, 2022

Flamefire mentioned this issue Jan 4, 2023

{lib}[foss/2022a] TensorFlow v2.9.1 w/ Python 3.10.4 + CUDA 11.7.0 easybuilders/easybuild-easyconfigs#16620

Closed

Flamefire mentioned this issue Jan 6, 2023

fix TensorFlow easyblock for new versions of Bazel & TensorFlow easybuilders/easybuild-easyblocks#2854

Merged

oquenchil closed this as completed Jan 20, 2023

oquenchil removed the untriaged label Jan 20, 2023

Inirit mentioned this issue Sep 23, 2023

Build on CI machine failing with error "Couldn't find a valid ICU package installed on the system" bazel-contrib/rules_dotnet#385

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--action_env ignored when `cfg = "exec"` is used #17062

--action_env ignored when `cfg = "exec"` is used #17062

Flamefire commented Dec 21, 2022 •

edited

Loading

fmeum commented Dec 21, 2022

Flamefire commented Dec 21, 2022

fmeum commented Dec 21, 2022

meteorcloudy commented Dec 22, 2022

Flamefire commented Dec 24, 2022 •

edited

Loading

gregestren commented Jan 4, 2023

Flamefire commented Jan 9, 2023

gregestren commented Jan 9, 2023

keith commented Jan 9, 2023

fmeum commented Jan 9, 2023

Flamefire commented Jan 10, 2023

uri-canva commented Jan 17, 2023

uri-canva commented Jan 17, 2023

oquenchil commented Jan 20, 2023

Flamefire commented Jan 20, 2023

fmeum commented Jan 20, 2023

Flamefire commented Jan 20, 2023

fmeum commented Jan 20, 2023 •

edited

Loading

Flamefire commented Jan 20, 2023

fmeum commented Jan 20, 2023

keith commented Jan 20, 2023

uri-canva commented Jan 21, 2023

fmeum commented Jan 21, 2023

gregestren commented Jan 24, 2023

--action_env ignored when cfg = "exec" is used #17062

--action_env ignored when cfg = "exec" is used #17062

Comments

Flamefire commented Dec 21, 2022 • edited Loading

Description of the bug:

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

fmeum commented Dec 21, 2022

Flamefire commented Dec 21, 2022

fmeum commented Dec 21, 2022

meteorcloudy commented Dec 22, 2022

Flamefire commented Dec 24, 2022 • edited Loading

gregestren commented Jan 4, 2023

Flamefire commented Jan 9, 2023

gregestren commented Jan 9, 2023

keith commented Jan 9, 2023

fmeum commented Jan 9, 2023

Flamefire commented Jan 10, 2023

uri-canva commented Jan 17, 2023

uri-canva commented Jan 17, 2023

oquenchil commented Jan 20, 2023

Flamefire commented Jan 20, 2023

fmeum commented Jan 20, 2023

Flamefire commented Jan 20, 2023

fmeum commented Jan 20, 2023 • edited Loading

Flamefire commented Jan 20, 2023

fmeum commented Jan 20, 2023

keith commented Jan 20, 2023

uri-canva commented Jan 21, 2023

fmeum commented Jan 21, 2023

gregestren commented Jan 24, 2023

--action_env ignored when `cfg = "exec"` is used #17062

--action_env ignored when `cfg = "exec"` is used #17062

Flamefire commented Dec 21, 2022 •

edited

Loading

Flamefire commented Dec 24, 2022 •

edited

Loading

fmeum commented Jan 20, 2023 •

edited

Loading