Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Track empty directories in tree artifacts #16015

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Aug 1, 2022

Fixes #15789
Fixes #15901

@fmeum fmeum changed the title Track empty directories in tree artifacts [WIP] Track empty directories in tree artifacts Aug 1, 2022
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 1, 2022

@tjgq Could you take a quick glance and let me know whether you approve of the general approach (introducing TreeEmptyDirectoryArtifact to complement TreeFileArtifact as a subclass of the new TreeChildArtifact)? I would then add and fix more tests.

@tjgq
Copy link
Contributor

tjgq commented Aug 1, 2022

@alexjski Could you take a first pass since you're the expert on tree artifacts? Feel free to reassign back to me to shepherd it in if/once you're happy with the general approach.

@tjgq tjgq requested a review from alexjski August 1, 2022 19:52
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 1, 2022

Please note that there are also special cases I haven't dealt with yet outside of test code: For example, the Starlark DirectoryExpander should probably filter out TreeEmptyDirectoryArtifacts.

@alexjski
Copy link
Contributor

alexjski commented Aug 2, 2022

I have some questions and comments here:

  1. The presence of the empty directory in the expansion may be a little more complicated API. We include directories in there, but only sometimes, when they are empty. I get that this is really leaf-level FS objects, but I hope it is obvious that files-only is a cleaner semantics. You mentioned that you may not want to expose those in Starlark, my question would be how often do we actually want the directories from the expansion? If it is very rare, maybe having a separate method would make more sense (granted there are symlinks there already). The last thing is not a very serious suggestion, more like trying to understand the use cases when we want that.

    As for Starlark, if we want to go all out, we could just recommend users to check isDirectory on the incoming FileAPI. DirectoryExpander is not the only way to experience that in Starlark, there is just args.add_all([tree]) which would include whole expansion (I believe you filtered that one out for now).

  2. Do we collect empty directories within trees with remote strategy? It would not be ideal if we could have different results depending on which side of dynamic execution wins the race.

  3. What happens for template actions here--will we try to compile an empty directory?

  4. I would strongly recommend having the behavior of collecting empty directories under a flag--my gut feeling is that we would not want to use that internally for a while.

There also is a workaround for declaring empty directories within trees today. The capability is weaker since it only allows you to declare that at analysis time (as opposed to whatever the action does), but you can declare that an action returns multiple, nested tree artifacts, e.g.

top = ctx.actions.declare_directory('foo')
nested = ctx.actions.declare_directory('foo/empty')

return DefaultInfo(files=depset([top, nested]))

This will make Bazel aware of the inner directory and you will have an empty tree artifact for that. Addition/removal in such sense clearly affects the action itself, hence will be handled in incremental builds. Same goes for action cache.

Let me add @aiuto for the potential API change here and @coeuvre for remote execution discussion.

@fmeum
Copy link
Collaborator Author

fmeum commented Aug 2, 2022

Let me first clarify the intention behind this PR: I view #15789 and #15901 as serious correctness issues in a supported feature that is enabled by default (directory outputs). The PR is meant to address these issues while preserving backwards compatibility, especially concerning API. I explicitly do not want to introduce new features.

  1. The presence of the empty directory in the expansion may be a little more complicated API. We include directories in there, but only sometimes, when they are empty. I get that this is really leaf-level FS objects, but I hope it is obvious that files-only is a cleaner semantics. You mentioned that you may not want to expose those in Starlark, my question would be how often do we actually want the directories from the expansion? If it is very rare, maybe having a separate method would make more sense (granted there are symlinks there already). The last thing is not a very serious suggestion, more like trying to understand the use cases when we want that.
    As for Starlark, if we want to go all out, we could just recommend users to check isDirectory on the incoming FileAPI. DirectoryExpander is not the only way to experience that in Starlark, there is just args.add_all([tree]) which would include whole expansion (I believe you filtered that one out for now).

My personal feeling is that you almost never want the empty directories in the Starlark-facing expansions - the directories are only there to ensure correct incrementality. Staying in line with this PR being a backwards compatible bug fix, I would vote for skipping over TreeEmptyDirectoryArtifacts both in CommandLine implementations and the DirectoryExpander API - if there should ever be a need to expand the empty directories in the future, this can still be introduced as new API, e.g., a new parameter on the DirectoryExpander methods, gated behind a flag.

  1. Do we collect empty directories within trees with remote strategy? It would not be ideal if we could have different results depending on which side of dynamic execution wins the race.

With this PR we do (although I haven't added tests for special remote functionality such as injected artifacts yet) and this is one of the core motivations for this PR - after all, both remote and disk cache are covered by the remote strategy logic. In fact, the status quo in Bazel is that local and remote execution differ in that the former sees the empty directories while the latter doesn't - a bug I intend to fix with this PR.

  1. What happens for template actions here--will we try to compile an empty directory?

I only found two instances of concrete template actions in the Bazel source code, so I just put the logic for skipping TreeEmptyDirectoryArtifact there. If you have more such actions internally, I could also move this logic to a higher level - I can't think of a case where creating actions for empty directories would ever be desired.

  1. I would strongly recommend having the behavior of collecting empty directories under a flag--my gut feeling is that we would not want to use that internally for a while.

Given that the aim of the PR is to fix incorrect behavior while preserving backwards compatibility, is a flag necessary? I would very much prefer Bazel's stable features to behave correctly by default. I will happily add more tests and ignore TreeEmptyDirectoryArtifacts in more places to make this possible.

In the worst case, couldn't we introduce a flag that defaults to true in Bazel but can be overridden internally at Google?

@tjgq
Copy link
Contributor

tjgq commented Aug 2, 2022

Let me first clarify the intention behind this PR: I view #15789 and #15901 as serious correctness issues in a supported feature that is enabled by default (directory outputs). The PR is meant to address these issues while preserving backwards compatibility, especially concerning API. I explicitly do not want to introduce new features.

IMO what's fundamentally at stake here is the semantics of a tree artifact: is it a representation of the full directory structure including empty directories, or merely of regular files (and symlinks) found within? The current implementation doesn't provide a conclusive answer because it's inconsistent: a freshly executed action is allowed to produce empty subdirectories, but retrieving an action result from a cache won't recreate them.

I should admit that, when I filed #15901 in response to #15789, I was a bit hasty to assume that we want the former interpretation for tree artifacts. It's possible that the latter interpretation is the one we prefer. If that were to be the case, then the way to fix the inconsistency would be to have Bazel delete empty subdirectories left behind by the execution of a tree-generating action.

@fmeum
Copy link
Collaborator Author

fmeum commented Aug 2, 2022

I agree that if we decide that TreeArtifacts are merely collections of files, we would either have to delete empty subdirectories after the tree has been generated or, more drastically, fail the build when we encounter an empty subdirectory.

I heavily lean towards not going down that route though for two reasons:

  1. The root directory of a TreeArtifacts is implicitly created for both action inputs and outputs - this has been the case with local and sandboxed strategies for quite some time and added for the remote strategy by myself in Unify sandbox/remote handling of empty TreeArtifact inputs #15276. This already contradicts the interpretation that a TreeArtifact is nothing more than a collection of files to some extent.
  2. More importantly, from the perspective of Bazel end users, a TreeArtifact is merely an implementation detail of a feature available as ctx.actions.declare_directory. Directories do have clear semantics and are not just a collection of files, but may also contain empty directories. I would argue that this already fixes the semantics of TreeArtifacts, regardless of which interpretation we as devs would prefer for implementation convenience.

@alexjski
Copy link
Contributor

alexjski commented Aug 2, 2022

I wonder if at this point it would not be productive to draft a doc to discuss the problem and alternatives considered (https://bazel.build/contribute/design-documents). In particular, I would expect that most actions don't care about empty directories (thinking of compilers etc), it would be interesting to know a case study where that is important.

Now, I understand the general idea here, I am personally not convinced that this PR won't result with getting empty directories in places unintentionally (e.g., I think it would make it to WorkRequest in current form). In fact, I think that it is in niche situations that we want those, so maybe expansion should not include the directories unless requested (e.g. for sandboxed/remote strategy).

As I said before, this feature would require us the ability to at least disable that internally, not sure if for Bazel we would not want to start with a feature switch either, but I am less invested in that, so I am indifferent, will defer that to @tjgq.

Couple alternatives/ideas to consider (those are not necessarily good ideas, more like loose thoughts):

  • As I mentioned before, nested tree artifacts allow us to declare a directory within tree--would be good to know whether that is sufficient (yes, that is a weird API and is inconvenient too).
  • Would people be weirded out that there are more things in the tree artifact than in the expansion? Maybe we could add a new expansion method for the API? Another option could be to carry a boolean with the tree artifact whether it skips empty directories (ctx.actions.declare_directory(name="foo", include_empty_directories=True)--that obviously makes sense only if we want to selectively apply that, it could be an alternative to a flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants