Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make "canonical" symlinks to cache materializing popular files in local hermetic process executions #8905

Conversation

cosmicexplorer
Copy link
Contributor

Problem

We create temp dirs willy-nilly when running hermetic process executions locally. Because it appears most operating systems do garbage collection of /tmp only on restart, developing on pants will occasionally bring up a "disk out of space" error that is only resolved by restarting (is there a ticket for this?). While we would like the operating system to do this job for us, there may also be some benefit in terms of performance if we an avoid writing out so many files from LMDB to a temp dir while pants runs locally.

Solution

  • Add a field require_real_files: bool = False to ExecuteProcessRequest, which preserves the existing behavior when True.
    • When False, and when --process-execution-local-symlink-optimization-threshold is nonzero, begin to start producing symlinks for files which have been materialized more times than the value of that threshold, when creating the sandbox in /tmp for a local hermetic process execution.
  • Create LocalFileMaterializationCache in materialization_cache.rs to accomplish the above, by writing "canonical" file materializations into ~/.cache/pants/lmdb_cache/file_materialization_cache, and writing symlinks pointing to files in that directory.
  • Have the file materialization cache persist across pants executions by writing its info to a cache-info.json file when Droped.

Result

This will hopefully help to reduce the incidence of disk out of space errors when developing on pants locally. This is expected to improve the performance of local v2 process executions, but we don't currently invoke any of those en masse in parallel yet. The native or jvm backends might be good things to convert to v2 next to perhaps take advantage of this optimization.

@cosmicexplorer cosmicexplorer force-pushed the canonical-symlinks-for-popular-files-in-local-hermetic-process-executions branch from 58f52c8 to c0ba551 Compare January 5, 2020 10:22
process_execution_local_symlink_optimization_threshold=100,
# The default is set to clean up cached materializations if they have not been accessed for 2
# days.
process_execution_local_symlink_ttl=172800.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit clearer if you write this as 60 * 60 * 24 * 2. It would be even better if there's some standard python library that would let you write something human-readable like "2 days" and convert it into seconds that way, but I don't know of one offhand.

Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty scary to me, because absent any kind of real sandboxing, this means that it's easy for one action to overwrite a file in its working directory (which is otherwise a 100% safe operation), and poison future actions (because they'll see the output of that action, rather than their own declared input). Is there a strong motivation for this other than working around this issue that #8892 is also working around? If that's what we're mostly doing, I'd be very interested in the answers to #8892 (comment) before we push any further here.

(This is just a review of the high-level concept, I haven't looked closely at the code)

self
.load_file_bytes_with(
digest,
move |bytes| {
// TODO: Determine when this situation happens! If we materialize files into a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The store is used plenty of places other than for process execution. For instance, one may in a @console_rule use Workspace.materialize_directory.

We strongly assume that for process execution, because the caller has itself made an empty directory, we won't hit this codepath. I don't think it's worth adding an arg like error_if_exists to double-cover that case.

@cosmicexplorer
Copy link
Contributor Author

This is pretty scary to me, because absent any kind of real sandboxing, this means that it's easy for one action to overwrite a file in its working directory (which is otherwise a 100% safe operation), and poison future actions (because they'll see the output of that action, rather than their own declared input).

I hadn't realized this very clear flaw!

Is there a strong motivation for this...?

Nope! I thought it might be an interesting optimization for the cases where we run v2 rulesets against the pants repo.

I will close this PR as I had not considered file mutability, and I don't want to have to figure that out correctly.

@cosmicexplorer
Copy link
Contributor Author

Hm. I also didn't realize that by making such cached files read-only, we could avoid the issue presented here. This would then mean that tasks which need to mutate such files would simply have to use require_real_files=True on ExecuteProcessRequest.

Because of that, I'd like to re-open this, especially as it might still be a useful optimization (haven't figured out how to effectively benchmark it yet).

@cosmicexplorer cosmicexplorer reopened this Jan 8, 2020
add notes and (unused) local file materialization cache!

add methods to query and modify the cached file materializations

the code compiles now!

add is_executable to a FileMaterializationInput cache key

add require_real_files field to ExecuteProcessRequest

add a test for symlink creation via the file materialization cache!

add --process-execution-local-symlink-optimization-threshold

remove /**/ comments

use an alternative pattern instead of repeating a match case

persist the file materialization cache info across pants invocations

add a FIXME todo comment, and make the tests pass again!

move materialization_cache.rs to its own file!

convert the cache entries to contain a TimeSpec

add a global ttl option for the symlink materialization cache

delete materialized files past the ttl time (supposedly)

TODO: add testing for this feature!

use SystemTime instead of Duration for last_accessed

fix up cache dir gc logic

add testing for ser/de of the cache!
@cosmicexplorer cosmicexplorer force-pushed the canonical-symlinks-for-popular-files-in-local-hermetic-process-executions branch from c0ba551 to 79ffc30 Compare January 8, 2020 00:54
@illicitonion
Copy link
Contributor

Hm. I also didn't realize that by making such cached files read-only, we could avoid the issue presented here. This would then mean that tasks which need to mutate such files would simply have to use require_real_files=True on ExecuteProcessRequest.

I think I'd still want better sandboxing before doing this; nothing currently stops a rule doing chmod +w file, which, while slightly devious, still isn't a surprising thing for some tools to do without you realising :(

@hrfuller hrfuller removed their request for review March 12, 2020 06:04
@Eric-Arellano
Copy link
Contributor

Closing in favor of #9716, which works around this issue by allowing users to specify the tmpdir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants