-
-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make "canonical" symlinks to cache materializing popular files in local hermetic process executions #8905
Conversation
58f52c8
to
c0ba551
Compare
process_execution_local_symlink_optimization_threshold=100, | ||
# The default is set to clean up cached materializations if they have not been accessed for 2 | ||
# days. | ||
process_execution_local_symlink_ttl=172800.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit clearer if you write this as 60 * 60 * 24 * 2. It would be even better if there's some standard python library that would let you write something human-readable like "2 days" and convert it into seconds that way, but I don't know of one offhand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty scary to me, because absent any kind of real sandboxing, this means that it's easy for one action to overwrite a file in its working directory (which is otherwise a 100% safe operation), and poison future actions (because they'll see the output of that action, rather than their own declared input). Is there a strong motivation for this other than working around this issue that #8892 is also working around? If that's what we're mostly doing, I'd be very interested in the answers to #8892 (comment) before we push any further here.
(This is just a review of the high-level concept, I haven't looked closely at the code)
self | ||
.load_file_bytes_with( | ||
digest, | ||
move |bytes| { | ||
// TODO: Determine when this situation happens! If we materialize files into a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The store is used plenty of places other than for process execution. For instance, one may in a @console_rule
use Workspace.materialize_directory
.
We strongly assume that for process execution, because the caller has itself made an empty directory, we won't hit this codepath. I don't think it's worth adding an arg like error_if_exists
to double-cover that case.
I hadn't realized this very clear flaw!
Nope! I thought it might be an interesting optimization for the cases where we run v2 rulesets against the pants repo. I will close this PR as I had not considered file mutability, and I don't want to have to figure that out correctly. |
Hm. I also didn't realize that by making such cached files read-only, we could avoid the issue presented here. This would then mean that tasks which need to mutate such files would simply have to use Because of that, I'd like to re-open this, especially as it might still be a useful optimization (haven't figured out how to effectively benchmark it yet). |
add notes and (unused) local file materialization cache! add methods to query and modify the cached file materializations the code compiles now! add is_executable to a FileMaterializationInput cache key add require_real_files field to ExecuteProcessRequest add a test for symlink creation via the file materialization cache! add --process-execution-local-symlink-optimization-threshold remove /**/ comments use an alternative pattern instead of repeating a match case persist the file materialization cache info across pants invocations add a FIXME todo comment, and make the tests pass again! move materialization_cache.rs to its own file! convert the cache entries to contain a TimeSpec add a global ttl option for the symlink materialization cache delete materialized files past the ttl time (supposedly) TODO: add testing for this feature! use SystemTime instead of Duration for last_accessed fix up cache dir gc logic add testing for ser/de of the cache!
c0ba551
to
79ffc30
Compare
I think I'd still want better sandboxing before doing this; nothing currently stops a rule doing |
Closing in favor of #9716, which works around this issue by allowing users to specify the tmpdir. |
Problem
We create temp dirs willy-nilly when running hermetic process executions locally. Because it appears most operating systems do garbage collection of
/tmp
only on restart, developing on pants will occasionally bring up a "disk out of space" error that is only resolved by restarting (is there a ticket for this?). While we would like the operating system to do this job for us, there may also be some benefit in terms of performance if we an avoid writing out so many files from LMDB to a temp dir while pants runs locally.Solution
require_real_files: bool = False
toExecuteProcessRequest
, which preserves the existing behavior whenTrue
.False
, and when--process-execution-local-symlink-optimization-threshold
is nonzero, begin to start producing symlinks for files which have been materialized more times than the value of that threshold, when creating the sandbox in/tmp
for a local hermetic process execution.LocalFileMaterializationCache
inmaterialization_cache.rs
to accomplish the above, by writing "canonical" file materializations into~/.cache/pants/lmdb_cache/file_materialization_cache
, and writing symlinks pointing to files in that directory.cache-info.json
file whenDrop
ed.Result
This will hopefully help to reduce the incidence of disk out of space errors when developing on pants locally. This is expected to improve the performance of local v2 process executions, but we don't currently invoke any of those en masse in parallel yet. The native or jvm backends might be good things to convert to v2 next to perhaps take advantage of this optimization.