Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caching to formatter #8089

Merged
merged 1 commit into from
Oct 23, 2023
Merged

Add caching to formatter #8089

merged 1 commit into from
Oct 23, 2023

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Oct 20, 2023

Summary

Implement caching for ruff format

The formatter caching is relatively simple. All we need to track is the formatted files. This PR extends our Cache implementation to instead track the changes to the cache instead of the new "file cache" contents (similar to an event store). Meaning, the cache now stores a changes collection instead of new_files and a Change either sets the formatted flag or updates the lint results.

Test Plan

Manual testing:

  • Checked a file with E701 (Multiple statements on same line): Diagnostic is shown
  • Formatted the code
  • Ran check: Diagnostic is gone (cache invalidation)

The speedup from the benchmarking shows that caching is working.

Benchmarking

Hot formatter cache, no changes

hyperfine  --warmup 10  \
                    "../ruff/target/release/ruff format . -s" \
                    "../ruff/ruff-main format ../airflow -s"
Benchmark 1: ../ruff/target/release/ruff format . -s
  Time (mean ± σ):     111.9 ms ±   3.6 ms    [User: 67.3 ms, System: 154.8 ms]
  Range (min … max):   104.1 ms … 121.2 ms    27 runs
 
Benchmark 2: ../ruff/ruff-main format ../airflow -s
  Time (mean ± σ):     312.7 ms ±   2.8 ms    [User: 1815.5 ms, System: 163.4 ms]
  Range (min … max):   308.8 ms … 316.7 ms    10 runs
 
Summary
  ../ruff/target/release/ruff format . -s ran
    2.80 ± 0.09 times faster than ../ruff/ruff-main format ../airflow -s

Hot formatter and linter cache, no changes

slightly slower for projects with many diagnostics

../ruff/target/release/ruff check . -s
hyperfine  --warmup 10  \
                    "../ruff/target/release/ruff format . -s" \
                    "../ruff/ruff-main format ../airflow -s"
Benchmark 1: ../ruff/target/release/ruff format . -s
  Time (mean ± σ):     118.4 ms ±   3.1 ms    [User: 76.7 ms, System: 160.9 ms]
  Range (min … max):   109.7 ms … 122.9 ms    24 runs
 
Benchmark 2: ../ruff/ruff-main format ../airflow -s
  Time (mean ± σ):     313.5 ms ±   5.5 ms    [User: 1819.3 ms, System: 163.6 ms]
  Range (min … max):   308.2 ms … 323.7 ms    10 runs
 
Summary
  ../ruff/target/release/ruff format . -s ran
    2.65 ± 0.08 times faster than ../ruff/ruff-main format ../airflow -s

No cache

hyperfine  --warmup 10  \
                    "../ruff/target/release/ruff format . -s --no-cache" \
                    "../ruff/ruff-main format ../airflow -s"
Benchmark 1: ../ruff/target/release/ruff format . -s --no-cache
  Time (mean ± σ):     321.4 ms ±   5.1 ms    [User: 1839.7 ms, System: 165.1 ms]
  Range (min … max):   317.7 ms … 332.9 ms    10 runs
 
Benchmark 2: ../ruff/ruff-main format ../airflow -s
  Time (mean ± σ):     320.9 ms ±  10.0 ms    [User: 1807.1 ms, System: 166.3 ms]
  Range (min … max):   310.9 ms … 343.8 ms    10 runs
 
Summary
  ../ruff/ruff-main format ../airflow -s ran
    1.00 ± 0.04 times faster than ../ruff/target/release/ruff format . -s --no-cache

Cold cache (slower)

hyperfine --prepare "rm -rf .ruff_cache" --warmup 10 \
                          "../ruff/target/release/ruff format . -s" \
                          "../ruff/ruff-main format ../airflow -s"
Benchmark 1: ../ruff/target/release/ruff format . -s
  Time (mean ± σ):     342.4 ms ±  13.1 ms    [User: 1842.5 ms, System: 234.0 ms]
  Range (min … max):   328.8 ms … 367.2 ms    10 runs
 
Benchmark 2: ../ruff/ruff-main format ../airflow -s
  Time (mean ± σ):     319.9 ms ±  16.0 ms    [User: 1816.3 ms, System: 166.7 ms]
  Range (min … max):   307.5 ms … 363.1 ms    10 runs
 
Summary
  ../ruff/ruff-main format ../airflow -s ran
    1.07 ± 0.07 times faster than ../ruff/target/release/ruff format . -s

Cold cache with changes (slower)

hyperfine --prepare "rm -rf .ruff_cache && git reset --hard -q" --warmup 10 \
                          "../ruff/target/release/ruff format . -s" \
                          "../ruff/ruff-main format ../airflow -s"
Benchmark 1: ../ruff/target/release/ruff format . -s
  Time (mean ± σ):     389.0 ms ±  12.0 ms    [User: 1763.3 ms, System: 354.7 ms]
  Range (min … max):   376.6 ms … 416.2 ms    10 runs
 
Benchmark 2: ../ruff/ruff-main format ../airflow -s
  Time (mean ± σ):     363.5 ms ±   4.7 ms    [User: 1732.4 ms, System: 297.8 ms]
  Range (min … max):   358.1 ms … 370.9 ms    10 runs
 
Summary
  ../ruff/ruff-main format ../airflow -s ran
    1.07 ± 0.04 times faster than ../ruff/target/release/ruff format . -s

Hot cache with changes (e.g. when switching between branches)

hyperfine --prepare "git reset --hard -q" --warmup 10 \
                          "../ruff/target/release/ruff format . -s" \
                          "../ruff/ruff-main format ../airflow -s"
Benchmark 1: ../ruff/target/release/ruff format . -s
  Time (mean ± σ):     220.1 ms ±   8.1 ms    [User: 653.7 ms, System: 323.0 ms]
  Range (min … max):   212.3 ms … 234.1 ms    10 runs
 
Benchmark 2: ../ruff/ruff-main format ../airflow -s
  Time (mean ± σ):     365.4 ms ±   6.2 ms    [User: 1722.6 ms, System: 294.4 ms]
  Range (min … max):   359.1 ms … 375.0 ms    10 runs
 
Summary
  ../ruff/target/release/ruff format . -s ran
    1.66 ± 0.07 times faster than ../ruff/ruff-main format ../airflow -s

Linting with caching

hyperfine --warmup 10 --setup "rm -rf ./.ruff_cache" \
                          "../ruff/target/release/ruff check -s -e ." \
                          "../ruff/ruff-main check -s -e ."
Benchmark 1: ../ruff/target/release/ruff check -s -e .
  Time (mean ± σ):      95.7 ms ±   1.4 ms    [User: 69.3 ms, System: 87.6 ms]
  Range (min … max):    92.1 ms … 101.0 ms    30 runs
 
Benchmark 2: ../ruff/ruff-main check -s -e .
  Time (mean ± σ):      95.8 ms ±   2.3 ms    [User: 68.9 ms, System: 87.5 ms]
  Range (min … max):    92.6 ms … 107.0 ms    30 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  ../ruff/target/release/ruff check -s -e . ran
    1.00 ± 0.03 times faster than ../ruff/ruff-main check -s -e .

@MichaReiser
Copy link
Member Author

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@@ -169,7 +203,7 @@ impl Cache {
///
/// This returns `None` if `key` differs from the cached key or if the
/// cache doesn't contain results for the file.
pub(crate) fn get<T: CacheKey>(&self, path: &RelativePath, key: &T) -> Option<&FileCache> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think supporting different cache keys for a single cache instance is a good idea. It can lead to key collisions and is prone to mixing different keys (which has the unexpected result that a stored value can not be retrieved).

E.g. I accidentally passed the hashed u64 instead of the FileCacheKey. Rust was happy because u64 implements CacheKey.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this should be generic on the instance, right? (Looks like it was made non-generic which is also fine with me.)

@MichaReiser MichaReiser added cli Related to the command-line interface formatter Related to the formatter labels Oct 20, 2023
Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good to me. Is it ready for review / approval?

@MichaReiser
Copy link
Member Author

This generally looks good to me. Is it ready for review / approval?

Not yet

@MichaReiser MichaReiser force-pushed the formatter-cache branch 5 times, most recently from 718c016 to cce999e Compare October 23, 2023 02:13
@MichaReiser MichaReiser marked this pull request as ready for review October 23, 2023 02:13
@github-actions
Copy link
Contributor

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

Comment on lines +372 to +374
/// Path to the cache directory.
#[arg(long, env = "RUFF_CACHE_DIR", help_heading = "Miscellaneous")]
pub cache_dir: Option<PathBuf>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would need to have different environment variables (RUFF_CHECK_CACHE_DIR, RUFF_FORMAT_CACHE_DIR) since one could potentially provide different path to the check / format subcommand.

Or, could make this a global flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would wait until we have a specific request to store the caches in different locations. I also don't know how to support different cache locations when check lints and formats.

The alternative would be to have two different caches (the implementation I had before this morning) where format and lint store their results in different files. Different jobs could then cache the specific sub-directories only. However, the overall design of the Cache API felt more fragile and would probably require a more drastic refactor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary IMO.

Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

@@ -169,7 +203,7 @@ impl Cache {
///
/// This returns `None` if `key` differs from the cached key or if the
/// cache doesn't contain results for the file.
pub(crate) fn get<T: CacheKey>(&self, path: &RelativePath, key: &T) -> Option<&FileCache> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this should be generic on the instance, right? (Looks like it was made non-generic which is also fine with me.)

let start = Instant::now();
let (mut results, mut errors): (Vec<_>, Vec<_>) = paths
.into_par_iter()
.par_iter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change, then cloning the path below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package_roots borrows paths until the end of the function. That means we can no longer use into_iter. I don't expect this to be meaningful, considering that we only hit the clone part when an error occurred.

)
}) {
Ok(inner) => inner.map(|result| FormatPathResult {
path: resolved_file.into_path(),
path: resolved_file.path().to_path_buf(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we keep the par_iter, perhaps path.to_path_buf()?

@MichaReiser MichaReiser merged commit 6fc35dd into main Oct 23, 2023
16 checks passed
@MichaReiser MichaReiser deleted the formatter-cache branch October 23, 2023 08:43
@konstin
Copy link
Member

konstin commented Oct 23, 2023

Did you profile what the time is spent on with a hot cache? 100ms sounds a lot to me for just checking modification timestamps.

@MichaReiser
Copy link
Member Author

MichaReiser commented Oct 23, 2023

Did you profile what the time is spent on with a hot cache? 100ms sounds a lot to me for just checking modification timestamps.

I did not. I assume it's the python file discovery as well as detecting the python packages, loading of the cache etc. Linting without changes takes 96ms. So it's about the same.

I agree, that we could probably improve our caching overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli Related to the command-line interface formatter Related to the formatter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants