Skip to content

Commit

Permalink
docs: update multithreading and caching sections of PERFORMANCE.md
Browse files Browse the repository at this point in the history
[skip ci]
  • Loading branch information
jqnatividad committed Aug 18, 2024
1 parent 77adae8 commit 5e6bc45
Showing 1 changed file with 4 additions and 3 deletions.
7 changes: 4 additions & 3 deletions docs/PERFORMANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ The same is true with the write buffer (default: 256k) with the `QSV_WTR_BUFFER_

## Multithreading

Several commands support multithreading - `stats`, `frequency`, `schema`, `split` and `tojsonl` (when an index is available); `apply`, `applydp`, `dedup`, `diff`, `excel`, `extsort`, `joinp`, `snappy`, `sort`, `sqlp`, `to` and `validate` (no index required).
Several commands support multithreading - `count`, `stats`, `frequency`, `sample`, `schema`, `split` and `tojsonl` (when an index is available); `apply`, `applydp`, `dedup`, `diff`, `excel`, `extsort`, `joinp`, `snappy`, `sort`, `sqlp`, `to` and `validate` (no index required).

qsv will automatically spawn parallel jobs equal to the detected number of logical processors. Should you want to manually override this, use the `--jobs` command-line option or the `QSV_MAX_JOBS` environment variable.

Expand Down Expand Up @@ -156,10 +156,11 @@ The qsv binary was built to target the aarch64-apple-darwin platform (Apple Sili
qsv employs several caching strategies to improve performance:

* qsv has large read and write buffers to minimize disk I/O. The default read buffer size is 128k and the default write buffer size is 512k. These can be fine-tuned with the `QSV_RDR_BUFFER_CAPACITY` and `QSV_WTR_BUFFER_CAPACITY` environment variables.
* The `stats` command caches its results in both CSV and binary formats. It does this to avoid re-computing the same statistics when the same input file/parameters are used, but also, as statistics are used in several other commands (currently - `schema` and `tojsonl`, with [more commands using cached statistics in the future](https://github.com/jqnatividad/qsv/issues/898)).
The stats cache are automatically refreshed when the input file is modified the next time the `stats` command is run or when cache-aware commands attempt to use them. The stats cache is stored in the same directory as the input file. The stats cache files are named with the same file stem as the input file with the `stats.csv`, `stats.csv.json` and `stats.csv.bin` extensions. The CSV contains the cached stats, the JSON file contains metadata about how the stats were compiled, and the bin file is the binary encoded version of the stats that can be directly loaded into memory by other commands. The binary format is used by the `schema` and `tojsonl` commands and will only be generated when the `--stats-binout` option is set.
* The `stats` command caches its results in both CSV and JSONL formats. It does this to avoid re-computing the same statistics when the same input file/parameters are used, but also, as statistics are used in several other commands (currently - `frequency`, `schema` and `tojsonl`, with [more commands using cached statistics in the future](https://github.com/jqnatividad/qsv/issues/898)).
The stats cache are automatically refreshed when the input file is modified the next time the `stats` command is run or when cache-aware commands attempt to use them. The stats cache is stored in the same directory as the input file. The stats cache files are named with the same file stem as the input file with the `stats.csv`, `stats.csv.json` and `stats.csv.data.jsonl` extensions. The CSV contains the cached stats, the JSON file contains metadata about how the stats were compiled, and the JSONL file is the JSONL version of the stats that can be directly loaded into memory by other commands. The JSONL is used by the `frequency`, `schema` and `tojsonl` commands and will only be generated when the `--stats-jsonl` option is set.
* The `geocode` command [memoizes](https://en.wikipedia.org/wiki/Memoization) otherwise expensive geocoding operations and will report its cache hit rate. `geocode` memoization, however, is not persistent across sessions.
* The `fetch` and `fetchpost` commands also memoizes expensive REST API calls. When the `--redis` option is enabled, it effectively has a persistent cache as the default time-to-live (TTL) before a Redis cache entry is expired is 28 days and Redis entries are persisted across restarts. Redis cache settings can be fine-tuned with the `QSV_REDIS_CONNSTR`, `QSV_REDIS_TTL_SECONDS`, `QSV_REDIS_TTL_REFRESH` and `QSV_FP_REDIS_CONNSTR` environment variables.
* Alternatively, the `fetch` and `fetchpost` commands can use a local disk cache with the `--disk-cache` option. The default cache directory is `~/.qsv-cache/fetch`. The cache directory can be changed with the `QSV_CACHE_DIR` environment variable or the `--disk-cache-dir` command-line option. A disk cache entry is expired after 28 days by default, but this can be changed with the `QSV_DISK_CACHE_TTL_SECONDS` and `QSV_DISKCACHE_TTL_REFRESH` environment variables.
* The `luau` command caches lookup tables on disk using the QSV_CACHE_DIR environment variable and the `--cache-dir` command-line option. The default cache directory is `~/.qsv-cache`. The QSV_CACHE_DIR environment variable overrides the `--cache-dir` command-line option.

## SIMD-accelerated UTF-8 Validation for Performance
Expand Down

0 comments on commit 5e6bc45

Please sign in to comment.