feat: table data cache for object storage #9772

dantengsky · 2023-01-29T14:26:27Z

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

table raw data cache

which caches raw column(compressed) data of the data block. currently, only disk-based cache storage is supported.

by default, it is disabled, to enable it:
- set table_data_cache_enabled to true in the query config file (or corresponding env var, command line arg)
- adjust table_disk_cache_max_size , table_disk_cache_root
metrics: cache_table_data_access_count, cache_table_data_hit_count, cache_table_data_miss_count

note that even if table_data_cache_enabled is set to true, disk cache will NOT take effect if storage type is set to fs, since caching block data of local fs in the local disk is ... usually not what we want.

cache will NOT be populated during data ingestion.
table data in-memory cache (experiment feature)

which caches deserialized column objects of a data block.

by default, it is disabled, to enable it:
- set table_data_cache_in_memory_max_size to some non-zero value
please use it with caution, the deserialized column objects may take lots of memory. enable it only if query nodes have plenty of memory, and the working set can be fitted into it, and the data access pattern will benefit from caching.

non-backward compatible config change:

several configuration entries are obsoleted.

during databend-query starting up, if any obsoleted configuration entry is used (command-line opt, env, or toml config file), the related migration suggestions will be shown (and then quit), like this:

--------------------------------------------------------------
 *** table-disk-cache-mb-size *** is obsoleted : 
 --------------------------------------------------------------
   alternative command-line options : cache-disk-max-bytes
   alternative environment variable : CACHE_DISK_MAX_BYTES
            alternative toml config : 
                    [cache]
                    ...
                    data_cache_storage = "disk"
                    ...
                    [cache.disk]
                    max_bytes = [MAX_BYTES]  
                    ...
                  
 --------------------------------------------------------------


 --------------------------------------------------------------
 *** table-meta-cache-enabled *** is obsoleted : 
 --------------------------------------------------------------
   alternative command-line options : cache-enable-table-meta-caches
   alternative environment variable : CACHE_ENABLE_TABLE_META_CACHE
            alternative toml config : 
                    [cache]
                    table-meta-cache-enabled=[true|false]
                  
 --------------------------------------------------------------


 --------------------------------------------------------------
 *** table-cache-block-meta-count *** is obsoleted : 
 --------------------------------------------------------------
   alternative command-line options : N/A
   alternative environment variable : N/A
            alternative toml config : N/A
 --------------------------------------------------------------


 --------------------------------------------------------------
 *** table-memory-cache-mb-size *** is obsoleted : 
 --------------------------------------------------------------
   alternative command-line options : N/A
   alternative environment variable : N/A
            alternative toml config : N/A
 --------------------------------------------------------------


 --------------------------------------------------------------
 *** table-disk-cache-root *** is obsoleted : 
 --------------------------------------------------------------
   alternative command-line options : cache-disk-path
   alternative environment variable : CACHE_DISK_PATH
            alternative toml config : 
                    [cache]
                    ...
                    data_cache_storage = "disk"
                    ...
                    [cache.disk]
                    max_bytes = [MAX_BYTES]  
                    path = [PATH]
                    ...
                    
 --------------------------------------------------------------


 --------------------------------------------------------------
 *** table-cache-snapshot-count *** is obsoleted : 
 --------------------------------------------------------------
   alternative command-line options : cache-table-meta-snapshot-count
   alternative environment variable : CACHE_TABLE_META_SNAPSHOT_COUNT
            alternative toml config : 
                    [cache]
                    table_meta_snapshot_count = [COUNT]
                    
 --------------------------------------------------------------


 --------------------------------------------------------------
 *** table-cache-statistic-count *** is obsoleted : 
 --------------------------------------------------------------
   alternative command-line options : cache-table-meta-statistic-count
   alternative environment variable : CACHE_TABLE_META_STATISTIC_COUNT
            alternative toml config : 
                    [cache]
                    table_meta_statistic_count = [COUNT]
                    
 --------------------------------------------------------------


 --------------------------------------------------------------
 *** table-cache-segment-count *** is obsoleted : 
 --------------------------------------------------------------
   alternative command-line options : cache-table-meta-segment-count
   alternative environment variable : CACHE_TABLE_META_SEGMENT_COUNT
            alternative toml config : 
                    [cache]
                    table_meta_segment_count = [COUNT]
                    
 --------------------------------------------------------------


 --------------------------------------------------------------
 *** table-cache-bloom-index-meta-count *** is obsoleted : 
 --------------------------------------------------------------
   alternative command-line options : cache-table-bloom-index-meta-count
   alternative environment variable : CACHE_TABLE_BLOOM_INDEX_META_COUNT
            alternative toml config : 
                    [cache]
                    table_bloom_index_meta_count = [COUNT]
                    
 --------------------------------------------------------------


 --------------------------------------------------------------
 *** table-cache-bloom-index-filter-count *** is obsoleted : 
 --------------------------------------------------------------
   alternative command-line options : cache-table-bloom-index-filter-count
   alternative environment variable : CACHE_TABLE_BLOOM_INDEX_FILTER_COUNT
            alternative toml config : 
                    [cache]
                    table_bloom_index_filter_count = [COUNT]
                    
 --------------------------------------------------------------

some implementation details

bring back @PsiACE 's DiskCache mod
- cache items are identified by the siphash (2-4, 128 bit) of cache key
- cache files are prefixed with the path of the first 3 common chars, e.g.
- crc32 checksum placed at the end of the file
TableDataCache
consist of a LruDiskCache and a cache population worker.
- while serving the get operations, LruDiskCache is used directly.
- while serving the put operations, LruDiskCache will be checked first (without accessing the disk), if cach missed, the items will be put into a bouned queue, or dropped if the queue is full. in a dedicated thread, the cache population worker takes items from a bounded queue, persists them to disk, and populates the cache.
setting table_data_cache_population_queue_size controls the max capacity of bounded queue.

metrics:
- cache_table_data_population_pending_count shows the number of items pending in the bounded queue.
- cache_table_data_population_overflow_count shows the number of items that have been droppped.
ColumnArrayCache
A Lru in-memory object cache, which caches Box<dyn Arrar>. ideally, caching BlockEntry is preferred, but that needs some further tweaks of the DataBlock structure (not only taking owned BlockEntrys but also shared ownership of BlockEntrys).
BlockReader::merge_io_read
integrated with TableDataCache and ColumnArrayCache

Performance evaluation

ClickBench

1. table data cache enabled vs main branch default setting

Databend (pr, in_memory_data_cache on)
this pr, in-memory data cache is enabled, disk-based table data cache disabled

metrics:

cache_table_data_column_array_hit_count 134241
cache_table_data_column_array_access_count 389649
cache_table_data_column_array_miss_count 255408

hits rate: 134241 / 389649 = 34%

memory:
after run.sh ended, top shows that the process RES is 18.9g

note that although table_data_cache_in_memory_max_size = 5368709120 (5G) is used in this scenario, the cached object is currently measured by the uncompressed bytes size of the column.
Databend (pr, disk_data_cache on)
this pr, disk-based table data cache enabled (table_disk_cache_max_size set to 20Gb, population queue size set to 65535), in-memory data cache is disabled.

metrics:

cache_table_data_hit_count 310269
cache_table_data_access_count 389649
cache_table_data_miss_count 79380

hits rate: 310269 / 389649 = 79.62%

note that, this ec2 machine's memory is large enough to buffer all the cache files. Although run.sh will drop os caches for the first run, the subsequent 2 runs are likely to read from the os pagecache without hitting the disk. (with exceptions that some query's execution involves non-deterministic partition accesses).
Databend (main, 1st round)
main branch(commit 9078cfd), default config

2. disable table cache vs main branch

Databend (main, nth round)
main branch, default config, 3 rounds
Databend (pr, no data cache, nth round)
both disk and in-memory cache is disabled (default setting of this pr), 3 rounds

it shows that, if table data cache disabled, the performance is on par with main branch

the raw data:
pr.html.gz

misc

to be improved

for raw data cache, the size of pending items should be measured by the bytes of pending data, not the number of them.
for object in-memory cache, BlockEntry is preferred, and they should be measure by the "heap size" of object, not the uncompressed byte size

** cluster mode"

eval the cache performance in cluster mod, verify that, with PartitionsShuffleKind::Mod, distribution of table data cache is nearly balanced amount the nodes of cluster.

disk cache (based on @PsiACE 's previous work
crc checksum, sip2-4 128 hash as the cache key, two-level cache file dir layout, etc
integrate disk cache with merge io reader
by default, only enable for async reading (since sync reading almost indicates we are working on local fs storage)
threshold of disk cache population
a bounded queue of pending cache items, with a dedicated thread that writes cache to disk
cache of dyn arry object
ideally, BlockEntry should be cached, but it seems that DataBlock or BlockEntry needs further tweaks.
prefix data cache dir with tenant id
refactor cache configuration
performance evaluation
some initial perf has been done, to be detailed

Closes #issue

vercel · 2023-01-29T14:26:31Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated
databend	✅ Ready (Inspect)	Visit Preview	💬 Add your feedback	Feb 16, 2023 at 2:29AM (UTC)

src/common/cache/src/todo

scripts/ci/deploy/config/databend-query-node-2.toml

src/query/config/src/setting.rs

scripts/ci/deploy/config/databend-query-node-1.toml

src/query/config/src/config.rs

to avoid naming collision of `Setting` (with `common-settings::Settings`)

dantengsky · 2023-02-15T15:05:13Z

@lichuang need your help :) please take a look at the following methods, I am not sure if they are implemented correctly (or concisely)

https://github.com/datafuselabs/databend/blob/6c1d40469bd120dc6528b50ac1e587f54e5216e2/src/query/catalog/src/plan/projection.rs#L77-L108

https://github.com/datafuselabs/databend/blob/6c1d40469bd120dc6528b50ac1e587f54e5216e2/src/common/storage/src/column_node.rs#L104-L124

github-actions · 2023-02-15T16:22:38Z

Benchmark Results: https://repo.databend.rs/benchmark/clickbench/pr/9772/4185651992.html

src/query/config/src/config.rs

…om_index_caches -> enable_table_bloom_index_cache

rename _caches -> _cache

BohuTANG

Great, cache coming.

github-actions · 2023-02-16T02:59:33Z

Benchmark Results: https://repo.databend.rs/benchmark/clickbench/pr/9772/4190074849.html

BohuTANG reviewed Jan 30, 2023

View reviewed changes

src/common/cache/src/todo Outdated Show resolved Hide resolved

dantengsky added 7 commits February 1, 2023 19:25

bring back @PsiACE's disk cache

71d3030

tailor disk cache

a29c2c1

assembly data cache to async reader

c388108

add CacheItem type parameter

f09faa0

fix cache key to string, use 128 bit hash

32e4251

shrink lock scope of reading cached data

291fe74

fix: cache init

9ac4c3d

dantengsky force-pushed the feat-block-cache branch from 6eb948b to 4ab5881 Compare February 1, 2023 13:17

wip

118c8fa

dantengsky force-pushed the feat-block-cache branch from b89d146 to 118c8fa Compare February 1, 2023 13:19

dantengsky added 2 commits February 3, 2023 13:05

fix: evcition

b10afa7

Merge remote-tracking branch 'origin/main' into feat-block-cache

60517b1

vercel bot deployed to Preview February 3, 2023 05:30 View deployment

BohuTANG mentioned this pull request Feb 3, 2023

Roadmap 2023 #9448

Open

9 tasks

add unit tests

c265c4d

BohuTANG mentioned this pull request Feb 4, 2023

Release proposal: Nightly v1.0 #9604

Closed

5 tasks

dantengsky added 7 commits February 6, 2023 13:22

add crc checksum

254f5a7

tiered disk cache

4b1ca8b

tuning and metrics

c05d3c7

add data cache related config

47cfc1d

fix ut

9f9bce8

fix ut it

e28564c

refactor: generic external cache type

bdca8a4

dantengsky changed the title ~~WIP: Feat block cache~~ Feat: block cache Feb 7, 2023

dantengsky changed the title ~~Feat: block cache~~ feat: table data cache Feb 7, 2023

mergify bot added the pr-feature this PR introduces a new feature to the codebase label Feb 7, 2023

dantengsky added 2 commits February 7, 2023 13:16

Merge remote-tracking branch 'origin/main' into feat-block-cache

945024c

remove debug env var

248b663