Allow checkpoints to run while other connections are reading, and no longer block new connections while checkpointing #11918

Mytherin · 2024-05-03T08:12:24Z

Partially fixes #9150

This PR reworks the way that locking around checkpointing works. Previously, when checkpointing, we required that the checkpointing thread was the only active thread. As a result, automatic checkpoints were blocked when other connections were querying the database, and CHECKPOINT would throw an exception. While checkpointing, new connections could not start transactions and would block until the checkpoint was completed. This introduced a number of issues:

As automatic checkpoints could not be triggered when other threads were querying the database, the optimistic writing optimization could not take place, and data would have to be written to the WAL file instead. This is significantly less efficient.
If automatic checkpoints were never run (because there was always concurrent activity), the WAL file would keep on growing uncontrollably as it could never be flushed.
While checkpointing or writing to the WAL, the transaction_lock was held, meaning new connections could not connect/start transactions until those operations were finished. As a result, writers could introduce significant latency for readers.

After this PR, checkpointing uses more granular locking. Automatic checkpoints can now run under the following conditions:

There are no active write transactions
There are no active read transactions that depend on previously committed updates or catalog changes
If the transaction has performed updates (using the UPDATE statement) or dropped catalog entries (using the DROP statement), automatic checkpointing is only possible if there are no active read transactions
If the transaction has performed deletes (using the DELETE statement), vacuuming can only be performed if there are no active read transactions

Checkpoint Lock

After these changes, we can either have:

n active writing transactions
1 checkpointing transaction

This is enforced using a shared/exclusive lock in the DuckTransactionManager called the checkpoint_lock. Writing threads grab a shared checkpoint lock, and checkpoints grab an exclusive checkpoint lock. This locking also means that, while a checkpoint is running, new writers are blocked until the checkpoint is finished.

Transactions are started as read transactions, and are upgraded to write transactions when they attempt to modify the database (through e.g. inserting data, updating, deleting, or performing an ALTER statement). There is a new callback added to transactions that is called when this happens:

virtual void Transaction::SetReadWrite();

For the DuckTransaction, this callback grabs a shared checkpoint lock.

Table Locks

Read transactions are not blocked by checkpointing - they can be started, committed, and can read data from the system while a checkpoint is running. However, since checkpoints restructure table data, it is not safe to read table data while a checkpoint is running. To prevent this from causing problems, this PR introduces table-level locks. Similar to the checkpoint locks, these locks are shared/exclusive locks. Threads that read from a table grab a shared lock, while the checkpoint grabs exclusive locks over tables while it is checkpointing.

This means that read transactions can be blocked by the checkpoint process when reading table data. In the future we plan to make these locks more granular instead of table-level.

Partial Blocks

When doing a checkpoint, we gather small writes and co-locate them on the same block. This is particularly useful when writing small tables, as without the partial block manager we would need to dedicate one block per column.

Prior to this PR, we would also co-locate multiple tables on the same block, as we would flush the partial block manager once at the end of a checkpoint.

This is problematic for concurrent checkpointing, however. The way the partial block manager works is that it gathers (small) writes, and writes them out to a block when a block is filled. This means that a write to Column A, can trigger a write to other columns. When working with table-level locks, this does not work nicely, since that would mean we must keep a lock on all tables for the entire duration of the checkpoint. This could even cause dead-locks, as the checkpoint would grab exclusive locks to multiple tables, and reading transactions can grab shared locks to multiple tables.

To solve this problem, in this PR we modify the partial block manager to operate on the table level only. As a result, partial blocks will never be shared across tables. During checkpoints, we then only need to hold locks on individual tables, and can release locks on tables after we are done processing them.

Sequences

This PR includes a minor rework of sequences. Previously, sequences could be used by read-only transactions, and were handled outside of the regular UndoBuffer infrastructure. When sequences were used in read-only databases, the sequences would increment in value, but that increment would not be stored on disk.

This behavior is reworked in this PR: calling nextval on a sequence is now a write operation on the database the sequence exists in. This change is required because otherwise read-only transactions would need to e.g. append to the write ahead log, which causes conflicts with checkpointing. In addition, another restriction is added in this PR: nextval can now only be called with a constant parameter (e.g. nextval('seq'), the following is no longer supported:

create sequence seq;
create table sequence_info(seq varchar);
insert into sequence_info values('seq');
select nextval(seq) from sequence_info;
-- Not implemented Error: currval/nextval requires a constant sequence - non-constant sequences are no longer supported

That is because allowing random sequences to be referenced during run-time causes a lot of head-aches, e.g. we can have multiple threads now marking transactions as write transactions concurrently, and we no longer know up-front when binding if a query is going to be read-only or not (does this refer to a temporary sequence or a persistent one?).

sqllogictest skipif/onlyif

This PR also extends the skipif/onlyif decorators in the sqllogictest to operate on loop identifiers. This allows us to write threads where e.g. certain threads perform a task, while others perform another task. The syntax is onlyif X=0 or skipif X=0 prior to a statement or query, e.g.:

onlyif i=0
statement ok
SELECT 42;

Below is a full example where we run concurrentloop, and then filter on threadid so we get one thread that is appending data, while the other threads are reading:

statement ok
CREATE TABLE integers(i INTEGER PRIMARY KEY)

statement ok
INSERT INTO integers SELECT * FROM range(10000);

concurrentloop threadid 0 20

loop i 0 20

onlyif threadid=0
statement ok
INSERT INTO integers SELECT * FROM range(10000 + ${i} * 100, 10100 + ${i} * 100);

endloop

loop i 0 100

skipif threadid=0
statement ok
SELECT * FROM integers WHERE i=${i} * (${threadid} * 300);

endloop

endloop

Experiments

Below is an experiment that shows the effects of this PR by appending data to the lineitem table, while another connection is continuously running queries (TPC-H Q1 in this instance).

Generate Data

import duckdb
duckdb.sql('CALL dbgen(sf=1)')
duckdb.sql('COPY lineitem TO lineitem.parquet')

Run Queries

import duckdb
from threading import Thread
import os
import time

db_file = 'concurrent_tpch_test.db'
wal_file = db_file + '.wal'

con = duckdb.connect(db_file)
con.sql('CALL dbgen(sf=0)')

def append_to_lineitem(con):
    while True:
        con.sql('INSERT INTO lineitem FROM lineitem.parquet')

def run_queries(con):
    while True:
        con.sql('PRAGMA tpch(1)').execute()

def print_file_size(path):
    file_stats = os.stat(path)
    print(f'File {path}: {file_stats.st_size / (1024 * 1024)}MB')

def measure_thread():
    for i in range(1000000):
        print(f'------{i}s--------')
        print_file_size(db_file)
        print_file_size(wal_file)
        time.sleep(1)


write_thread = Thread(target=append_to_lineitem, args=[con.cursor()])
read_thread = Thread(target=run_queries, args=[con.cursor()])
measure_thread = Thread(target=measure_thread, args=[])

write_thread.start()
read_thread.start()
measure_thread.start()

write_thread.join()
read_thread.join()
measure_thread.join()

v0.10.2

Time (s)	Database Size	WAL Size
5s	476MB	2162MB
10s	951MB	4177MB
15s	1426MB	5819MB
20s	1774MB	7928MB
25s	2218MB	9737MB
30s	2573MB	11532MB

New

Time (s)	Database Size	WAL Size
5s	805MB	0MB
10s	1458MB	0MB
15s	2101MB	0MB
20s	2594MB	0MB
25s	3081MB	0MB
30s	3559MB	0MB

We can see that in the new version, because the reading thread no longer blocks the optimistic writing/automatic checkpointing, no data is ever written to the WAL. Instead data can keep on being written into the database file and the WAL is never utilized in this scenario.

…, when this is called, we grab a shared checkpoint lock. This effectively blocks transactions from being upgraded to write transactions while a checkpoint is running. Simultaneously, this prevents checkpoints from running while there are active write transactions.

…mmands only for certain loop iterations

…eckpoint lock when checkpointing. Grab a shared copy when reading from the table to prevent checkpoints while reading.

…hods for index manipulation. Also add test for concurrent index scans while appending, and correctly grab checkpoint lock when we perform the index scan check.

…ssertion that could be triggered in zone map skipping when scanning and appending at the same time

…n_tests_one_by_one where failure when obtainined the test list was ignored

…nd clean-up code + tests

…anges

…we write to the WAL

… disallow using non-constant parameters to sequences

…hich allows them to signal a database will be modified, and make Binder::StatementProperties use the same pattern as other Binder related things

…ve lock during checkpointing

…e latest rows that are still readable in other transactions

…not vacuum deletes unless we are doing a FULL CHECKPOINT

… lock inside the transaction

…nstances of a table when the table is altered

…e stats are also shared

…we store multiple validity values per validity_t. Reading threads can read earlier bits, while a writing thread could append and write later bits in the same value. While this is a data race on the entire validity_t - our code ensures the reading threads are not reading the *bits* that the writing thread is writing, making this race non-problematic.

…cks. One change we need to make for this is that we need to flush partial blocks after every table write. This means we can no longer store multiple tables on the same (partial) block, and each table will occupy at least one full block in the file.

…ucceed in obtaining the checkpoint lock initially

…ny entries in the catalog

Mytherin · 2024-05-03T12:50:30Z

Nightly test failures are unrelated and should be picked up separate from this PR.

suiluj · 2024-05-03T13:15:50Z

@Mytherin thanks a lot for your work! this is great news! :)

Merge pull request duckdb/duckdb#11918 from Mytherin/morefinegrainedcheckpointing

Mytherin added 30 commits April 24, 2024 12:30

WIP - more fine grained checkpointing (not working yet)

3c1b9ca

Add support for skipif/onlyif in sqllogictests to execute specific co…

5850e8b

…mmands only for certain loop iterations

Add a checkpoint lock to DataTable. Grab an exclusive copy of that ch…

dfe9181

…eckpoint lock when checkpointing. Grab a shared copy when reading from the table to prevent checkpoints while reading.

Clean-up DataTable -> make DataTableInfo a private member and add met…

45a1a91

…hods for index manipulation. Also add test for concurrent index scans while appending, and correctly grab checkpoint lock when we perform the index scan check.

Extend conditions to support > >= < <= and &&, and add a fix for an a…

e29ee44

…ssertion that could be triggered in zone map skipping when scanning and appending at the same time

More tests

b12dc45

Run inter-query tests with force storage in tsan, and fix issue in ru…

727240a

…n_tests_one_by_one where failure when obtainined the test list was ignored

ColumnData count needs to be atomic

3ad9a2f

Run duckdb_platform_binary in the project source dir

dd26f4a

Format

15a619b

Revert duckdb_platform_binary

7662f5b

Materialized CTEs - correctly inherit properties from child binder, a…

14f55fa

…nd clean-up code + tests

Add InternalException when a transaction is read only but has made ch…

c8d915b

…anges

Make sequence usage push into the undo buffer like all other entries …

e35f618

…we write to the WAL

Rework sequences so nextval correctly sets the modified database, and…

0abdff5

… disallow using non-constant parameters to sequences

Minor test fixes

1a68dc5

Add a dedicated get_modified_databases callback to scalar functions w…

fc96401

…hich allows them to signal a database will be modified, and make Binder::StatementProperties use the same pattern as other Binder related things

Format

0c1459d

Correctly upgrade existing shared lock of a transaction to an exclusi…

825f6ab

…ve lock during checkpointing

During checkpoint/vacuuming we want the latest committed rows, not th…

44bd920

…e latest rows that are still readable in other transactions

Rename CheckpointType to PartialBlockType

66df755

Clean-up checkpoint options, and add a CheckpointType setting. We can…

280a907

…not vacuum deletes unless we are doing a FULL CHECKPOINT

Push the checkpoint info into the column data writers

2add9c2

Reference not pointer

aaef3f4

Don't clear updates in a Concurrent Checkpoint

5034d17

Rework the way the checkpoint lock works - we need to keep the shared…

18f835f

… lock inside the transaction

Table lock needs to be in DataTableInfo so that it is shared across i…

b32621e

…nstances of a table when the table is altered

Block checkpoints if the CURRENT transaction has done any updates

be07499

Merge branch 'main' into morefinegrainedcheckpointing

99d7f35

Mytherin added 23 commits May 2, 2024 11:10

TableStatistics locks need to be shared when a table is altered as th…

c200e97

…e stats are also shared

Concurrent reads while appending NULL values

fe6dee8

Construct ExpressionExecutor up-front in RowGroupCollection::AddColumn

c20580a

Avoid grabbing the local storage again

9014f95

Rework AlterOwnership not to re-obtain a transaction

09f1444

Add more concurrent tests

ecc8e14

Remove output

ea576ea

This is fixed

0979e3f

Avoid race in Catch internals

6946278

Large deletes with transactions

df83080

Cascading updates

63673c5

Add missing include

f2238d5

For FORCE CHECKPOINT - we only need to lock all clients if we don't s…

2744140

…ucceed in obtaining the checkpoint lock initially

Move to slow tests

6089fd4

Test fixes

1ee427c

Fix for VSS

b691fb6

Merge branch 'main' into morefinegrainedcheckpointing

2de8c3e

Add missing include

522d73e

Skip WAL tests for HNSW for now

b4bb789

Increase time out

83ccdcf

Also prevent automatic checkpoints if the transaction has *dropped* a…

348f809

…ny entries in the catalog

Mytherin merged commit cb82ce9 into duckdb:main May 3, 2024
50 of 63 checks passed

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request May 6, 2024

chore: Update vendored sources to duckdb/duckdb@cb82ce9

6c0ab35

Merge pull request duckdb/duckdb#11918 from Mytherin/morefinegrainedcheckpointing

This was referenced May 14, 2024

Grab checkpoint lock during storage metadata reads #12053

Merged

Rework FORCE CHECKPOINT - instead of actively cancelling transactions it now blocks until it can checkpoint #12061

Merged

Mytherin deleted the morefinegrainedcheckpointing branch June 7, 2024 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow checkpoints to run while other connections are reading, and no longer block new connections while checkpointing #11918

Allow checkpoints to run while other connections are reading, and no longer block new connections while checkpointing #11918

Mytherin commented May 3, 2024 •

edited

Loading

Mytherin commented May 3, 2024

suiluj commented May 3, 2024

Allow checkpoints to run while other connections are reading, and no longer block new connections while checkpointing #11918

Allow checkpoints to run while other connections are reading, and no longer block new connections while checkpointing #11918

Conversation

Mytherin commented May 3, 2024 • edited Loading

Checkpoint Lock

Table Locks

Partial Blocks

Sequences

sqllogictest skipif/onlyif

Experiments

Generate Data

Run Queries

v0.10.2

New

Mytherin commented May 3, 2024

suiluj commented May 3, 2024

Mytherin commented May 3, 2024 •

edited

Loading