Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow checkpoints to run while other connections are reading, and no longer block new connections while checkpointing #11918

Merged
merged 77 commits into from
May 3, 2024

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented May 3, 2024

Partially fixes #9150

This PR reworks the way that locking around checkpointing works. Previously, when checkpointing, we required that the checkpointing thread was the only active thread. As a result, automatic checkpoints were blocked when other connections were querying the database, and CHECKPOINT would throw an exception. While checkpointing, new connections could not start transactions and would block until the checkpoint was completed. This introduced a number of issues:

  • As automatic checkpoints could not be triggered when other threads were querying the database, the optimistic writing optimization could not take place, and data would have to be written to the WAL file instead. This is significantly less efficient.
  • If automatic checkpoints were never run (because there was always concurrent activity), the WAL file would keep on growing uncontrollably as it could never be flushed.
  • While checkpointing or writing to the WAL, the transaction_lock was held, meaning new connections could not connect/start transactions until those operations were finished. As a result, writers could introduce significant latency for readers.

After this PR, checkpointing uses more granular locking. Automatic checkpoints can now run under the following conditions:

  • There are no active write transactions
  • There are no active read transactions that depend on previously committed updates or catalog changes
  • If the transaction has performed updates (using the UPDATE statement) or dropped catalog entries (using the DROP statement), automatic checkpointing is only possible if there are no active read transactions
  • If the transaction has performed deletes (using the DELETE statement), vacuuming can only be performed if there are no active read transactions

Checkpoint Lock

After these changes, we can either have:

  • n active writing transactions
  • 1 checkpointing transaction

This is enforced using a shared/exclusive lock in the DuckTransactionManager called the checkpoint_lock. Writing threads grab a shared checkpoint lock, and checkpoints grab an exclusive checkpoint lock. This locking also means that, while a checkpoint is running, new writers are blocked until the checkpoint is finished.

Transactions are started as read transactions, and are upgraded to write transactions when they attempt to modify the database (through e.g. inserting data, updating, deleting, or performing an ALTER statement). There is a new callback added to transactions that is called when this happens:

virtual void Transaction::SetReadWrite();

For the DuckTransaction, this callback grabs a shared checkpoint lock.

Table Locks

Read transactions are not blocked by checkpointing - they can be started, committed, and can read data from the system while a checkpoint is running. However, since checkpoints restructure table data, it is not safe to read table data while a checkpoint is running. To prevent this from causing problems, this PR introduces table-level locks. Similar to the checkpoint locks, these locks are shared/exclusive locks. Threads that read from a table grab a shared lock, while the checkpoint grabs exclusive locks over tables while it is checkpointing.

This means that read transactions can be blocked by the checkpoint process when reading table data. In the future we plan to make these locks more granular instead of table-level.

Partial Blocks

When doing a checkpoint, we gather small writes and co-locate them on the same block. This is particularly useful when writing small tables, as without the partial block manager we would need to dedicate one block per column.

Prior to this PR, we would also co-locate multiple tables on the same block, as we would flush the partial block manager once at the end of a checkpoint.

This is problematic for concurrent checkpointing, however. The way the partial block manager works is that it gathers (small) writes, and writes them out to a block when a block is filled. This means that a write to Column A, can trigger a write to other columns. When working with table-level locks, this does not work nicely, since that would mean we must keep a lock on all tables for the entire duration of the checkpoint. This could even cause dead-locks, as the checkpoint would grab exclusive locks to multiple tables, and reading transactions can grab shared locks to multiple tables.

To solve this problem, in this PR we modify the partial block manager to operate on the table level only. As a result, partial blocks will never be shared across tables. During checkpoints, we then only need to hold locks on individual tables, and can release locks on tables after we are done processing them.

Sequences

This PR includes a minor rework of sequences. Previously, sequences could be used by read-only transactions, and were handled outside of the regular UndoBuffer infrastructure. When sequences were used in read-only databases, the sequences would increment in value, but that increment would not be stored on disk.

This behavior is reworked in this PR: calling nextval on a sequence is now a write operation on the database the sequence exists in. This change is required because otherwise read-only transactions would need to e.g. append to the write ahead log, which causes conflicts with checkpointing. In addition, another restriction is added in this PR: nextval can now only be called with a constant parameter (e.g. nextval('seq'), the following is no longer supported:

create sequence seq;
create table sequence_info(seq varchar);
insert into sequence_info values('seq');
select nextval(seq) from sequence_info;
-- Not implemented Error: currval/nextval requires a constant sequence - non-constant sequences are no longer supported

That is because allowing random sequences to be referenced during run-time causes a lot of head-aches, e.g. we can have multiple threads now marking transactions as write transactions concurrently, and we no longer know up-front when binding if a query is going to be read-only or not (does this refer to a temporary sequence or a persistent one?).

sqllogictest skipif/onlyif

This PR also extends the skipif/onlyif decorators in the sqllogictest to operate on loop identifiers. This allows us to write threads where e.g. certain threads perform a task, while others perform another task. The syntax is onlyif X=0 or skipif X=0 prior to a statement or query, e.g.:

onlyif i=0
statement ok
SELECT 42;

Below is a full example where we run concurrentloop, and then filter on threadid so we get one thread that is appending data, while the other threads are reading:

statement ok
CREATE TABLE integers(i INTEGER PRIMARY KEY)

statement ok
INSERT INTO integers SELECT * FROM range(10000);

concurrentloop threadid 0 20

loop i 0 20

onlyif threadid=0
statement ok
INSERT INTO integers SELECT * FROM range(10000 + ${i} * 100, 10100 + ${i} * 100);

endloop

loop i 0 100

skipif threadid=0
statement ok
SELECT * FROM integers WHERE i=${i} * (${threadid} * 300);

endloop

endloop

Experiments

Below is an experiment that shows the effects of this PR by appending data to the lineitem table, while another connection is continuously running queries (TPC-H Q1 in this instance).

Generate Data
import duckdb
duckdb.sql('CALL dbgen(sf=1)')
duckdb.sql('COPY lineitem TO lineitem.parquet')
Run Queries
import duckdb
from threading import Thread
import os
import time

db_file = 'concurrent_tpch_test.db'
wal_file = db_file + '.wal'

con = duckdb.connect(db_file)
con.sql('CALL dbgen(sf=0)')

def append_to_lineitem(con):
    while True:
        con.sql('INSERT INTO lineitem FROM lineitem.parquet')

def run_queries(con):
    while True:
        con.sql('PRAGMA tpch(1)').execute()

def print_file_size(path):
    file_stats = os.stat(path)
    print(f'File {path}: {file_stats.st_size / (1024 * 1024)}MB')

def measure_thread():
    for i in range(1000000):
        print(f'------{i}s--------')
        print_file_size(db_file)
        print_file_size(wal_file)
        time.sleep(1)


write_thread = Thread(target=append_to_lineitem, args=[con.cursor()])
read_thread = Thread(target=run_queries, args=[con.cursor()])
measure_thread = Thread(target=measure_thread, args=[])

write_thread.start()
read_thread.start()
measure_thread.start()

write_thread.join()
read_thread.join()
measure_thread.join()
v0.10.2
Time (s) Database Size WAL Size
5s 476MB 2162MB
10s 951MB 4177MB
15s 1426MB 5819MB
20s 1774MB 7928MB
25s 2218MB 9737MB
30s 2573MB 11532MB
New
Time (s) Database Size WAL Size
5s 805MB 0MB
10s 1458MB 0MB
15s 2101MB 0MB
20s 2594MB 0MB
25s 3081MB 0MB
30s 3559MB 0MB

We can see that in the new version, because the reading thread no longer blocks the optimistic writing/automatic checkpointing, no data is ever written to the WAL. Instead data can keep on being written into the database file and the WAL is never utilized in this scenario.

Mytherin added 30 commits April 24, 2024 12:30
…, when this is called, we grab a shared checkpoint lock. This effectively blocks transactions from being upgraded to write transactions while a checkpoint is running. Simultaneously, this prevents checkpoints from running while there are active write transactions.
…eckpoint lock when checkpointing. Grab a shared copy when reading from the table to prevent checkpoints while reading.
…hods for index manipulation.

Also add test for concurrent index scans while appending, and correctly grab checkpoint lock when we perform the index scan check.
…ssertion that could be triggered in zone map skipping when scanning and appending at the same time
…n_tests_one_by_one where failure when obtainined the test list was ignored
… disallow using non-constant parameters to sequences
…hich allows them to signal a database will be modified, and make Binder::StatementProperties use the same pattern as other Binder related things
…e latest rows that are still readable in other transactions
…not vacuum deletes unless we are doing a FULL CHECKPOINT
…nstances of a table when the table is altered
Mytherin added 23 commits May 2, 2024 11:10
…we store multiple validity values per validity_t. Reading threads can read earlier bits, while a writing thread could append and write later bits in the same value. While this is a data race on the entire validity_t - our code ensures the reading threads are not reading the *bits* that the writing thread is writing, making this race non-problematic.
…cks. One change we need to make for this is that we need to flush partial blocks after every table write. This means we can no longer store multiple tables on the same (partial) block, and each table will occupy at least one full block in the file.
…ucceed in obtaining the checkpoint lock initially
@Mytherin
Copy link
Collaborator Author

Mytherin commented May 3, 2024

Nightly test failures are unrelated and should be picked up separate from this PR.

@Mytherin Mytherin merged commit cb82ce9 into duckdb:main May 3, 2024
50 of 63 checks passed
@suiluj
Copy link

suiluj commented May 3, 2024

@Mytherin thanks a lot for your work! this is great news! :)

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request May 6, 2024
Merge pull request duckdb/duckdb#11918 from Mytherin/morefinegrainedcheckpointing
@Mytherin Mytherin deleted the morefinegrainedcheckpointing branch June 7, 2024 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WAL file explodes in size during ingestion if running any concurrent reads/writes queries
2 participants