diff --git a/v20.2/admin-ui-debug-pages.md b/v20.2/admin-ui-debug-pages.md index 412f420b167..c1feb02f843 100644 --- a/v20.2/admin-ui-debug-pages.md +++ b/v20.2/admin-ui-debug-pages.md @@ -38,7 +38,7 @@ The **Even More Advanced Debugging** section of the page lists additional report Depending on your [access level](admin-ui-overview.html#admin-ui-access), the endpoints listed here provide access to: - [Log files](debug-and-error-logs.html#write-to-file) -- Secondary log files (e.g., RocksDB logs, [execution logs](query-behavior-troubleshooting.html#cluster-wide-execution-logs), [slow query logs](query-behavior-troubleshooting.html#using-the-slow-query-log), [authentication logs](query-behavior-troubleshooting.html#authentication-logs)) +- Secondary log files (e.g., storage engine logs, [execution logs](query-behavior-troubleshooting.html#cluster-wide-execution-logs), [slow query logs](query-behavior-troubleshooting.html#using-the-slow-query-log), [authentication logs](query-behavior-troubleshooting.html#authentication-logs)) - Node status - Hot ranges - Node-specific metrics diff --git a/v20.2/architecture/life-of-a-distributed-transaction.md b/v20.2/architecture/life-of-a-distributed-transaction.md index b537f28278d..1ecfecd7bca 100644 --- a/v20.2/architecture/life-of-a-distributed-transaction.md +++ b/v20.2/architecture/life-of-a-distributed-transaction.md @@ -1,6 +1,6 @@ --- title: Life of a Distributed Transaction -summary: This guide details the path through CockroachDB's architecture that a query takes, starting with a SQL client and progressing all the way to RocksDB (and then back out again). +summary: This guide details the path through CockroachDB's architecture that a query takes, starting with a SQL client and progressing all the way to the storage layer (and then back out again). toc: true --- @@ -117,9 +117,9 @@ If the write operation is valid according to the evaluator, the leaseholder send Importantly, this feature is entirely built for transactional optimization (known as [transaction pipelining](transaction-layer.html#transaction-pipelining)). There are no issues if an operation passes the evaluator but doesn't end up committing. -### Reads from RocksDB +### Reads from the storage layer -All operations (including writes) begin by reading from the local instance of RocksDB to check for write intents for the operation's key. We talk much more about [write intents in the transaction layer of the CockroachDB architecture](transaction-layer.html#write-intents), which is worth reading, but a simplified explanation is that these are provisional, uncommitted writes that express that some other concurrent transaction plans to write a value to the key. +All operations (including writes) begin by reading from the local instance of the [storage engine](storage-layer.html) to check for write intents for the operation's key. We talk much more about [write intents in the transaction layer of the CockroachDB architecture](transaction-layer.html#write-intents), which is worth reading, but a simplified explanation is that these are provisional, uncommitted writes that express that some other concurrent transaction plans to write a value to the key. What we detail below is a simplified version of the CockroachDB transaction model. For more detail, check out [the transaction architecture documentation](transaction-layer.html). @@ -128,7 +128,7 @@ What we detail below is a simplified version of the CockroachDB transaction mode If an operation encounters a write intent for a key, it attempts to "resolve" the write intent by checking the state of the write intent's transaction. If the write intent's transaction record is... - `COMMITTED`, this operation converts the write intent to a regular key-value pair, and then proceeds as if it had read that value instead of a write intent. -- `ABORTED`, this operation discards the write intent and reads the next-most-recent value from RocksDB. +- `ABORTED`, this operation discards the write intent and reads the next-most-recent value from the storage engine. - `PENDING`, the new transaction attempts to "push" the write intent's transaction by moving that transaction's timestamp forward (i.e., ahead of this transaction's timestamp); however, this only succeeds if the write intent's transaction has become inactive. - If the push succeeds, the operation continues. - If this push fails (which is the majority of the time), this transaction goes into the [`TxnWaitQueue`](https:www.cockroachlabs.com/docs/stable/architecture/transaction-layer.html#txnwaitqueue) on this node. The incoming transaction can only continue once the blocking transaction completes (i.e., commits or aborts). @@ -142,7 +142,7 @@ Check out our architecture documentation for more information about [CockroachDB #### Read Operations -If the read doesn't encounter a write intent and the key-value operation is meant to serve a read, it can simply use the value it read from the leaseholder's instance of RocksDB. This works because the leaseholder had to be part of the Raft consensus group for any writes to complete, meaning it must have the most up-to-date version of the range's data. +If the read doesn't encounter a write intent and the key-value operation is meant to serve a read, it can simply use the value it read from the leaseholder's instance of the storage engine. This works because the leaseholder had to be part of the Raft consensus group for any writes to complete, meaning it must have the most up-to-date version of the range's data. The leaseholder aggregates all read responses into a `BatchResponse` that will get returned to the gateway node's `DistSender`. @@ -158,21 +158,21 @@ The leaseholder then proposes these Raft operations to the Raft group leader. Th CockroachDB leverages Raft as its consensus protocol. If you aren't familiar with it, we recommend checking out the details about [how CockroachDB leverages Raft](https://www.cockroachlabs.com/docs/v2.1/architecture/replication-layer.html#raft), as well as [learning more about how the protocol works at large](http://thesecretlivesofdata.com/raft/). -In terms of executing transactions, the Raft leader receives proposed Raft commands from the leaseholder. Each Raft command is a write that is used to represent an atomic state change of the underlying key-value pairs stored in RocksDB. +In terms of executing transactions, the Raft leader receives proposed Raft commands from the leaseholder. Each Raft command is a write that is used to represent an atomic state change of the underlying key-value pairs stored in the storage engine. ### Consensus For each command the Raft leader receives, it proposes a vote to the other members of the Raft group. -Once the command achieves consensus (i.e., a majority of nodes including itself acknowledge the Raft command), it is committed to the Raft leader’s Raft log and written to RocksDB. At the same time, the Raft leader also sends a command to all other nodes to include the command in their Raft logs. +Once the command achieves consensus (i.e., a majority of nodes including itself acknowledge the Raft command), it is committed to the Raft leader’s Raft log and written to the storage engine. At the same time, the Raft leader also sends a command to all other nodes to include the command in their Raft logs. -Once the leader commits the Raft log entry, it’s considered committed. At this point the value is considered written, and if another operation comes in and performs a read on RocksDB for this key, they’ll encounter this value. +Once the leader commits the Raft log entry, it’s considered committed. At this point the value is considered written, and if another operation comes in and performs a read from the storage engine for this key, they’ll encounter this value. Note that this write operation creates a write intent; these writes will not be fully committed until the gateway node sets the transaction record's status to `COMMITTED`. ## On the way back up -Now that we have followed an operation all the way down from the SQL client to RocksDB, we can pretty quickly cover what happens on the way back up (i.e., when generating a response to the client). +Now that we have followed an operation all the way down from the SQL client to the storage engine, we can pretty quickly cover what happens on the way back up (i.e., when generating a response to the client). 1. Once the leaseholder applies a write to its Raft log, it sends an commit acknowledgment to the gateway node's `DistSender`, which was waiting for this signal (having already received the provisional acknowledgment from the leaseholder's evaluator). diff --git a/v20.2/architecture/replication-layer.md b/v20.2/architecture/replication-layer.md index 07d8f531323..77db35e3fd0 100644 --- a/v20.2/architecture/replication-layer.md +++ b/v20.2/architecture/replication-layer.md @@ -147,7 +147,7 @@ The replication layer sends `BatchResponses` back to the distribution layer's `D Committed Raft commands are written to the Raft log and ultimately stored on disk through the storage layer. -The leaseholder serves reads from its RocksDB instance, which is in the storage layer. +The leaseholder serves reads from the storage layer. ## What's next? diff --git a/v20.2/architecture/storage-layer.md b/v20.2/architecture/storage-layer.md index 5d39cbe9d9c..ee4704ef1a0 100644 --- a/v20.2/architecture/storage-layer.md +++ b/v20.2/architecture/storage-layer.md @@ -15,7 +15,14 @@ If you haven't already, we recommend reading the [Architecture Overview](overvie Each CockroachDB node contains at least one `store`, specified when the node starts, which is where the `cockroach` process reads and writes its data on disk. -This data is stored as key-value pairs on disk using RocksDB, which is treated primarily as a black-box API. Internally, each store contains two instances of RocksDB: +This data is stored as key-value pairs on disk using the storage engine, which is treated primarily as a black-box API. + +New in v20.2: By default, [CockroachDB uses the Pebble storage engine](../cockroach-start.html#storage-engine), with RocksDB available as an option. Pebble is intended to be bidirectionally compatible with the RocksDB on-disk format, but differs in that it: + +- Is written in Go and implements a subset of RocksDB's large feature set. +- Contains optimizations that benefit CockroachDB. + +Internally, each store contains two instances of the storage engine: - One for storing temporary distributed SQL data - One for all other data on the node @@ -30,11 +37,24 @@ In relationship to other layers in CockroachDB, the storage layer: ## Components +### Pebble + +New in v20.2: CockroachDB uses [Pebble by default](../cockroach-start.html#storage-engine)––an embedded key-value store with a RocksDB-compatible API developed by Cockroach Labs––to read and write data to disk. You can find more information about it on the [Pebble GitHub page](https://github.com/cockroachdb/pebble) or in the blog post [Introducing Pebble: A RocksDB Inspired Key-Value Store Written in Go](https://www.cockroachlabs.com/blog/pebble-rocksdb-kv-store/). + +Pebble integrates really well with CockroachDB for a number of reasons: + +- Key-value store, which makes mapping to our key-value layer simple +- Atomic write batches and snapshots, which give us a subset of transactions +- It is developed by Cockroach Labs engineers +- It contains optimizations that are not in RocksDB, that are inspired by how CockroachDB uses the storage engine. For an example of such an optimization, see the blog post [Faster Bulk-Data Loading in CockroachDB](https://www.cockroachlabs.com/blog/bulk-data-import/). + +Efficient storage for the keys is guaranteed by the underlying Pebble engine by means of prefix compression. + ### RocksDB -CockroachDB uses RocksDB––an embedded key-value store––to read and write data to disk. You can find more information about it on the [RocksDB Basics GitHub page](https://github.com/facebook/rocksdb/wiki/RocksDB-Basics). +If you [choose it at node startup time (it's not the default)](../cockroach-start.html#storage-engine), CockroachDB uses RocksDB––an embedded key-value store––to read and write data to disk. You can find more information about it on the [RocksDB Basics GitHub page](https://github.com/facebook/rocksdb/wiki/RocksDB-Basics). -RocksDB integrates really well with CockroachDB for a number of reasons: +RocksDB integrates well with CockroachDB for a number of reasons: - Key-value store, which makes mapping to our key-value layer simple - Atomic write batches and snapshots, which give us a subset of transactions @@ -43,7 +63,7 @@ Efficient storage for the keys is guaranteed by the underlying RocksDB engine by ### MVCC -CockroachDB relies heavily on [multi-version concurrency control (MVCC)](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) to process concurrent requests and guarantee consistency. Much of this work is done by using [hybrid logical clock (HLC) timestamps](transaction-layer.html#time-and-hybrid-logical-clocks) to differentiate between versions of data, track commit timestamps, and identify a value's garbage collection expiration. All of this MVCC data is then stored in RocksDB. +CockroachDB relies heavily on [multi-version concurrency control (MVCC)](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) to process concurrent requests and guarantee consistency. Much of this work is done by using [hybrid logical clock (HLC) timestamps](transaction-layer.html#time-and-hybrid-logical-clocks) to differentiate between versions of data, track commit timestamps, and identify a value's garbage collection expiration. All of this MVCC data is then stored in Pebble. Despite being implemented in the storage layer, MVCC values are widely used to enforce consistency in the [transaction layer](transaction-layer.html). For example, CockroachDB maintains a [timestamp cache](transaction-layer.html#timestamp-cache), which stores the timestamp of the last time that the key was read. If a write operation occurs at a lower timestamp than the largest value in the read timestamp cache, it signifies there’s a potential anomaly and the transaction must be restarted at a later timestamp. diff --git a/v20.2/cluster-setup-troubleshooting.md b/v20.2/cluster-setup-troubleshooting.md index 2470d71ca81..b1bdff902a7 100644 --- a/v20.2/cluster-setup-troubleshooting.md +++ b/v20.2/cluster-setup-troubleshooting.md @@ -312,7 +312,7 @@ A CockroachDB node will grow to consume all of the memory allocated for its `cac CockroachDB memory usage has 3 components: - **Go allocated memory**: Memory allocated by the Go runtime to support query processing and various caches maintained in Go by CockroachDB. These caches are generally small in comparison to the RocksDB cache size. If Go allocated memory is larger than a few hundred megabytes, something concerning is going on. -- **CGo allocated memory**: Memory allocated by the C/C++ libraries linked into CockroachDB and primarily concerns RocksDB and the RocksDB block cache. This is the “cache” mentioned in the note above. The size of CGo allocated memory is usually very close to the configured cache size. +- **CGo allocated memory**: Memory allocated by the C/C++ libraries linked into CockroachDB and primarily concerns the block caches for the storage engine (this is true for both [Pebble (the default engine) and RocksDB](cockroach-start.html#storage-engine)). This is the “cache” mentioned in the note above. The size of CGo allocated memory is usually very close to the configured cache size. - **Overhead**: The process resident set size minus Go/CGo allocated memory. If Go allocated memory is larger than a few hundred megabytes, you might have encountered a memory leak. Go comes with a built-in heap profiler which is already enabled on your CockroachDB process. See this [excellent blog post](https://blog.golang.org/profiling-go-programs) on profiling Go programs. diff --git a/v20.2/cockroach-debug-zip.md b/v20.2/cockroach-debug-zip.md index cef7ab52728..2bd3519b11e 100644 --- a/v20.2/cockroach-debug-zip.md +++ b/v20.2/cockroach-debug-zip.md @@ -9,7 +9,7 @@ key: debug-zip.html The `cockroach debug zip` [command](cockroach-commands.html) connects to your cluster and gathers information from each active node into a single file (inactive nodes are not included): - [Log files](debug-and-error-logs.html) -- Secondary log files (e.g., RocksDB logs, [execution logs](query-behavior-troubleshooting.html#cluster-wide-execution-logs), [slow query logs](query-behavior-troubleshooting.html#using-the-slow-query-log)) +- Secondary log files (e.g., storage engine logs, [execution logs](query-behavior-troubleshooting.html#cluster-wide-execution-logs), [slow query logs](query-behavior-troubleshooting.html#using-the-slow-query-log)) - Cluster events - Schema change events - Node liveness diff --git a/v20.2/cockroach-start.md b/v20.2/cockroach-start.md index 05e8791bfb2..73ce924df7c 100644 --- a/v20.2/cockroach-start.md +++ b/v20.2/cockroach-start.md @@ -127,9 +127,10 @@ The `--locality` flag accepts arbitrary key-value pairs that describe the locati Supported options: -- `default`: Checks which engine type was last used for this node's [store directory](#store) (Pebble or RocksDB), and uses that engine. If more than one store is specified, the previous engine type of the first store is used. If the check fails for any reason, or if the store directory does not exist yet, RocksDB is used. +- `pebble`: **Default** unless specified otherwise at node startup. Uses the [Pebble storage engine](https://github.com/cockroachdb/pebble). Pebble is intended to be bidirectionally compatible with the RocksDB on-disk format. Pebble differs from RocksDB in that it: + - Is written in Go and implements a subset of RocksDB's large feature set. + - Contains optimizations that benefit CockroachDB. - `rocksdb`: Uses the [RocksDB](https://rocksdb.org) storage engine. -- `pebble`: Uses the experimental [Pebble storage engine](https://github.com/cockroachdb/pebble). Pebble is intended to be bidirectionally compatible with the RocksDB on-disk format. Pebble differs from RocksDB in that it is written in Go and implements a subset of RocksDB's large feature set. #### Store diff --git a/v20.2/encryption.md b/v20.2/encryption.md index 1b61e68c4cf..7ff88c2ebea 100644 --- a/v20.2/encryption.md +++ b/v20.2/encryption.md @@ -33,7 +33,7 @@ Any new file created by the store uses the currently-active data key. All data k After startup, if the active data key is too old, CockroachDB generates a new data key and marks it as active, using it for all further encryption. -CockroachDB does not currently force re-encryption of older files but instead relies on normal RocksDB churn to slowly rewrite all data with the desired encryption. +CockroachDB does not currently force re-encryption of older files but instead relies on normal [storage engine](architecture/storage-layer.html) churn to slowly rewrite all data with the desired encryption. ### Rotating keys @@ -171,7 +171,7 @@ The report shows encryption status for all stores on the selected node, includin * Active data key information. * The fraction of files/bytes encrypted using the active data key. -CockroachDB relies on RocksDB compactions to write new files using the latest encryption key. It may take several days for all files to be replaced. Some files are only rewritten at startup, and some keep older copies around, requiring multiple restarts. You can force RocksDB compaction with the `cockroach debug compact` command (the node must first be [stopped](cockroach-quit.html)). +CockroachDB relies on [storage layer](architecture/storage-layer.html) compactions to write new files using the latest encryption key. It may take several days for all files to be replaced. Some files are only rewritten at startup, and some keep older copies around, requiring multiple restarts. You can force storage compaction with the `cockroach debug compact` command (the node must first be [stopped](cockroach-quit.html)). Information about keys is written to [the logs](debug-and-error-logs.html), including: diff --git a/v20.2/known-limitations.md b/v20.2/known-limitations.md index 71f7d624320..80316632167 100644 --- a/v20.2/known-limitations.md +++ b/v20.2/known-limitations.md @@ -194,7 +194,7 @@ Accessing the Admin UI for a secure cluster now requires login information (i.e. ### Large index keys can impair performance -The use of tables with very large primary or secondary index keys (>32KB) can result in excessive memory usage. Specifically, if the primary or secondary index key is larger than 32KB the default indexing scheme for RocksDB SSTables breaks down and causes the index to be excessively large. The index is pinned in memory by default for performance. +The use of tables with very large primary or secondary index keys (>32KB) can result in excessive memory usage. Specifically, if the primary or secondary index key is larger than 32KB the default indexing scheme for [storage engine](cockroach-start.html#storage-engine) SSTables breaks down and causes the index to be excessively large. The index is pinned in memory by default for performance. To work around this issue, we recommend limiting the size of primary and secondary keys to 4KB, which you must account for manually. Note that most columns are 8B (exceptions being `STRING` and `JSON`), which still allows for very complex key structures.