Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: consistent follower reads with leaseholder coordination #72593

Open
nvanbenschoten opened this issue Nov 10, 2021 · 3 comments
Open

kv: consistent follower reads with leaseholder coordination #72593

nvanbenschoten opened this issue Nov 10, 2021 · 3 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Nov 10, 2021

To date, follower reads (rfc/follower_reads.md, rfc/follower_reads_implementation.md, rfc/non_blocking_txns.md) have always been viewed first and foremost as a tool to minimize latency for read-only operations. By avoiding all communication with a range's leaseholder, follower reads can help a transaction avoid cross-region communication, dramatically reducing latency. However, in order to avoid any coordination with the leaseholder, follower reads trade off some utility — they either require reads to be stale or writes to be pushed into the future. This limits the places where they can be used.

This issue explores an extended form of "consistent" follower read that can be used in more situations than "stale" follower reads but still requires synchronous fixed-size (with respect to data accessed) communication with the range's leaseholder, negating what we have traditionally viewed as the primary benefit of follower reads. It also explores the secondary benefits of follower reads that remain even if the leaseholder helps coordinate the read off of a follower.

Motivations

  1. Network costs in public clouds are expensive. They are also asymmetric, with pricing dependent on the source and destination of data transfer. For example, we see from EC2's data transfer pricing page that cross-region transfer costs between $0.01-$0.02 per GB, cross-zone transfer costs $0.01 per GB, and intra-zone transfer is free. This asymmetric pricing provides a strong incentive to minimize the amount of data shipped across regions/zones, even if some communication across regions/zones is unavoidable. Recognizing that many clients often have a follower for a given range closer (in data transfer cost terms) than the leaseholder presents an opportunity for cost savings.

  2. Load-based splitting and rebalancing can help spread out well-distributed load across a cluster of nodes. However, they cannot spread out hotspots that cannot be split into different ranges. For read-heavy hotspots, serving reads from followers replicas can provide a form of load-balancing. This is true even if the leaseholder is contacted at some point to facilitate the read from the follower, as long as the follower is the one performing the expensive portion of the read (e.g. reading from disk, sending the result set to the client over the network, etc.).

  3. (Stretch motivation) In the future, followers may store data in a different layout than leaseholders (e.g. column-oriented instead of row-oriented), which may be better suited for large analytical-style reads. The data organization would exchange write performance for read performance, so it would be more appropriate for a follower by virtue of the fact that followers can apply log entries at a slower cadence than leaseholders (e.g. batching 100s of entries to apply at a time). Serving reads from follower replicas would allow these read-optimized followers to be used even for consistent reads.

High-Level Overview

The key idea here is that even if the leaseholder is contacted during a read, it doesn't need to be the one to serve the read's results. Instead, it can be contacted to do some light bookkeeping and then offload the heavy-lifting to a follower replica who may be a better candidate to serve the data back to the client.

For the sake of this issue, let's pretend we introduced a new request type called EstablishResolvedTimestamp (a sibling to the QueryResolvedTimestamp request).

In response to an EstablishResolvedTimestamp request, the leaseholder would concern itself with concurrency control and with determining how far the follower needs to catch up on its Raft log before its state machine contains a fully resolved view of the specified span x timestamp segment of "keyspacetime". Morally, the leaseholder would be in charge of creating a resolved timestamp over the given key span at the given timestamp. So the API would look something like this:

type EstablishResolvedTimestampRequest struct{ Transaction, Timestamp, Span }
type EstablishResolvedTimestampResponse struct{ LeaseAppliedIndex }

With this new API, followers can now be used to serve consistent follower reads. Either of the following appeaches would work here, and each has their own benefits:

Follower-coordinated

  1. client issues scan/get to nearest follower replica with ts
  2. follower checks closed timestamp against ts, determines its closed timestamp is not high enough
  3. follower sends EstablishResolvedTimestampRequest to leaseholder
  4. leaseholder grabs latches, checks lock table, bumps timestamp cache over span, and notes current lease_applied_index
  5. leaseholder returns EstablishResolvedTimestampResponse with lease_applied_index
  6. follower waits to apply log entry with lease_applied_index >= one from response
  7. follower reads serves read and returns to client

Benefits:

  • simpler
  • minimal API changes
  • can avoid leaseholder hop if closed timestamp happens to be high enough
  • can short-circuit leaseholder hop if closed timestamp increases while EstablishResolvedTimestamp outstanding, which can be helpful for stale reads as well

Client-coordinated

  1. client issues EstablishResolvedTimestampRequest to leaseholder
  2. leaseholder grabs latches, checks lock table, bumps timestamp cache over span, and notes current lease_applied_index
  3. leaseholder returns EstablishResolvedTimestampResponse with lease_applied_index
  4. client redirects to follower with lease_applied_index
  5. follower waits to apply log entry with lease_applied_index >= one from response
  6. follower reads serves read and returns to client

Extended client-coordinated

  1. client issues scan/get to leaseholder replica with some establish_resolved_timestamp_on_large_result flag
  2. leaseholder grabs latches, checks lock table, bumps timestamp cache over span, evaluates
  3. leaseholder determines if result is small or large. If small, return. If large, return lease_applied_index
  4. client redirects to follower with lease_applied_index
  5. follower waits to apply log entry with lease_applied_index >= one from response
  6. follower reads serves read and returns to client

Benefits:

  • lazy determination of single-hop vs. multi-hop, based on actual result size instead of guess

Additional unstructured notes:

- the Transaction is needed in EstablishResolvedTimestampRequest for deadlock detection
- if an EstablishResolvedTimestampRequest is scanning the entire range, it can also bump the closed timestamp
- the EstablishResolvedTimestampResponse could carry an observed timestamp to help avoid some uncertainty restarts
- uncertainty works as expected on follower
-- it *does not* need to resolve up to uncertainty interval
-- it knows that any causal predecessor will have been included in a log entry with <= lease_applied_index
- read-your-writes in read-write txn works as expected on follower
-- any intent writes will be flushed during pipeline stall on leaseholder and will have been included in a log entry with <= lease_applied_index
- how do limited scans play into this?
- how does an actor tail the log and wait for a lease_applied_index?
-- what if it sees a split? or a replica removal?
- when a follower is waiting to apply, is liveness guaranteed?
-- Does it ever need to wake up the range from quiescence or ensure an active leaseholder?

Jira issue: CRDB-11223

Epic CRDB-14991

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. A-kv-transactions Relating to MVCC and the transactional model. labels Nov 10, 2021
@blathers-crl blathers-crl bot added the T-kv KV Team label Nov 10, 2021
@ajwerner
Copy link
Contributor

ajwerner commented Dec 8, 2021

Do sql pods know their AZ and the AZ of sql nodes? I assume they know the latter thing, or, at least, have node descriptors which give them some info. Do we need to plumb more info into the sql pods to help facilitate this?

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 16, 2021
This commit adds support for the `--locality` and `--max-offset` flags to the
`cockroach mt start-sql` command.

The first of these is important because tenant SQL pods should know where they
reside. This will be important in the future for multi-region serverless and
also for projects like cockroachdb#72593.

The second of these is important because the SQL pod's max-offset setting needs
to be the same as the host cluster's. If we want to be able to configure the
host cluster's maximum clock offset to some non-default value, we'll need SQL
pods to be configured identically.

Validation of plumbing:
```sh
./cockroach start-single-node --insecure --max-offset=250ms
./cockroach sql --insecure -e 'select crdb_internal.create_tenant(2)'

 # verify --max-offset

./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0
 # CRDB crashes with error "locally configured maximum clock offset (250ms) does not match that of node [::]:62744 (500ms)"

./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms
 # successful

 # verify --locality

./cockroach sql --insecure --port=26258 -e 'select gateway_region()'

ERROR: gateway_region(): no region set on the locality flag on this node

./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms --locality=region=us-east1

./cockroach sql --insecure --port=26258 -e 'select gateway_region()'

  gateway_region
------------------
  us-east1
```
craig bot pushed a commit that referenced this issue Dec 17, 2021
73500: kv,storage: persist gateway node id in transaction intents r=AlexTalks a=AlexTalks

This change augments the `TxnMeta` protobuf structure to include the
gateway node ID (responsible for initiating the transaction) when
serializing the intent.  By doing so, this commit enables the Contention
Event Store proposed in #71965, utilizing option 2.

Release note: None

73862: sql: add test asserting CREATE/USAGE on public schema r=otan a=rafiss

refs #70266

The public schema currently always has CREATE/USAGE privileges
for the public role. Add a test that confirms this.

Release note: None

73873: scdeps: tighten dependencies, log more side effects r=postamar a=postamar

This commit reworks the dependency injection for the event logger, among
other declarative schema changer dependencies. It also makes the test
dependencies more chatty in the side effects log.

Release note: None

73932: ui: select grants tab on table details page r=maryliag a=maryliag

Previosuly, when the grants view was selected on the Database
Details page, it was going to the Table Details with the Overview
tab selected.
With this commit, if the view mode selected is Grant, the grant
tab is selected on the Table Details page.

Fixes #68829

Release note: None

73943: cli: support --locality and --max-offset flags with sql tenant pods r=nvanbenschoten a=nvanbenschoten

This commit adds support for the `--locality` and `--max-offset` flags to the `cockroach mt start-sql` command.

The first of these is important because tenant SQL pods should know where they reside. This will be important in the future for multi-region serverless and also for projects like #72593.

The second of these is important because the SQL pod's max-offset setting needs to be the same as the host cluster's. If we want to be able to configure the host cluster's maximum clock offset to some non-default value, we'll need SQL pods to be configured identically.

Validation of plumbing:
```sh
./cockroach start-single-node --insecure --max-offset=250ms
./cockroach sql --insecure -e 'select crdb_internal.create_tenant(2)'

 # verify --max-offset

./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0
 # CRDB crashes with error "locally configured maximum clock offset (250ms) does not match that of node [::]:62744 (500ms)"

./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms
 # successful

 # verify --locality

./cockroach sql --insecure --port=26258 -e 'select gateway_region()'

ERROR: gateway_region(): no region set on the locality flag on this node

./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms --locality=region=us-east1

./cockroach sql --insecure --port=26258 -e 'select gateway_region()'

  gateway_region
------------------
  us-east1
```

73946: ccl/sqlproxyccl: fix TestWatchPods under stressrace r=jaylim-crl a=jaylim-crl

Fixes #69220.
Regression from #67452.

In #67452, we omitted DRAINING pods from the tenant directory. Whenever a pod
goes into the DRAINING state, the pod watcher needs time to update the
directory. Not waiting for that while calling EnsureTenantAddr produces a
stale result. This commit updates TestWatchPods by polling on EnsureTenantAddr
until the pod watcher updates the directory.

Release note: None

73954: sqlsmith: don't compare voids for joins r=rafiss a=otan

No comparison expr is defined on voids, so don't generate comparisons
for them.

Resolves #73901
Resolves #73898
Resolves #73777

Release note: None

Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Marius Posta <marius@cockroachlabs.com>
Co-authored-by: Marylia Gutierrez <marylia@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Jay <jay@cockroachlabs.com>
Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
gustasva pushed a commit to gustasva/cockroach that referenced this issue Jan 4, 2022
This commit adds support for the `--locality` and `--max-offset` flags to the
`cockroach mt start-sql` command.

The first of these is important because tenant SQL pods should know where they
reside. This will be important in the future for multi-region serverless and
also for projects like cockroachdb#72593.

The second of these is important because the SQL pod's max-offset setting needs
to be the same as the host cluster's. If we want to be able to configure the
host cluster's maximum clock offset to some non-default value, we'll need SQL
pods to be configured identically.

Validation of plumbing:
```sh
./cockroach start-single-node --insecure --max-offset=250ms
./cockroach sql --insecure -e 'select crdb_internal.create_tenant(2)'

 # verify --max-offset

./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0
 # CRDB crashes with error "locally configured maximum clock offset (250ms) does not match that of node [::]:62744 (500ms)"

./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms
 # successful

 # verify --locality

./cockroach sql --insecure --port=26258 -e 'select gateway_region()'

ERROR: gateway_region(): no region set on the locality flag on this node

./cockroach mt start-sql --insecure --tenant-id=2 --sql-addr=:26258 --http-addr=:0 --max-offset=250ms --locality=region=us-east1

./cockroach sql --insecure --port=26258 -e 'select gateway_region()'

  gateway_region
------------------
  us-east1
```
@github-actions
Copy link

github-actions bot commented Oct 2, 2023

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 16, 2023
@ajd12342

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

3 participants