docs: add RFC to expose follower reads to clients #33474

ajwerner · 2019-01-03T18:37:41Z

Relates to #16593.
Release note: None

cockroach-teamcity · 2019-01-03T18:37:50Z

This change is

justinj

Great RFC 👍

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 23 at r1 (raw file):

DistSender aware that follower reads are possible. Given the intention to make
follower reads an enterprise feature, some of the complexity in this proposal
stem from the need to inject behavior from CCL code.

nit: stems

docs/RFCS/20181227_follower_reads_implementation.md, line 60 at r1 (raw file):

seconds but could likely be lowered to 5-10 seconds (at some threshold it may
potentially interfere with on-going transactions). This proposal seeks to
acheive follower reads by employing stateless approximation of when a follower

nit: achieve

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):

In order to ease the burden of the client determining an adequately old
timestamp for use with an `AS OF SYSTEM TIME` query, this RFC introduces a new
SQL syntax `AS OF RECENT SYSTEM TIME` which is effectively a syntactic

I wonder if there should be a way for a user to explicitly bound the amount of staleness in this syntax, independent of a cluster setting? I'm thinking like, AS OF RECENT SYSTEM TIME AT MOST '30s' AGO? As a application developer I feel like I might be concerned that some future changes to the cluster settings will make my reads become more stale than I had initially planned on.

docs/RFCS/20181227_follower_reads_implementation.md, line 100 at r1 (raw file):

This section will focus on the specific details of plumbing the functionality
required to expose follower reads through the codebase. Because follwer reads

nit: follower

docs/RFCS/20181227_follower_reads_implementation.md, line 127 at r1 (raw file):

    targetMultiple := FollowerReadsTargetMultiple.Get(&st.SV)
    targetDuration := closedts.TargetDuration.Get(&st.SV)
	return -1 * time.Duration(targetMultiple * float64(targetDuration))

nit: stray tab here

docs/RFCS/20181227_follower_reads_implementation.md, line 176 at r1 (raw file):

policy which selects the closest replica.

Prior to this change the policy is used to statically construct an a oracle

nit: an Oracle

docs/RFCS/20181227_follower_reads_implementation.md, line 283 at r1 (raw file):

feel less egregious?

### How does this capability fit in to the SQL query planning?

I believe the work being done today around taking expected latencies into account is significantly less involved than what would be required to make use of follower reads: it's only planned to use this information for indexes which are pinned to a region using geopartitioning (i.e., it has no knowledge of ranges whatsoever). Additionally, making use of range-level information would be a big step up and would almost certainly have to happen post- absorption of the DistSQL physical planning into the optimizer.

But despite that it does seem to me that long-term the optimizer is where this kind of decision should be made, and we probably won't want to ship the decision of what replicas to read from off to an injected oracle. For instance, if the rest of the plan affects the choice of which replica would be best, like maybe we want to colocate reads from ranges we expect to join together, or something? I'm not sure. Now that I'm writing out examples they seem kind of contrived and I'm wondering if there are many real-world examples of such cases.

I don't know what kind of interface the optimizer will want when we cross that bridge, so for now I'd say that the interface discussed in this RFC is the right way to go.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

I wonder if there should be a way for a user to explicitly bound the amount of staleness in this syntax, independent of a cluster setting? I'm thinking like, AS OF RECENT SYSTEM TIME AT MOST '30s' AGO? As a application developer I feel like I might be concerned that some future changes to the cluster settings will make my reads become more stale than I had initially planned on.

This smells a lot like AS OF SYSTEM TIME '-30s'. To me the RECENT syntax is expressly for the client to ask for follower reads. If they want to control the staleness we already have a syntax for that. Assuming the chosen time is adequately old, a follower read will happen.

docs/RFCS/20181227_follower_reads_implementation.md, line 283 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

I believe the work being done today around taking expected latencies into account is significantly less involved than what would be required to make use of follower reads: it's only planned to use this information for indexes which are pinned to a region using geopartitioning (i.e., it has no knowledge of ranges whatsoever). Additionally, making use of range-level information would be a big step up and would almost certainly have to happen post- absorption of the DistSQL physical planning into the optimizer.

But despite that it does seem to me that long-term the optimizer is where this kind of decision should be made, and we probably won't want to ship the decision of what replicas to read from off to an injected oracle. For instance, if the rest of the plan affects the choice of which replica would be best, like maybe we want to colocate reads from ranges we expect to join together, or something? I'm not sure. Now that I'm writing out examples they seem kind of contrived and I'm wondering if there are many real-world examples of such cases.

I don't know what kind of interface the optimizer will want when we cross that bridge, so for now I'd say that the interface discussed in this RFC is the right way to go.

Ack. Will update after some more feedback comes in

nvanbenschoten

Looks good, and it's fantastic to see the prototype sitting along-side this.

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 1 at r1 (raw file):

nit: delete new line

docs/RFCS/20181227_follower_reads_implementation.md, line 7 at r1 (raw file):

- Authors: Andrew Werner
- RFC PR: (PR # after acceptance of initial draft)
- Cockroach Issue: (one or more # from the issue tracker)

#16593 and maybe others.

docs/RFCS/20181227_follower_reads_implementation.md, line 23 at r1 (raw file):

DistSender aware that follower reads are possible. Given the intention to make
follower reads an enterprise feature, some of the complexity in this proposal
stem from the need to inject behavior from CCL code.

s/stem/stems/

docs/RFCS/20181227_follower_reads_implementation.md, line 59 at r1 (raw file):

`kv.closed_timestamp.target_duration`. As of writing this value defaults to 30
seconds but could likely be lowered to 5-10 seconds (at some threshold it may
potentially interfere with on-going transactions). This proposal seeks to

We had it all the way down at 5 seconds before, but saw test flakes because of it. We bumped it because we weren't making use of it, but I expect that we will drop the default to 5-10 seconds once this work goes in.

docs/RFCS/20181227_follower_reads_implementation.md, line 61 at r1 (raw file):

potentially interfere with on-going transactions). This proposal seeks to
acheive follower reads by employing stateless approximation of when a follower
read is possible by assuming that a read may be directed at to a follower if it

"at to a"

docs/RFCS/20181227_follower_reads_implementation.md, line 62 at r1 (raw file):

acheive follower reads by employing stateless approximation of when a follower
read is possible by assuming that a read may be directed at to a follower if it
occurs some multiple of the target duration which is controlled by a new cluster

occurs at some multiple?

docs/RFCS/20181227_follower_reads_implementation.md, line 68 at r1 (raw file):

read (which will happen seamlessly due to a NotLeaseHolderError). The
`target_multiple` offers clients a tradeoff between staleness and likelihood of
follower read failing.

You use the term "staleness" a few times here. It might be worth clarifying that you mean the "minimum staleness required for a follower read to be attempted".

docs/RFCS/20181227_follower_reads_implementation.md, line 83 at r1 (raw file):

The physical planning of SQL evaluation currently tries to send queries to be

s/queries/DistSQL processors/

s/evaluated/run/

docs/RFCS/20181227_follower_reads_implementation.md, line 91 at r1 (raw file):

other than READ_UNCOMMITTED or INCONSISTENT

This would just be CONSISTENT, but I think this is wrong. READ_UNCOMMITTED operations should also be evaluated on the leaseholder.

docs/RFCS/20181227_follower_reads_implementation.md, line 100 at r1 (raw file):

This section will focus on the specific details of plumbing the functionality
required to expose follower reads through the codebase. Because follwer reads

"follwer"

docs/RFCS/20181227_follower_reads_implementation.md, line 136 at r1 (raw file):

the current target duration for closed timestamps is 30s, queries performed
with `RECENT` should lag "real" time by roughly 1m30s. If we can lower the
target duration to 10s which would lead to a 30s real time delay.

I wonder if we should be taking kv.closed_timestamp.close_fraction into account as well. kv.closed_timestamp.target_duration * kv.closed_timestamp.close_fraction should give us the period of closed timestamp updates, so multiplying that period by some configurable multiple and adding it to the closed timestamp target duration seems like the right approach to conservatively estimate the closed timestamp threshold of a follower.

docs/RFCS/20181227_follower_reads_implementation.md, line 138 at r1 (raw file):

target duration to 10s which would lead to a 30s real time delay.

### SQL Support For `AS OF RECENT SYSTEM TIME` Queries

Should we eagerly reject this grammar when an enterprise license is missing as well? That's probably a better experience than it just silently not hitting a follower read.

docs/RFCS/20181227_follower_reads_implementation.md, line 156 at r1 (raw file):

`init`.

### Abstract replica selection for SQL physical planning.

@andreimatei and @RaduBerinde should take a look at this.

The big thing that jumps out to me is that we have this logic living in two different places (which isn't new). As we add more complex routing policies, this becomes a bigger issue, especially when we want the decisions from the policies to line up (e.g. we'd hate to route a processor to a local node only to have its DistSender send requests to a remote node). I don't remember the history of these leaseHolderOracles, but I wonder whether it's time to revisit whether they can be merged with the same kind of logic in DistSender.

docs/RFCS/20181227_follower_reads_implementation.md, line 247 at r1 (raw file):

One potential downside of this approach is that in an edge case it may have the
potential to have a detrimentally affect cluster performance in the case of

"effect on cluster performance"

docs/RFCS/20181227_follower_reads_implementation.md, line 260 at r1 (raw file):

increasing the target multiple. The problem seems worse as the replication
factor increases beyond 3 to numbers like 7 or 9. Furthermore even if the
increased load during this bursty period does not meaningfully effect OLTP

"affect"

docs/RFCS/20181227_follower_reads_implementation.md, line 273 at r1 (raw file):

additionally require new mechanisms to mark as known to be safe for follower
reads. Furthermore the state tracking may be prohibitively expensive on large
clusters.

Another alternative would be to dynamically adjust the target multiple as part of a feedback look with follower read hits and misses as an input. Ideally, this would allow us to adapt to changes to closed timestamp update latency.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 7 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

#16593 and maybe others.

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 23 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/stem/stems/

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 59 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We had it all the way down at 5 seconds before, but saw test flakes because of it. We bumped it because we weren't making use of it, but I expect that we will drop the default to 5-10 seconds once this work goes in.

Do we understand why it was leading to test flakes? Is there work to be done there?

docs/RFCS/20181227_follower_reads_implementation.md, line 61 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"at to a"

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 62 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

occurs at some multiple?

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 68 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

You use the term "staleness" a few times here. It might be worth clarifying that you mean the "minimum staleness required for a follower read to be attempted".

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 83 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/queries/DistSQL processors/

s/evaluated/run/

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 91 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

other than READ_UNCOMMITTED or INCONSISTENT
This would just be CONSISTENT, but I think this is wrong. READ_UNCOMMITTED operations should also be evaluated on the leaseholder.

You are correct. Updated

docs/RFCS/20181227_follower_reads_implementation.md, line 100 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"follwer"

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 136 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I wonder if we should be taking kv.closed_timestamp.close_fraction into account as well. kv.closed_timestamp.target_duration * kv.closed_timestamp.close_fraction should give us the period of closed timestamp updates, so multiplying that period by some configurable multiple and adding it to the closed timestamp target duration seems like the right approach to conservatively estimate the closed timestamp threshold of a follower.

Seems reasonable to me. If I understand correctly, currently I have:

recent := kv.closed_timestamp.target_duration * kv.follower_reads.target_multiple

and you're suggesting

recent := kv.closed_timestamp.target_duration * (1 +  kv.closed_timestamp.close_fraction * kv.follower_reads.target_multiple)

docs/RFCS/20181227_follower_reads_implementation.md, line 138 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we eagerly reject this grammar when an enterprise license is missing as well? That's probably a better experience than it just silently not hitting a follower read.

Sure, I updated the doc and will update the prototype.

docs/RFCS/20181227_follower_reads_implementation.md, line 247 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"effect on cluster performance"

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 260 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"affect"

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 273 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Another alternative would be to dynamically adjust the target multiple as part of a feedback look with follower read hits and misses as an input. Ideally, this would allow us to adapt to changes to closed timestamp update latency.

Added some language to this effect.

Another thing I'm thinking is that we should add some monitoring to track when we're closed timestamps lag to understand if and when it's actually a problem.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 138 at r1 (raw file):

Previously, ajwerner wrote…

Sure, I updated the doc and will update the prototype.

Updated the prototype too.

knz

Reviewed 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 7 at r1 (raw file):

Previously, ajwerner wrote…

Done.

nit: by convention (and to ease grep) we place the # before the issue/pr number

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r2 (raw file):

In order to ease the burden of the client determining an adequately old
timestamp for use with an `AS OF SYSTEM TIME` query, this RFC introduces a new
SQL syntax `AS OF RECENT SYSTEM TIME` which is effectively a syntactic

To be honest I am not too fond of a syntax extension to achieve this behavior. This prevents clients from 1) programmatically choosing one behavior over another based on other settings or a computation and 2) inspect the timestamp at which the reads will be done

Have you thought about using a built-in function for this: AS OF SYSTEM TIME historical_timestamp()

where historical_timestamp() returns the highest timestamp for which crdb guarantees follower reads (if enabled)

This way clients can also inspect the timestamp with SELECT historical_timestamp() etc

knz · 2019-01-04T08:01:30Z

(the name "historical_timestamp" may need to be iterated over a little bit. Perhaps "latest_timestamp_for_distributed_reads()" would be more accurate/descriptive)

ajwerner

The builtin approach seems cleaner than the syntax change. TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r2 (raw file):

Previously, knz (kena) wrote…

To be honest I am not too fond of a syntax extension to achieve this behavior. This prevents clients from 1) programmatically choosing one behavior over another based on other settings or a computation and 2) inspect the timestamp at which the reads will be done

Have you thought about using a built-in function for this: AS OF SYSTEM TIME historical_timestamp()

where historical_timestamp() returns the highest timestamp for which crdb guarantees follower reads (if enabled)

This way clients can also inspect the timestamp with SELECT historical_timestamp() etc

I'm not at all wedded to the syntax extension. The builtin approach makes sense. It should also make the CCL injection somewhat less nasty.

tbg · 2019-01-05T09:47:29Z

docs/RFCS/20181227_follower_reads_implementation.md

+achieve follower reads by employing stateless approximation of when a follower
+read is possible by assuming that a read may be directed to a follower if it
+occurs at some multiple of the target duration which is controlled by a new 
+cluster setting `kv.follower_reads.target_multiple`. While this may ultimately 


Have you considered consulting the gateway node's closed timestamp subsystem for information about the remote node? For example, if you're on n1 and you're deciding whether you can serve a follower read from a replica on n2, you can look at the closed timestamp of n2 to get a good idea. I think this is superior to making this purely timing-based, in the sense that it has protection against "slow nodes" built in. With a purely timing-based approach, it's more likely that you'll have to add state to remember when a node rejected follower reads that were thought possible.

Sorry, I mean that if you have a range that's on n1, n2, n3 and the leader is, say, n1, then you consider n2 and n3 for serving a follower read for timestamps that have closed out at n1 (perhaps with some padding), the idea being that all that's missing for n2 and n3 to serve the follower read is their catching up on the Raft log. There are still plenty of ways to get this wrong (a follower could be waiting for a snapshot, etc) but overall it offers a little more protection than going blindly by target duration.

Interestingly the way I started writing this RFC was with the stateless approach as the rejected alternative and a more stateful approach as the proposal. After some talking to Nathan I was encouraged to flip it, if only because the proposed approach here is simpler to implement and can be nuanced later.

A problem as far as I can tell with relying on node level closed timestamp arises when there's a range that's on n2, n3, and n4 and the gateway node receiving the request is on n1. n1 won't have information on the state of the closed ts for the range in question and thus still falls back to having to guess. Using the closed timestamp tracking on the node level could enable the recent timestamp to be less conservative.

I'm curious how important it is that this feature seek to be as up to date as possible. I wonder if clients opting in to this functionality will make their decision based on whether the data 10s vs 60s stale.

the idea being that all that's missing for n2 and n3 to serve the follower read is their catching up on the Raft log

When we hit this situation, the receiving replica won't know that the request had some reason to believe that a follower read was safe so this will still result in a NotLeaseHolderError. Are you suggesting that we should provide some mechanism to mark a request as expecting to be safe and then wait at the follower for it to catch up?

tbg · 2019-01-05T09:50:42Z

docs/RFCS/20181227_follower_reads_implementation.md

+traffic, it may lead to potentially massively increased latency for queries
+which in the previous regime had been fast.
+
+A more sophisticated mechanism which statefully tracks a closed timestamps on a


This is roughly where I was going above, except that I don't think you'd need to track individual replicas (which I agree is not a good thing to buy into). Looking at the node-wide closed timestamp alone should be sufficient, at least for now.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 59 at r1 (raw file):

Previously, ajwerner wrote…

Do we understand why it was leading to test flakes? Is there work to be done there?

It was leading to flakes because tests weren't prepared to have their transactions' timestamps pushed forwards. The work there is just to fix the tests.

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):

Previously, ajwerner wrote…

This smells a lot like AS OF SYSTEM TIME '-30s'. To me the RECENT syntax is expressly for the client to ask for follower reads. If they want to control the staleness we already have a syntax for that. Assuming the chosen time is adequately old, a follower read will happen.

I tend to agree with @ajwerner. If RECENT had an upper bound of now() given a sufficiently up-to-date follower and was determined dynamically by follower progress then something like AS OF RECENT SYSTEM TIME AT MOST '5s' AGO would make a lot of sense. However, since we need to hit the leaseholder for any query within the kv.closed_timestamp.target_duration to ensure RW conflicts are properly handled (given our current non-distributed timestamp cache*) and because RECENT isn't being determined based on follower progress, I don't see a huge reason to add in the syntax.

* The idea of permitting follower reads within the kv.closed_timestamp.target_duration while still handling RW conflicts correctly is a different discussion that would be really interesting to have at some point! I think a workable solution would look closer to quorum leases — where multiple replicas would hold leases at a given time, each would need to be a part of all replication quorums, and each would need to record reads in a local timestamp cache-like structure and be prepared to reject incoming proposals in the event of RW conflicts.

docs/RFCS/20181227_follower_reads_implementation.md, line 136 at r1 (raw file):

Previously, ajwerner wrote…

Seems reasonable to me. If I understand correctly, currently I have:

recent := kv.closed_timestamp.target_duration * kv.follower_reads.target_multiple

and you're suggesting

recent := kv.closed_timestamp.target_duration * (1 +  kv.closed_timestamp.close_fraction * kv.follower_reads.target_multiple)

Yes, that's exactly right.

docs/RFCS/20181227_follower_reads_implementation.md, line 62 at r2 (raw file):

A problem as far as I can tell with relying on node level closed timestamp arises when there's a range that's on n2, n3, and n4 and the gateway node receiving the request is on n1. n1 won't have information on the state of the closed ts for the range in question and thus still falls back to having to guess. Using the closed timestamp tracking on the node level could enable the recent timestamp to be less conservative.

Yes, that was my understanding of the problem a well. It's likely that that's just because I don't have the details of the closed timestamp subsystem paged in. Does a node have enough information to make one of these kinds of decisions accurately (or close to accurately) when it doesn't hold a replica in the Range?

danhhz

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I tend to agree with @ajwerner. If RECENT had an upper bound of now() given a sufficiently up-to-date follower and was determined dynamically by follower progress then something like AS OF RECENT SYSTEM TIME AT MOST '5s' AGO would make a lot of sense. However, since we need to hit the leaseholder for any query within the kv.closed_timestamp.target_duration to ensure RW conflicts are properly handled (given our current non-distributed timestamp cache*) and because RECENT isn't being determined based on follower progress, I don't see a huge reason to add in the syntax.

* The idea of permitting follower reads within the kv.closed_timestamp.target_duration while still handling RW conflicts correctly is a different discussion that would be really interesting to have at some point! I think a workable solution would look closer to quorum leases — where multiple replicas would hold leases at a given time, each would need to be a part of all replication quorums, and each would need to record reads in a local timestamp cache-like structure and be prepared to reject incoming proposals in the event of RW conflicts.

Do we have design partners for follower reads? This is exactly the sort of thing it's handy to get external feedback on.

(With my Foursquare hat on, I happen to find justin's suggestion of an optional AT MOST <> AGO suffix compelling. AS OF SYSTEM TIME '-30s' means you're always running 30s stale, whereas AS OF RECENT SYSTEM TIME AT MOST '30s' AGO is probably running more recent than that but you still don't have to worry about it being an hour stale.)

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Do we have design partners for follower reads? This is exactly the sort of thing it's handy to get external feedback on.

(With my Foursquare hat on, I happen to find justin's suggestion of an optional AT MOST <> AGO suffix compelling. AS OF SYSTEM TIME '-30s' means you're always running 30s stale, whereas AS OF RECENT SYSTEM TIME AT MOST '30s' AGO is probably running more recent than that but you still don't have to worry about it being an hour stale.)

With the recent change to a function, this now seems potentially expressible in SQL with min.

I do think there's probably a conversation to be had about the implication of allowing the expr in AS OF SYSTEM TIME to be non-constant as this proposal does. Does that raise any red flags? Do we have a middle ground between constant only and all expressions which would enable using built ins but not reading from the database?

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r2 (raw file):

Previously, ajwerner wrote…

I'm not at all wedded to the syntax extension. The builtin approach makes sense. It should also make the CCL injection somewhat less nasty.

Done.

tbg

I won't be able to give this a closer look this week, but I think I'm on the same page with you and Nathan. Let me know if you need my eyes on this again.

Reviewed 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 62 at r2 (raw file):
Yeah, this is a little tricky -- the way I was thinking about this is that if we're just talking about the closed timestamp (i.e. no MLAIs), then this is cheap enough to distribute throughout. But right now you either distribute nothing or everything, and while in a "random cluster" usually all nodes talk to all other nodes, this doesn't need to be true in general (and shouldn't be true in strictly partitioned environments, ideally).

Are you suggesting that we should provide some mechanism to mark a request as expecting to be safe and then wait at the follower for it to catch up?

No, I was suggesting adding some padding to the closed timestamp before deciding to serve the read and to hope that that will work well enough for the time being, but that may not be enough. Dealing with stragglers will need some state at DistSender as this becomes a problem with either approach.

I agree that there are enough concerns to KISS for now (i.e. go threshold based). It's straightforward enough to retrofit if a desire for using the information we have in principle becomes stronger.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 59 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It was leading to flakes because tests weren't prepared to have their transactions' timestamps pushed forwards. The work there is just to fix the tests.

I'll create a separate issue for this to lower the time and track down the flakes.

docs/RFCS/20181227_follower_reads_implementation.md, line 136 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yes, that's exactly right.

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 265 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is roughly where I was going above, except that I don't think you'd need to track individual replicas (which I agree is not a good thing to buy into). Looking at the node-wide closed timestamp alone should be sufficient, at least for now.

Done.

ajwerner

@knz I've updated the RFC and prototype to use a function rather than a syntax change as suggested, PTAL. I called it recent_timestamp which is certainly less descriptive than your suggestion and am open to further iteration.

Reviewable status: complete! 0 of 0 LGTMs obtained

andreimatei

LGTM

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 156 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

@andreimatei and @RaduBerinde should take a look at this.

The big thing that jumps out to me is that we have this logic living in two different places (which isn't new). As we add more complex routing policies, this becomes a bigger issue, especially when we want the decisions from the policies to line up (e.g. we'd hate to route a processor to a local node only to have its DistSender send requests to a remote node). I don't remember the history of these leaseHolderOracles, but I wonder whether it's time to revisit whether they can be merged with the same kind of logic in DistSender.

I agree that the more we can unify DistSender and DistSQL planning code, the better. Can the `DistSender be made to use the oracle for determining the first replica it should send to? I don't know; all I can say is that I too encourage this exploration :)
If there's nothing better to be done at the moment, the interfaces described here don't seem too bad for me - it's not significantly more complexity and duplication than we already have :)

docs/RFCS/20181227_follower_reads_implementation.md, line 11 at r4 (raw file):

Follower reads are consistent reads at historical timestamps from follower
replicas. They make the non-leader replicas in a range suitable sources for

you use "leader" and "follower" here, but this is all about leaseholder / non-leaseholder, not Raft leaders and followers, right?

docs/RFCS/20181227_follower_reads_implementation.md, line 65 at r4 (raw file):

example, if the `target_duration` is 30s and the `close_fraction` is 0.2 then
the subsystem will try to keep the closed timestamp 30s behind real time and
will try to update the value every 6s. This proposal seeks to achieve follower

So does that mean that the closed timestamp will lag by between 24s and 30s, or 30-36s? Can you clarify?

docs/RFCS/20181227_follower_reads_implementation.md, line 142 at r4 (raw file):

The setting represents the tradeoff between staleness of RECENT queries and

RECENT? Stale syntax?

docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):

### SQL `recent_timestamp()` builtin

A new SQL builtin `recent_timestamp()` is added to call through to an injected

I think the naming of this function should be tied to "follower reads". How about follower_lag_target(). And one would use it with select as of system time now() - follower_lag_target(). In the future we might want different flavors of this too - e.g. timestamp at which the read is guaranteed to be served locally, most recent timestamp possible at which the read will be served locally.

Also I think the function should work just fine in non-licensed clusters. You don't want the app to crash when going from a licensed cluster to an unlicensed one (think dev testing). Just that there'd be no follower reads. And it'd be useful - in testing I can see the effects of stale reads on the user experience.

docs/RFCS/20181227_follower_reads_implementation.md, line 233 at r4 (raw file):

to send batch requests to current lease holders which may prevent reads from
going to nearby follower replicas. In order to inform the DistSender that it can
send a batch request to a follower we add a new global var in the kv package

I think it'd be really good if we can avoid having this function ptr be a global. This will be a testing nightmare, as these globals always are. Please find a way to scope the hook to a Server.
For example, we can have packages register "server startup hooks" (and the list of hooks itself can be a global slice), and then Server.Start() would call these hooks and pass itself as an argument. This has the advantage that once you have a Server, it's insulated from any changes to any globals.

But actually, can we relitigate the decision to only have CCL code in separate packages? It's been known to cause avalanches of exports and headaches, and I'm not sure there's a good reason for it. We'd be much better off in my opinion (and seems that Andrew agrees) if we had this code here, for example, be in the kv package guarded by a build tag. Is there any lawery reason not to do that? @bdarnell @danhhz

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 156 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I agree that the more we can unify DistSender and DistSQL planning code, the better. Can the `DistSender be made to use the oracle for determining the first replica it should send to? I don't know; all I can say is that I too encourage this exploration :)
If there's nothing better to be done at the moment, the interfaces described here don't seem too bad for me - it's not significantly more complexity and duplication than we already have :)

I'll see what I can do.

docs/RFCS/20181227_follower_reads_implementation.md, line 62 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yeah, this is a little tricky -- the way I was thinking about this is that if we're just talking about the closed timestamp (i.e. no MLAIs), then this is cheap enough to distribute throughout. But right now you either distribute nothing or everything, and while in a "random cluster" usually all nodes talk to all other nodes, this doesn't need to be true in general (and shouldn't be true in strictly partitioned environments, ideally).

Are you suggesting that we should provide some mechanism to mark a request as expecting to be safe and then wait at the follower for it to catch up?

No, I was suggesting adding some padding to the closed timestamp before deciding to serve the read and to hope that that will work well enough for the time being, but that may not be enough. Dealing with stragglers will need some state at DistSender as this becomes a problem with either approach.

I agree that there are enough concerns to KISS for now (i.e. go threshold based). It's straightforward enough to retrofit if a desire for using the information we have in principle becomes stronger.

Ack.

docs/RFCS/20181227_follower_reads_implementation.md, line 11 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

you use "leader" and "follower" here, but this is all about leaseholder / non-leaseholder, not Raft leaders and followers, right?

You are correct. Updated.

docs/RFCS/20181227_follower_reads_implementation.md, line 65 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

So does that mean that the closed timestamp will lag by between 24s and 30s, or 30-36s? Can you clarify?

I will do some research and get back to you.

docs/RFCS/20181227_follower_reads_implementation.md, line 142 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

RECENT? Stale syntax?

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think the naming of this function should be tied to "follower reads". How about follower_lag_target(). And one would use it with select as of system time now() - follower_lag_target(). In the future we might want different flavors of this too - e.g. timestamp at which the read is guaranteed to be served locally, most recent timestamp possible at which the read will be served locally.

Also I think the function should work just fine in non-licensed clusters. You don't want the app to crash when going from a licensed cluster to an unlicensed one (think dev testing). Just that there'd be no follower reads. And it'd be useful - in testing I can see the effects of stale reads on the user experience.

This is a place where having a design partner would be nice. It might be worth pulling in some sales engineers and to wait for Andy to get back.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):

Previously, ajwerner wrote…

This is a place where having a design partner would be nice. It might be worth pulling in some sales engineers and to wait for Andy to get back.

FWIW, I wouldn't expect exceptional input from 3rd parties on the topic :)
Without trying to influence your thoughts on what makes the most sense, I think we should start by implementing the things that are the simplest and then add sugar on top as need. And the way I see it, the simplest thing is to expose the notion of the target lag (as an interval, and let it be composed with now().
If, say, we were to use a closed timestamp publicized by some other node, then that would make sense to be exposed by a function returning a timestamptz.

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):

Previously, ajwerner wrote…

With the recent change to a function, this now seems potentially expressible in SQL with min.

I do think there's probably a conversation to be had about the implication of allowing the expr in AS OF SYSTEM TIME to be non-constant as this proposal does. Does that raise any red flags? Do we have a middle ground between constant only and all expressions which would enable using built ins but not reading from the database?

I remember there was some reason we couldn't allow non-constant expressions involving now() in AOST (which is why we had to build in specific support for negative intervals). We may want to treat recent_timestamp() as a special case instead of opening the door to arbitrary expressions.

One caveat with non-constant expressions here is that if you prepare a statement that uses AOST recent_timestamp(), the timestamp should presumably get reevaluated every time the statement is executed (which may in turn invalidate any cached query plan).

docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

FWIW, I wouldn't expect exceptional input from 3rd parties on the topic :)
Without trying to influence your thoughts on what makes the most sense, I think we should start by implementing the things that are the simplest and then add sugar on top as need. And the way I see it, the simplest thing is to expose the notion of the target lag (as an interval, and let it be composed with now().
If, say, we were to use a closed timestamp publicized by some other node, then that would make sense to be exposed by a function returning a timestamptz.

+1 to calling this something like follower_timestamp(). The user doesn't want "recent" results; they get those by default. They want to be able to read from a follower.

Users may want to read from a follower for different reasons. One is lower latency for geo-distributed clusters, which is what this proposal addresses. But users also often want to read from a follower to avoid adding load to the leader (I'm not sure whether this is worthwhile in the CRDB model where every node is leader for some portion of the data, but it's a very common ask/expectation). We may want to think about exposing some way to read from a follower even if the leader is local.

docs/RFCS/20181227_follower_reads_implementation.md, line 233 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think it'd be really good if we can avoid having this function ptr be a global. This will be a testing nightmare, as these globals always are. Please find a way to scope the hook to a Server.
For example, we can have packages register "server startup hooks" (and the list of hooks itself can be a global slice), and then Server.Start() would call these hooks and pass itself as an argument. This has the advantage that once you have a Server, it's insulated from any changes to any globals.

But actually, can we relitigate the decision to only have CCL code in separate packages? It's been known to cause avalanches of exports and headaches, and I'm not sure there's a good reason for it. We'd be much better off in my opinion (and seems that Andrew agrees) if we had this code here, for example, be in the kv package guarded by a build tag. Is there any lawery reason not to do that? @bdarnell @danhhz

I think it's important that we have a simple pattern for identifying all CCL files, so that pure-OSS distributions can discard them. That doesn't necessarily mean separate packages, but the idea was that using packages can make it easier to enforce that there are no references from apache code to CCL code (build tags also made a mess of the build cache in older versions of go, although that's better now). I think if we relax the current rule and start to mix CCL and non-CCL code in the same package we're going to have to do more testing of pure-OSS builds than we do now.

I also think that when the plumbing seems like a lot of trouble relative to the amount of code on the CCL side, there is little value in actually using the CCL and it's better to just leave it OSS.

docs/RFCS/20181227_follower_reads_implementation.md, line 111 at r5 (raw file):

follower replicas, it does not enable running entire (read-only) transactions at
a single point in time and thus benefitting from the performance gains offerred
by follower reads. This document proposes an extension to the `SET TRANSACTION`

Most parts of the SET TRANSACTION statement can also be applied to the BEGIN statement. Will that be possible here too?

docs/RFCS/20181227_follower_reads_implementation.md, line 123 at r5 (raw file):

BEGIN;
SET TRANSACTION AS OF SYSTEM TIME recent_timestamp(), PRIORITY HIGH;

I like this; it will also help applications that use ORMs that make it difficult to inject the AOST clause into a generated SELECT statement.

docs/RFCS/20181227_follower_reads_implementation.md, line 145 at r5 (raw file):

## Detailed Design

### The `kv.follower_reads.target_multiple` Cluster Setting

I don't really like this setting - the application developer will be thinking in terms of "I can tolerate staleness of up to X seconds", but using this setting requires a bunch of math with other values that are mostly opaque/hidden from the app (and which seem like implementation details subject to change in the future). I think that recent_timestamp() should probably be a black box for users who don't care too much about how old their results are, and users who care should be able to pass in their staleness bound directly (maybe add an optional argument recent_timestamp('10s').

docs/RFCS/20181227_follower_reads_implementation.md, line 354 at r5 (raw file):

detecting NotLeaseHolderErrors for queries which expected to hit followers.
This could mitigate the flood of consistent reads in the face of lagging closed
timestamps but would make the semantics of `recent_timestamp()` harder to

FWIW, this would make the semantics harder to understand, but based on my conversations with potential users of this feature it's what they generally want. Users who are interested in follower reads tend to have a very high upper bound for staleness but they don't want to fix their reads at that worst-case value. They'd like timestamp selection to be somehow dynamic.

knz

Reviewed 1 of 1 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I remember there was some reason we couldn't allow non-constant expressions involving now() in AOST (which is why we had to build in specific support for negative intervals). We may want to treat recent_timestamp() as a special case instead of opening the door to arbitrary expressions.

One caveat with non-constant expressions here is that if you prepare a statement that uses AOST recent_timestamp(), the timestamp should presumably get reevaluated every time the statement is executed (which may in turn invalidate any cached query plan).

We need to be a bit careful to separate intent from mechanism.

Intent:

does it matter for the MVP of this feature that the user be able to place bounds on staleness? All this discussion sounds like "I wish" from nathan and dan but I'd like this to be supported by our customer design partners. In general, any subtlety in the UX specs will require extra complexity in the implementation. I'd like to shed all non-required complexity from the MVP otherwise the discussion (below) on mechanism will distract from the main task.
once the UX requirements are known the plan caching semantics need to be specified. If the UX requirement is that the new feature places a bound on staleness, then we must specify here in the RFC that the cached plan be invalidated adequately (an old plan can also be stale, as much as the data!). Once you spell out the staleness requirement, we can discuss the ways to achieve them.
that being said, a part of the intent will be to not stray too much from our txn model so as to not create a super long impl path to a MVP. A part of the txn model is that the ASOT expression be an expression that's a pure function of the txn start timestamp. So whatever you choose, it would be great to express the result of that expression as a (math) function of the start timestamp, that will make the reading much clearer.

(Note: I am discussing this this way, because the exchange above was like a tail wagging the dog, trying to invent deliverable requirements based on perceived notions about impl restrictions. Beware! And this yields also point 4:)

regarding the name recent_timestamp vs follower_timestamp. Obviously the name follower_timestamp is intimately linked to an implementation choice, and in general naming user-visible functions to the mechanism is poor design. I can make do with the idea that the word "recent" is inadequate, but "follower" is even more inadequate. That said, finding names is something we've done well in the past: once you have clear answers to the points 1,2,3 above, a good name will fall out of that tree on its own.

Mechanism:

A) what ben says; we're not allowing arbitrary SQL expressions in the ASOT clause because doing so would be hard. We can't reuse the regular typing logic, no subqueries, no nothing in those expressions (for different reasons that will become clearer when you become acquainted with the code) so whatever you want to evaluate you have to do manually. Today the code is limited to support a few types of literals. So the idea to use an explicit SQL min() func app would mandate you re-implementing an eval logic for that. That's A.1) cumbersome A.2) a road I'm not keen to follow, because we don't exactly have bandwidth to support a second implementation of scalar evaluation.

B) It is possible however to make a special case for a built-in function (or several) without arguments, or one applied to literal values (eg. you could make your recent_timestamp function take the bound, if desired by point 2 above, as an optional argument).

Design principle / best practices

X) using scalar expression syntax (especially built-in functions) gives us more flexibility to grow and add other things alongside it later: we know from experience that adding keywords in the SQL grammar expecting to combine both sanity and future extensibility is like holding a shotgun towards your face and another towards your feet, pull both triggers at once and expect something healthy to come out.

Y) client code knows "well" how to assemble scalar expressions, typically less "well" how to assemble arbitrary pieces of syntax.

docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

+1 to calling this something like follower_timestamp(). The user doesn't want "recent" results; they get those by default. They want to be able to read from a follower.

Users may want to read from a follower for different reasons. One is lower latency for geo-distributed clusters, which is what this proposal addresses. But users also often want to read from a follower to avoid adding load to the leader (I'm not sure whether this is worthwhile in the CRDB model where every node is leader for some portion of the data, but it's a very common ask/expectation). We may want to think about exposing some way to read from a follower even if the leader is local.

See my comment above - the word follower is implementation-specific and not a good choice.

Start tying down what you want to achieve, and discuss names only after that.

docs/RFCS/20181227_follower_reads_implementation.md, line 233 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think it's important that we have a simple pattern for identifying all CCL files, so that pure-OSS distributions can discard them. That doesn't necessarily mean separate packages, but the idea was that using packages can make it easier to enforce that there are no references from apache code to CCL code (build tags also made a mess of the build cache in older versions of go, although that's better now). I think if we relax the current rule and start to mix CCL and non-CCL code in the same package we're going to have to do more testing of pure-OSS builds than we do now.

I also think that when the plumbing seems like a lot of trouble relative to the amount of code on the CCL side, there is little value in actually using the CCL and it's better to just leave it OSS.

I think separate packages are better for the ease of excising without pain.

However another general principle applies, which has been discussed in SQL circles but not yet ventilated to the rest of the org:

everything about planning should be core (non-CCL) and then we can restrict some parts of execution to CCL packages.

The planning structures should be core because the query planning has changed shape recently (CBO) in a way that makes it non-sensical (and outright insane) to try to split it in different packages.

(FYI we'll need to implement this change for all the other CCL statements, incl backup/restore/import/export, before they can be subsumed by the CBO. ASOT is constrained in the same way.)

danhhz

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 233 at r4 (raw file):

Previously, knz (kena) wrote…

I think separate packages are better for the ease of excising without pain.

However another general principle applies, which has been discussed in SQL circles but not yet ventilated to the rest of the org:

everything about planning should be core (non-CCL) and then we can restrict some parts of execution to CCL packages.

The planning structures should be core because the query planning has changed shape recently (CBO) in a way that makes it non-sensical (and outright insane) to try to split it in different packages.

(FYI we'll need to implement this change for all the other CCL statements, incl backup/restore/import/export, before they can be subsumed by the CBO. ASOT is constrained in the same way.)

If we want to revisit how the various ccl hooks work, so they're not just globals, I'm all for that. I think the package split has worked quite well though. We tried build tags as one of the options when we were setting this up initially and it was obnoxious from an "editors and tooling" perspective, which made it a hassle to work on enterprise code.

ajwerner

The remaining comments boil down to the following points which I've ordered
relative to my personal view of their relative importance for resolution. I'd
like to tackle the first two issues before the second two issues. Maybe I should
have kept these going inline below but I felt like I needed to step back and take
stock of the feedback. If it makes it easier to follow, feel free to jump back below
to the inline discussion.

Timestamp selection:
- As far as this proposal is concerned, the selection of picking a timestamp
  in SQL and determining whether a timestamp is safe to perform against
  followers are two sides to the same coin.
  - There exists some function f() which returns a timestamp that is intended
    to be safe for follower reads.
  - If TxnTs < (f() + max clock offset), then distsqlplan and kv/DistSender
    ought to consider the txn's read requests as candidates for follower
    reads.
  - The big question of this proposal then is what is f()
  - If this basic design element is not agreed upon, the below questions
    are less relevant. I'm assuming that at least this facet of the
    problem is not controversial.
- The current proposal is a static, stateless mechanism based solely on
  cluster settings.
  - + Simple to implement
    - - So simple as to hardly justify the injection required to make a
      CCL package
  - + Relatively straightforward to understand semantically
  - - Can't make guarentee that directing a request at a follower be safe
    for performing a read.
  - - Exposes difficult to use, implementation-specific cluster setting.
    - The design could be revised to hide this setting but that further
      reduces the client control over how this mechanism works
    - If there's no cluster setting to control the fixed staleness of the
      the timestamp, how do we pick a goldilocks value?
  - - Likely to be too conservative in the common case and still may be
    not conservative enough in other cases.
- The other end of the spectrum would be a mechanism that explicitly tracks
  closed timestamps of all stores on each node.
  - + Much higher confidence that reads will actually be safe to perform
    against a follower
  - + Seems to better align with Ben's understanding of customer demand.
    - Customers may have some maximum tolerance for staleness which is
      quite high but would prefer a mechanism which dynamically chooses
      a timestamp?
  - - Introduces failure mode when nodes start up and don't have a value?
    - Error?
    - Fall back to stateless solution above?
  - - Much more involved from an implementation perspective.
    - Given that the data amounts to keeping a map of store to timestamp
      and updating it per store on the order of >1s it feels morally like
      building such a system should not be an undue burden on the cluster.
    - + At least will have enough code to comfortably justify a CCL
      package.
  - - Would have somewhat harder to reason about sematics with regards
    to staleness.
    - This solution may benefit greatly from a client specified upper
      bound on staleness, though such an upper bound somewhat defeats
      the purpose of exact timestamp selection if we are hitting it with
      any frequency.
- A middle ground might be a solution which dynamically tweaks the offset
  based on feedback. Imagine we allow nodes to dynamically alter the
  target_multiple, initially setting it to a relatively aggressive value
  (1?) and then increasing it dynamically on a per-node basis via a feedback
  mechanism whereby the DistSender records when requests it expects to be
  safe for follower reads return NotLeaseHolderError.
  - + is dynamic with regards to the staleness of timestamp based on
    cluster status, thus aligns better with our understanding of customer
    expectations.
  - + Could be implemented with an easier to understand cluster setting
    about max staleness.
  - + Is simpler to implement that a stateful timestamp tracking
    mechanism.
  - - Is not able to make strong claims about whether a timestamp will
    be safe for a read.
  - - Is probably the least easy to understand with regards to semantics.
- Are there other approaches I'm not thinking of?
- Would it be okay to pursue a simpler approach at first and then nuance it?
- Would that imply taking extra care to to expose implementation-specific
  settings/details?
What sort of expressions are valid for AOST? What are the plan caching
semantics?
- Currently just supports constant expressions
  - Though it's worth noting that they can be relative offsets which are
    computed relative to the statement time
- Let's work under an assumption that in addition to the current constant
  expressions we'll accept at least a scalar function call of no arguments
  which operates solely on the statement timestamp.
- Should have the same planning implications as today's '-2m' AOST queries
  - Do we do these correctly today?
  - Is there something I don't understand that would make builtin which is
    pure w.r.t. stmt time different from '-2m'?
- Let's table the idea that you can use arbitrary function expressions such
  as min for reasons Raphael mentioned. If we need an upper bound, we can
  implement that via an argument to our function.
- If we do the right thing for planning with prepared statements which
  include AS OF SYSTEM TIME '-2m' it ought to be trivial to do the right
  thing with a builtin which is a pure function of statement timestamp.
What do we call this function which will return a timestamp?
- This is an important thing to get right but something I'd like to push to
  the very end because it's something of a leaf concern that will be easier
  to discuss when all the other details have been figured out (Raphael's
  point).
Is there a better mechanism for CCL injection?
- This is important but whatever the answer is, it shouldn't be too hard to
  accomodate.

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20181227_follower_reads_implementation.md, line 111 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Most parts of the SET TRANSACTION statement can also be applied to the BEGIN statement. Will that be possible here too?

Sure, I see no reason not to. Will update.

docs/RFCS/20181227_follower_reads_implementation.md, line 145 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't really like this setting - the application developer will be thinking in terms of "I can tolerate staleness of up to X seconds", but using this setting requires a bunch of math with other values that are mostly opaque/hidden from the app (and which seem like implementation details subject to change in the future). I think that recent_timestamp() should probably be a black box for users who don't care too much about how old their results are, and users who care should be able to pass in their staleness bound directly (maybe add an optional argument recent_timestamp('10s').

Noted. See my high level comment.

ajwerner

I met with Ben this morning to primarily to discuss timestamp selection. The conclusion is that while less than perfect, the proposed function which produces a timestamp that lags the statement time by an interval derived from cluster settings is a reasonable path forward for an initial implementation. The intention is to hide the implementation details about how exactly this timestamp selection works so that over time we can make the mechanism more sophisticated. That resolves point 1. above.

As for point 3. Ben expressed that recent carries no useful meaning to the user and that because the intention of this function is so directly related to the desire to perform follower reads that the name ought to be obviously relatable to the feature.

I'm still working to better understand point 2.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, @justinj, @nvanbenschoten, and @tbg)

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @justinj, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):

Previously, knz (kena) wrote…

We need to be a bit careful to separate intent from mechanism.

Intent:

does it matter for the MVP of this feature that the user be able to place bounds on staleness? All this discussion sounds like "I wish" from nathan and dan but I'd like this to be supported by our customer design partners. In general, any subtlety in the UX specs will require extra complexity in the implementation. I'd like to shed all non-required complexity from the MVP otherwise the discussion (below) on mechanism will distract from the main task.

once the UX requirements are known the plan caching semantics need to be specified. If the UX requirement is that the new feature places a bound on staleness, then we must specify here in the RFC that the cached plan be invalidated adequately (an old plan can also be stale, as much as the data!). Once you spell out the staleness requirement, we can discuss the ways to achieve them.

that being said, a part of the intent will be to not stray too much from our txn model so as to not create a super long impl path to a MVP. A part of the txn model is that the ASOT expression be an expression that's a pure function of the txn start timestamp. So whatever you choose, it would be great to express the result of that expression as a (math) function of the start timestamp, that will make the reading much clearer.

(Note: I am discussing this this way, because the exchange above was like a tail wagging the dog, trying to invent deliverable requirements based on perceived notions about impl restrictions. Beware! And this yields also point 4:)

regarding the name recent_timestamp vs follower_timestamp. Obviously the name follower_timestamp is intimately linked to an implementation choice, and in general naming user-visible functions to the mechanism is poor design. I can make do with the idea that the word "recent" is inadequate, but "follower" is even more inadequate. That said, finding names is something we've done well in the past: once you have clear answers to the points 1,2,3 above, a good name will fall out of that tree on its own.

Mechanism:

A) what ben says; we're not allowing arbitrary SQL expressions in the ASOT clause because doing so would be hard. We can't reuse the regular typing logic, no subqueries, no nothing in those expressions (for different reasons that will become clearer when you become acquainted with the code) so whatever you want to evaluate you have to do manually. Today the code is limited to support a few types of literals. So the idea to use an explicit SQL min() func app would mandate you re-implementing an eval logic for that. That's A.1) cumbersome A.2) a road I'm not keen to follow, because we don't exactly have bandwidth to support a second implementation of scalar evaluation.

B) It is possible however to make a special case for a built-in function (or several) without arguments, or one applied to literal values (eg. you could make your recent_timestamp function take the bound, if desired by point 2 above, as an optional argument).

Design principle / best practices

X) using scalar expression syntax (especially built-in functions) gives us more flexibility to grow and add other things alongside it later: we know from experience that adding keywords in the SQL grammar expecting to combine both sanity and future extensibility is like holding a shotgun towards your face and another towards your feet, pull both triggers at once and expect something healthy to come out.

Y) client code knows "well" how to assemble scalar expressions, typically less "well" how to assemble arbitrary pieces of syntax.

See below comment. The resolution thus far has been that we'll remove the setting and implementation specific documentation for follower_read_timestamp(). This is indeed a feature specific function but it's not necessarily an implementation specific function. The intention for the user is that they should expect that this function will provide as up-to-date a timestamp as the system thinks can be served by a follower read. Obviously in this initial implementation that guess is conservative but the function leaves the door open to future implementations making it smarter.

The document now proposes that the evaluation of the expression with the AS OF clause is either constant or is an invocation of just that function to keep the implementation straightforward.

docs/RFCS/20181227_follower_reads_implementation.md, line 156 at r1 (raw file):

Previously, ajwerner wrote…

I'll see what I can do.

I didn't do much :/

docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):

Previously, knz (kena) wrote…

See my comment above - the word follower is implementation-specific and not a good choice.

Start tying down what you want to achieve, and discuss names only after that.

After some discussions with Ben, the idea here is that regardless of exactly the mechanism by which the system chooses this timestamp, the intention of using this function is that it is likely to produce a timestamp which the system will deem safe for a follower read. That does not make the function implementation specific but it certainly makes it feature specific. This function should be used exclusively by clients hoping to perform reads against followers. Given the feedback regarding not overly tying the functionality to

docs/RFCS/20181227_follower_reads_implementation.md, line 233 at r4 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

If we want to revisit how the various ccl hooks work, so they're not just globals, I'm all for that. I think the package split has worked quite well though. We tried build tags as one of the options when we were setting this up initially and it was obnoxious from an "editors and tooling" perspective, which made it a hassle to work on enterprise code.

Given the lack of desire to reopen that can of worms, I'm moving forward with the global.

docs/RFCS/20181227_follower_reads_implementation.md, line 111 at r5 (raw file):

Previously, ajwerner wrote…

Sure, I see no reason not to. Will update.

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 123 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I like this; it will also help applications that use ORMs that make it difficult to inject the AOST clause into a generated SELECT statement.

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 145 at r5 (raw file):

Previously, ajwerner wrote…

Noted. See my high level comment.

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 354 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

FWIW, this would make the semantics harder to understand, but based on my conversations with potential users of this feature it's what they generally want. Users who are interested in follower reads tend to have a very high upper bound for staleness but they don't want to fix their reads at that worst-case value. They'd like timestamp selection to be somehow dynamic.

Discussed offline that while something more dynamic is probably desirable, for the first pass we're going to do the simpler thing.

Release note: None

nvanbenschoten

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @justinj, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):

Previously, ajwerner wrote…

After some discussions with Ben, the idea here is that regardless of exactly the mechanism by which the system chooses this timestamp, the intention of using this function is that it is likely to produce a timestamp which the system will deem safe for a follower read. That does not make the function implementation specific but it certainly makes it feature specific. This function should be used exclusively by clients hoping to perform reads against followers. Given the feedback regarding not overly tying the functionality to

+1 on follower. That's what we're going to call the feature, so it doesn't seem like a problem to me.

docs/RFCS/20181227_follower_reads_implementation.md, line 145 at r5 (raw file):

Previously, ajwerner wrote…

Done.

What is your stance on no setting vs. a hidden setting?

ajwerner

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @justinj, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

+1 on follower. That's what we're going to call the feature, so it doesn't seem like a problem to me.

Done.

docs/RFCS/20181227_follower_reads_implementation.md, line 145 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What is your stance on no setting vs. a hidden setting?

I'm for a hidden setting. I added that to the doc. It makes sense that 1) we'd want to play with this while testing out the feature and 2) when it's in the field there might be some case where we're happy it's there.

ajwerner · 2019-01-30T16:08:12Z

bors r+

33474: docs: add RFC to expose follower reads to clients r=ajwerner a=ajwerner Relates to #16593. Release note: None 34399: storage: fix NPE while printing trivial truncateDecision r=bdarnell a=tbg Fixes #34398. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

craig · 2019-01-30T16:39:33Z

Build succeeded

GitHub CI (Cockroach)

knz

From a process perspective, there's a shortcoming here. The latest iteration was less than 24 hours ago, and I haven't had the chance to get an additional turn at reviewing this before it was merged. That's not what we usually do with RFCs.

Reviewed 1 of 1 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @bdarnell, @justinj, @nvanbenschoten, and @tbg)

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):

Previously, ajwerner wrote…

See below comment. The resolution thus far has been that we'll remove the setting and implementation specific documentation for follower_read_timestamp(). This is indeed a feature specific function but it's not necessarily an implementation specific function. The intention for the user is that they should expect that this function will provide as up-to-date a timestamp as the system thinks can be served by a follower read. Obviously in this initial implementation that guess is conservative but the function leaves the door open to future implementations making it smarter.

The document now proposes that the evaluation of the expression with the AS OF clause is either constant or is an invocation of just that function to keep the implementation straightforward.

You're missing that the word "follower" is an artifact of our choice to use Raft (or consensus in general).

What the feature really is about is breaking the master/slave (or primary/secondary) boundary for scans and actually delivering on our promise of "multi-active".

We're not doing "follower reads" here (that's a raft-specific word). We're doing safe distributed reads from secondaries. This is why, if I had the time to iterate on this after your latest batch of changes, I'd have been driven by your other changes/explanations to the natural good name for this: safe_distributed_read_timestamp().

ajwerner

I apologize for being hasty. I was having a somewhat difficult time balancing pressure to get this work in. Even more than that I think I personally underplayed the importance of the naming here and felt that there was support from Ben and Nathan for “follower”. As I expressed before, I don’t have strong conviction on the naming here. I’m happy to keep this open and to make further revision to this RFC.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @bdarnell, @justinj, @nvanbenschoten, and @tbg)

docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):

Previously, knz (kena) wrote…

You're missing that the word "follower" is an artifact of our choice to use Raft (or consensus in general).

What the feature really is about is breaking the master/slave (or primary/secondary) boundary for scans and actually delivering on our promise of "multi-active".

We're not doing "follower reads" here (that's a raft-specific word). We're doing safe distributed reads from secondaries. This is why, if I had the time to iterate on this after your latest batch of changes, I'd have been driven by your other changes/explanations to the natural good name for this: safe_distributed_read_timestamp().

I don’t disagree with anything you’ve said here. My sense is that it’ll be easiest to discuss this feature if the terminology aligns with how we market this functionality. My current understanding is that the product folks think of this functionality as “follower reads” and will market it as such and hence the chosen name.

In Ceph for example the name for a somewhat analogous feature is called “balanced reads” though it additionally implies a round-robin replica selection policy and furthermore is a single data center deployment and thus is not a great fit for us.

Maybe we should rebrand this feature as “Distributed Reads” in which case we absolutely should rename this function to match. Probably my biggest blind spot in this discussion is a general lack of conviction regarding the naming here. I apologize for not giving enough time and space to let that discussion run its course. I’m happy to further revise this doc in a follow up PR.

knz · 2019-01-30T19:14:44Z

In all fairness this is a discussion where a PM should have been involved in the loop where we balance the engineering aspects with the productization aspects.

As a general rule, when a feature like this has a strong UX visibility in this way, it's OK in the RFC to say "the initial implementation will use the name crdb_internal.experimental_xxxx() and we'll iterate with stakeholders for a better naming after that".

33478: sql,kv,followerreadsccl: enable follower reads for historical queries r=ajwerner a=ajwerner Follower reads are reads which can be served from any replica as opposed to just the current lease holder. The foundation for this change was laid with the work to introduce closed timestamps and to support follower reads at the replica level. This change adds the required support to the sql and kv layers and additionally exposes a new syntax to ease client adoption of the functionality. The change adds the followerreadsccl package with logic to check when follower reads are safe and to inject the functionality so that it can be packaged as an enterprise feature. Modifies `AS OF SYSTEM TIME` semantics to allow for the evaluation of a new builtin tentatively called `follower_read_timestamp()` in addition to constant expressions. This new builtin ensures that an enterprise license exists and then returns a time that can likely be used to read from a follower. The change abstracts (and renames to the more appropriate replicaoracle) the existing leaseHolderOracle in the distsqlplan package to allow a follower read aware policy to be injected. Lastly the change add to kv a site to inject a function for checking if follower reads are safe and allowed given a cluster, settings, and batch request. This change includes a high level roachtest which validates observable behavior of performing follower reads by examining latencies for reads in a geo- replicated setting. Implements #33474 Fixes #16593 Release note (enterprise change): Add support for performing sufficiently old historical reads against closest replicas rather than leaseholders. A new builtin function `follower_read_timestamp()` which can be used with `AS OF SYSTEM TIME` clauses to generate a timestamp which is likely to be safe for reads from a follower. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

ajwerner requested a review from nvanbenschoten January 3, 2019 18:37

ajwerner requested a review from a team as a code owner January 3, 2019 18:37

ajwerner mentioned this pull request Jan 3, 2019

sql,kv,followerreadsccl: enable follower reads for historical queries #33478

Merged

justinj reviewed Jan 3, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/follower-reads-impl-rfc branch from c8abc14 to 1df87ad Compare January 3, 2019 20:39

ajwerner commented Jan 3, 2019

View reviewed changes

nvanbenschoten requested changes Jan 3, 2019

View reviewed changes

ajwerner commented Jan 3, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/follower-reads-impl-rfc branch from 1df87ad to 6fa126e Compare January 3, 2019 23:15

ajwerner commented Jan 3, 2019

View reviewed changes

knz reviewed Jan 4, 2019

View reviewed changes

ajwerner commented Jan 4, 2019

View reviewed changes

tbg reviewed Jan 5, 2019

View reviewed changes

nvanbenschoten requested changes Jan 7, 2019

View reviewed changes

danhhz reviewed Jan 7, 2019

View reviewed changes

ajwerner commented Jan 7, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/follower-reads-impl-rfc branch from 6fa126e to 1c91579 Compare January 7, 2019 16:11

tbg approved these changes Jan 8, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/follower-reads-impl-rfc branch 2 times, most recently from 4f646ff to 2e10d41 Compare January 8, 2019 15:50

ajwerner commented Jan 8, 2019

View reviewed changes

andreimatei approved these changes Jan 8, 2019

View reviewed changes

ajwerner commented Jan 8, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/follower-reads-impl-rfc branch from 2e10d41 to eef11b6 Compare January 8, 2019 20:13

andreimatei approved these changes Jan 8, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/follower-reads-impl-rfc branch from eef11b6 to 9787909 Compare January 14, 2019 18:35

ajwerner force-pushed the ajwerner/follower-reads-impl-rfc branch from 9787909 to 15f13c6 Compare January 16, 2019 15:57

bdarnell reviewed Jan 17, 2019

View reviewed changes

knz reviewed Jan 17, 2019

View reviewed changes

danhhz reviewed Jan 17, 2019

View reviewed changes

ajwerner commented Jan 17, 2019

View reviewed changes

ajwerner commented Jan 23, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/follower-reads-impl-rfc branch from 15f13c6 to 1210662 Compare January 29, 2019 20:02

ajwerner commented Jan 29, 2019

View reviewed changes

docs: add RFC to expose follower reads to clients

58b96c4

Release note: None

ajwerner force-pushed the ajwerner/follower-reads-impl-rfc branch from 1210662 to 58b96c4 Compare January 29, 2019 22:27

nvanbenschoten approved these changes Jan 30, 2019

View reviewed changes

ajwerner commented Jan 30, 2019

View reviewed changes

craig bot merged commit 58b96c4 into cockroachdb:master Jan 30, 2019

knz reviewed Jan 30, 2019

View reviewed changes

ajwerner commented Jan 30, 2019

View reviewed changes

docs: add RFC to expose follower reads to clients #33474

docs: add RFC to expose follower reads to clients #33474

Conversation

ajwerner commented Jan 3, 2019 • edited Loading

cockroach-teamcity commented Jan 3, 2019

justinj left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment • edited Loading

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented Jan 4, 2019

ajwerner left a comment

Choose a reason for hiding this comment

tbg Jan 5, 2019

Choose a reason for hiding this comment

tbg Jan 5, 2019

Choose a reason for hiding this comment

ajwerner Jan 5, 2019

Choose a reason for hiding this comment

tbg Jan 5, 2019

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

danhhz left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

danhhz left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner commented Jan 30, 2019

craig bot commented Jan 30, 2019

Build succeeded

knz left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

knz commented Jan 30, 2019

ajwerner commented Jan 3, 2019 •

edited

Loading

nvanbenschoten left a comment •

edited

Loading