-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add RFC to expose follower reads to clients #33474
docs: add RFC to expose follower reads to clients #33474
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great RFC 👍
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 23 at r1 (raw file):
DistSender aware that follower reads are possible. Given the intention to make follower reads an enterprise feature, some of the complexity in this proposal stem from the need to inject behavior from CCL code.
nit: stems
docs/RFCS/20181227_follower_reads_implementation.md, line 60 at r1 (raw file):
seconds but could likely be lowered to 5-10 seconds (at some threshold it may potentially interfere with on-going transactions). This proposal seeks to acheive follower reads by employing stateless approximation of when a follower
nit: achieve
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):
In order to ease the burden of the client determining an adequately old timestamp for use with an `AS OF SYSTEM TIME` query, this RFC introduces a new SQL syntax `AS OF RECENT SYSTEM TIME` which is effectively a syntactic
I wonder if there should be a way for a user to explicitly bound the amount of staleness in this syntax, independent of a cluster setting? I'm thinking like, AS OF RECENT SYSTEM TIME AT MOST '30s' AGO
? As a application developer I feel like I might be concerned that some future changes to the cluster settings will make my reads become more stale than I had initially planned on.
docs/RFCS/20181227_follower_reads_implementation.md, line 100 at r1 (raw file):
This section will focus on the specific details of plumbing the functionality required to expose follower reads through the codebase. Because follwer reads
nit: follower
docs/RFCS/20181227_follower_reads_implementation.md, line 127 at r1 (raw file):
targetMultiple := FollowerReadsTargetMultiple.Get(&st.SV) targetDuration := closedts.TargetDuration.Get(&st.SV) return -1 * time.Duration(targetMultiple * float64(targetDuration))
nit: stray tab here
docs/RFCS/20181227_follower_reads_implementation.md, line 176 at r1 (raw file):
policy which selects the closest replica. Prior to this change the policy is used to statically construct an a oracle
nit: an Oracle
docs/RFCS/20181227_follower_reads_implementation.md, line 283 at r1 (raw file):
feel less egregious? ### How does this capability fit in to the SQL query planning?
I believe the work being done today around taking expected latencies into account is significantly less involved than what would be required to make use of follower reads: it's only planned to use this information for indexes which are pinned to a region using geopartitioning (i.e., it has no knowledge of ranges whatsoever). Additionally, making use of range-level information would be a big step up and would almost certainly have to happen post- absorption of the DistSQL physical planning into the optimizer.
But despite that it does seem to me that long-term the optimizer is where this kind of decision should be made, and we probably won't want to ship the decision of what replicas to read from off to an injected oracle. For instance, if the rest of the plan affects the choice of which replica would be best, like maybe we want to colocate reads from ranges we expect to join together, or something? I'm not sure. Now that I'm writing out examples they seem kind of contrived and I'm wondering if there are many real-world examples of such cases.
I don't know what kind of interface the optimizer will want when we cross that bridge, so for now I'd say that the interface discussed in this RFC is the right way to go.
c8abc14
to
1df87ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):
Previously, justinj (Justin Jaffray) wrote…
I wonder if there should be a way for a user to explicitly bound the amount of staleness in this syntax, independent of a cluster setting? I'm thinking like,
AS OF RECENT SYSTEM TIME AT MOST '30s' AGO
? As a application developer I feel like I might be concerned that some future changes to the cluster settings will make my reads become more stale than I had initially planned on.
This smells a lot like AS OF SYSTEM TIME '-30s'
. To me the RECENT
syntax is expressly for the client to ask for follower reads. If they want to control the staleness we already have a syntax for that. Assuming the chosen time is adequately old, a follower read will happen.
docs/RFCS/20181227_follower_reads_implementation.md, line 283 at r1 (raw file):
Previously, justinj (Justin Jaffray) wrote…
I believe the work being done today around taking expected latencies into account is significantly less involved than what would be required to make use of follower reads: it's only planned to use this information for indexes which are pinned to a region using geopartitioning (i.e., it has no knowledge of ranges whatsoever). Additionally, making use of range-level information would be a big step up and would almost certainly have to happen post- absorption of the DistSQL physical planning into the optimizer.
But despite that it does seem to me that long-term the optimizer is where this kind of decision should be made, and we probably won't want to ship the decision of what replicas to read from off to an injected oracle. For instance, if the rest of the plan affects the choice of which replica would be best, like maybe we want to colocate reads from ranges we expect to join together, or something? I'm not sure. Now that I'm writing out examples they seem kind of contrived and I'm wondering if there are many real-world examples of such cases.
I don't know what kind of interface the optimizer will want when we cross that bridge, so for now I'd say that the interface discussed in this RFC is the right way to go.
Ack. Will update after some more feedback comes in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and it's fantastic to see the prototype sitting along-side this.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 1 at r1 (raw file):
nit: delete new line
docs/RFCS/20181227_follower_reads_implementation.md, line 7 at r1 (raw file):
- Authors: Andrew Werner - RFC PR: (PR # after acceptance of initial draft) - Cockroach Issue: (one or more # from the issue tracker)
#16593 and maybe others.
docs/RFCS/20181227_follower_reads_implementation.md, line 23 at r1 (raw file):
DistSender aware that follower reads are possible. Given the intention to make follower reads an enterprise feature, some of the complexity in this proposal stem from the need to inject behavior from CCL code.
s/stem/stems/
docs/RFCS/20181227_follower_reads_implementation.md, line 59 at r1 (raw file):
`kv.closed_timestamp.target_duration`. As of writing this value defaults to 30 seconds but could likely be lowered to 5-10 seconds (at some threshold it may potentially interfere with on-going transactions). This proposal seeks to
We had it all the way down at 5 seconds before, but saw test flakes because of it. We bumped it because we weren't making use of it, but I expect that we will drop the default to 5-10 seconds once this work goes in.
docs/RFCS/20181227_follower_reads_implementation.md, line 61 at r1 (raw file):
potentially interfere with on-going transactions). This proposal seeks to acheive follower reads by employing stateless approximation of when a follower read is possible by assuming that a read may be directed at to a follower if it
"at to a"
docs/RFCS/20181227_follower_reads_implementation.md, line 62 at r1 (raw file):
acheive follower reads by employing stateless approximation of when a follower read is possible by assuming that a read may be directed at to a follower if it occurs some multiple of the target duration which is controlled by a new cluster
occurs at some multiple?
docs/RFCS/20181227_follower_reads_implementation.md, line 68 at r1 (raw file):
read (which will happen seamlessly due to a NotLeaseHolderError). The `target_multiple` offers clients a tradeoff between staleness and likelihood of follower read failing.
You use the term "staleness" a few times here. It might be worth clarifying that you mean the "minimum staleness required for a follower read to be attempted".
docs/RFCS/20181227_follower_reads_implementation.md, line 83 at r1 (raw file):
The physical planning of SQL evaluation currently tries to send queries to be
s/queries/DistSQL processors/
s/evaluated/run/
docs/RFCS/20181227_follower_reads_implementation.md, line 91 at r1 (raw file):
other than READ_UNCOMMITTED or INCONSISTENT
This would just be CONSISTENT
, but I think this is wrong. READ_UNCOMMITTED
operations should also be evaluated on the leaseholder.
docs/RFCS/20181227_follower_reads_implementation.md, line 100 at r1 (raw file):
This section will focus on the specific details of plumbing the functionality required to expose follower reads through the codebase. Because follwer reads
"follwer"
docs/RFCS/20181227_follower_reads_implementation.md, line 136 at r1 (raw file):
the current target duration for closed timestamps is 30s, queries performed with `RECENT` should lag "real" time by roughly 1m30s. If we can lower the target duration to 10s which would lead to a 30s real time delay.
I wonder if we should be taking kv.closed_timestamp.close_fraction
into account as well. kv.closed_timestamp.target_duration * kv.closed_timestamp.close_fraction
should give us the period of closed timestamp updates, so multiplying that period by some configurable multiple and adding it to the closed timestamp target duration seems like the right approach to conservatively estimate the closed timestamp threshold of a follower.
docs/RFCS/20181227_follower_reads_implementation.md, line 138 at r1 (raw file):
target duration to 10s which would lead to a 30s real time delay. ### SQL Support For `AS OF RECENT SYSTEM TIME` Queries
Should we eagerly reject this grammar when an enterprise license is missing as well? That's probably a better experience than it just silently not hitting a follower read.
docs/RFCS/20181227_follower_reads_implementation.md, line 156 at r1 (raw file):
`init`. ### Abstract replica selection for SQL physical planning.
@andreimatei and @RaduBerinde should take a look at this.
The big thing that jumps out to me is that we have this logic living in two different places (which isn't new). As we add more complex routing policies, this becomes a bigger issue, especially when we want the decisions from the policies to line up (e.g. we'd hate to route a processor to a local node only to have its DistSender send requests to a remote node). I don't remember the history of these leaseHolderOracle
s, but I wonder whether it's time to revisit whether they can be merged with the same kind of logic in DistSender
.
docs/RFCS/20181227_follower_reads_implementation.md, line 247 at r1 (raw file):
One potential downside of this approach is that in an edge case it may have the potential to have a detrimentally affect cluster performance in the case of
"effect on cluster performance"
docs/RFCS/20181227_follower_reads_implementation.md, line 260 at r1 (raw file):
increasing the target multiple. The problem seems worse as the replication factor increases beyond 3 to numbers like 7 or 9. Furthermore even if the increased load during this bursty period does not meaningfully effect OLTP
"affect"
docs/RFCS/20181227_follower_reads_implementation.md, line 273 at r1 (raw file):
additionally require new mechanisms to mark as known to be safe for follower reads. Furthermore the state tracking may be prohibitively expensive on large clusters.
Another alternative would be to dynamically adjust the target multiple as part of a feedback look with follower read hits and misses as an input. Ideally, this would allow us to adapt to changes to closed timestamp update latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 7 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
#16593 and maybe others.
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 23 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/stem/stems/
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 59 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We had it all the way down at 5 seconds before, but saw test flakes because of it. We bumped it because we weren't making use of it, but I expect that we will drop the default to 5-10 seconds once this work goes in.
Do we understand why it was leading to test flakes? Is there work to be done there?
docs/RFCS/20181227_follower_reads_implementation.md, line 61 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"at to a"
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 62 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
occurs at some multiple?
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 68 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
You use the term "staleness" a few times here. It might be worth clarifying that you mean the "minimum staleness required for a follower read to be attempted".
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 83 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/queries/DistSQL processors/
s/evaluated/run/
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 91 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
other than READ_UNCOMMITTED or INCONSISTENT
This would just be
CONSISTENT
, but I think this is wrong.READ_UNCOMMITTED
operations should also be evaluated on the leaseholder.
You are correct. Updated
docs/RFCS/20181227_follower_reads_implementation.md, line 100 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"follwer"
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 136 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I wonder if we should be taking
kv.closed_timestamp.close_fraction
into account as well.kv.closed_timestamp.target_duration * kv.closed_timestamp.close_fraction
should give us the period of closed timestamp updates, so multiplying that period by some configurable multiple and adding it to the closed timestamp target duration seems like the right approach to conservatively estimate the closed timestamp threshold of a follower.
Seems reasonable to me. If I understand correctly, currently I have:
recent := kv.closed_timestamp.target_duration * kv.follower_reads.target_multiple
and you're suggesting
recent := kv.closed_timestamp.target_duration * (1 + kv.closed_timestamp.close_fraction * kv.follower_reads.target_multiple)
docs/RFCS/20181227_follower_reads_implementation.md, line 138 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we eagerly reject this grammar when an enterprise license is missing as well? That's probably a better experience than it just silently not hitting a follower read.
Sure, I updated the doc and will update the prototype.
docs/RFCS/20181227_follower_reads_implementation.md, line 247 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"effect on cluster performance"
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 260 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"affect"
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 273 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Another alternative would be to dynamically adjust the target multiple as part of a feedback look with follower read hits and misses as an input. Ideally, this would allow us to adapt to changes to closed timestamp update latency.
Added some language to this effect.
Another thing I'm thinking is that we should add some monitoring to track when we're closed timestamps lag to understand if and when it's actually a problem.
1df87ad
to
6fa126e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 138 at r1 (raw file):
Previously, ajwerner wrote…
Sure, I updated the doc and will update the prototype.
Updated the prototype too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 7 at r1 (raw file):
Previously, ajwerner wrote…
Done.
nit: by convention (and to ease grep
) we place the #
before the issue/pr number
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r2 (raw file):
In order to ease the burden of the client determining an adequately old timestamp for use with an `AS OF SYSTEM TIME` query, this RFC introduces a new SQL syntax `AS OF RECENT SYSTEM TIME` which is effectively a syntactic
To be honest I am not too fond of a syntax extension to achieve this behavior. This prevents clients from 1) programmatically choosing one behavior over another based on other settings or a computation and 2) inspect the timestamp at which the reads will be done
Have you thought about using a built-in function for this: AS OF SYSTEM TIME historical_timestamp()
where historical_timestamp() returns the highest timestamp for which crdb guarantees follower reads (if enabled)
This way clients can also inspect the timestamp with SELECT historical_timestamp()
etc
(the name "historical_timestamp" may need to be iterated over a little bit. Perhaps "latest_timestamp_for_distributed_reads()" would be more accurate/descriptive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The builtin approach seems cleaner than the syntax change. TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r2 (raw file):
Previously, knz (kena) wrote…
To be honest I am not too fond of a syntax extension to achieve this behavior. This prevents clients from 1) programmatically choosing one behavior over another based on other settings or a computation and 2) inspect the timestamp at which the reads will be done
Have you thought about using a built-in function for this:
AS OF SYSTEM TIME historical_timestamp()
where historical_timestamp() returns the highest timestamp for which crdb guarantees follower reads (if enabled)
This way clients can also inspect the timestamp with
SELECT historical_timestamp()
etc
I'm not at all wedded to the syntax extension. The builtin approach makes sense. It should also make the CCL injection somewhat less nasty.
achieve follower reads by employing stateless approximation of when a follower | ||
read is possible by assuming that a read may be directed to a follower if it | ||
occurs at some multiple of the target duration which is controlled by a new | ||
cluster setting `kv.follower_reads.target_multiple`. While this may ultimately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered consulting the gateway node's closed timestamp subsystem for information about the remote node? For example, if you're on n1 and you're deciding whether you can serve a follower read from a replica on n2, you can look at the closed timestamp of n2 to get a good idea. I think this is superior to making this purely timing-based, in the sense that it has protection against "slow nodes" built in. With a purely timing-based approach, it's more likely that you'll have to add state to remember when a node rejected follower reads that were thought possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I mean that if you have a range that's on n1, n2, n3 and the leader is, say, n1, then you consider n2 and n3 for serving a follower read for timestamps that have closed out at n1 (perhaps with some padding), the idea being that all that's missing for n2 and n3 to serve the follower read is their catching up on the Raft log. There are still plenty of ways to get this wrong (a follower could be waiting for a snapshot, etc) but overall it offers a little more protection than going blindly by target duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly the way I started writing this RFC was with the stateless approach as the rejected alternative and a more stateful approach as the proposal. After some talking to Nathan I was encouraged to flip it, if only because the proposed approach here is simpler to implement and can be nuanced later.
A problem as far as I can tell with relying on node level closed timestamp arises when there's a range that's on n2, n3, and n4 and the gateway node receiving the request is on n1. n1 won't have information on the state of the closed ts for the range in question and thus still falls back to having to guess. Using the closed timestamp tracking on the node level could enable the recent timestamp to be less conservative.
I'm curious how important it is that this feature seek to be as up to date as possible. I wonder if clients opting in to this functionality will make their decision based on whether the data 10s vs 60s stale.
the idea being that all that's missing for n2 and n3 to serve the follower read is their catching up on the Raft log
When we hit this situation, the receiving replica won't know that the request had some reason to believe that a follower read was safe so this will still result in a NotLeaseHolderError. Are you suggesting that we should provide some mechanism to mark a request as expecting to be safe and then wait at the follower for it to catch up?
traffic, it may lead to potentially massively increased latency for queries | ||
which in the previous regime had been fast. | ||
|
||
A more sophisticated mechanism which statefully tracks a closed timestamps on a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is roughly where I was going above, except that I don't think you'd need to track individual replicas (which I agree is not a good thing to buy into). Looking at the node-wide closed timestamp alone should be sufficient, at least for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 59 at r1 (raw file):
Previously, ajwerner wrote…
Do we understand why it was leading to test flakes? Is there work to be done there?
It was leading to flakes because tests weren't prepared to have their transactions' timestamps pushed forwards. The work there is just to fix the tests.
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):
Previously, ajwerner wrote…
This smells a lot like
AS OF SYSTEM TIME '-30s'
. To me theRECENT
syntax is expressly for the client to ask for follower reads. If they want to control the staleness we already have a syntax for that. Assuming the chosen time is adequately old, a follower read will happen.
I tend to agree with @ajwerner. If RECENT
had an upper bound of now()
given a sufficiently up-to-date follower and was determined dynamically by follower progress then something like AS OF RECENT SYSTEM TIME AT MOST '5s' AGO
would make a lot of sense. However, since we need to hit the leaseholder for any query within the kv.closed_timestamp.target_duration
to ensure RW conflicts are properly handled (given our current non-distributed timestamp cache*) and because RECENT
isn't being determined based on follower progress, I don't see a huge reason to add in the syntax.
* The idea of permitting follower reads within the kv.closed_timestamp.target_duration
while still handling RW conflicts correctly is a different discussion that would be really interesting to have at some point! I think a workable solution would look closer to quorum leases — where multiple replicas would hold leases at a given time, each would need to be a part of all replication quorums, and each would need to record reads in a local timestamp cache-like structure and be prepared to reject incoming proposals in the event of RW conflicts.
docs/RFCS/20181227_follower_reads_implementation.md, line 136 at r1 (raw file):
Previously, ajwerner wrote…
Seems reasonable to me. If I understand correctly, currently I have:
recent := kv.closed_timestamp.target_duration * kv.follower_reads.target_multiple
and you're suggesting
recent := kv.closed_timestamp.target_duration * (1 + kv.closed_timestamp.close_fraction * kv.follower_reads.target_multiple)
Yes, that's exactly right.
docs/RFCS/20181227_follower_reads_implementation.md, line 62 at r2 (raw file):
A problem as far as I can tell with relying on node level closed timestamp arises when there's a range that's on n2, n3, and n4 and the gateway node receiving the request is on n1. n1 won't have information on the state of the closed ts for the range in question and thus still falls back to having to guess. Using the closed timestamp tracking on the node level could enable the recent timestamp to be less conservative.
Yes, that was my understanding of the problem a well. It's likely that that's just because I don't have the details of the closed timestamp subsystem paged in. Does a node have enough information to make one of these kinds of decisions accurately (or close to accurately) when it doesn't hold a replica in the Range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I tend to agree with @ajwerner. If
RECENT
had an upper bound ofnow()
given a sufficiently up-to-date follower and was determined dynamically by follower progress then something likeAS OF RECENT SYSTEM TIME AT MOST '5s' AGO
would make a lot of sense. However, since we need to hit the leaseholder for any query within thekv.closed_timestamp.target_duration
to ensure RW conflicts are properly handled (given our current non-distributed timestamp cache*) and becauseRECENT
isn't being determined based on follower progress, I don't see a huge reason to add in the syntax.* The idea of permitting follower reads within the
kv.closed_timestamp.target_duration
while still handling RW conflicts correctly is a different discussion that would be really interesting to have at some point! I think a workable solution would look closer to quorum leases — where multiple replicas would hold leases at a given time, each would need to be a part of all replication quorums, and each would need to record reads in a local timestamp cache-like structure and be prepared to reject incoming proposals in the event of RW conflicts.
Do we have design partners for follower reads? This is exactly the sort of thing it's handy to get external feedback on.
(With my Foursquare hat on, I happen to find justin's suggestion of an optional AT MOST <> AGO
suffix compelling. AS OF SYSTEM TIME '-30s'
means you're always running 30s stale, whereas AS OF RECENT SYSTEM TIME AT MOST '30s' AGO
is probably running more recent than that but you still don't have to worry about it being an hour stale.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
Do we have design partners for follower reads? This is exactly the sort of thing it's handy to get external feedback on.
(With my Foursquare hat on, I happen to find justin's suggestion of an optional
AT MOST <> AGO
suffix compelling.AS OF SYSTEM TIME '-30s'
means you're always running 30s stale, whereasAS OF RECENT SYSTEM TIME AT MOST '30s' AGO
is probably running more recent than that but you still don't have to worry about it being an hour stale.)
With the recent change to a function, this now seems potentially expressible in SQL with min
.
I do think there's probably a conversation to be had about the implication of allowing the expr in AS OF SYSTEM TIME to be non-constant as this proposal does. Does that raise any red flags? Do we have a middle ground between constant only and all expressions which would enable using built ins but not reading from the database?
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r2 (raw file):
Previously, ajwerner wrote…
I'm not at all wedded to the syntax extension. The builtin approach makes sense. It should also make the CCL injection somewhat less nasty.
Done.
6fa126e
to
1c91579
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't be able to give this a closer look this week, but I think I'm on the same page with you and Nathan. Let me know if you need my eyes on this again.
Reviewed 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 62 at r2 (raw file):
Yeah, this is a little tricky -- the way I was thinking about this is that if we're just talking about the closed timestamp (i.e. no MLAIs), then this is cheap enough to distribute throughout. But right now you either distribute nothing or everything, and while in a "random cluster" usually all nodes talk to all other nodes, this doesn't need to be true in general (and shouldn't be true in strictly partitioned environments, ideally).
Are you suggesting that we should provide some mechanism to mark a request as expecting to be safe and then wait at the follower for it to catch up?
No, I was suggesting adding some padding to the closed timestamp before deciding to serve the read and to hope that that will work well enough for the time being, but that may not be enough. Dealing with stragglers will need some state at DistSender as this becomes a problem with either approach.
I agree that there are enough concerns to KISS for now (i.e. go threshold based). It's straightforward enough to retrofit if a desire for using the information we have in principle becomes stronger.
4f646ff
to
2e10d41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 59 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It was leading to flakes because tests weren't prepared to have their transactions' timestamps pushed forwards. The work there is just to fix the tests.
I'll create a separate issue for this to lower the time and track down the flakes.
docs/RFCS/20181227_follower_reads_implementation.md, line 136 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yes, that's exactly right.
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 265 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is roughly where I was going above, except that I don't think you'd need to track individual replicas (which I agree is not a good thing to buy into). Looking at the node-wide closed timestamp alone should be sufficient, at least for now.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knz I've updated the RFC and prototype to use a function rather than a syntax change as suggested, PTAL. I called it recent_timestamp
which is certainly less descriptive than your suggestion and am open to further iteration.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 156 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
@andreimatei and @RaduBerinde should take a look at this.
The big thing that jumps out to me is that we have this logic living in two different places (which isn't new). As we add more complex routing policies, this becomes a bigger issue, especially when we want the decisions from the policies to line up (e.g. we'd hate to route a processor to a local node only to have its DistSender send requests to a remote node). I don't remember the history of these
leaseHolderOracle
s, but I wonder whether it's time to revisit whether they can be merged with the same kind of logic inDistSender
.
I agree that the more we can unify DistSender
and DistSQL planning code, the better. Can the `DistSender be made to use the oracle for determining the first replica it should send to? I don't know; all I can say is that I too encourage this exploration :)
If there's nothing better to be done at the moment, the interfaces described here don't seem too bad for me - it's not significantly more complexity and duplication than we already have :)
docs/RFCS/20181227_follower_reads_implementation.md, line 11 at r4 (raw file):
Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for
you use "leader" and "follower" here, but this is all about leaseholder / non-leaseholder, not Raft leaders and followers, right?
docs/RFCS/20181227_follower_reads_implementation.md, line 65 at r4 (raw file):
example, if the `target_duration` is 30s and the `close_fraction` is 0.2 then the subsystem will try to keep the closed timestamp 30s behind real time and will try to update the value every 6s. This proposal seeks to achieve follower
So does that mean that the closed timestamp will lag by between 24s and 30s, or 30-36s? Can you clarify?
docs/RFCS/20181227_follower_reads_implementation.md, line 142 at r4 (raw file):
The setting represents the tradeoff between staleness of
RECENT
queries and
RECENT
? Stale syntax?
docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):
### SQL `recent_timestamp()` builtin A new SQL builtin `recent_timestamp()` is added to call through to an injected
I think the naming of this function should be tied to "follower reads". How about follower_lag_target()
. And one would use it with select as of system time now() - follower_lag_target()
. In the future we might want different flavors of this too - e.g. timestamp at which the read is guaranteed to be served locally, most recent timestamp possible at which the read will be served locally.
Also I think the function should work just fine in non-licensed clusters. You don't want the app to crash when going from a licensed cluster to an unlicensed one (think dev testing). Just that there'd be no follower reads. And it'd be useful - in testing I can see the effects of stale reads on the user experience.
docs/RFCS/20181227_follower_reads_implementation.md, line 233 at r4 (raw file):
to send batch requests to current lease holders which may prevent reads from going to nearby follower replicas. In order to inform the DistSender that it can send a batch request to a follower we add a new global var in the kv package
I think it'd be really good if we can avoid having this function ptr be a global. This will be a testing nightmare, as these globals always are. Please find a way to scope the hook to a Server
.
For example, we can have packages register "server startup hooks" (and the list of hooks itself can be a global slice), and then Server.Start()
would call these hooks and pass itself as an argument. This has the advantage that once you have a Server
, it's insulated from any changes to any globals.
But actually, can we relitigate the decision to only have CCL code in separate packages? It's been known to cause avalanches of exports and headaches, and I'm not sure there's a good reason for it. We'd be much better off in my opinion (and seems that Andrew agrees) if we had this code here, for example, be in the kv package guarded by a build tag. Is there any lawery reason not to do that? @bdarnell @danhhz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 156 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I agree that the more we can unify
DistSender
and DistSQL planning code, the better. Can the `DistSender be made to use the oracle for determining the first replica it should send to? I don't know; all I can say is that I too encourage this exploration :)
If there's nothing better to be done at the moment, the interfaces described here don't seem too bad for me - it's not significantly more complexity and duplication than we already have :)
I'll see what I can do.
docs/RFCS/20181227_follower_reads_implementation.md, line 62 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Yeah, this is a little tricky -- the way I was thinking about this is that if we're just talking about the closed timestamp (i.e. no MLAIs), then this is cheap enough to distribute throughout. But right now you either distribute nothing or everything, and while in a "random cluster" usually all nodes talk to all other nodes, this doesn't need to be true in general (and shouldn't be true in strictly partitioned environments, ideally).
Are you suggesting that we should provide some mechanism to mark a request as expecting to be safe and then wait at the follower for it to catch up?
No, I was suggesting adding some padding to the closed timestamp before deciding to serve the read and to hope that that will work well enough for the time being, but that may not be enough. Dealing with stragglers will need some state at DistSender as this becomes a problem with either approach.
I agree that there are enough concerns to KISS for now (i.e. go threshold based). It's straightforward enough to retrofit if a desire for using the information we have in principle becomes stronger.
Ack.
docs/RFCS/20181227_follower_reads_implementation.md, line 11 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
you use "leader" and "follower" here, but this is all about leaseholder / non-leaseholder, not Raft leaders and followers, right?
You are correct. Updated.
docs/RFCS/20181227_follower_reads_implementation.md, line 65 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
So does that mean that the closed timestamp will lag by between 24s and 30s, or 30-36s? Can you clarify?
I will do some research and get back to you.
docs/RFCS/20181227_follower_reads_implementation.md, line 142 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
RECENT
? Stale syntax?
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think the naming of this function should be tied to "follower reads". How about
follower_lag_target()
. And one would use it withselect as of system time now() - follower_lag_target()
. In the future we might want different flavors of this too - e.g. timestamp at which the read is guaranteed to be served locally, most recent timestamp possible at which the read will be served locally.Also I think the function should work just fine in non-licensed clusters. You don't want the app to crash when going from a licensed cluster to an unlicensed one (think dev testing). Just that there'd be no follower reads. And it'd be useful - in testing I can see the effects of stale reads on the user experience.
This is a place where having a design partner would be nice. It might be worth pulling in some sales engineers and to wait for Andy to get back.
2e10d41
to
eef11b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):
Previously, ajwerner wrote…
This is a place where having a design partner would be nice. It might be worth pulling in some sales engineers and to wait for Andy to get back.
FWIW, I wouldn't expect exceptional input from 3rd parties on the topic :)
Without trying to influence your thoughts on what makes the most sense, I think we should start by implementing the things that are the simplest and then add sugar on top as need. And the way I see it, the simplest thing is to expose the notion of the target lag (as an interval
, and let it be composed with now()
.
If, say, we were to use a closed timestamp publicized by some other node, then that would make sense to be exposed by a function returning a timestamptz
.
eef11b6
to
9787909
Compare
9787909
to
15f13c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):
Previously, ajwerner wrote…
With the recent change to a function, this now seems potentially expressible in SQL with
min
.I do think there's probably a conversation to be had about the implication of allowing the expr in AS OF SYSTEM TIME to be non-constant as this proposal does. Does that raise any red flags? Do we have a middle ground between constant only and all expressions which would enable using built ins but not reading from the database?
I remember there was some reason we couldn't allow non-constant expressions involving now()
in AOST (which is why we had to build in specific support for negative intervals). We may want to treat recent_timestamp()
as a special case instead of opening the door to arbitrary expressions.
One caveat with non-constant expressions here is that if you prepare a statement that uses AOST recent_timestamp()
, the timestamp should presumably get reevaluated every time the statement is executed (which may in turn invalidate any cached query plan).
docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
FWIW, I wouldn't expect exceptional input from 3rd parties on the topic :)
Without trying to influence your thoughts on what makes the most sense, I think we should start by implementing the things that are the simplest and then add sugar on top as need. And the way I see it, the simplest thing is to expose the notion of the target lag (as aninterval
, and let it be composed withnow()
.
If, say, we were to use a closed timestamp publicized by some other node, then that would make sense to be exposed by a function returning atimestamptz
.
+1 to calling this something like follower_timestamp()
. The user doesn't want "recent" results; they get those by default. They want to be able to read from a follower.
Users may want to read from a follower for different reasons. One is lower latency for geo-distributed clusters, which is what this proposal addresses. But users also often want to read from a follower to avoid adding load to the leader (I'm not sure whether this is worthwhile in the CRDB model where every node is leader for some portion of the data, but it's a very common ask/expectation). We may want to think about exposing some way to read from a follower even if the leader is local.
docs/RFCS/20181227_follower_reads_implementation.md, line 233 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think it'd be really good if we can avoid having this function ptr be a global. This will be a testing nightmare, as these globals always are. Please find a way to scope the hook to a
Server
.
For example, we can have packages register "server startup hooks" (and the list of hooks itself can be a global slice), and thenServer.Start()
would call these hooks and pass itself as an argument. This has the advantage that once you have aServer
, it's insulated from any changes to any globals.But actually, can we relitigate the decision to only have CCL code in separate packages? It's been known to cause avalanches of exports and headaches, and I'm not sure there's a good reason for it. We'd be much better off in my opinion (and seems that Andrew agrees) if we had this code here, for example, be in the kv package guarded by a build tag. Is there any lawery reason not to do that? @bdarnell @danhhz
I think it's important that we have a simple pattern for identifying all CCL files, so that pure-OSS distributions can discard them. That doesn't necessarily mean separate packages, but the idea was that using packages can make it easier to enforce that there are no references from apache code to CCL code (build tags also made a mess of the build cache in older versions of go, although that's better now). I think if we relax the current rule and start to mix CCL and non-CCL code in the same package we're going to have to do more testing of pure-OSS builds than we do now.
I also think that when the plumbing seems like a lot of trouble relative to the amount of code on the CCL side, there is little value in actually using the CCL and it's better to just leave it OSS.
docs/RFCS/20181227_follower_reads_implementation.md, line 111 at r5 (raw file):
follower replicas, it does not enable running entire (read-only) transactions at a single point in time and thus benefitting from the performance gains offerred by follower reads. This document proposes an extension to the `SET TRANSACTION`
Most parts of the SET TRANSACTION statement can also be applied to the BEGIN statement. Will that be possible here too?
docs/RFCS/20181227_follower_reads_implementation.md, line 123 at r5 (raw file):
BEGIN;
SET TRANSACTION AS OF SYSTEM TIME recent_timestamp(), PRIORITY HIGH;
I like this; it will also help applications that use ORMs that make it difficult to inject the AOST clause into a generated SELECT statement.
docs/RFCS/20181227_follower_reads_implementation.md, line 145 at r5 (raw file):
## Detailed Design ### The `kv.follower_reads.target_multiple` Cluster Setting
I don't really like this setting - the application developer will be thinking in terms of "I can tolerate staleness of up to X seconds", but using this setting requires a bunch of math with other values that are mostly opaque/hidden from the app (and which seem like implementation details subject to change in the future). I think that recent_timestamp()
should probably be a black box for users who don't care too much about how old their results are, and users who care should be able to pass in their staleness bound directly (maybe add an optional argument recent_timestamp('10s')
.
docs/RFCS/20181227_follower_reads_implementation.md, line 354 at r5 (raw file):
detecting NotLeaseHolderErrors for queries which expected to hit followers. This could mitigate the flood of consistent reads in the face of lagging closed timestamps but would make the semantics of `recent_timestamp()` harder to
FWIW, this would make the semantics harder to understand, but based on my conversations with potential users of this feature it's what they generally want. Users who are interested in follower reads tend to have a very high upper bound for staleness but they don't want to fix their reads at that worst-case value. They'd like timestamp selection to be somehow dynamic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I remember there was some reason we couldn't allow non-constant expressions involving
now()
in AOST (which is why we had to build in specific support for negative intervals). We may want to treatrecent_timestamp()
as a special case instead of opening the door to arbitrary expressions.One caveat with non-constant expressions here is that if you prepare a statement that uses
AOST recent_timestamp()
, the timestamp should presumably get reevaluated every time the statement is executed (which may in turn invalidate any cached query plan).
We need to be a bit careful to separate intent from mechanism.
Intent:
-
does it matter for the MVP of this feature that the user be able to place bounds on staleness? All this discussion sounds like "I wish" from nathan and dan but I'd like this to be supported by our customer design partners. In general, any subtlety in the UX specs will require extra complexity in the implementation. I'd like to shed all non-required complexity from the MVP otherwise the discussion (below) on mechanism will distract from the main task.
-
once the UX requirements are known the plan caching semantics need to be specified. If the UX requirement is that the new feature places a bound on staleness, then we must specify here in the RFC that the cached plan be invalidated adequately (an old plan can also be stale, as much as the data!). Once you spell out the staleness requirement, we can discuss the ways to achieve them.
-
that being said, a part of the intent will be to not stray too much from our txn model so as to not create a super long impl path to a MVP. A part of the txn model is that the ASOT expression be an expression that's a pure function of the txn start timestamp. So whatever you choose, it would be great to express the result of that expression as a (math) function of the start timestamp, that will make the reading much clearer.
(Note: I am discussing this this way, because the exchange above was like a tail wagging the dog, trying to invent deliverable requirements based on perceived notions about impl restrictions. Beware! And this yields also point 4:)
- regarding the name
recent_timestamp
vsfollower_timestamp
. Obviously the namefollower_timestamp
is intimately linked to an implementation choice, and in general naming user-visible functions to the mechanism is poor design. I can make do with the idea that the word "recent" is inadequate, but "follower" is even more inadequate. That said, finding names is something we've done well in the past: once you have clear answers to the points 1,2,3 above, a good name will fall out of that tree on its own.
Mechanism:
A) what ben says; we're not allowing arbitrary SQL expressions in the ASOT clause because doing so would be hard. We can't reuse the regular typing logic, no subqueries, no nothing in those expressions (for different reasons that will become clearer when you become acquainted with the code) so whatever you want to evaluate you have to do manually. Today the code is limited to support a few types of literals. So the idea to use an explicit SQL min()
func app would mandate you re-implementing an eval logic for that. That's A.1) cumbersome A.2) a road I'm not keen to follow, because we don't exactly have bandwidth to support a second implementation of scalar evaluation.
B) It is possible however to make a special case for a built-in function (or several) without arguments, or one applied to literal values (eg. you could make your recent_timestamp
function take the bound, if desired by point 2 above, as an optional argument).
Design principle / best practices
X) using scalar expression syntax (especially built-in functions) gives us more flexibility to grow and add other things alongside it later: we know from experience that adding keywords in the SQL grammar expecting to combine both sanity and future extensibility is like holding a shotgun towards your face and another towards your feet, pull both triggers at once and expect something healthy to come out.
Y) client code knows "well" how to assemble scalar expressions, typically less "well" how to assemble arbitrary pieces of syntax.
docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
+1 to calling this something like
follower_timestamp()
. The user doesn't want "recent" results; they get those by default. They want to be able to read from a follower.Users may want to read from a follower for different reasons. One is lower latency for geo-distributed clusters, which is what this proposal addresses. But users also often want to read from a follower to avoid adding load to the leader (I'm not sure whether this is worthwhile in the CRDB model where every node is leader for some portion of the data, but it's a very common ask/expectation). We may want to think about exposing some way to read from a follower even if the leader is local.
See my comment above - the word follower
is implementation-specific and not a good choice.
Start tying down what you want to achieve, and discuss names only after that.
docs/RFCS/20181227_follower_reads_implementation.md, line 233 at r4 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I think it's important that we have a simple pattern for identifying all CCL files, so that pure-OSS distributions can discard them. That doesn't necessarily mean separate packages, but the idea was that using packages can make it easier to enforce that there are no references from apache code to CCL code (build tags also made a mess of the build cache in older versions of go, although that's better now). I think if we relax the current rule and start to mix CCL and non-CCL code in the same package we're going to have to do more testing of pure-OSS builds than we do now.
I also think that when the plumbing seems like a lot of trouble relative to the amount of code on the CCL side, there is little value in actually using the CCL and it's better to just leave it OSS.
I think separate packages are better for the ease of excising without pain.
However another general principle applies, which has been discussed in SQL circles but not yet ventilated to the rest of the org:
everything about planning should be core (non-CCL) and then we can restrict some parts of execution to CCL packages.
The planning structures should be core because the query planning has changed shape recently (CBO) in a way that makes it non-sensical (and outright insane) to try to split it in different packages.
(FYI we'll need to implement this change for all the other CCL statements, incl backup/restore/import/export, before they can be subsumed by the CBO. ASOT is constrained in the same way.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 233 at r4 (raw file):
Previously, knz (kena) wrote…
I think separate packages are better for the ease of excising without pain.
However another general principle applies, which has been discussed in SQL circles but not yet ventilated to the rest of the org:
everything about planning should be core (non-CCL) and then we can restrict some parts of execution to CCL packages.
The planning structures should be core because the query planning has changed shape recently (CBO) in a way that makes it non-sensical (and outright insane) to try to split it in different packages.
(FYI we'll need to implement this change for all the other CCL statements, incl backup/restore/import/export, before they can be subsumed by the CBO. ASOT is constrained in the same way.)
If we want to revisit how the various ccl hooks work, so they're not just globals, I'm all for that. I think the package split has worked quite well though. We tried build tags as one of the options when we were setting this up initially and it was obnoxious from an "editors and tooling" perspective, which made it a hassle to work on enterprise code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remaining comments boil down to the following points which I've ordered
relative to my personal view of their relative importance for resolution. I'd
like to tackle the first two issues before the second two issues. Maybe I should
have kept these going inline below but I felt like I needed to step back and take
stock of the feedback. If it makes it easier to follow, feel free to jump back below
to the inline discussion.
-
Timestamp selection:
- As far as this proposal is concerned, the selection of picking a timestamp
in SQL and determining whether a timestamp is safe to perform against
followers are two sides to the same coin.- There exists some function
f()
which returns a timestamp that is intended
to be safe for follower reads. - If
TxnTs < (f() + max clock offset)
, then distsqlplan and kv/DistSender
ought to consider the txn's read requests as candidates for follower
reads. - The big question of this proposal then is what is
f()
- If this basic design element is not agreed upon, the below questions
are less relevant. I'm assuming that at least this facet of the
problem is not controversial.
- There exists some function
- The current proposal is a static, stateless mechanism based solely on
cluster settings.- + Simple to implement
- - So simple as to hardly justify the injection required to make a
CCL package
- - So simple as to hardly justify the injection required to make a
- + Relatively straightforward to understand semantically
- - Can't make guarentee that directing a request at a follower be safe
for performing a read. - - Exposes difficult to use, implementation-specific cluster setting.
- The design could be revised to hide this setting but that further
reduces the client control over how this mechanism works - If there's no cluster setting to control the fixed staleness of the
the timestamp, how do we pick a goldilocks value?
- The design could be revised to hide this setting but that further
- - Likely to be too conservative in the common case and still may be
not conservative enough in other cases.
- + Simple to implement
- The other end of the spectrum would be a mechanism that explicitly tracks
closed timestamps of all stores on each node.- + Much higher confidence that reads will actually be safe to perform
against a follower - + Seems to better align with Ben's understanding of customer demand.
- Customers may have some maximum tolerance for staleness which is
quite high but would prefer a mechanism which dynamically chooses
a timestamp?
- Customers may have some maximum tolerance for staleness which is
- - Introduces failure mode when nodes start up and don't have a value?
- Error?
- Fall back to stateless solution above?
- - Much more involved from an implementation perspective.
- Given that the data amounts to keeping a map of store to timestamp
and updating it per store on the order of >1s it feels morally like
building such a system should not be an undue burden on the cluster. - + At least will have enough code to comfortably justify a CCL
package.
- Given that the data amounts to keeping a map of store to timestamp
- - Would have somewhat harder to reason about sematics with regards
to staleness.- This solution may benefit greatly from a client specified upper
bound on staleness, though such an upper bound somewhat defeats
the purpose of exact timestamp selection if we are hitting it with
any frequency.
- This solution may benefit greatly from a client specified upper
- + Much higher confidence that reads will actually be safe to perform
- A middle ground might be a solution which dynamically tweaks the offset
based on feedback. Imagine we allow nodes to dynamically alter the
target_multiple
, initially setting it to a relatively aggressive value
(1?) and then increasing it dynamically on a per-node basis via a feedback
mechanism whereby the DistSender records when requests it expects to be
safe for follower reads return NotLeaseHolderError.- + is dynamic with regards to the staleness of timestamp based on
cluster status, thus aligns better with our understanding of customer
expectations. - + Could be implemented with an easier to understand cluster setting
about max staleness. - + Is simpler to implement that a stateful timestamp tracking
mechanism. - - Is not able to make strong claims about whether a timestamp will
be safe for a read. - - Is probably the least easy to understand with regards to semantics.
- + is dynamic with regards to the staleness of timestamp based on
- Are there other approaches I'm not thinking of?
- Would it be okay to pursue a simpler approach at first and then nuance it?
- Would that imply taking extra care to to expose implementation-specific
settings/details?
- As far as this proposal is concerned, the selection of picking a timestamp
-
What sort of expressions are valid for AOST? What are the plan caching
semantics?- Currently just supports constant expressions
- Though it's worth noting that they can be relative offsets which are
computed relative to the statement time
- Though it's worth noting that they can be relative offsets which are
- Let's work under an assumption that in addition to the current constant
expressions we'll accept at least a scalar function call of no arguments
which operates solely on the statement timestamp. - Should have the same planning implications as today's '-2m' AOST queries
- Do we do these correctly today?
- Is there something I don't understand that would make builtin which is
pure w.r.t. stmt time different from'-2m'
?
- Let's table the idea that you can use arbitrary function expressions such
asmin
for reasons Raphael mentioned. If we need an upper bound, we can
implement that via an argument to our function. - If we do the right thing for planning with prepared statements which
includeAS OF SYSTEM TIME '-2m'
it ought to be trivial to do the right
thing with a builtin which is a pure function of statement timestamp.
- Currently just supports constant expressions
-
What do we call this function which will return a timestamp?
- This is an important thing to get right but something I'd like to push to
the very end because it's something of a leaf concern that will be easier
to discuss when all the other details have been figured out (Raphael's
point).
- This is an important thing to get right but something I'd like to push to
-
Is there a better mechanism for CCL injection?
- This is important but whatever the answer is, it shouldn't be too hard to
accomodate.
- This is important but whatever the answer is, it shouldn't be too hard to
Reviewable status: complete! 0 of 0 LGTMs obtained
docs/RFCS/20181227_follower_reads_implementation.md, line 111 at r5 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Most parts of the SET TRANSACTION statement can also be applied to the BEGIN statement. Will that be possible here too?
Sure, I see no reason not to. Will update.
docs/RFCS/20181227_follower_reads_implementation.md, line 145 at r5 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I don't really like this setting - the application developer will be thinking in terms of "I can tolerate staleness of up to X seconds", but using this setting requires a bunch of math with other values that are mostly opaque/hidden from the app (and which seem like implementation details subject to change in the future). I think that
recent_timestamp()
should probably be a black box for users who don't care too much about how old their results are, and users who care should be able to pass in their staleness bound directly (maybe add an optional argumentrecent_timestamp('10s')
.
Noted. See my high level comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I met with Ben this morning to primarily to discuss timestamp selection. The conclusion is that while less than perfect, the proposed function which produces a timestamp that lags the statement time by an interval derived from cluster settings is a reasonable path forward for an initial implementation. The intention is to hide the implementation details about how exactly this timestamp selection works so that over time we can make the mechanism more sophisticated. That resolves point 1. above.
As for point 3. Ben expressed that recent
carries no useful meaning to the user and that because the intention of this function is so directly related to the desire to perform follower reads that the name ought to be obviously relatable to the feature.
I'm still working to better understand point 2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, @justinj, @nvanbenschoten, and @tbg)
15f13c6
to
1210662
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @justinj, @knz, @nvanbenschoten, and @tbg)
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):
Previously, knz (kena) wrote…
We need to be a bit careful to separate intent from mechanism.
Intent:
does it matter for the MVP of this feature that the user be able to place bounds on staleness? All this discussion sounds like "I wish" from nathan and dan but I'd like this to be supported by our customer design partners. In general, any subtlety in the UX specs will require extra complexity in the implementation. I'd like to shed all non-required complexity from the MVP otherwise the discussion (below) on mechanism will distract from the main task.
once the UX requirements are known the plan caching semantics need to be specified. If the UX requirement is that the new feature places a bound on staleness, then we must specify here in the RFC that the cached plan be invalidated adequately (an old plan can also be stale, as much as the data!). Once you spell out the staleness requirement, we can discuss the ways to achieve them.
that being said, a part of the intent will be to not stray too much from our txn model so as to not create a super long impl path to a MVP. A part of the txn model is that the ASOT expression be an expression that's a pure function of the txn start timestamp. So whatever you choose, it would be great to express the result of that expression as a (math) function of the start timestamp, that will make the reading much clearer.
(Note: I am discussing this this way, because the exchange above was like a tail wagging the dog, trying to invent deliverable requirements based on perceived notions about impl restrictions. Beware! And this yields also point 4:)
- regarding the name
recent_timestamp
vsfollower_timestamp
. Obviously the namefollower_timestamp
is intimately linked to an implementation choice, and in general naming user-visible functions to the mechanism is poor design. I can make do with the idea that the word "recent" is inadequate, but "follower" is even more inadequate. That said, finding names is something we've done well in the past: once you have clear answers to the points 1,2,3 above, a good name will fall out of that tree on its own.Mechanism:
A) what ben says; we're not allowing arbitrary SQL expressions in the ASOT clause because doing so would be hard. We can't reuse the regular typing logic, no subqueries, no nothing in those expressions (for different reasons that will become clearer when you become acquainted with the code) so whatever you want to evaluate you have to do manually. Today the code is limited to support a few types of literals. So the idea to use an explicit SQL
min()
func app would mandate you re-implementing an eval logic for that. That's A.1) cumbersome A.2) a road I'm not keen to follow, because we don't exactly have bandwidth to support a second implementation of scalar evaluation.B) It is possible however to make a special case for a built-in function (or several) without arguments, or one applied to literal values (eg. you could make your
recent_timestamp
function take the bound, if desired by point 2 above, as an optional argument).Design principle / best practices
X) using scalar expression syntax (especially built-in functions) gives us more flexibility to grow and add other things alongside it later: we know from experience that adding keywords in the SQL grammar expecting to combine both sanity and future extensibility is like holding a shotgun towards your face and another towards your feet, pull both triggers at once and expect something healthy to come out.
Y) client code knows "well" how to assemble scalar expressions, typically less "well" how to assemble arbitrary pieces of syntax.
See below comment. The resolution thus far has been that we'll remove the setting and implementation specific documentation for follower_read_timestamp()
. This is indeed a feature specific function but it's not necessarily an implementation specific function. The intention for the user is that they should expect that this function will provide as up-to-date a timestamp as the system thinks can be served by a follower read. Obviously in this initial implementation that guess is conservative but the function leaves the door open to future implementations making it smarter.
The document now proposes that the evaluation of the expression with the AS OF clause is either constant or is an invocation of just that function to keep the implementation straightforward.
docs/RFCS/20181227_follower_reads_implementation.md, line 156 at r1 (raw file):
Previously, ajwerner wrote…
I'll see what I can do.
I didn't do much :/
docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):
Previously, knz (kena) wrote…
See my comment above - the word
follower
is implementation-specific and not a good choice.Start tying down what you want to achieve, and discuss names only after that.
After some discussions with Ben, the idea here is that regardless of exactly the mechanism by which the system chooses this timestamp, the intention of using this function is that it is likely to produce a timestamp which the system will deem safe for a follower read. That does not make the function implementation specific but it certainly makes it feature specific. This function should be used exclusively by clients hoping to perform reads against followers. Given the feedback regarding not overly tying the functionality to
docs/RFCS/20181227_follower_reads_implementation.md, line 233 at r4 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
If we want to revisit how the various ccl hooks work, so they're not just globals, I'm all for that. I think the package split has worked quite well though. We tried build tags as one of the options when we were setting this up initially and it was obnoxious from an "editors and tooling" perspective, which made it a hassle to work on enterprise code.
Given the lack of desire to reopen that can of worms, I'm moving forward with the global.
docs/RFCS/20181227_follower_reads_implementation.md, line 111 at r5 (raw file):
Previously, ajwerner wrote…
Sure, I see no reason not to. Will update.
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 123 at r5 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I like this; it will also help applications that use ORMs that make it difficult to inject the AOST clause into a generated SELECT statement.
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 145 at r5 (raw file):
Previously, ajwerner wrote…
Noted. See my high level comment.
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 354 at r5 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
FWIW, this would make the semantics harder to understand, but based on my conversations with potential users of this feature it's what they generally want. Users who are interested in follower reads tend to have a very high upper bound for staleness but they don't want to fix their reads at that worst-case value. They'd like timestamp selection to be somehow dynamic.
Discussed offline that while something more dynamic is probably desirable, for the first pass we're going to do the simpler thing.
Release note: None
1210662
to
58b96c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @justinj, @knz, @nvanbenschoten, and @tbg)
docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):
Previously, ajwerner wrote…
After some discussions with Ben, the idea here is that regardless of exactly the mechanism by which the system chooses this timestamp, the intention of using this function is that it is likely to produce a timestamp which the system will deem safe for a follower read. That does not make the function implementation specific but it certainly makes it feature specific. This function should be used exclusively by clients hoping to perform reads against followers. Given the feedback regarding not overly tying the functionality to
+1 on follower
. That's what we're going to call the feature, so it doesn't seem like a problem to me.
docs/RFCS/20181227_follower_reads_implementation.md, line 145 at r5 (raw file):
Previously, ajwerner wrote…
Done.
What is your stance on no setting vs. a hidden setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @justinj, @knz, @nvanbenschoten, and @tbg)
docs/RFCS/20181227_follower_reads_implementation.md, line 152 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
+1 on
follower
. That's what we're going to call the feature, so it doesn't seem like a problem to me.
Done.
docs/RFCS/20181227_follower_reads_implementation.md, line 145 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What is your stance on no setting vs. a hidden setting?
I'm for a hidden setting. I added that to the doc. It makes sense that 1) we'd want to play with this while testing out the feature and 2) when it's in the field there might be some case where we're happy it's there.
bors r+ |
33474: docs: add RFC to expose follower reads to clients r=ajwerner a=ajwerner Relates to #16593. Release note: None 34399: storage: fix NPE while printing trivial truncateDecision r=bdarnell a=tbg Fixes #34398. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Build succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a process perspective, there's a shortcoming here. The latest iteration was less than 24 hours ago, and I haven't had the chance to get an additional turn at reviewing this before it was merged. That's not what we usually do with RFCs.
Reviewed 1 of 1 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @bdarnell, @justinj, @nvanbenschoten, and @tbg)
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):
Previously, ajwerner wrote…
See below comment. The resolution thus far has been that we'll remove the setting and implementation specific documentation for
follower_read_timestamp()
. This is indeed a feature specific function but it's not necessarily an implementation specific function. The intention for the user is that they should expect that this function will provide as up-to-date a timestamp as the system thinks can be served by a follower read. Obviously in this initial implementation that guess is conservative but the function leaves the door open to future implementations making it smarter.The document now proposes that the evaluation of the expression with the AS OF clause is either constant or is an invocation of just that function to keep the implementation straightforward.
You're missing that the word "follower" is an artifact of our choice to use Raft (or consensus in general).
What the feature really is about is breaking the master/slave (or primary/secondary) boundary for scans and actually delivering on our promise of "multi-active".
We're not doing "follower reads" here (that's a raft-specific word). We're doing safe distributed reads from secondaries. This is why, if I had the time to iterate on this after your latest batch of changes, I'd have been driven by your other changes/explanations to the natural good name for this: safe_distributed_read_timestamp()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize for being hasty. I was having a somewhat difficult time balancing pressure to get this work in. Even more than that I think I personally underplayed the importance of the naming here and felt that there was support from Ben and Nathan for “follower”. As I expressed before, I don’t have strong conviction on the naming here. I’m happy to keep this open and to make further revision to this RFC.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @bdarnell, @justinj, @nvanbenschoten, and @tbg)
docs/RFCS/20181227_follower_reads_implementation.md, line 72 at r1 (raw file):
Previously, knz (kena) wrote…
You're missing that the word "follower" is an artifact of our choice to use Raft (or consensus in general).
What the feature really is about is breaking the master/slave (or primary/secondary) boundary for scans and actually delivering on our promise of "multi-active".
We're not doing "follower reads" here (that's a raft-specific word). We're doing safe distributed reads from secondaries. This is why, if I had the time to iterate on this after your latest batch of changes, I'd have been driven by your other changes/explanations to the natural good name for this:
safe_distributed_read_timestamp()
.
I don’t disagree with anything you’ve said here. My sense is that it’ll be easiest to discuss this feature if the terminology aligns with how we market this functionality. My current understanding is that the product folks think of this functionality as “follower reads” and will market it as such and hence the chosen name.
In Ceph for example the name for a somewhat analogous feature is called “balanced reads” though it additionally implies a round-robin replica selection policy and furthermore is a single data center deployment and thus is not a great fit for us.
Maybe we should rebrand this feature as “Distributed Reads” in which case we absolutely should rename this function to match. Probably my biggest blind spot in this discussion is a general lack of conviction regarding the naming here. I apologize for not giving enough time and space to let that discussion run its course. I’m happy to further revise this doc in a follow up PR.
In all fairness this is a discussion where a PM should have been involved in the loop where we balance the engineering aspects with the productization aspects. As a general rule, when a feature like this has a strong UX visibility in this way, it's OK in the RFC to say "the initial implementation will use the name |
33478: sql,kv,followerreadsccl: enable follower reads for historical queries r=ajwerner a=ajwerner Follower reads are reads which can be served from any replica as opposed to just the current lease holder. The foundation for this change was laid with the work to introduce closed timestamps and to support follower reads at the replica level. This change adds the required support to the sql and kv layers and additionally exposes a new syntax to ease client adoption of the functionality. The change adds the followerreadsccl package with logic to check when follower reads are safe and to inject the functionality so that it can be packaged as an enterprise feature. Modifies `AS OF SYSTEM TIME` semantics to allow for the evaluation of a new builtin tentatively called `follower_read_timestamp()` in addition to constant expressions. This new builtin ensures that an enterprise license exists and then returns a time that can likely be used to read from a follower. The change abstracts (and renames to the more appropriate replicaoracle) the existing leaseHolderOracle in the distsqlplan package to allow a follower read aware policy to be injected. Lastly the change add to kv a site to inject a function for checking if follower reads are safe and allowed given a cluster, settings, and batch request. This change includes a high level roachtest which validates observable behavior of performing follower reads by examining latencies for reads in a geo- replicated setting. Implements #33474 Fixes #16593 Release note (enterprise change): Add support for performing sufficiently old historical reads against closest replicas rather than leaseholders. A new builtin function `follower_read_timestamp()` which can be used with `AS OF SYSTEM TIME` clauses to generate a timestamp which is likely to be safe for reads from a follower. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Relates to #16593.
Release note: None