-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: use AOST -10s when building crdb_internal.table_row_statistics #60953
Conversation
Note: this change deviates slightly from the suggestions by @ajwerner. I don't have very good intuition about transaction contention, so this change may be suboptimal. But the reason I did it this way is that I am conscious of the memory overhead required for histograms, and generating all of the histograms up front seems like it could expand the memory utilization more than necessary. But maybe I'm just being overly cautious. I also don't think we need to add all the stats in the same transaction, which is why I split it up into several smaller transactions. But maybe you disagree, @RaduBerinde. Curious to hear your thoughts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be even better than this would be to accumulate all of the writes into a single batch and commit that batch atomically. That would incur zero contention in most cases and a minimal amount in cases where the relevant stats rows spanned ranges. Is it ever the case that the sketches sum up to megabytes?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose maybe that's hard.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)
I think it's rare that histograms would be multiple megabytes, but if a table were very wide with a lot of columns and indexes, it could definitely happen. When you say "accumulate all of the writes into a single batch", do you mean write a multi-row insert statement? If so, I can give that a try... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do feel that sometimes it's bad the ways in which SQL is a bad match for the desired KV API semantics. Like if we didn't need to support the conversational semantics exactly, we'd have more freedom to group things into batches better.
Anyway, this certainly is an improvement. Better yet would be to group the delete and insert into a single statement so that we could read and then broadcast the write with a commit. As this is it'll have ~8 WAN global round trips per column in the worst case.
Delete reads from leaseholder (1)
Delete writes (1 to lease holder, replication pipelined)
Insert writes (1 to lease holder, replication pipelined)
Commit (1 to leaseholders, 1 to replicate, 1 to make explicitly committed, 1 replicate, 1 resolve)
Here the contention footprint is ~7 global RTTs.
I think the best we can hope to do with statement per column is:
Delete reads from leaseholder (1 to leaseholders)
Delete and insert writes get propagated with a commit (1 to leaseholders, 1 to replicate, 1 to make explicitly committed, 1 to resolve)
Here the contention footprint is ~5 global RTTs
However, if in the above case all of the writes and deletes fall in the same range (very likely), then it's 1PC and then the reads are really just blocked waiting on replication so 1 global RTT.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the internals of how a CTE that does the delete and insert in the same statement would get planned. I doubt it will be a single batch. If it were, that would be amazing.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given all of this data should be co-located, I think the min user priority thing should work pretty well. Transactions encountering the intents should be able to abort the writing transaction without leaving the local node.
On the whole, I'm coming around to this being just totally fine and a good change so long as you can tolerate the loss of atomicity.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have reason to believe that splitting up the txn is preferable?
One thing to keep in mind is that we'll need to change the gossip mechanism to a range feed soon. The range feed "listens" to changes on the system table. At that point, having a single txn would be better, otherwise we would end up refreshing the stats multiple times for the same table.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That gossip thing seems like a big deal. The issue here is about reducing the contention footprint of these transactions. If the table has many columns and we're in a global cluster and we set priority low, then a loop of querying the stats from the leaseholders region will effectively prevent the stats from ever committing. Breaking the transaction into pieces alone at least bounds much more tightly how long that footprint will last. It still may be multiple global RTTs, as noted above.
I do have livelock worries with PRIORITY LOW. Everything is tradeoffs and in this one, I'm not sure what is best.
Reviewable status: complete! 0 of 0 LGTMs obtained
Doesn't the priority increment each time a transaction is pushed? |
I'm pretty sure it never changes class. That incrementing priority is about avoiding starvation and providing some amount of fairness among transactions of the same priority class. |
5f9c855
to
74b5c84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)
pkg/sql/crdb_internal.go, line 415 at r1 (raw file):
// querying the system.table_statistics table when building // crdb_internal.table_row_statistics. It is mutable for testing. var StatsAsOfTime = 10 * time.Second
a non-public cluster setting will both make the testing injection more elegant and generally just seems better to me.
pkg/sql/crdb_internal.go, line 418 at r1 (raw file):
var crdbInternalTablesTableLastStats = virtualSchemaTable{ comment: "the latest stats for all tables accessible by current user in current database (KV scan)",
nit: I do think t's worth a comment here too that it is stale
9332695
to
6d0bec6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is RFAL - thanks!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)
pkg/sql/crdb_internal.go, line 415 at r1 (raw file):
Previously, ajwerner wrote…
a non-public cluster setting will both make the testing injection more elegant and generally just seems better to me.
Done.
pkg/sql/crdb_internal.go, line 418 at r1 (raw file):
Previously, ajwerner wrote…
nit: I do think t's worth a comment here too that it is stale
Done.
6d0bec6
to
6cfcecc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @rytaft)
pkg/sql/catalog/descs/collection.go, line 325 at r2 (raw file):
Quoted 15 lines of code…
if descID == descpb.InvalidID { var found bool var err error found, descID, err = catalogkv.LookupObjectID(ctx, txn, codec, parentID, parentSchemaID, name) if err != nil || !found { return found, nil, err } } // Always pick up a mutable copy so it can be cached. desc, err = catalogkv.GetAnyDescriptorByID(ctx, txn, codec, descID, catalogkv.Mutable) if err != nil { return false, nil, err } else if desc == nil { // Having done the namespace lookup, the descriptor must exist. return false, nil, errors.NewAssertionErrorWithWrappedErrf(
gosh this is gross :(
I guess we just pushed this grossness down here. Can you do something like:
isSystemDescriptor := descID != descpb.InvalidID
if !isSystemDescriptor {
var found bool
var err error
// ....
}
//..
// Always pick up a mutable copy so it can be cached.
desc, err = catalogkv.GetAnyDescriptorByID(ctx, txn, codec, descID, catalogkv.Mutable)
if err != nil {
return false, nil, err
} else if desc == nil && isSystemDescriptor {
// Say things about how this can happen during startup because we're not actually looking up the system descriptor IDs in KV.
// return something wrapping ErrDescriptorNotFound
} else if desc == nil {
// Having done the namespace lookup, the descriptor must exist.
// Assertion failure
This commit updates the query used to build the virtual table crdb_internal.table_row_statistics so that it is always run at AS OF SYSTEM TIME '-10s'. This will reduce contention on the system.table_statistics table and avoid delays when running SHOW TABLES, which queries crdb_internal.table_row_statistics. Informs cockroachdb#58189 Release note (performance improvement): Updated the query used to build the virtual table crdb_internal.table_row_statistics so that it is always run at AS OF SYSTEM TIME '-10s'. This should reduce contention on the table and improve performance for transactions that rely on crdb_internal.table_row_statistics, such as SHOW TABLES.
6cfcecc
to
9c24267
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)
pkg/sql/catalog/descs/collection.go, line 325 at r2 (raw file):
Previously, ajwerner wrote…
if descID == descpb.InvalidID { var found bool var err error found, descID, err = catalogkv.LookupObjectID(ctx, txn, codec, parentID, parentSchemaID, name) if err != nil || !found { return found, nil, err } } // Always pick up a mutable copy so it can be cached. desc, err = catalogkv.GetAnyDescriptorByID(ctx, txn, codec, descID, catalogkv.Mutable) if err != nil { return false, nil, err } else if desc == nil { // Having done the namespace lookup, the descriptor must exist. return false, nil, errors.NewAssertionErrorWithWrappedErrf(
gosh this is gross :(
I guess we just pushed this grossness down here. Can you do something like:
isSystemDescriptor := descID != descpb.InvalidID if !isSystemDescriptor { var found bool var err error // .... } //.. // Always pick up a mutable copy so it can be cached. desc, err = catalogkv.GetAnyDescriptorByID(ctx, txn, codec, descID, catalogkv.Mutable) if err != nil { return false, nil, err } else if desc == nil && isSystemDescriptor { // Say things about how this can happen during startup because we're not actually looking up the system descriptor IDs in KV. // return something wrapping ErrDescriptorNotFound } else if desc == nil { // Having done the namespace lookup, the descriptor must exist. // Assertion failure
Done. Thanks for the code snippet!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 6 files at r2, 1 of 1 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained
TFTR! bors r+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 0 LGTMs obtained
Build failed (retrying...): |
Build succeeded: |
61215: sql: deflake TestPGPreparedQuery r=RaduBerinde,rytaft a=rafiss A recent change to stats in SHOW TABLES made this test flaky. Stats are now read with AOST -10s by default, but for tests like this we need to modify this. See #60953 and #61191 Release justification: test-only change Release note: None Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
This commit updates the query used to build the virtual table
crdb_internal.table_row_statistics
so that it is always run atAS OF SYSTEM TIME '-10s'
. This will reduce contention on thesystem.table_statistics
table and avoid delays when runningSHOW TABLES
,which queries
crdb_internal.table_row_statistics
.Informs #58189
Release note (performance improvement): Updated the query used to build
the virtual table crdb_internal.table_row_statistics so that it is always
run at AS OF SYSTEM TIME '-10s'. This should reduce contention on
the table and improve performance for transactions that rely on
crdb_internal.table_row_statistics, such as SHOW TABLES.