-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: temporary tables support #41482
Conversation
9a9fbb1
to
579a393
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work! A few nits from me, but on the high level I think it makes sense (although I'm not very familiar with that part of the codebase).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @solongordon)
docs/RFCS/20191009_temp_tables.md, line 37 at r1 (raw file):
the session it was created in. - TTs can depend on other TTs using foreign keys, and PTs can depend on other PTs, but it's not possible to refer to a PT from a TT or vice-versa.
Why is it not ok for a TT to refer to a PT? Is this what Postgres does?
docs/RFCS/20191009_temp_tables.md, line 63 at r1 (raw file):
> conversely, the same mechanism is always used when the CREATE statement targets a temporary schema, > regardless of whether the TEMP keyword is specified or not. - SELECT * FROM t is equivalent to SELECT * FROM pg_temp_1231231312.t
nit: queries need highlighting.
docs/RFCS/20191009_temp_tables.md, line 167 at r1 (raw file):
- For all existing tables, the schemaID field is prefilled with 0 to scope them under `public`. ### Session Scoped Namespace
nit: "Session Scoped Namespace" is repeated as the header for the second time - seems a little weird.
docs/RFCS/20191009_temp_tables.md, line 217 at r1 (raw file):
> is the table descriptor. Instead of persisting table descriptors, we could simply store temporary tabled descriptors in the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @solongordon)
docs/RFCS/20191009_temp_tables.md, line 217 at r1 (raw file):
Previously, yuzefovich wrote…
The lost comment was: nit: s/tabled/table/
.
579a393
to
47e2301
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for having a look! :)
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @solongordon and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 37 at r1 (raw file):
Previously, yuzefovich wrote…
Why is it not ok for a TT to refer to a PT? Is this what Postgres does?
Yeah, cross-referencing TTs - PTs is not allowed by Postgres.
docs/RFCS/20191009_temp_tables.md, line 63 at r1 (raw file):
Previously, yuzefovich wrote…
[nit]: queries need highlighting.
Done.
docs/RFCS/20191009_temp_tables.md, line 167 at r1 (raw file):
Scoped
It was actually supposed to be Session Scoped Deletion
docs/RFCS/20191009_temp_tables.md, line 217 at r1 (raw file):
Previously, yuzefovich wrote…
The lost comment was: [nit]:
s/tabled/table/
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @knz, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 37 at r2 (raw file):
the session it was created in. - TTs can depend on other TTs using foreign keys, and PTs can depend on other PTs, but it's not possible to refer to a PT from a TT or vice-versa.
Maybe add some note here about interleaving (a crdb-specific feature).
docs/RFCS/20191009_temp_tables.md, line 43 at r2 (raw file):
created in a special session-specific temporary schema. Additionally, TTs are exposed in information_schema and pg_catalog like regular tables, with their
nit: information_schema
and pg_catalog
(with `` for the identifiers)
docs/RFCS/20191009_temp_tables.md, line 53 at r2 (raw file):
There is just one temporary schema defined per database and per session. Its name is auto-generated based on the session ID. For example, session with ID 1231231312 will have "pg_temp_1231231312" as its temporary schema name.
ditto ``
docs/RFCS/20191009_temp_tables.md, line 59 at r2 (raw file):
- `CREATE TEMP TABLE t(x INT)` is equivalent to `CREATE TEMP TABLE pg_temp_1231231312.t(x INT)` and also `CREATE TABLE pg_temp_1231231312.t(x INT)` and also `CREATE TABLE pg_temp.t(x INT)` >Note that the last two equivalences are a reminder that the TEMP keyword is merely syntactic sugar
Maybe indent this to the left under the bullet point instead of using the block quote syntax.
docs/RFCS/20191009_temp_tables.md, line 67 at r2 (raw file):
The temporary schema, when needed the first time, gets auto-created in the current database as defined by the `database` session variable (and the head of search_path). If a client session
nit search_path
with ``
docs/RFCS/20191009_temp_tables.md, line 102 at r2 (raw file):
Ensuring no foreign key cross referencing is allowed between temporary/persistent tables should not be that hard to solve -- a boolean check at the point of establishment should suffice.
Please detail where the check should be implemented.
docs/RFCS/20191009_temp_tables.md, line 104 at r2 (raw file):
that hard to solve -- a boolean check at the point of establishment should suffice. This requires adding an additional boolean flag to TableDescriptors that is set to true if the table is temporary.
Are you sure we need it in the table descriptor itself, maybe in the MutableTableDescriptor is sufficient (only schema changes need to know the difference I think?)
docs/RFCS/20191009_temp_tables.md, line 113 at r2 (raw file):
temporary tables. Tables under this schema can only be accessed from the session that created the schema. Currently, CockroachDB only supports the `public` physical schema. As part of this task, CRDB should be extended to support other physical schemas (`pg_temp_<session id>`) under which temporary tables can live.
Also there should be one such temp schema per database I think?
docs/RFCS/20191009_temp_tables.md, line 122 at r2 (raw file):
### Workflow Every session starts with 4 schemas (`public`, `crdb_internal`, `information_schema`, `pg_catalog`).
session+database
docs/RFCS/20191009_temp_tables.md, line 143 at r2 (raw file):
| parentID | name | Id | parentSchemaID | |----------|--------------------|----|----------------| | 1 | pg_temp_1231231312 | 52 | 0 |
I suppose we could use parentSchemaID = -1 to distinguish rows in system.namespace that designate schemas, from those that designate tables with public
as parent schema.
docs/RFCS/20191009_temp_tables.md, line 144 at r2 (raw file):
|----------|--------------------|----|----------------| | 1 | pg_temp_1231231312 | 52 | 0 | | 1 | rides | 53 | 52 |
Also you'll need a sub-section to explain what will happen with SHOW TABLES (and other places) in mixed-version configs (19.2 + 20.1).
Perhaps the new feature can only be enabled when the cluster version has been bumped to 20.1 because of this.
docs/RFCS/20191009_temp_tables.md, line 150 at r2 (raw file):
This mapping can never change during the course of a session because the schema can not be renamed or dropped. Even if all temporary tables for a session are manually dropped, the schema is not. Thus, this cache is always consistent for a particular session.
... and does not need an invalidation protocol.
docs/RFCS/20191009_temp_tables.md, line 169 at r2 (raw file):
- Non-qualified names get looked up in the order specified by search_path, with the same special case as PostgreSQL: - If search_path mentions pg_catalog explicitly, search_path is used as-is
nit: `` everywhere relevant
docs/RFCS/20191009_temp_tables.md, line 192 at r2 (raw file):
- Name resolution for databases changes to (0, 0, Database name) -> DatabaseID. - Name resolution for schemas is introduced, as (Parent ID, 0, Schema Name) -> SchemaID. - Name resolution for tables changes to (ParentID, SchemaID, Table Name) -> TableDescriptorID
Here you have another reason to gate the feature behind a cluster version: you'll need to change the PK of system.namespace, and that means that all pre-20.1 nodes will not be able to use it. I think that's fine but it needs to be called out.
docs/RFCS/20191009_temp_tables.md, line 198 at r2 (raw file):
- If a session tries to access a temporary table owned by another session, this can be caught during name resolution as the schema name is constructed using the session. A session is only allowed to access `pg_temp_<session_id>` and`public `physical schemas.
nit: quotes
docs/RFCS/20191009_temp_tables.md, line 205 at r2 (raw file):
#### Migration: - For every DatabaseID that exists in the system.namespace table, a `public` schema is added by
Add a note that the migration only occurs when the cluster version is bumped to 20.1 or later.
docs/RFCS/20191009_temp_tables.md, line 212 at r2 (raw file):
There could be cases where a session terminates and is unable to perform clean up, for example when a node goes down. We can not rely on a session to ensure that hanging data/table descriptors are removed. Instead, we use a daemon process to perform cleanup.
What is the frequency of execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @knz, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 71 at r3 (raw file):
same name gets created in the new database. The temporary schema is thus defined per-database and it is thus possible to have identically named temporary tables in different databases in the same session.
[nit] Mention that this is consistent with Postgres (these semantics seem a little strange when reading about them for the first time).
docs/RFCS/20191009_temp_tables.md, line 101 at r3 (raw file):
# Reference-level explanation Ensuring no foreign key cross referencing is allowed between temporary/persistent tables should not be
[nit] these paragraphs are kind of random, maybe the should be moved after the major challenges part, as bullet points.
docs/RFCS/20191009_temp_tables.md, line 215 at r3 (raw file):
removed. Instead, we use a daemon process to perform cleanup. A background process finds all active sessions and filters through the system.namespace table to find
Sounds like this should use the jobs infrastructure. Each temp schema would trigger a job, and there is some functionality for jobs being taken over by other nodes when a node fails.
docs/RFCS/20191009_temp_tables.md, line 283 at r3 (raw file):
#### Q1. Do we need to efficiently allocate temporary table IDs? Currently, we do not keep track of which table IDs are in use and which ones have been deleted.
I don't see why this would be a problem (except maybe convenience during debugging), but if it is we could generate IDs for temp tables using a separate counter, and always set the high bit for these - this way "small" IDs will be regular tables and "large" IDs will be temp tables.
47e2301
to
2da199d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I've worked through most of the comments, will address the ones I haven't in my next push.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @knz, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 53 at r2 (raw file):
Previously, knz (kena) wrote…
ditto ``
Done.
docs/RFCS/20191009_temp_tables.md, line 59 at r2 (raw file):
Previously, knz (kena) wrote…
Maybe indent this to the left under the bullet point instead of using the block quote syntax.
Done.
docs/RFCS/20191009_temp_tables.md, line 102 at r2 (raw file):
Previously, knz (kena) wrote…
Please detail where the check should be implemented.
Done.
docs/RFCS/20191009_temp_tables.md, line 104 at r2 (raw file):
Previously, knz (kena) wrote…
Are you sure we need it in the table descriptor itself, maybe in the MutableTableDescriptor is sufficient (only schema changes need to know the difference I think?)
I could be wrong, but when create table writes the mutable descriptor it is creating, it goes through the WriteNewDescToBatch
call which only persists the TableDescriptor part of the MutableTableDescriptor.
If we do not add the persistenceStatus to the table descriptor, when we hit the KV layer for name resolution during the FK resolution process, we will not be able to ensure there isn't a cross-reference.
docs/RFCS/20191009_temp_tables.md, line 113 at r2 (raw file):
Previously, knz (kena) wrote…
Also there should be one such temp schema per database I think?
Done.
docs/RFCS/20191009_temp_tables.md, line 122 at r2 (raw file):
Previously, knz (kena) wrote…
session+database
Done.
docs/RFCS/20191009_temp_tables.md, line 143 at r2 (raw file):
Previously, knz (kena) wrote…
I suppose we could use parentSchemaID = -1 to distinguish rows in system.namespace that designate schemas, from those that designate tables with
public
as parent schema.
Updated the example to use 29 instead of 0 as the public
schema ID. This achieves the same goal but keeps the ID reservation consistent with our current system imo.
docs/RFCS/20191009_temp_tables.md, line 144 at r2 (raw file):
Previously, knz (kena) wrote…
Also you'll need a sub-section to explain what will happen with SHOW TABLES (and other places) in mixed-version configs (19.2 + 20.1).
Perhaps the new feature can only be enabled when the cluster version has been bumped to 20.1 because of this.
If the feature is gated behind cluster version 20.1, SHOW TABLES should not change right?
docs/RFCS/20191009_temp_tables.md, line 150 at r2 (raw file):
Previously, knz (kena) wrote…
... and does not need an invalidation protocol.
Done.
docs/RFCS/20191009_temp_tables.md, line 192 at r2 (raw file):
Previously, knz (kena) wrote…
Here you have another reason to gate the feature behind a cluster version: you'll need to change the PK of system.namespace, and that means that all pre-20.1 nodes will not be able to use it. I think that's fine but it needs to be called out.
I've added this to the Summary section.
docs/RFCS/20191009_temp_tables.md, line 205 at r2 (raw file):
Previously, knz (kena) wrote…
Add a note that the migration only occurs when the cluster version is bumped to 20.1 or later.
Wouldn't the migration need to be done when the cluster version is bumped to 20.1 or later, but the feature itself be gated until the cluster version is bumped?
docs/RFCS/20191009_temp_tables.md, line 212 at r2 (raw file):
Previously, knz (kena) wrote…
What is the frequency of execution?
Not quite sure what is a reasonable frequency. I'm putting this as an unresolved question for now.
I'm also ironing out the details of possibly using the jobs framework to do deletion. If we have a job per temp schema created, there might be an opportunity to use an exponentially growing wait time -- longer running sessions would not be checked as frequently. Not sure if this is overkill though.
docs/RFCS/20191009_temp_tables.md, line 215 at r3 (raw file):
Previously, RaduBerinde wrote…
Sounds like this should use the jobs infrastructure. Each temp schema would trigger a job, and there is some functionality for jobs being taken over by other nodes when a node fails.
I like this idea! I'll iron out the details and update the RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @knz, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 111 at r4 (raw file):
temporary tables. Tables under this schema can only be accessed from the session that created the schema. Currently, CockroachDB only supports the `public` physical schema. As part of this task, CRDB should be extended to support other physical schemas (`pg_temp_<session id>`) under which temporary tables can live.
Supporting additional physical schemas is its own beneficial project. Is there a plan to do that as well? Should that be a separate project that is done as a prerequisite to temporary tables?
docs/RFCS/20191009_temp_tables.md, line 199 at r4 (raw file):
To remedy this: - 29 (the next available unreserved ID) is now reserved for `PublicSchemaID`. - System.namespace mapping is changed to (ParentDatabaseID, ParentSchemaID, ObjectName) -> ObjectID
@knz I thought we discussed this in the past and concluded that the existing system.namespace
table schema was sufficient to support physical schemas. Am I misremembering? The current system.namespace
schema is (ParentID, Name) -> ObjectID
. ParentID
does not have to be a database ID, though the current lookup rules expect this. I'd have to work out the details again, but I was imaging physical schemas would be implemented by having (DatabaseID, Name) -> SchemaID
and a second lookup from (SchemaID, Name) -> TableID
. You would know if the second lookup is necessary because looking up SchemaID
in system.descriptors
would give you a SchemaDescriptor
and not a TableDescriptor
.
docs/RFCS/20191009_temp_tables.md, line 297 at r4 (raw file):
A table ID that has been deleted creates a “hole” in the ID range. As temporary tables are created and deleted significantly more than regular tables, this problem will be exacerbated. Does this need to be solved? Are there any obvious downsides to having large numbers for tableIDs?
Holes in the ID space should not be a problem modulo bugs. I don't specifically expect bugs in this area, but it would be good to test what happens when there is a large hole.
One problem with large numbers of table IDs is that it puts more pressure on the "system config" range which is gossiped. This is a known deficiency that would be good to fix independently of the temporary tables. I don't have a clear sense for whether it is necessary to fix as a prerequisite to temporary tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @knz, @petermattis, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 104 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
I could be wrong, but when create table writes the mutable descriptor it is creating, it goes through the
WriteNewDescToBatch
call which only persists the TableDescriptor part of the MutableTableDescriptor.If we do not add the persistenceStatus to the table descriptor, when we hit the KV layer for name resolution during the FK resolution process, we will not be able to ensure there isn't a cross-reference.
Please detail this reasoning in the RFC.
docs/RFCS/20191009_temp_tables.md, line 144 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
If the feature is gated behind cluster version 20.1, SHOW TABLES should not change right?
Yes but this needs to be called out.
Also while I am thinking about this, please add a paragraph to explain how SHOW SCHEMAS and SHOW TABLES and pg_namespace and information_schema.xxx need to be modified as a result of these changes.
docs/RFCS/20191009_temp_tables.md, line 205 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Wouldn't the migration need to be done when the cluster version is bumped to 20.1 or later, but the feature itself be gated until the cluster version is bumped?
Good question. What do you think?
docs/RFCS/20191009_temp_tables.md, line 111 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Supporting additional physical schemas is its own beneficial project. Is there a plan to do that as well? Should that be a separate project that is done as a prerequisite to temporary tables?
This is a separate project #26443 and #30916. Arul is well-aware of that other project and effort is made to account for it. However it is out of scope. Generalized user-defined schemas miss two nice properties:
- the mapping of name <-> ID for pg_temp temp schemas is constant throughout a session, and thus can be cached without invalidation. User-defined schemas would need to think about leasing and caching and cache expiration/invalidation.
- temp schemas have no special privileges. User-defined schemas have customizable privileges. Privilege management would also need to think about leasing and caching, and likely a revamp of our privilege system (which is also planned independently).
The idea for now is to be mindful of #26443 and #30916 in the temp table work but not expand the scope to go "all the way" on additional physical schemas.
docs/RFCS/20191009_temp_tables.md, line 199 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
@knz I thought we discussed this in the past and concluded that the existing
system.namespace
table schema was sufficient to support physical schemas. Am I misremembering? The currentsystem.namespace
schema is(ParentID, Name) -> ObjectID
.ParentID
does not have to be a database ID, though the current lookup rules expect this. I'd have to work out the details again, but I was imaging physical schemas would be implemented by having(DatabaseID, Name) -> SchemaID
and a second lookup from(SchemaID, Name) -> TableID
. You would know if the second lookup is necessary because looking upSchemaID
insystem.descriptors
would give you aSchemaDescriptor
and not aTableDescriptor
.
-
the status quo today is to eschew SchemaDescriptors entirely (they are not needed for the functionality). This does not detract from the rest your point though, we could have IDs in system.namespace that do not map to descriptors.
-
I agree that purely functionally the current structure would be sufficient.
I think the reasoning here is to: -
avoid two lookups in system.namespace every time a name gets resolved, in the common case of lookups in the public schema.
-
avoid a complete rewrite of system.namespace when the feature gets activated (we'd need to reparent every table to a newly-inserted public shema)
I would vaguely support an initial implementation that uses just the existing 3 columns, and measure the perf cost with that, to decide whether we want to go for the more complex 4-column approach. However the work will likely need help from more expert folk to iron out the migration process if we keep just 3 columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @knz, @petermattis, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 104 at r2 (raw file):
Previously, knz (kena) wrote…
Please detail this reasoning in the RFC.
Done.
docs/RFCS/20191009_temp_tables.md, line 205 at r2 (raw file):
Previously, knz (kena) wrote…
Good question. What do you think?
I think I mistyped above. I meant migration when node version is bumped to 20.1 or later, but the feature itself be gated until the cluster version is bumped.
docs/RFCS/20191009_temp_tables.md, line 111 at r4 (raw file):
Previously, knz (kena) wrote…
This is a separate project #26443 and #30916. Arul is well-aware of that other project and effort is made to account for it. However it is out of scope. Generalized user-defined schemas miss two nice properties:
- the mapping of name <-> ID for pg_temp temp schemas is constant throughout a session, and thus can be cached without invalidation. User-defined schemas would need to think about leasing and caching and cache expiration/invalidation.
- temp schemas have no special privileges. User-defined schemas have customizable privileges. Privilege management would also need to think about leasing and caching, and likely a revamp of our privilege system (which is also planned independently).
The idea for now is to be mindful of #26443 and #30916 in the temp table work but not expand the scope to go "all the way" on additional physical schemas.
I would also add that the work to go "all the way" for user defined schemas is significantly more than what temporary tables requires. Conversely, if/when user defined schemas are implemented in a different way, the work required to migrate temporary tables will not be very significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @knz, @petermattis, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 111 at r4 (raw file):
Previously, knz (kena) wrote…
This is a separate project #26443 and #30916. Arul is well-aware of that other project and effort is made to account for it. However it is out of scope. Generalized user-defined schemas miss two nice properties:
- the mapping of name <-> ID for pg_temp temp schemas is constant throughout a session, and thus can be cached without invalidation. User-defined schemas would need to think about leasing and caching and cache expiration/invalidation.
- temp schemas have no special privileges. User-defined schemas have customizable privileges. Privilege management would also need to think about leasing and caching, and likely a revamp of our privilege system (which is also planned independently).
The idea for now is to be mindful of #26443 and #30916 in the temp table work but not expand the scope to go "all the way" on additional physical schemas.
Ack. Fair response, I just wanted to make sure it was considered.
docs/RFCS/20191009_temp_tables.md, line 199 at r4 (raw file):
- avoid two lookups in system.namespace every time a name gets resolved, in the common case of lookups in the public schema.
Are the system.namespace
lookups expensive? If they are, I think they could be sped up significantly by changing the caching of system.namespace
to be something more than just the raw keys/values that is done now.
Adding support for physical schemas wouldn't require changing the existing support of public
schema. Specifically, (DatabaseID, Name) -> TableID
could indicate that table Name
is present in the public
schema, while (DatabaseID, Name) -> SchemaID
would indicate that Name
refers to a physical schema and an additional lookup is needed for the table name. Lookups for tables in the public
schema would remain as they are today. Lookups for tables in physical schemas would require an extra lookup in system.namespace
.
- avoid a complete rewrite of system.namespace when the feature gets activated (we'd need to reparent every table to a newly-inserted public shema)
Isn't this what is being proposed in this RFC? Or am I misunderstanding what is involved in changing the schema of system.namespace
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @petermattis, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 205 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
I think I mistyped above. I meant migration when node version is bumped to 20.1 or later, but the feature itself be gated until the cluster version is bumped.
What do you need/want me to say?
docs/RFCS/20191009_temp_tables.md, line 199 at r4 (raw file):
Adding support for physical schemas wouldn't require changing the existing support of public schema [...]
It's useful to rephrase this paragraph in the context of the algorithm for name resolution.
The general structure of a name is x.y.z. The algorithm is like this:
- if all 3 are specified, look up db x, then schema y with proper db parent, then table z with proper schema parent
- if only 2 are specified, look up schema y first in current db. If that fails, look up db y and assume z is in public schema.
- if only 1 is specified, look up tb according to search_path in current db.
I understand your proposal as follows:
- if all 3 are specified, and y is "public", look up db x then table z with db x as parent. If y is another schema, look up db x then schema y in db then table z in schema.
- if only 2 are specified, and y is "public", look up tb z with current db as parent. Otherwise, do as previously.
- if only 1 is specified, look up tb according to search_path in current db; with special case when finding "public" in search_path
Overall you're suggesting to convert the current 3-level lookup algorithm (tb to schema, then schema to db), which currently uses 3 data structures / APIs, into just two levels, where the first level has both the db name and schema name as input arguments.
This would "bind" the db and schema together for the purpose of lookups.
It feels a bit ad-hoc but since we're never going to look at more than 3 levels anyway it could be a useful simplification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @petermattis, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 199 at r4 (raw file):
Isn't this what is being proposed in this RFC? Or am I misunderstanding what is involved in changing the schema of system.namespace?
The RFC proposes to add a new column to system.namespace. For all the existing rows that column will default to 0, which we'll interpret as the ID of the public schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @petermattis, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 199 at r4 (raw file):
Previously, knz (kena) wrote…
Isn't this what is being proposed in this RFC? Or am I misunderstanding what is involved in changing the schema of system.namespace?
The RFC proposes to add a new column to system.namespace. For all the existing rows that column will default to 0, which we'll interpret as the ID of the public schema.
I dislike making "public" special with a 2-level lookup while other schemas use a 3-level lookup (except during a transitional period). This seems like an invitation for bugs or perf problems that we never notice (because we just use the public
schema everywhere) but users what are in the habit of using different schemas do.
(I don't have a strong opinion on whether it's better to go to a 3-level lookup for all schemas including public, or add a column to system.namespace).
docs/RFCS/20191009_temp_tables.md, line 264 at r4 (raw file):
#### Some Key Observations: 1. Temporary tables will never be accessed concurrently. 2. We do not need to pay the replication + deletion overhead for temporary table descriptors for no
There's also gossip overhead: Whenever system.namespace changes, the whole thing gets gossiped out to all the nodes. This could get very expensive, and is completely unnecessary. This may be enough reason to store temp table state (only enough metadata for GC) somewhere other than system.namespace.
docs/RFCS/20191009_temp_tables.md, line 297 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Holes in the ID space should not be a problem modulo bugs. I don't specifically expect bugs in this area, but it would be good to test what happens when there is a large hole.
One problem with large numbers of table IDs is that it puts more pressure on the "system config" range which is gossiped. This is a known deficiency that would be good to fix independently of the temporary tables. I don't have a clear sense for whether it is necessary to fix as a prerequisite to temporary tables.
Accelerating the increase of table IDs would be a (mild?) performance penalty - table IDs appear all over the place in key prefixes, so we become less efficient when table IDs need multiple encoded bytes. It might be nice to use a separate sequence with the high bit set for temp tables and reserve smaller IDs for permanent tables.
Whether or not it's shared with permanent tables, a global space for temp table IDs is unfortunate. In MySQL (years ago), on-disk temp table creation needed a write lock on schema metadata (and did some fairly expensive operations while holding it), which blocked all other transactions that needed read locks on the same structure. Contention on this lock got so bad that we had to ban temporary tables at a previous job, unless they could be guaranteed to fit in memory. (this was a bigger problem for mysql because mysql uses temp tables internally when it does things like an on-disk sort, so this ban extended to queries that did anything like that)
Mysql stores temp tables in memory until they reach 16MB by default; it doesn't need to get expensive locks until it has to go to disk. It would be nice if we could do something similar although it's going to be harder for us with the way key prefixes are baked in everywhere. Maybe we could allocate an id, use it in memory, and then return that ID to a freelist if it never had to spill to disk. But that still involves a lot of layer-violating magic to be able to use in-memory storage for anything.
We can at least allocate batches of temp table IDs so that workloads that make heavy use of temp tables don't have to keep contending on this key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @petermattis, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 264 at r4 (raw file):
Whenever system.namespace changes, the whole thing gets gossiped out to all the nodes. This could get very expensive
Oh, good catch.
@arulajmani: this may be the one argument that will push you to say
- store the schema IDs in a separate table from system.namespace for the purpose of collecting the info for GC, to avoid the gossip update overhead
- keep the mapping of temp schema ID -> table IDs into ram, just on the gateway node (the distsql execution doesn't need to map table IDs back to names), and have name resolution access this data structure before/separately from system.namespace
- temp schemas do not get stored in system.namespace, and we side-step that discussion entirely
If you were to go in this direction then the TT work would become much more disconnected from #30916 than you initially anticipated. Perhaps that's all right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @petermattis, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 205 at r2 (raw file):
Previously, knz (kena) wrote…
What do you need/want me to say?
You're correct that there are 4 phases:
- all nodes running at binary version X, cluster logical version X
- some nodes running at binver X, some others at Y, clusver X
- all nodes running at binver Y, clusver X
- all nodes running at binver Y, clusver Y
Your section on "how to introduce this feature in existing clusters" (which you still need to add at some point) would need to detail what happens at steps 2, 3, 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @petermattis, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 264 at r4 (raw file):
Previously, knz (kena) wrote…
Whenever system.namespace changes, the whole thing gets gossiped out to all the nodes. This could get very expensive
Oh, good catch.
@arulajmani: this may be the one argument that will push you to say
- store the schema IDs in a separate table from system.namespace for the purpose of collecting the info for GC, to avoid the gossip update overhead
- keep the mapping of temp schema ID -> table IDs into ram, just on the gateway node (the distsql execution doesn't need to map table IDs back to names), and have name resolution access this data structure before/separately from system.namespace
- temp schemas do not get stored in system.namespace, and we side-step that discussion entirely
If you were to go in this direction then the TT work would become much more disconnected from #30916 than you initially anticipated. Perhaps that's all right.
OTOH - just discussed with @bdarnell offline - it may also be that gossiping system.namespace is not necessary any more (like it was when CockroachDB was first implemented).
If that is the case (we need to investigate further) maybe your original proposal remains valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @knz, @petermattis, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 199 at r4 (raw file):
The RFC proposes to add a new column to system.namespace. For all the existing rows that column will default to 0, which we'll interpret as the ID of the public schema.
system.namespace
is special in how it is read, though. At least it used to be. I'm not sure if that has changed, but there used to be special decoding of its keys and values.
I dislike making "public" special with a 2-level lookup while other schemas use a 3-level lookup (except during a transitional period). This seems like an invitation for bugs or perf problems that we never notice (because we just use the public schema everywhere) but users what are in the habit of using different schemas do.
Ack. I also dislike that. It is probably best if everything becomes a 3-level lookup. My main point was that sticking with the current schema gives us a bit more flexibility in how this transition is done.
(I don't have a strong opinion on whether it's better to go to a 3-level lookup for all schemas including public, or add a column to system.namespace).
I chatted with @knz about this in person and the trade-off is about the difficulty of performing a schema change on system.namespace
. Doing that scares me a bit, though that might just be FUD. Definitely worth exploring the actual complexity vs listening to my fears.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @knz, @petermattis, @RaduBerinde, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 44 at r4 (raw file):
if/when CRDB supports user-defined schemas
perhaps throw in a link: #26443
docs/RFCS/20191009_temp_tables.md, line 63 at r4 (raw file):
- `CREATE TEMP TABLE t(x INT)` is equivalent to `CREATE TEMP TABLE pg_temp_1231231312.t(x INT)` and also `CREATE TABLE pg_temp_1231231312.t(x INT)` and also `CREATE TABLE pg_temp.t(x INT)`
what happens if you try to create some other schema that starts with pg_temp
manually? i guess we don't need to worry about it since we don't support user-defined schemas right now, but it looks like that would be an error in postgres.
docs/RFCS/20191009_temp_tables.md, line 217 at r4 (raw file):
- For every DatabaseID that exists in the system.namespace table, a `public` schema is added by adding an entry (DatabaseID, 0, public) -> 29. - For all existing tables, the schemaID field is prefilled with 29 to scope them under `public`.
Shouldn't some of the tables belong to other schemas, i.e., crdb_internal, information_schema, or pg_catalog?
2da199d
to
2a6ec46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @knz, @petermattis, @RaduBerinde, @rafiss, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 144 at r2 (raw file):
Previously, knz (kena) wrote…
Yes but this needs to be called out.
Also while I am thinking about this, please add a paragraph to explain how SHOW SCHEMAS and SHOW TABLES and pg_namespace and information_schema.xxx need to be modified as a result of these changes.
Added a section about metadata query semantics and how they should be implemented.
docs/RFCS/20191009_temp_tables.md, line 44 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
if/when CRDB supports user-defined schemas
perhaps throw in a link: #26443
Done.
docs/RFCS/20191009_temp_tables.md, line 63 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
- `CREATE TEMP TABLE t(x INT)` is equivalent to `CREATE TEMP TABLE pg_temp_1231231312.t(x INT)` and also `CREATE TABLE pg_temp_1231231312.t(x INT)` and also `CREATE TABLE pg_temp.t(x INT)`
what happens if you try to create some other schema that starts with
pg_temp
manually? i guess we don't need to worry about it since we don't support user-defined schemas right now, but it looks like that would be an error in postgres.
postgres does not allow creation of user defined schemas that start with "pg_", so we shouldn't run into any problems with compatibility.
docs/RFCS/20191009_temp_tables.md, line 199 at r4 (raw file):
system.namespace is special in how it is read, though. At least it used to be. I'm not sure if that has changed, but there used to be special decoding of its keys and values.
It seems like the key is constructed exactly how encoding.md describes key construction and the value is put using the regular way. I'm not sure if this means that we no longer readsystem.namespace
in a special way, but I think adding another primary key column should be doable.
difficulty of performing a schema change on system.namespace
Talked to @knz offline and came up with an approach that could solve this issue + @bdarnell concern about system.namespace
being gossiped + @knz's comments about the migration.
Considering we need to preserve the older system.namespace
table for multi-version clusters, we create a second, empty system.namespace
table whose ID is not in the gossip range. This new table will be filled from the older system.namespace
(by adding the PublicSchemaID
= 29) only when the cluster version is bumped to 20.1 .
All name resolutions will follow two possible code paths -- the new system.namespace
will be tried first, to see if an object name can be resolved. If that doesn't work, the old system.namespace
will be tried. Eventually, we can get rid of the old system.namespace
.
This relies on system.namespace
not being gossiped. I'm still investigating this, but the initial signs look promising.
docs/RFCS/20191009_temp_tables.md, line 217 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
Shouldn't some of the tables belong to other schemas, i.e., crdb_internal, information_schema, or pg_catalog?
All the other schemas are "virtual" schemas, so they aren't really stored in the system.namespace table. The virtual tables do not have physical table descriptors, so they aren't stored in system.namespace.
docs/RFCS/20191009_temp_tables.md, line 264 at r4 (raw file):
Previously, knz (kena) wrote…
OTOH - just discussed with @bdarnell offline - it may also be that gossiping system.namespace is not necessary any more (like it was when CockroachDB was first implemented).
If that is the case (we need to investigate further) maybe your original proposal remains valid.
Investigating this, initial signs look positive :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @knz, @petermattis, @RaduBerinde, @rafiss, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 199 at r4 (raw file):
This relies on system.namespace not being gossiped. I'm still investigating this, but the initial signs look promising.
This plan sounds reasonable to me, though it needs to be validated against the true authority (the code).
One thing the (DatabaseID, SchemaID, TableName)
approach loses out on is that we can't rename schemas from one database to another without performing O(n) operations. I suspect schema renames across databases isn't allowed by Postgres. Can you double check that? It is worth calling this limitation out in this RFC.
O(n) with a low constant though, it's only an operation on system.namespace.
Correct. But more to the point, it's not a thing that we know clients often do. So even if it was possible it would be fine to offer O(n). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, the in-memory version can (and should) be built on top of the initial implementation proposed in the RFC
I see it all the other way around. It seems to me that the implementation of in-memory temp tables wouldn't have much in common with the implementation of on-disk temp tables. It further seems to me that the implementation of in-mem (nay, node-local - they can still be on disk and be gigabytes in size)) is both much easier and much more useful (in fact I think you always want node-local and ~never distributed temp tables).
What exactly is this compatibility that keeps being referenced? The text says we won't support transaction-scoped TTs. Where's the precious compatibility there? With more ephemeral storage, we could support that, for example.
Using normal tables allows these tables to be big. But I think the only way in which these temp tables could actually be big is when created with create table as
which uses a bulk row creation mechanism. All other writes have to go though one session, so you can't actually get such a table to be too big. So, is create table TT as ...
really that important, when used in conjunction with large data sets? Does anyone really want to use temp tables this way? I feel like if I'm gonna be using a lot of space, I wouldn't do it through a temp table... (and if it is a temp table, I definitely wouldn't want replication for it). Is it important enough to let it wag the dog (as I see it)? Apart from this, I don't see what other limitations node-local tables would have. There was a mention in the text of some interaction with sequences that I don't understand very well, but I'm guessing it could be solved.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @nvanbenschoten, @petermattis, @RaduBerinde, @rafiss, @solongordon, and @yuzefovich)
I agree with @knz's sentiment that we ought to aim first for compatibility and second for performance. I think we should satisfy all usages of temp tables, even if they're slow, and optimize later as needed. This avoids premature optimization - if nobody needed fast temporary tables, we could stop after the initial implementation. On the flip side, if we built the fast but limited version first, and someone needed the on-disk implementation, they'd be out of luck until we did a large amount of extra work. I believe a typical use case for an on-disk temporary table is producing a large amount of data that you'll use again later in a data pipeline. I've personally never encountered one that needs to store data in memory for performance, so it appears that there's at least some amount of diversity to use cases here. I'd also like to suggest that the temporary tables could be immediately zone-configured to be RF=1 and pinned to the leaseholder. That would accomplish node-local storage without having to jump through many new implementation hoops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there's another aspect of compatibility - having our TTs be usable with the same frequency as they are in Postgres. With implications on, for example, gossip and range splits, would that be true for our tables? It seems to me it'd be too easy to destabilize a cluster with them.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @nvanbenschoten, @petermattis, @RaduBerinde, @rafiss, @solongordon, and @yuzefovich)
Dude you should really spend some time studying postgres, there are really wonders in there. Between other things:
So separating the implementation would require duplicating:
In comparison, the solution I've outlined above is doing this at a separate layer: we create a regular in-memory store with a special attribute, then we create TTs as regular tables but a zone config that pins them to that store with replication factor 1. Everything else in SQL remains the same. What's not to like? |
Also Andrei the point of disabling gossip updates for certain DDL is a valid optimization, this was discussed before and is independent of temp tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Andrei and I have been discussing this offline.
I think it's fair to say the RFC should report on that part of the discussion of alternatives, since the discussion has happened in the past. The various comments from today should provide you some pro/con arguments to fill in that section.
Andrei also reminded me that there are a number of questions that remain open on the RFC and these will need some attention too.
My recommendation would be to have a recap session with @solongordon when he's back, to list all the open topics of discussion, sort them by order of priority, group related topics together, then attempt to "close the open loops".
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @nvanbenschoten, @petermattis, @RaduBerinde, @rafiss, @solongordon, and @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another detail: it's not just the gossiping on creation of these tables that's problematic, I think. It's also simply writing to system.descriptor
that I think is unfortunate. By being in the "system config range", this table is limited to a single range and bundled with a bunch of other tables. So when it hits 64MB (or maybe 128MB), everything stops working because writes start being refused until GC. The TTL on this table is the default 25h (90k seconds), and this is one table that I don't think we can reduce the TTL on (cause AS OF SYSTEM TIME queries on any other table need to read historical descriptors). So, assuming that a descriptor has... 300 bytes... we get 64000000/300/90000=2.3 temp table creations per second (and half for any index added after the fact).
Perhaps if we stop gossiping all updates to this table, we can also take it out of the system config range, but that's probably another involved migration.
If system.descriptor
would be allowed to split, then the next problem with it is that it's not partitioned - so writes to it pay inter-region latencies.
Reviewed 1 of 1 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @bdarnell, @nvanbenschoten, @petermattis, @RaduBerinde, @rafiss, @solongordon, and @yuzefovich)
This RFC proposes to add support for temporary tables. Release note: None
ada1d78
to
64790f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the RFC to capture the node-local approach that was discussed earlier today as an alternative. Also added @knz's followup suggestion as an alternative that can be built after the v1 approach described by the RFC.
Closed some of the other open loops as well. Will get back to other open discussions in the next few days, and try to get them resolved. :)
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @arulajmani, @bdarnell, @knz, @nvanbenschoten, @petermattis, @RaduBerinde, @rafiss, @solongordon, and @yuzefovich)
docs/RFCS/20191009_temp_tables.md, line 112 at r6 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
The MySQL semantics are bad
Or, the mysql semantics are evidence that there is little demand for (session-scoped) introspection on temporary tables.
I think there's actually more of a need for a global view of temporary tables than a session-scoped one. Temporary tables take up space on disk and admins need to be able to see how much data is consumed by temporary tables (and which users/sessions own them?). Maybe that's something other than pg_catalog/information_schema, but it's worth considering in the design of our introspection semantics here.
This makes sense, I hadn't accounted for the need of having a global view for temp tables -- I was only considering session scoped introspection from the point of view of ORMs.
I think it makes sense to tweak the PG semantics to be more consistent (information_schema.tables includes other sessions' temporary tables) and provide a global view of TTs. I'm updating the RFC to reflect this.
docs/RFCS/20191009_temp_tables.md, line 285 at r7 (raw file):
Previously, knz (kena) wrote…
Can you explain this magic ID 29 a little bit better. Why do you share the same public schema ID betwen all dbs? This needs to be explained.
Also if you do so, you'll need to combine the schema ID with the db ID to generate the namespace OID in pg_catalog. I don't recall if your PR does this already? In any case this also needs to be explained.
The introspection semantics aren't complete yet. They'll be addressed in a subsequent PR, and I'll add the namespace OID generation to PG catalog there (as part of saving the schemaID in the table descriptor).
docs/RFCS/20191009_temp_tables.md, line 310 at r7 (raw file):
Previously, knz (kena) wrote…
There's something in-between: a node has observed the verison to be 19.2,
then while the version is being updated to 20.1 it proceeds to perform a namespace update.The result is a possible write to the old system.namespace concurrent with a version bump, and it's possible for it to occur just after the version bump.
I think your algorithm below recovers from this, but I am not 100% sure how. Can you make a section with a step-by-step explanation of this particular case?
If I'm understanding this case correctly, this sounds like a special case of the the window I describe above -- an object is written to the old system.namespace after all the entries have been copied over to the new system.namespace, and the node is unaware such an entry exists.
I'll expand a bit more on the copy over + the window case below, let me know if that/this case requires more detailing.
docs/RFCS/20191009_temp_tables.md, line 343 at r7 (raw file):
Previously, knz (kena) wrote…
The title of this sub-sub-section is called "second copy over". This suggests there is a process that kicks in to "finish up" the work of copying the entires. Where is this process described?
Currently specified as a 20.2 migration. This can definitely be optimized by doing a 2 step cluster bump/having post version bump migrations. Both of these methods will require more investigation, which I can explore if identified as a priority.
@andreimatei there's been discussion to un-gossip That said the problem exists also for regular (non-TT) workloads that create/drop many tables, and we've started to notice that with customer deployments. Therefore, this concern is really orthogonal to this RFC. |
@andreimatei, @ajwerner, and I spent some time talking about this over lunch. Those two had spent the morning brainstorming this with @arulajmani. Here's a new proposal that we'd like to get @knz's and @bdarnell's opinion on. To start, let's talk about the properties we need and those we don't need out of these temporary tables. We need:
We don't need:
These requirements make it clear why the node-local approach is appealing. Keeping the data local immediately gives us data/resource isolation and easy lifetime management of these temp tables. However, I don't think it's going far enough. The obvious question that falls out of this is why we need to involve the KV layer in this at all. Other than the single requirement about a "mutable data sink", we have something that fits these requirements a lot closer already: virtual tables. Virtual tables are treated by the SQL layer as an alternative data source. The SQL layer knows how to properly plan and execute these data sources as an alternative to accessing KV data. They are materialized on-demand, but this isn't a hard requirement. They could just as easily be written to some queryable node-local storage engine. I think we should consider pushing harder on our Getting the interface would be a bit of work, but it has a lot of benefits. First off, it gives us everything we need from while avoiding unnecessary (and possibly destabilizing) work like replication and distribution. It also avoids polluting the global descriptor namespace and avoids all of the fallout (i.e. gossip) that would come from this. It avoids the issues that would come from incorrectly localizing the temp tables and avoids the complexities of leasing and data cleanup on gateway failure. Finally, it provides exactly the right resource isolation model. One can image moving to a gateway-only process-per-session model and wrapping each process in a cgroup. Keeping temp-table data locally would allow all the usual isolation primitives to apply without any extra work. Furthermore, there is plenty of precedent for this. PostgreSQL has built its entire SQL layer around pluggable data sources, which is essentially what we're proposing here. This indicates that any work we do here to make the abstraction around data sources more concrete would be beneficial outside of just temp tables. We're also going to have to make virtual table's mutable eventually anyway for PG compat, so again, this work would not be wasted. |
I'll think about this more but I think you forgot two points in your list of requirements:
- TTs absolutely need txn atomicity, insofar that mutations on TTs are subject to txn rollbacks. That includes savepoint rollbacks and thus seqnum ranges.
- TTs need to support multiple indexes and index selection
- open question about ASOT in combination with TTs
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
I like some of these suggestions a lot, but if you're going to allow schema changes on TTs, assumptions about who reads and writes to them and where may need to be re-examined. (In either case, we should probably explicitly talk about schema changes a bit more in the RFC.) |
OK so I've thought about this a bit more. I line your proposal a lot and it would bear two different projects as fruits.
But to start, I really think this is a separate feature set from what Arul was tasked to solve (more on this below). From a process perspective I think this discussion should be pulled in a separate issue.
Some more thoughts about the two projects born out of your thought process.
One is the data source abstraction. You are right that it would do us well to peek at pg' foreign data wrapper interface.
FYI this interface is not just data access it can also define index access methods (so as to provide various indexing implementations) as well as a transactional API, to ensure ACID guarantees can be preserved for txns that span across multiple FDWs.
I can see several ways this would help us; not the least by facilitating row-oriented and column-oriented data side by side. We need that.
I do also think however that this notion of data source is not directly required for TTs, in fact it is orthogonal.
The second project born from your thoughts is to have multiple ways to provision read-only db objects within a txn. This would enable us to have our current virtual tables and read-only global tables under a common abstraction. That is good too. I think however that this will be restricted to read only objects that already exist prior to the txn, hence again unrelated to TTs. The reason why I think this, is that any attempt to bring mutability to that abstraction will make it as complex as the FDW interface discussed above and inadequate for global read only tables.
Finally coming back to this RFC and the idea of separate project. The RFC as is is proposing a feature (together with implementation) that satisfies the roadmap item: provide pg-compatible TTs that make the ORMs happy.
The direction you outline in your comment and that drove your discussion at lunch today is born from a separate user story unrelated to making ORMs happy, and instead creating some new SQL feature and possibly business value (= innovation). The various restrictions that come with that hypothetical feature would need an analysis of how much users are ready and able to adopt this new thing in their apps. I think we should definitely do that analysis but again it brings us far further/away from our 20.1 roadmap.
…--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
Also I think the new kind of table needs a name that is precise and descriptive. "Temp tables" does not work to describe what you propose.
We can call them "private side data tables" residing in a "side data store". "Side" to emphasize they are not subject to transactional semantics and that effects are not rolled back with txns.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
I think there might have been a misunderstanding here - in Nathan's proposal these temp tables would still have transactional semantics: you can rollback a transaction, you can have savepoints. Achieving this would be achieved somehow by taking a RocksDB-level savepoint at the beginning of a txn. Or something more general - a single txn can all be in-memory so we can do whatever we want.
The intention here is to provide temp tables that don't have too many restrictions, beyond the storage restriction to one node. Don't know about
Can we clarify more about the use cases? Like, isn't that fact that one probably can't create more than one or two of these tables according to the current proposal completely breaking compatibility? Don't we expect people to use this feature frequently? Side note: you've brought up index selection several times. I was trolling the Postgres docs and they say that statistics are not gathered automatically; you have to run |
Andrei you're right that a discussion of use cases should be repeated here. In fact we can ask @awoods187 to share with us what is already mentioned in the airtable entry, to frame this discussion. But further than that. Different set of requirements set different strategic choices.
I like Nathan's line of thinking because it bears multiple interesting and innovative projects as fruit, even if we can't found TTs on top of that. Foremost I like Nathan's line of thinking because Nathan makes a constructive proposal with both a set of requirements and an outline of implementation strategy. The other line of thinking — "let's make TTs that satisfy ORM requirements but without the various costs proposed" — does not yet have a satisfying list of requirements combined with a realistic outline of work to be done. I still doubt it yields a better approach than taking Arul's work and then improving it in various ways: 1) towards plan D in the RFC then 2) by un-gossiping descriptors, which has been discussed before. |
I wasn't proposing eliminating rollbacks and savepoints. That can all be pretty trivially implemented with some in-memory bookkeeping. That's what I was (admittedly vaguely) getting at with "we'd probably have to create an abstraction around SQL's notion of a We could have a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's merge this as is
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @arulajmani, @bdarnell, @knz, @nvanbenschoten, @petermattis, @RaduBerinde, @rafiss, @solongordon, and @yuzefovich)
bors r+ |
Build failed |
bors r+ |
41482: rfc: temporary tables support r=solongordon a=arulajmani This RFC proposes to add support for temporary tables. RFC text: https://github.com/arulajmani/cockroach/blob/temp-tables-rfc/docs/RFCS/20191009_temp_tables.md Release note: None 41989: rfcs: resurrect RFC for altering primary key r=solongordon a=solongordon A year-and-a-half ago, David Taylor wrote an RFC for how to support primary key changes, but that work was deprioritized while the RFC was still a draft. Now we are ready to perform that work, so Rohan and I are bringing that RFC back from the dead. The core approach remains the same, but we've filled in a few TODOs and added more implementation details, syntax examples, and follow-up work ideas. The original RFC got a few comments before it was shelved (see #25208), but we intend to go through a more thorough review process this time around. Release note: None 43655: protectedts/ptcache: implement the Cache to watch protectedts state r=nvanbenschoten a=ajwerner The Cache periodically polls the protectedts state. The Cache polls the meta row from each node at a period defined by a cluster setting. Clients can toggle a manual refresh. Release note: None. Co-authored-by: Arul Ajmani <arula@cockroachlabs.com> Co-authored-by: Solon Gordon <solon@cockroachlabs.com> Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Build succeededAnd happy new year from bors! 🎉 |
This RFC proposes to add support for temporary tables.
RFC text: https://github.com/arulajmani/cockroach/blob/temp-tables-rfc/docs/RFCS/20191009_temp_tables.md
Release note: None