Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: User-defined SQL schemas #30916

Merged
merged 1 commit into from
May 1, 2020
Merged

Conversation

knz
Copy link
Contributor

@knz knz commented Oct 3, 2018

Informs #30916.
Release note: None

@knz knz requested a review from a team as a code owner October 3, 2018 12:28
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very comprehensive RFC!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 280 at r1 (raw file):

(I like the idea to avoid descriptors because that would ask complex
questions about leases and the allocation of ranges.)

Can you elaborate on this concern? I was imagining that stored schemas would be implemented with some sort of SchemaDescriptor and would be present in the system.namespace table very much like you describe in this section. How would any of these alternatives affect the allocation of ranges (or range IDs which you allude to elsewhere)? Are the leases you're concerned about table leases? We don't currently have leases on databases (unless something changed that I'm not aware of).

system.namespace was initially designed to support arbitrary hierarchy similar to a filesystem, but we limited the initial lookup support to only 2 levels (database, table). I think you might be aware of this.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 280 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can you elaborate on this concern? I was imagining that stored schemas would be implemented with some sort of SchemaDescriptor and would be present in the system.namespace table very much like you describe in this section. How would any of these alternatives affect the allocation of ranges (or range IDs which you allude to elsewhere)? Are the leases you're concerned about table leases? We don't currently have leases on databases (unless something changed that I'm not aware of).

system.namespace was initially designed to support arbitrary hierarchy similar to a filesystem, but we limited the initial lookup support to only 2 levels (database, table). I think you might be aware of this.

  • about ranges: currently every descriptor ID, be it db or table, causes a range to be allocated.

  • about leases: the fact there are no leases on dbs currently is a bug and we are lucky our users do not notice because traffic on db descriptors is low.

    In contrast once schemas are available there will be many more create/drop traffic on schemas, also inside transactions, and users will be much more sensitive to the lack of proper transactional semantics on those. I don't want to be casual with them as we've been with databases.

  • about system.namespace and hierarchy. We're making a SQL database with at most 2-3 levels. The full generality of system.namespace was premature design IMHO. Also even if we wanted to implement arbitrary deep objects (say, if/when we want a filesystem) this schema is poor because it requires recursive queries to normalize/resolve paths. There are better approaches to storing filesystem paths, for example that used in ReiserFS or ZFS, and if/when we want to do that we should get inspiration from there.

Until then I would like to avoid considering this theoretical situation in this discussion, and instead capitalize/utilize the fact we need at most 2-3 levels to optimize/simplify things.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 280 at r1 (raw file):

about ranges: currently every descriptor ID, be it db or table, causes a range to be allocated.

Right, but I don't think the addition of schemas will cause this problem to get much worse. We should fix this for databases.

In contrast once schemas are available there will be many more create/drop traffic on schemas, also inside transactions, and users will be much more sensitive to the lack of proper transactional semantics on those. I don't want to be casual with them as we've been with databases.

Hmm, ok. Can you elaborate on these concerns a bit more? I'm not up to speed on what problems the lack of database schemas is causing. Also, cc @vivekmenezes who has been thinking about "schema leases" for 2.2. I'm not quite sure what level of "schema" he is thinking of.

Until then I would like to avoid considering this theoretical situation in this discussion, and instead capitalize/utilize the fact we need at most 2-3 levels to optimize/simplify things.

Ack. I didn't mean to suggest that we should support more levels than database / schema / table.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 280 at r1 (raw file):

Right, but I don't think the addition of schemas will cause this problem to get much worse. We should fix this for databases.

"Much" perhaps not, but worse definitely. In test frameworks, schemas get created at a similar rate as tables. So we'd double the traffic on identifiers. This alone actually would make alternative C below more attractive.

Hmm, ok. Can you elaborate on these concerns a bit more? I'm not up to speed on what problems the lack of database schemas is causing.

I'll look into it, I can't recall the specific problematic SQL patterns out of my head. However perhaps @vivekmenezes knows them by heart?

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 280 at r1 (raw file):

"Much" perhaps not, but worse definitely. In test frameworks, schemas get created at a similar rate as tables. So we'd double the traffic on identifiers. This alone actually would make alternative C below more attractive.

I'm going to file an issue about this. I wouldn't want these spurious empty ranges to be a reason to prefer one of these alternatives over another. I believe it shouldn't be too difficult to fix. For the purposes of this RFC, I think it should be a minor consideration (though it is great that you pointed it out).

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 280 at r1 (raw file):

I'm going to file an issue about this.

Note that the issue here is independent of SchemaDescriptors. Simply allocating IDs from the same namespace as table IDs and database IDs (as is being suggested in this alternative) will cause the problem.

Copy link
Contributor

@vilterp vilterp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 280 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm going to file an issue about this.

Note that the issue here is independent of SchemaDescriptors. Simply allocating IDs from the same namespace as table IDs and database IDs (as is being suggested in this alternative) will cause the problem.

issue about databases having empty ranges: #24847

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 122 at r1 (raw file):

- using a simple 1-part name, e.g. `users`, both the "current database"
  (`SET database` / `USER`) and the *current schema* (`SET

s/USER/USE/


docs/RFCS/20181003_schemas.md, line 224 at r1 (raw file):

  inherit directly from database to table even with schemas defined.

  The ability to customize zones per schema would be a feature that

I think this is worth doing at the same time that we introduce user-defined schemas.


docs/RFCS/20181003_schemas.md, line 362 at r1 (raw file):

latency.

### Alternative B: schemas with descriptors

I prefer option B.

I dislike option A because it creates such a difference between databases (which have descriptors) and schemas (which have some other to-be-determined place to store the things that are currently in the DB descriptor). Option A would be more appealing to me if we planned to move away from database descriptors too and make them use the same system as for schemas.

Option C is acceptable, but it feels hackier to me than option A or B. I think the concerns about database descriptor leases are more severe for option C (for option B, operations on schemas would not be substantially more common or more complex than operations on databases are today, so it doesn't make things much worse, but in option C the kinds of changes we'd make to database descriptors become more complex). Vivek's schema leases should take care of this no matter which alternative we choose here, though.

@knz
Copy link
Contributor Author

knz commented Oct 5, 2018

operations on schemas would not be substantially more common or more complex than operations on databases are today,

that's not quite corrects -- we've found that test frameworks create/destroy schemas at a similar rate as tables, not databases.

Copy link
Contributor

@vivekmenezes vivekmenezes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 31 at r1 (raw file):

A. compatibility with PostgreSQL -- ORMs and client apps expect this to
   work.
B. simplify the management of multi-user or multi-app clusters.

I'm not convinced that B is important at this stage. Is A a priority?


docs/RFCS/20181003_schemas.md, line 356 at r1 (raw file):

(for example, privileges and grants for creating new tables) would be
stored in a separate system table. (to be defined by further experimentation)

i think privileges should be kept out of descriptors. See #2939

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 31 at r1 (raw file):

Previously, vivekmenezes wrote…

I'm not convinced that B is important at this stage. Is A a priority?

Robert has pressed for A, it is found to be a blocker in the sales pipeline

Lakshmi has pressed for B, it is found to be demand from managed users

The motivation for this work (and the fact it was put on the 2.2 roadmap) comes from them. We can ask Andy Woods for clarification.


docs/RFCS/20181003_schemas.md, line 356 at r1 (raw file):

Previously, vivekmenezes wrote…

i think privileges should be kept out of descriptors. See #2939

I tend to agree -- although that seems to be an orthogonal concern to this RFC. As long as table/db descs are storing descriptors, any new schema descriptors can do the same; only when we actively work to remove privs from db/table descs, should we remove them from schema descs too. The design for the two things ought to be consistent.

I will reword this to express this more clearly.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 280 at r1 (raw file):

Previously, vilterp (Pete Vilter) wrote…

issue about databases having empty ranges: #24847

FYI, databases and views will no longer create empty ranges. So the "allocation of ranges" part here is now FUD. 😃

Copy link
Contributor

@vilterp vilterp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 280 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

FYI, databases and views will no longer create empty ranges. So the "allocation of ranges" part here is now FUD. 😃

Saw that. Nice work @ridwanmsharif!

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


docs/RFCS/20181003_schemas.md, line 122 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/USER/USE/

Done.


docs/RFCS/20181003_schemas.md, line 224 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think this is worth doing at the same time that we introduce user-defined schemas.

Done.


docs/RFCS/20181003_schemas.md, line 280 at r1 (raw file):

Previously, vilterp (Pete Vilter) wrote…

Saw that. Nice work @ridwanmsharif!

Done.


docs/RFCS/20181003_schemas.md, line 356 at r1 (raw file):

Previously, knz (kena) wrote…

I tend to agree -- although that seems to be an orthogonal concern to this RFC. As long as table/db descs are storing descriptors, any new schema descriptors can do the same; only when we actively work to remove privs from db/table descs, should we remove them from schema descs too. The design for the two things ought to be consistent.

I will reword this to express this more clearly.

Done.


docs/RFCS/20181003_schemas.md, line 362 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I prefer option B.

I dislike option A because it creates such a difference between databases (which have descriptors) and schemas (which have some other to-be-determined place to store the things that are currently in the DB descriptor). Option A would be more appealing to me if we planned to move away from database descriptors too and make them use the same system as for schemas.

Option C is acceptable, but it feels hackier to me than option A or B. I think the concerns about database descriptor leases are more severe for option C (for option B, operations on schemas would not be substantially more common or more complex than operations on databases are today, so it doesn't make things much worse, but in option C the kinds of changes we'd make to database descriptors become more complex). Vivek's schema leases should take care of this no matter which alternative we choose here, though.

I have addded your reaction to the rationale section below.

@knz knz force-pushed the 20181003-rfc-schemas branch from 4f558b8 to 9e80d86 Compare January 3, 2019 16:16
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed/studied this with @bobvauwter today. Came up with new alternative D; also found pitfalls due to mixed-version clusters with the other alternatives and the new one. Came up with a test scenario to implement as foundation for the evaluation of the various alternatives.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@Disturbing
Copy link

Waiting for this one to go live. Any ETA?

@knz
Copy link
Contributor Author

knz commented Mar 25, 2019 via email

@jordanlewis
Copy link
Member

@Disturbing, out of curiosity, what's your use case? In CockroachDB, there's little difference between a CREATE DATABASE statement and what a CREATE SCHEMA would eventually do. Could you make do with a DATABASE?

@Disturbing
Copy link

Im working on using hasura's graphql engine @ global scale. Last mentioned, there's an error here: hasura/graphql-engine#411 (comment)

I can ask their team to change the setup if you have a recommendation.

I personally haven't ran the startup just yet, but I should be able to find some time to repo this error if it still exists or not.

@jordanlewis
Copy link
Member

@Disturbing my recommendation would be to use databases instead of schemas and see whether that satisfies your use case.

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019
@knz knz force-pushed the 20181003-rfc-schemas branch from 9e80d86 to 1128801 Compare October 11, 2019 15:12
@knz
Copy link
Contributor Author

knz commented Oct 11, 2019

Updated RFC with new version numbers, also minor simplification informed by #41482

@otan
Copy link
Contributor

otan commented Mar 17, 2020

hey @knz is this still up to date given the temp table work? i'm not sure what the follow up is given that schemas have an id in the system.namespace table, but there is no metadata work. are we looking to do something along the lines of B or C?

furthermore, there is a performance issue we need to solve with table resolution given #44802

@ajwerner
Copy link
Contributor

I like B best, but as you note, we'll need a story around caching and retrieval. I would like to move databases and schemas to be leased resources like tables. There are lots of details to be worked out but in my mind:

A database version will change when adding, deleting, or renaming a child. Same for a schema.

We'll also need a migration to move the tables to a public schema.

There's lots of details to work out here. This is all part of planning work for the upcoming sql schema team.

@knz
Copy link
Contributor Author

knz commented Mar 18, 2020

This draft RFC has not been updated since Arul chose his scheme for the new system.namespace. Now that this choice is made, some of the discussion in the RFC can be collapsed.

is this a discussion we want to pick back up now?

@rohany
Copy link
Contributor

rohany commented Mar 18, 2020

It looks like user defined schemas will be an item for the sql features team in 20.2, so its time for us to start thinking about this again.

@ajwerner
Copy link
Contributor

It looks like user defined schemas will be an item for the sql features team in 20.2, so its time for us to start thinking about this again.

Yep, also on the schema teams plate. The tentative breakdown is that name resolution and the associated leasing falls on the schema team. Let's make sure to coordinate.

@thoszhang
Copy link
Contributor

For anyone following this issue, there's an updated RFC for user-defined schemas here: #48276

@thoszhang thoszhang force-pushed the 20181003-rfc-schemas branch from 1128801 to 88e35a3 Compare May 1, 2020 14:41
@thoszhang
Copy link
Contributor

At knz's suggestion I am marking this version of the RFC as "superseded" and merging it.

bors r+

@blathers-crl
Copy link

blathers-crl bot commented May 1, 2020

❌ The GitHub CI (Cockroach) build has failed on 88e35a37.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@craig
Copy link
Contributor

craig bot commented May 1, 2020

Build succeeded

@craig craig bot merged commit d8b16b5 into cockroachdb:master May 1, 2020
craig bot pushed a commit that referenced this pull request May 16, 2020
48276: rfcs: user-defined schemas r=lucy-zhang a=lucy-zhang

#30916 is @knz's original RFC for user-defined schemas. I created a mostly-new one because our support for schemas has improved significantly over the last few releases (in part motivated by temp tables), and now many of the open questions in the previous RFC have been resolved. This new version attempts to pinpoint the challenges we have yet to deal with. A few sections of this are taken more or less directly from the previous RFC, where applicable.

Closes #47148.

Release note: None

Co-authored-by: Lucy Zhang <lucy-zhang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.