Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHOW CREATE: have an option to dump all tables' create statements #53488

Closed
mwang1026 opened this issue Aug 26, 2020 · 26 comments · Fixed by #60539
Closed

SHOW CREATE: have an option to dump all tables' create statements #53488

mwang1026 opened this issue Aug 26, 2020 · 26 comments · Fixed by #60539
Assignees
Labels
A-disaster-recovery C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery

Comments

@mwang1026
Copy link

mwang1026 commented Aug 26, 2020

With the deprecation of dump, on of the popular use cases was the ability to see all schemas for a database in one place to review schema. In lieu of that, we want to create an alternative solution for that use case.

One suggestion was a SHOW CREATE type command which would print the create statements for all database, etc. to address that particular use case.

Jira issue: CRDB-13989

@mwang1026 mwang1026 added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-disaster-recovery labels Aug 26, 2020
@keith-mcclellan
Copy link
Contributor

Why is dump being deprecated? This is news to the SE team and it's a widely used feature in the field for migrations.

@mwang1026
Copy link
Author

What do you mean by migrations? And what part of dump is used?

I don't want us to get caught up in being tied to a solution especially if we can make a better one. This issue is an attempt to give an alternative to the "see all me schemas" use cases such as auditing schemas.

If you can point out a use case that this doesn't cover in lieu of dump we're all ears.

@rafiss
Copy link
Collaborator

rafiss commented Nov 16, 2020

One idea: could such a command be composed by doing something in combination with SHOW TABLES? Even if so, we could still make an alias for that.

@CyborgMaster
Copy link

CyborgMaster commented Nov 16, 2020

One of the use cases that we use this for as well is to have a point in time checkpoint to create a dev database. We currently do a dump --dump-mode=schema on every commit, and then where a developer needs to spin up a new local db (due to a new machine or just blowing away his local db because he borked it), we just load that dump. It would be great if the format used by any new schema dump tool was loadable on an empty db to recreate the schema.

@CyborgMaster
Copy link

The other alternative is to re-run the migrations since the beginning of time, which we do save in source control, but since cockroach takes a little which to run migrations this can take several minutes, while just loading the schema dump is only a few seconds.

@Freeaqingme
Copy link

Freeaqingme commented Nov 28, 2020

I second the request. I'm also used to dump the database structure in every Git commit. This allows to easily review any migrations that were introduced. Especially for early projects where migrations are plentiful.

On a related note, the 'cockroach dump' command allows to dump to stdout. Even though that makes no sense for large online enterprise databases, it is really perfect for a (small) dev environment, especially when running in docker. I don't think the new BACKUP functionality provides for that.

@kernfeld-cockroach
Copy link
Contributor

Cockroach Cloud currently uses buffalo's pop ORM, which currently relies on cockroach dump. In order for Cockroach Cloud to upgrade to CRDB v21.x, we'd need to arrange an update to pop so that it no longer uses cockroach dump. The particular error that I saw was when running migrations, although it's possible that cockroach dump is also called in other code paths.

❯ soda migrate -e test-base
v5.0.11
[POP] 2020/12/10 11:58:58 info - Migrations already up to date, nothing to apply
[POP] 2020/12/10 11:58:58 info - 0.0311 seconds
ERROR: unknown command "dump" for "cockroach"
Did you mean this?
	demo
Failed running "cockroach"
[POP] 2020/12/10 11:58:58 warn - Migrator: unable to dump schema: exit status 1

@rafiss
Copy link
Collaborator

rafiss commented Dec 10, 2020

The solution here will probably involve extracting out the getDumpMetadata function that's in dump right now and moving it somewhere else.

cockroach/pkg/cli/dump.go

Lines 518 to 521 in 92d9495

// getDumpMetadata retrieves the table information for the specified table(s).
// It also retrieves the cluster timestamp at which the metadata was
// retrieved.
func getDumpMetadata(

@vy-ton
Copy link
Contributor

vy-ton commented Jan 5, 2021

@vy-ton investigate what the SQL statement should be

@rafiss
Copy link
Collaborator

rafiss commented Jan 12, 2021

@RichardJCai will pick this up

@vy-ton
Copy link
Contributor

vy-ton commented Jan 14, 2021

@CyborgMaster @Freeaqingme @kernfeld-cockroach cockroach dump --dump-mode=schema only outputs tables. To recreate a database structure, would you also want a command that outputs all database objects (sequences, tables, types, user-defined schemas, and views)?

@CyborgMaster
Copy link

The tables are the most important, but yes, a command that outputs the sql to fully recreate all the objects in the db schema would be ideal!

@kernfeld-cockroach
Copy link
Contributor

For the buffalo use case, I think it's intended to just print out the schema of every table for the user's benefit; it doesn't need to print objects. I think the format should be pretty flexible, since it looks like it's more about helping the user understand the state of the schema, rather than producing an output that will be fed into some other tool.

@vy-ton
Copy link
Contributor

vy-ton commented Jan 19, 2021

Proposed SHOW command

SHOW CREATE TABLES
SHOW CREATE TABLES <table> <table>

@RichardJCai
Copy link
Contributor

Proposed SHOW command

SHOW CREATE TABLES
SHOW CREATE TABLES <table> <table>

Turns out SHOW CREATE TABLES is ambiguous grammar (TABLES is a valid table name so it would conflict with the existing SHOW CREATE grammar.

SHOW CREATE ALL TABLES would for example would be a valid option.

@vy-ton
Copy link
Contributor

vy-ton commented Jan 21, 2021

@RichardJCai A few questions

  1. For the use case to replace cockroach dump <database> <table> <table...> --dump-mode=schema, could we make SHOW CREATE output the flat structure when multiple tables are supplied? For example, SHOW CREATE orders packages
  2. In the future, would SHOW CREATE ALL be valid grammar. This would support outputting all database objects not just tables.

If yes to both questions above, I think SHOW CREATE ALL TABLES could work.

@CyborgMaster
Copy link

Which ever way we go, I think it would be nice for the output to have the Foreign keys inside the create table statements. That does require the dump tool to topologically sort the create statements by foreign key dependencies to ensure they are created in the right order, but it has the advantage of making the output easier to visually inspect to understand the schema along with being faster to boostrap an empty database as the alter statements to add the foreign keys are a lot slower than creating them along with the table.

@kernfeld-cockroach
Copy link
Contributor

topologically sort the create statements by foreign key dependencies

That is a pretty cool idea! One weird edge case to this is that CockroachDB allows cycles of foreign key constraints, so I think it's not always possible to topologically sort tables by foreign key dependencies. We'd need some kind of way to handle this case.

craig bot pushed a commit that referenced this issue Feb 11, 2021
60154: sql: add crdb_internal.show_create_all_tables builtin r=rafiss a=RichardJCai

sql: add crdb_internal.show_create_all_tables builtin

Making this a PR first before continuing with #53488, we can alias this builtin with SHOW CREATE ALL TABLES.

Example output:
```
query T
SELECT crdb_internal.show_create_all_tables('d')
----
CREATE TABLE public.parent (
  x INT8 NULL,
  y INT8 NULL,
  z INT8 NULL,
  UNIQUE INDEX parent_x_y_z_key (x ASC, y ASC, z ASC),
  UNIQUE INDEX parent_x_key (x ASC),
  FAMILY f1 (x, y, z, rowid)
);
CREATE TABLE public.full_test (
  x INT8 NULL,
  y INT8 NULL,
  z INT8 NULL,
  UNIQUE INDEX full_test_x_key (x ASC),
  FAMILY f1 (x, y, z, rowid)
);
CREATE SEQUENCE public.s MINVALUE 1 MAXVALUE 9223372036854775807 INCREMENT 1 START 1;
CREATE VIEW public.vx ("?column?") AS SELECT 1;
ALTER TABLE public.full_test ADD CONSTRAINT fk_x_ref_parent FOREIGN KEY (x, y, z) REFERENCES public.parent(x, y, z) MATCH FULL ON DELETE CASCADE ON UPDATE CASCADE;
ALTER TABLE public.full_test ADD CONSTRAINT test_fk FOREIGN KEY (x) REFERENCES public.parent(x) ON DELETE CASCADE;
-- Validate foreign key constraints. These can fail if there was unvalidated data during the SHOW CREATE ALL TABLES
ALTER TABLE public.full_test VALIDATE CONSTRAINT fk_x_ref_parent;
ALTER TABLE public.full_test VALIDATE CONSTRAINT test_fk;
```

Release note (sql change): crdb_internal.show_create_all_tables is
a new builtin that takes in a database name (string) and returns
a flat log of all the CREATE TABLE statements in the database followed
by alter statements to add constraints. The output can be used
to recreate a database. This builtin was added to replace old dump logic.

60469: opt: prefer sorting fewer columns r=RaduBerinde a=RaduBerinde

Currently, if we have to sort results and project a new column, there
is no cost difference between the two orders and we happen to prefer
the sort on top. It is preferable to sort before adding new columns to
avoid storing the extra value in memory or on disk.

This change improves the sort costing by adding a cost proportional to
the total number of values.

Fixes #32952.

Co-authored-by: richardjcai <caioftherichard@gmail.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
@craig craig bot closed this as completed in aa8f949 Feb 16, 2021
@knz
Copy link
Contributor

knz commented Feb 17, 2021

The current implementation has a bit too many shortcomings. We'll need to rework that.

@knz knz reopened this Feb 17, 2021
@knz
Copy link
Contributor

knz commented Feb 17, 2021

I'm going to comment here on the implementation of #60154, which is the root of the problem. Once that problem has been addressed, it's going to be easy to rework #60539 accordingly.

So here's the shortcoming: the proposed built-in function returns just a single scalar value and composes its result with string concatenation.

  • there is no memory accounting
  • if there are many tables, the result will be too large and crash certain drivers (the pg protocol was not designed for payloads larger than a few hundred kilobytes)
  • the result is accumulated in RAM throughout the operation, which in turn will cause the query to balloon in proportion to the number of tables

The fundamental design mistake here is that the built-in should not have been a scalar.

At the very least it should have been a set-returning function, with one row per CREATE or ALTER statement. So the behavior of this function should be to stream out one SQL statement at a time, without keeping more than 1 output statement in RAM at a time.

Meanwhile, I appreciate that this is not trivial because we need to order the tables in dependency order.

The ordering of the table is currently done by performing 1) a query to get all descriptors 2) another, separate subsequent query to extract the dependency links 3) an in-memory sort.

This is too much work and again causes the in-RAM state to balloon unconstrained. The proper approach is to perform this sort using SQL, using a join between the descriptor table and itself, with a predicate (probably using a JSON operator) on the backward dependency. We have the pb_to_json built-ins for this purpose nowadays. The reason why this works wrt memory usage is that a SQL-level sort properly uses disk spilling.

(To ensure the query is fully streamable, it's pretty important to not generate all the create statements upfront, i.e. you can't rely on the existing crdb_internal.create_statements vtable. Instead you'd need to define two scalar built-ins, one show_create_statement(descid) and show_alter_statements(descid) that produce the create and alter statement separately for just 1 table. Then you'd apply these built-in functions on the ordered result of the self-join.)

Once you have a crdb_internal.show_create_all_tables() SRF, you can define SHOW CREATE ALL TABLES as SELECT create_statement FROM crdb_internal.show_create_all_tables()

Also to get the streaming behavior right you'll also be depending on the new streaming interface for the internal executor that @yuzefovich has produced. That dependency, in turn, makes this work non back-portable.

@RichardJCai
Copy link
Contributor

I'm going to comment here on the implementation of #60154, which is the root of the problem. Once that problem has been addressed, it's going to be easy to rework #60539 accordingly.

So here's the shortcoming: the proposed built-in function returns just a single scalar value and composes its result with string concatenation.

  • there is no memory accounting
  • if there are many tables, the result will be too large and crash certain drivers (the pg protocol was not designed for payloads larger than a few hundred kilobytes)
  • the result is accumulated in RAM throughout the operation, which in turn will cause the query to balloon in proportion to the number of tables

The fundamental design mistake here is that the built-in should not have been a scalar.

At the very least it should have been a set-returning function, with one row per CREATE or ALTER statement. So the behavior of this function should be to stream out one SQL statement at a time, without keeping more than 1 output statement in RAM at a time.

Meanwhile, I appreciate that this is not trivial because we need to order the tables in dependency order.

The ordering of the table is currently done by performing 1) a query to get all descriptors 2) another, separate subsequent query to extract the dependency links 3) an in-memory sort.

This is too much work and again causes the in-RAM state to balloon unconstrained. The proper approach is to perform this sort using SQL, using a join between the descriptor table and itself, with a predicate (probably using a JSON operator) on the backward dependency. We have the pb_to_json built-ins for this purpose nowadays. The reason why this works wrt memory usage is that a SQL-level sort properly uses disk spilling.

(To ensure the query is fully streamable, it's pretty important to not generate all the create statements upfront, i.e. you can't rely on the existing crdb_internal.create_statements vtable. Instead you'd need to define two scalar built-ins, one show_create_statement(descid) and show_alter_statements(descid) that produce the create and alter statement separately for just 1 table. Then you'd apply these built-in functions on the ordered result of the self-join.)

Once you have a crdb_internal.show_create_all_tables() SRF, you can define SHOW CREATE ALL TABLES as SELECT create_statement FROM crdb_internal.show_create_all_tables()

Also to get the streaming behavior right you'll also be depending on the new streaming interface for the internal executor that @yuzefovich has produced. That dependency, in turn, makes this work non back-portable.

Noted, will work on fix asap.

@CyborgMaster
Copy link

That is a pretty cool idea! One weird edge case to this is that CockroachDB allows cycles of foreign key constraints, so I think it's not always possible to topologically sort tables by foreign key dependencies. We'd need some kind of way to handle this case.

I think in the ideal case what this would do is find all of the non-cyclical dependencies and include the foreign keys in the create statements, and then in the case of cycles arbitrarily break the cycle with an after the fact alter statement. I recognize that of course this is a non-trival amount of work.

@rafiss
Copy link
Collaborator

rafiss commented Mar 9, 2021

The new syntax SHOW CREATE ALL TABLES will be available in v21.1. It will output one row for each statement to set up the schema of a database.

21.1 will also keep the cockroach dump CLI command with support ONLY for --dump-mode=schema. This gives tools that rely on cockroach dump --dump-mode=schema until the 21.2 release (scheduled for Fall 2021) to switch away from using dump.

@rafiss rafiss closed this as completed Mar 9, 2021
@RichardJCai
Copy link
Contributor

Also just want to add that SHOW CREATE ALL TABLES only shows tables/views/sequences. Please leave any notes / open a new issue if there is a need to show types and schemas as well.

@bikbah
Copy link

bikbah commented Mar 19, 2022

SHOW CREATE ALL TABLES gives DDL for creating tables, but misses user defined types (enums).
is there any option to get DDL for creating all user defined types too?

@edigaryev
Copy link

is there any option to get DDL for creating all user defined types too?

There's SHOW CREATE ALL TYPES.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery
Projects
None yet
Development

Successfully merging a pull request may close this issue.