Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: make view descriptors depend on table/columns by IDs #15388

Closed
wants to merge 2 commits into from

Conversation

knz
Copy link
Contributor

@knz knz commented Apr 26, 2017

(Needed for #12611 and #13968.)

This patch does what it says on the label: the descriptor still
contains a valid SQL query, but with all table names rewritten to use
numeric table references. These are hidden initially, until
the referenced tables gets a new schema.

For example:

 CREATE TABLE kv(k INT PRIMARY KEY, v INT);
 CREATE VIEW vx AS SELECT v AS x FROM kv;
 SHOW CREATE VIEW vx;
 +------+-----------------------------------------+
 | View |           CreateView                    |
 +------+-----------------------------------------+
 | vx   | CREATE VIEW vx AS SELECT v AS x FROM kv |
 +------+-----------------------------------------+
 (1 row)
 ALTER TABLE kv ADD COLUMN d INT;
 SHOW CREATE VIEW vx;
 +------+--------------------------------------------------------------+
 | View |                          CreateView                          |
 +------+--------------------------------------------------------------+
 | vx   | CREATE VIEW vx AS SELECT v AS x FROM [64(1, 2) AS kv (k, v)] |
 +------+--------------------------------------------------------------+
 (1 row)

A side effect of this patch is that view definitions can now use star
expansion, i.e. CREATE VIEW kv_alias AS SELECT * FROM kv is now
possible and valid. (And there was much rejoicing.)

Another side effect, visible in the example above, is that SHOW CREATE VIEW now reveals the structural dependency (showing both the
ID and the original table name) when the table's name or column
definitions change.

This may or may not be desirable from a UX perspective, however anyone
wishing to improve upon this will take note that if the column list in
the table descriptor was altered after the view was created (e.g. to
add new columns, rename columns or remove columns not depended on by
the view), there is no valid way to print out the view definition
using valid SQL but without using the table reference syntax. Consider
the example above: suppose column "v" was renamed to "w"; trying to
print out as create view vx as select v as x from kv would be
invalid because then column "v" would have disappeared; then suppose
"v" was renamed to "w" and a new unrelated column "v" was added,
trying create view vx as select v as x from kv as kv(v) would be
invalid as well because the new column v is unrelated to the one the
view was intended to depend on. Due to these obstacles, it is
advisable to let the numeric table reference show up in the output of
SHOW CREATE VIEW and document this behavior.

Although it would be correct to do so for newly-created clusters, this
patch does not lift the restriction on ALTER that prevents it from
renaming tables or columns depended on by views. This change would be
correct on new clusters because once the guarantee is enforced that
all view descriptors depend on table/columns by IDs, renaming becomes
possible (renaming preserves the ID); however, for previously created
clusters there may exist already some view descriptors containing
by-name table references, and allowing those to be renamed would be
unsound. The proper approach is to completement this patch by a
cluster migration which rewrites all existing view descriptors. This
work is left to a subsequent commit.

A last improvement brought by this PR is that column renames are now
properly visible in pg_catalog:

> CREATE VIEW v (x) AS SELECT k FROM kv
> SELECT definition FROM pg_catalog.pg_views WHERE viewname = 'v'
SELECT k AS x FROM (SELECT k FROM kv)

cc @a-robinson

@knz knz requested a review from RaduBerinde April 26, 2017 22:51
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz force-pushed the structural-views branch from 2b1eb2e to a2038c2 Compare April 26, 2017 23:02
@RaduBerinde
Copy link
Member

Great stuff!

How would CREATE VIEW vx AS SELECT a AS x FROM kv AS foo(a,b); show up? CREATE VIEW vs AS SELECT a AS x FROM [13(1,2) AS kv(k,v)] AS foo(a,b) ? And if we rename kv(k,v) to bar(x,y) how would it show up?

One thing I don't quite understand is: if I'm understanding the change correctly, we look up the current table and column names when we print out the expression. Why wouldn't showing that (without the numeric references) be correct? In your example, suppose column "v" was renamed to "w", we we would display it with the new column name "w".


Review status: 0 of 16 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/sql/testdata/logic_test/views, line 504 at r1 (raw file):

query II
SELECT * FROM s1 ORDER BY 1
----

Can we have some tests that rename tables/columns and then do SHOW CREATE VIEW again?


Comments from Reviewable

@knz knz force-pushed the structural-views branch from a2038c2 to 3cc6930 Compare April 26, 2017 23:13
@knz
Copy link
Contributor Author

knz commented Apr 26, 2017

root@:26257/t> CREATE VIEW vx AS SELECT a AS x FROM kv AS foo(a,b);
CREATE VIEW
root@:26257/t> SHOW CREATE VIEW vx;
+------+----------------------------------------------------------------------------+
| View |                                 CreateView                                 |
+------+----------------------------------------------------------------------------+
| vx   | CREATE VIEW vx AS SELECT a AS x FROM [64(1, 2) AS kv (k, v)] AS foo (a, b) |
+------+----------------------------------------------------------------------------+
(1 row)

Renames are not yet activated, as per the commit message (I haven't implemented the necessary migration yet). However, hypothetically:

root@:26257/t> CREATE VIEW vx AS SELECT a AS x FROM kv AS foo(a,b);
CREATE VIEW
root@:26257/t> ALTER TABLE kv ALTER COLUMN k RENAME TO x;
ALTER TABLE
root@:26257/t> ALTER TABLE kv ALTER COLUMN v RENAME TO y;
ALTER TABLE
root@:26257/t> ALTER TABLE kv RENAME TO bar;
ALTER TABLE
root@:26257/t> SHOW CREATE VIEW vx;
+------+----------------------------------------------------------------------------+
| View |                                 CreateView                                 |
+------+----------------------------------------------------------------------------+
| vx   | CREATE VIEW vx AS SELECT a AS x FROM [64(1, 2) AS kv (k, v)] AS foo (a, b) |
+------+----------------------------------------------------------------------------+
(1 row)

The "AS kv" clause is embedded into the view descriptor, for good reason: every potential other reference to the names "kv", "k" or "v" in the rest of the query cannot be (easily) rewritten.

To answer your other points:

if I'm understanding the change correctly, we look up the current table and column names when we print out the expression.

No that's not correct; the names are embedded alongside the IDs during view creation. The code for SHOW CREATE VIEW is unchanged, and is dumb (it simply prints out whatever it finds in the descriptor as-is).

[Assuming SHOW CREATE could look up the "new" name from a numeric table reference] Why wouldn't showing that (without the numeric references) be correct?

That's because we do not have valid SQL syntax to display a non-numeric-reference table where the set of columns is different.

Again the example from initially (hypothetically):

CREATE TABLE kv(k, v)
CREATE VIEW kv_alias AS SELECT * FROM kv;
ALTER TABLE kv ALTER COLUMN v RENAME TO w;
ALTER TABLE kv ADD COLUMN v INT;
SHOW CREATE VIEW kv_alias;

What do you expect to see?

  • CREATE VIEW kv_alias AS SELECT * FROM kv -> incorrect, v has changed definition
  • CREATE VIEW kv_alias AS SELECT * FROM kv AS kv(k, v) -> incorrect, mismatch in the number of columns (kv has 3 columns now)

Maybe you're thinking about constructing an ad-hoc subquery CREATE VIEW kv_alias AS SELECT * FROM (SELECT k, w AS v FROM kv), but that would cause the show create view / create view not to roundtrip properly (we check for that in multiple places)

@RaduBerinde
Copy link
Member

I see. My thinking was that we could just replace the IDs with the current names, and we can do that even if we have to keep the AS old_name(old_cols,..) along with it. But now I understand that this notation not only allows use of IDs, but also allows projecting to a subset of columns (this is quite a departure from what FROM <table> can normally do).

This projection is only useful if we want to leave any * as is instead of pre-expanding it. Frankly I think things would be clearer if we just pre-expand it. If we did that, we could always just use the current table name and rename it with AS as necessary. So in your last example, I would expect CREATE VIEW kv_alias AS SELECT k,v FROM kv AS kv(k,v,unused1).

The ad-hoc subquery to do the projection would work too I guess, as long as we only include it when a projection is necessary. There wouldn't be any round-tripping issues unless there's a schema change in-between.

@RaduBerinde
Copy link
Member

The code :lgtm:


Review status: 0 of 16 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


pkg/sql/data_source.go, line 443 at r2 (raw file):

// renameSource applies an AS clause to a data source.
func renameSource(
	src planDataSource, as parser.AliasClause, includeHidden bool,

is an AliasClause that includes hidden columns valid? Do we tolerate both in the syntax, depending on the number of columns?


Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Apr 27, 2017

In a regular AS clause (from t AS x(y,z)) hidden columns are never taken into account; in a numeric table reference (from [123 as x(y,z)]), hidden columns are always taken into account.

I will look into doing the other thing you mentioned earlier.

@knz knz added the docs-todo label Apr 27, 2017
@a-robinson
Copy link
Contributor

Thanks for doing this, @knz! Let me know if you want a full review from me, otherwise I'm going to trust that @RaduBerinde found everything I would have and more.


Review status: 0 of 16 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@knz knz added this to the 1.1 milestone May 2, 2017
@knz
Copy link
Contributor Author

knz commented May 2, 2017

Bumping to 1.1, too much cleanup to do here for today

@knz
Copy link
Contributor Author

knz commented May 12, 2017

I'm kinda bummed about this PR, there's an unforeseen hurdle: when restoring a backup that was made prior to this change, the descriptor must be rewritten but it's not easy to handle this in the restore context.

A good way forward would be to introduce a new FormatVersion value for table descriptors, which is one beyond the current one, and indicates the descriptor is using structural dependencies. This way the backup code can mostly ignore the problem.

(It also solves the other problem which this PR had from the beginning: the lack of a migration story that would enable ALTER RENAME.)

However doing so requires introducing a version upgrade path that doesn't exist currently: one that looks up other descriptors in the process of updating the current one. As the descriptor upgrade can occur in the lease code, I am very concerned that there's some kind of care to be taken that the table descriptors that will be used during the rename are "atomically considered" with the lease being taken on the upgraded descriptor. I do not know how to do this.

@jordanlewis do you think we could sit together at some point and look at this together? I believe you are more familiar with this code.

@knz knz force-pushed the structural-views branch 2 times, most recently from b2377fe to 90370ee Compare May 12, 2017 19:18
@knz
Copy link
Contributor Author

knz commented Jul 7, 2017

@cuongdo I would like to ressucitate this PR, but I fear it needs concerted attention from people comfortable with both the lease code, descriptor code and perhaps backup code.

@knz
Copy link
Contributor Author

knz commented Jul 28, 2017

An update on this PR:

  • with suggestions from @danhhz I was able to ensure descriptors are properly reassigned to new IDs upon restore
  • this PR still misses a general migration step to update "old format" view queries in existing descriptors - I will do this next. Two separate migrations needs to exist, both at cluster start-up and during restore (restoring an old-format desc onto a new-format db)

@knz
Copy link
Contributor Author

knz commented Jul 28, 2017

Some tests failing due to base changes. Will also fix next.

@knz
Copy link
Contributor Author

knz commented Jul 30, 2017

One step forward - the only remaining broken test is because of #17306, which I will fix shortly.

knz added 2 commits August 2, 2017 00:22
Prior to this patch, view dependency analysis during CREATE VIEW was
broken because it would only check dependencies after plan
optimization, i.e. possibly after some dependency were lost. For
example, with the queries:

```sql
CREATE VIEW v AS SELECT k FROM (SELECT k,v FROM kv) -- loses dependency on kv.v
CREATE VIEW v AS SELECT k,v FROM kv WHERE FALSE -- loses dependency on kv
```

This patch addresses the issue as follows:

- the dependencies are now collected during the initial construction
  of the query plan, before any optimization are applied. This way we
  ensure the dependency tracking is complete.
- a migration is implemented that will fix any view descriptor and
  corresponding dependency information that was populated prior to
  this fix.
This patch does what it says on the label: the descriptor still
contains a valid SQL query, but with all table names rewritten to use
numeric table references. These are hidden initially, until
the referenced tables gets a new schema.

For example:
```
 CREATE TABLE kv(k INT PRIMARY KEY, v INT);
 CREATE VIEW vx AS SELECT v AS x FROM kv;
 SHOW CREATE VIEW vx;
 +------+-----------------------------------------+
 | View |           CreateView                    |
 +------+-----------------------------------------+
 | vx   | CREATE VIEW vx AS SELECT v AS x FROM kv |
 +------+-----------------------------------------+
 (1 row)
 ALTER TABLE kv ADD COLUMN d INT;
 SHOW CREATE VIEW vx;
 +------+--------------------------------------------------------------+
 | View |                          CreateView                          |
 +------+--------------------------------------------------------------+
 | vx   | CREATE VIEW vx AS SELECT v AS x FROM [64(1, 2) AS kv (k, v)] |
 +------+--------------------------------------------------------------+
 (1 row)
```

A side effect of this patch is that view definitions can now use star
expansion, i.e. `CREATE VIEW kv_alias AS SELECT * FROM kv` is now
possible and valid. (And there was much rejoicing.)

Another side effect, visible in the example above, is that `SHOW
CREATE VIEW` now reveals the structural dependency (showing both the
ID and the original table name) when the table's name or column
definitions change.

This may or may not be desirable from a UX perspective, however anyone
wishing to improve upon this will take note that if the column list in
the table descriptor was altered after the view was created (e.g. to
add new columns, rename columns or remove columns not depended on by
the view), there is no valid way to print out the view definition
using valid SQL but without using the table reference syntax. Consider
the example above: suppose column "v" was renamed to "w"; trying to
print out as `create view vx as select v as x from kv` would be
invalid because then column "v" would have disappeared; then suppose
"v" was renamed to "w" and a new unrelated column "v" was added,
trying `create view vx as select v as x from kv as kv(v)` would be
invalid as well because the new column `v` is unrelated to the one the
view was intended to depend on. Due to these obstacles, it is
advisable to let the numeric table reference show up in the output of
`SHOW CREATE VIEW` and document this behavior.

Although it would be correct to do so for newly-created clusters, this
patch does *not* lift the restriction on ALTER that prevents it from
renaming tables or columns depended on by views. This change would be
correct on new clusters because once the guarantee is enforced that
all view descriptors depend on table/columns by IDs, renaming becomes
possible (renaming preserves the ID); however, for previously created
clusters there may exist already some view descriptors containing
by-name table references, and allowing those to be renamed would be
unsound. The proper approach is to completement this patch by a
cluster migration which rewrites all existing view descriptors. This
work is left to a subsequent commit.

A last improvement brought by this PR is that column renames are now
properly visible in pg_catalog:

```sql
> CREATE VIEW v (x) AS SELECT k FROM kv
> SELECT definition FROM pg_catalog.pg_views WHERE viewname = 'v'
SELECT k AS x FROM (SELECT k FROM kv)
```
@knz knz force-pushed the structural-views branch from d9271e2 to ba28022 Compare August 2, 2017 00:39
@knz knz requested a review from a team as a code owner August 2, 2017 00:39
@knz
Copy link
Contributor Author

knz commented Aug 2, 2017

cc @danhhz can you check I didn't do anything foolish in restore.go specifically?

@danhhz
Copy link
Contributor

danhhz commented Aug 2, 2017

I only looked at restore.go. Left one comment


Review status: 0 of 44 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/ccl/sqlccl/restore.go, line 311 at r4 (raw file):

			}
		}
		if hasRestoredDep {

since we error in the second branch, this is equivalent to if len(table.DependsOn) > 0, right? I find that a little clearer

In the event that you strongly disagree and want to keep it like this, then please rename hasRestoredDep to something with View in it

This all also really wants a backup/restore test, but I'm fine leaving that for a followup given how big this PR is already


Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Aug 4, 2017

@RaduBerinde @jordanlewis I am considering closing this PR in favor of another solution I discovered this week:

instead of rewriting the view query as done in this PR, instead,

  1. extend instead the view descriptor with a new field "Apparent Schema" containing a mapping of (db name, table name) to (table ID, list of visible column IDs+Names, list of visible indexes+names), initialized at the moment the view is created
  2. extend the parameters of getDataSource() with a similar mapping
  3. populate the parameters of getDataSource() for further recursion each time a view descriptor is opened by getViewSource
  4. make getDataSource()/getTablePlan() first look up table names using the new mapping before going to the table cache / KV.

The advantage is that the view query would stay identical which would improve UX when looking at the view query after new columns have been added to the underlying views/tables.

What do you think?

@RaduBerinde
Copy link
Member

I agree that would better from a UX perspective. I don't know if there are any subtleties involved implementation-wise, I'll let @jordanlewis chime in.

@knz
Copy link
Contributor Author

knz commented Aug 8, 2017

Closing in favor of #17501.

@knz knz closed this Aug 8, 2017
@knz knz deleted the structural-views branch April 27, 2018 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants