Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: (temporary) Hang when dropping unique index created after ALTER PRIMARY KEY #45150

Closed
andy-kimball opened this issue Feb 18, 2020 · 8 comments
Assignees

Comments

@andy-kimball
Copy link
Contributor

Run the following SQL script:

SET experimental_enable_primary_key_changes = true;
CREATE TABLE tdup (x INT PRIMARY KEY, y INT NOT NULL, z INT NOT NULL);
INSERT INTO tdup VALUES (1, 1, 1);
ALTER TABLE tdup ALTER PRIMARY KEY USING COLUMNS (y, z);
DROP INDEX tdup_x_key CASCADE;

EXPECTED: This should switch the primary key to columns (y, z) and drop the unique index synthesized by the ALTER PRIMARY KEY code.

ACTUAL: The DROP INDEX statement hangs.

NOTE: This happens with and without the #45093 fix.

@andy-kimball
Copy link
Contributor Author

Hm, it looks like after > 1 minute, it did complete. Is that expected?

@rohany
Copy link
Contributor

rohany commented Feb 18, 2020

Yes, this is expected. The current job framework is unable to spawn child jobs from a given job. To get around this, the primary key change process lets the async schema changer pick up and run the drop index jobs. Until the async schema changer can pick up these jobs, mutations queued up after will have to wait. The interval for these schema changers is 1 min +- some jitter, so this seems like it's working as expected.

Lucy and I have been thinking about how to get around this blemish though.

@jordanlewis jordanlewis changed the title sql: Hang when dropping unique index created after ALTER PRIMARY KEY sql: (temporary) Hang when dropping unique index created after ALTER PRIMARY KEY Mar 3, 2020
@jordanlewis
Copy link
Member

When @otan's NOTICE feature gets in, let's add a NOTICE that says the schema change will take a minute to kick off as a mitigation if we can't fix the issue completely.

@thoszhang
Copy link
Contributor

I think if we implement the same pattern as in #46929, to use CreateStartableJobInTxn and then start the job after we publish the table descriptor, we should be able to at least have the index drop job start immediately. This would at least decrease the amount of time we're hanging.

@thoszhang
Copy link
Contributor

#47624 implements what I described above.

craig bot pushed a commit that referenced this issue Apr 22, 2020
47624: sql: start index drop job for primary key changes immediately r=lucy-zhang a=lucy-zhang

Previously, the job to drop old indexes after a primary key change was
created at the end of the schema change and left for the job registry to
adopt at some later time. This PR switches the job creation over to
`CreateStartableJobWithTxn`, which allows for starting the job
immediately after the transaction which creates the job is committed.

This eliminates one source of waiting time before other schema changes,
which are queued behind the mutation to drop the indexes, can run. Note,
however, that successive schema changes don't start running immediately
after the mutations ahead of them in the queue are all cleared; if
they've ever hit a retry error due to not being first in line, then they
will have to wait to be re-adopted by the job registry. This is why the
existing primary key change tests still need to have the adopt interval
set to a low value in order to finish quickly.

Addresses #45150.

Release note (performance improvement): The cleanup job which runs after
a primary key change to remove old indexes, which blocks other schema
changes from running, now starts immediately after the primary key swap
is complete. This reduces the amount of waiting time before subsequent
schema changes can run.

47748: cli/interactive_tests: deflake test_demo_partitioning r=rohany a=knz

Fixes #47747

The test is meant to tolerate a failure in the licensing server at the
beginning.

This is done correctly, however the check does not work if the `spawn`
statement itself fails.

This patch corrects the flake by avoiding `spawn` and using a bourne
shell to spawn the `cockroach` processes instead.

Release note: None

Co-authored-by: Lucy Zhang <lucy-zhang@users.noreply.github.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@thoszhang
Copy link
Contributor

#47624 improves the situation, but I think to really consider this problem solved, we should

  1. wait to return results to the client until dropping the indexes is completed, to make the user experience more predictable; and
  2. reduce the amount of redundant waiting we do for schema change jobs (sql: schema changes can be very slow #47790 covers this).

craig bot pushed a commit that referenced this issue May 9, 2020
48045: sql: enable indexing and ordering on arrays of orderable and indexable types r=rohany a=rohany

Fixes #17154.
Fixes #35707.

This PR enables arrays to be ordered and indexed by
introducing an ordered key encoding for arrays.
Once this exists, the rest of the SQL infrastructure
is ready to handle indexing and ordering on arrays.

To encode an array of elements `ARRAY[a, b]`,
we create the following encoding.

Let `AM` = a marker byte for arrays and let `AT` be a terminator byte.

`enc(ARRAY[a, b]) = [AM, enc(a), enc(b), AT]`

The key is that the terminator is less than the element marker.
This allows for the "prefix matching" style comparison that
arrays support.

Release note (sql change): This PR adds support for indexing
and ordering of arrays of indexable and orderable inner types.


48608: schemachange: speed up slow schema changes r=spaskob a=spaskob

Touches #45150.
Fixes #47607.
Touches #47790.

Release note (performance improvement):
Before this a simple schema change could take 30s+.
The reason was that if the schema change is not first
in line in the table mutation queue it would return a
re-triable error and the jobs framework will re-adopt and
run it later. The problem is that the job adoption loop
is 30s.

To repro run this for some time:
```
cockroach sql --insecure --watch 1s -e 'drop table if exists users cascade; create table users (id uuid not null, name varchar(255) not null, email varchar(255) not null, password varchar(255) not null, remember_token varchar(100) null, created_at timestamp(0) without time zone null, updated_at timestamp(0) without time zone null, deleted_at timestamp(0) without time zone null); alter table users add primary key (id); alter table users add constraint users_email_unique unique (email);'
```

Instead of returning on re-triable errors we retry with exponential
backoff in the schema change code. This pattern of dealing with
re-triable errors in client job code is encouraged vs relying on the
registry because the latter leads to slowness and additionally to more
complicated test fixtures that rely on hacking with the internals of the
job registry,

Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
Co-authored-by: Spas Bojanov <pachob@gmail.com>
@spaskob
Copy link
Contributor

spaskob commented May 9, 2020

I just merged #48608 which may further improve the performance for this case.

@ajwerner
Copy link
Contributor

This all happens quite quickly now. I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants