Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[schema] Updating the tables schema #5449

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Jul 20, 2018

We've noticed a number of anomalies in our database caused by ill-defined forms and/or table schema definitions. This PR resolves a number of issues related to the tables table including:

  • Modifies the uniqueness logic defined by PR Adding YAML Import-Export for Datasources to CLI #3978 (which was never created in a migration) to include the schema name (previously it was only based on the database/table). This is now handled by a SQLAlchemy event listener as many dialects permit multiple NULL values for columns than can contain NULL.
  • Removes the pre-add for checking for uniqueness. This is no longer required as the uniqueness constraint defines the same logic. This also only worked when adding a table and thus it was possible to have duplicates names if someone edited the record.
  • Adds form validation to ensure that the schema name is not part of the table name.
  • Ensures the table_name column is non-nullable.

Screen Shot 2019-03-22 at 5 09 58 PM

Note this migration will fail if the table_name column is NULL. One must manually fix these records as programmatically trying to remedy these invalid records is difficult as the intent is unclear and the tables may function (from a query standpoint) if SQL is provided. The following query determines which records are problematic:

SELECT 
    *
FROM 
   tables
WHERE 
   table_name IS NULL

Finally this migration will fail if the tables table is corrupt in terms of the uniqueness. One must manually consolidate duplicate or non-valid records given there's no way of programmatically removing invalid records. The following query determines whether there are duplicates:

SELECT 
    database_id,
    `schema`, 
    table_name,
    COUNT(1)
FROM 
    tables
GROUP BY 
    1, 2, 3
HAVING 
    COUNT(1) > 1
ORDER BY 
    COUNT(1) DESC

Note this PR is gated by #5445 and #7084 which ensure that empty strings associated with form-data wont persist in the database and is necessary for ensuring that the relevant entries are non-NULL.

to: @fabianmenges @graceguo-supercat @michellethomas @mistercrunch @timifasubaa

@john-bodley john-bodley force-pushed the john-bodley-tables-uniqueness-constraint branch 3 times, most recently from b0f8589 to 6de006c Compare July 21, 2018 06:00
@codecov-io
Copy link

codecov-io commented Jul 21, 2018

Codecov Report

Merging #5449 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5449      +/-   ##
==========================================
+ Coverage   59.11%   59.13%   +0.01%     
==========================================
  Files         372      372              
  Lines       23756    23754       -2     
  Branches     2758     2758              
==========================================
+ Hits        14044    14046       +2     
+ Misses       9697     9693       -4     
  Partials       15       15
Impacted Files Coverage Δ
superset/connectors/sqla/models.py 78.01% <100%> (ø) ⬆️
superset/connectors/sqla/views.py 73.78% <100%> (+3.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d5443e...08b8821. Read the comment docs.

@john-bodley john-bodley force-pushed the john-bodley-tables-uniqueness-constraint branch from 6de006c to 08b8821 Compare July 21, 2018 17:06

table_name = Column(String(250))
table_name = Column(String(127))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will cause problems. When you create new "Tables" in SQL Lab it will autogenerate very long table names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianmenges unless the database or tab names are significantly long I suspect that this shouldn't generate very long table names.

@@ -243,17 +244,13 @@ class TableModelView(DatasourceModelView, DeleteMixin, YamlExportMixin): # noqa
'is_sqllab_view': _('SQL Lab View'),
'template_params': _('Template parameters'),
}
validators_columns = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about SQL Lab tables?

Copy link
Member Author

@john-bodley john-bodley Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianmenges could you be more specific? It potentially seems wrong to use a period in a datasource name. Based on the logic for auto-generating the SQL Lab table name it should only contain a period if the username, database name, or tab name contain a period.

@youngyjd
Copy link
Contributor

youngyjd commented Sep 27, 2018

quick question: when will this PR be merged? Saw this PR has been open for more than 2 months.

@john-bodley
Copy link
Member Author

@youngyjd there's still responses I'm waiting on and it's also gated by #5445.

@mistercrunch
Copy link
Member

I made the point in #6718 that we may want to allow multiple datasources that point to the same physical table in the database.

The concept of ownership effectively allows a user to "landgrab" a datasource and control it. It's also possible that an employe that has left the company is the owner of a datasource, then you have to ask an admin to grant you ownership of that table... Or if I want to add a metric to a table, I have to ask the owner to make me an owner as well. Also if a table has many owners, it's likely that one person will break someone else's chart or dashboard.

On the other hand, having many datasources around the rides or booking physical table is also confusing. Tradeoffs. I think I could be convinced either way...

Copy link

@agrawaldevesh agrawaldevesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at my PR #6718 which made me look at yours :-)


# Add the missing uniqueness constraint.
with op.batch_alter_table('tables', naming_convention=conv) as batch_op:
batch_op.create_unique_constraint(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that when I did this op.batch_alter_table alone: It didn't really remove the existing unique constraint. Second, the '.schema' on the table showed that a new unique constraint wasn't really created. Usually alembic creates a temp table and copies the old table to it but it didn't do so in this case. So I am curious if you actually did confirm that this unique constraint is effectively getting created for sqlite ?

Also, are you aware of this very old diff: 15b67b2 (from the days of caravel) that already actually adds the schema you are shooting for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agrawaldevesh this may because that in some deployments there may have been a uniqueness constraint on the table_name. I'll update the migration logic.

batch_op.drop_constraint(
generic_find_uq_constraint_name(
'tables',
{'database_id', 'schema', 'table_name'},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the issues with this schema uniqueness is the pesky 'sql' column on the table: Effectively we would ideally like to have two different views of the same table, both with different sql but with the same <database, schema, table_name>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is that if there were two different views of the same table how would one differentiate between them as datasources. Note in a database construct view names need to be unique.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean real 'SQL views'. I meant basically having the same table added twice, but with different 'sql' like so:

Table foo with sql: 'select foo.a, foo.b, count(1) from foo group by a, b'.
And another with Table foo with sql: 'select foo.*, some_function() from foo'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally they feel like different views and thus should be represented as different entities in the database table. If they're both called foo it's not apparent to the user which one the datasource is referring to and thus this would lead to confusion.

@john-bodley john-bodley added the risk:db-migration PRs that require a DB migration label Mar 22, 2019
@john-bodley john-bodley force-pushed the john-bodley-tables-uniqueness-constraint branch 4 times, most recently from c303533 to af3fcfe Compare March 25, 2019 18:10
@stale
Copy link

stale bot commented May 24, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

@stale stale bot added the inactive Inactive for >= 30 days label May 24, 2019
@stale stale bot closed this May 31, 2019
@john-bodley john-bodley reopened this May 31, 2019
@stale stale bot removed the inactive Inactive for >= 30 days label May 31, 2019
@mistercrunch mistercrunch added the .pinned Draws attention label Jun 1, 2019
@john-bodley
Copy link
Member Author

Sadly yes. I think we never resolved how to implement the ideal solution for the various Superset metadata databases.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

6 similar comments
@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 9, 2021

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

3 similar comments
@github-actions
Copy link
Contributor

github-actions bot commented May 3, 2021

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2021

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@junlincc junlincc closed this Jun 8, 2021
@john-bodley john-bodley reopened this Jun 11, 2021
@nytai
Copy link
Member

nytai commented Jun 11, 2021

i'm pretty surprised it hasn't fallen more out of sync :)

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

7 similar comments
@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

⚠️ @john-bodley Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@john-bodley
Copy link
Member Author

Closing in favor of #15909.

@john-bodley john-bodley deleted the john-bodley-tables-uniqueness-constraint branch July 27, 2021 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.pinned Draws attention risk:db-migration PRs that require a DB migration size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.