Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(sql): deprecate and remove hierarchical usage of schema #8655

Merged
merged 10 commits into from
Mar 25, 2024

Conversation

gforsyth
Copy link
Member

@gforsyth gforsyth commented Mar 14, 2024

No more schema

This is part of a larger refactor where we are removing all usage of the
word schema in its hierarchical sense.

Ibis will adhere to the following convention moving forward:

  • schema: a mapping of column names to datatypes
  • database: a collection of tables
  • catalog: a collection of databases

There are a bunch of commits here that I'm currently keeping separate so I can add fixup changes to the appropriate backend as needed.
I plan to squash all of these into their corresponding top-level commits before we rebase.
I've listed out the commit messages below for those top-level commits (each one corresponding to the changes to, e..g list_tables or create_schema, etc.).

refactor(list_tables): deprecate schema keyword

This change to list_tables is not a breaking change. schema is
deprecated and will warn on usage, but all existing code should remain
functional.

refactor(ddl): deprecate all *_schema methods

BREAKING CHANGE: We are removing the word schema in its hierarchical
sense. We use database to mean a collection of tables. The behavior of
all *_database methods now applies only to collections of tables and
never to collections of database (formerly schema)

  • CanListDatabases abstract methods now all refer to
    collections of tables.
  • CanCreateDatabases abstract methods now all refer to
    collections of tables.
  • list_databases now takes a kwarg catalog.
  • create_database now takes a kwarg catalog.
  • drop_database now takes a kwarg catalog.
  • current_database now refers to the current collection of tables.
  • CanCreateSchema is deprecated and create_schema, drop_schema,
    list_schemas, and current_schema are deprecated and redirect to the
    corresponding method/property ending in database.
  • We add a CanListCatalog and CanCreateCatalog that can list and
    create collections of database, respectively.
    The new methods are list_catalogs, create_catalog, drop_catalog,
  • There is a new current_catalog property.

refactor(get_schema): remove schema kwarg, add catalog, kw-only

This is not a breaking change as get_schema was introduced during
the-epic-split refactor and has not yet been released.

refactor(table): deprecate schema

schema kwarg is deprecated in all Backend.table usage. Wherever it
was present, we now have handling to warn if schema is passed, and to
encourage users to switch usage to the tuple of (catalog, database).

refactor(ops): use catalog and database kwargs for Namespace op

This is an internals only change, but changes database to catalog
and changes schema to database for our internal Namespace op.

refactor(table ddl): remove hierarchical schema from *_table methods

The schema kwarg here was introduced during the-epic-split, so not deprecating.

refactor(sql): deprecate schema in _view ddl methods

Deprecate the use of schema as a hierarchical term in all *_view DDL
methods.
Added a helper mixin to try to avoid continued repetition of the warning
code and the repeated parsing of the database + schema combinations.

refactor(sql): deprecate schema kwarg in insert

schema as a hierarchical term is deprecated in insert.
It will be removed Ibis 10.0.

refactor(ddl): deprecate schema keyword in truncate_table

A few followups

Not for this PR, which is already too big

  • I missed insert, need to do deprecate schema there, too. done
  • remove schema kwarg from connect and do_connect (although I'm not sure this is necessary)
  • make most/all kwargs for DDL methods kw-only -- I haven't done that here as it would introduce more breaking changes, but maybe before 10.0 when we also remove schema?
  • refactor some of the common code paths I've added here around handling various combinations of deprecated and non-deprecated arguments (I've started to do this but I haven't gone back and added it everywhere I added it.

Fixes #8477
Fixes #8380
Fixes #8491

xref: #8253
xref: #6821
xref: #6489

@gforsyth gforsyth changed the title No more schema refactor(sql): deprecate and remove hierarchical usage of schema Mar 14, 2024
Copy link
Contributor

ACTION NEEDED

Ibis follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message.

Please update your PR title and description to match the specification.

ibis/util.py Outdated
Comment on lines 732 to 742
def _warn_schema():
warn_deprecated(
name="schema",
as_of="9.0",
removed_in="10.0",
instead="Use the `database` kwarg with one of the following patterns:"
'\ndatabase="database"'
'\ndatabase=("catalog", "database")'
'\ndatabase="catalog.database"',
# TODO: add option for namespace object
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to move this into the deprecation class helper mixin, but after I squash down everything else because that rebase might be a bit gnarly.

Comment on lines +265 to +270
dbs = "SHOW SCHEMAS"

with self._safe_raw_sql(dbs) as cur:
databases = list(map(itemgetter(0), cur))

return self._filter_with_like(databases, like)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risingwave doesn't support information_schema.schemata but this works.

Comment on lines +354 to +372
# Include temporary tables only if no database has been explicitly specified
# to avoid temp tables showing up in all calls to `list_tables`
if db == "public":
out += self._fetch_temp_tables()

return self._filter_with_like(map(itemgetter(0), out), like)

def list_databases(self, like=None) -> list[str]:
def _fetch_temp_tables(self):
# postgres temporary tables are stored in a separate schema
# so we need to independently grab them and return them along with
# the existing results

sql = (
sg.select("table_name")
.from_(sg.table("tables", db="information_schema"))
.distinct()
.where(C.table_type.eq("LOCAL TEMPORARY"))
.sql(self.dialect)
)

with self._safe_raw_sql(sql) as cur:
out = cur.fetchall()

return out
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both risingwave and postgres were returning all tables in all databases (formerly schema) when any list_tables call was made.

I fixed that to only return tables in self.current_database but that leaves out temp tables, so I added this helper to also return temp tables (which live in their own set of postgres-schema that have unpredictable names).

Comment on lines +315 to +317
raise ValueError(
"Using both the `schema` and `database` kwargs is not supported. "
"`schema` is deprecated and will be removed in Ibis 10.0"
"\nUse the `database` kwarg with one of the following patterns:"
'\ndatabase="database"'
'\ndatabase=("catalog", "database")'
'\ndatabase="catalog.database"',
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the odd duck out, as we had a schema kwarg in list_tables but no database kwarg, so this isn't a breaking change.

@gforsyth gforsyth force-pushed the no_more_schema branch 2 times, most recently from b55b64f to ab2d621 Compare March 14, 2024 18:48
@gforsyth gforsyth added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Mar 14, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Mar 14, 2024
@gforsyth gforsyth added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Mar 14, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Mar 14, 2024
@gforsyth gforsyth force-pushed the no_more_schema branch 3 times, most recently from 9d86b41 to 80e225f Compare March 15, 2024 19:51
@gforsyth
Copy link
Member Author

This is ready for a look (I think).

I've left the various backend-specific commits unsquashed for the moment as I think that might make it easier to review, but if folks would rather all the groups of changes were visible together I can squash it down.

Upon writing that, I think I'm going to squash it, should make reviewing the blocks of changes a bit easier.

This is part of a larger refactor where we are removing all usage of the
word `schema` in its hierarchical sense.

Ibis will adhere to the following convention moving forward:
* `schema`: a mapping of column names to datatypes
* `database`: a collection of tables
* `catalog`: a collection of databases

These terms are mapped accordingly onto the specific backend
terminology.

This change to `list_tables` is not a breaking change. `schema` is
deprecated and will warn on usage, but all existing code should remain
functional.

chore(schema): add mixin to help with schema deprecation warnings

refactor(trino): deprecate schema in list_tables

refactor(postgres): deprecate schema in list_tables

refactor(oracle): deprecate schema in list_tables

refactor(bigquery): deprecate schema in list_tables

refactor(mysql): deprecate schema in list_tables

refactor(duckdb): deprecate schema in list_tables

refactor(mssql): deprecate schema in list_tables

refactor(snowflake): deprecate schema in list_tables

docs(list_tables): add backend-specific docstrings for list_tables

fix(postgres): handle temp tables in list_tables
BREAKING CHANGE: We are removing the word `schema` in its hierarchical
sense. We use `database` to mean a collection of tables. The behavior of
all `*_database` methods now applies only to collections of tables and
never to collections of `database` (formerly `schema`)
* `CanListDatabases` abstract methods now all refer to
collections of tables.
* `CanCreateDatabases` abstract methods now all refer to
collections of tables.
* `list_databases` now takes a kwarg `catalog`.
* `create_database` now takes a kwarg `catalog`.
* `drop_database` now takes a kwarg `catalog`.
* `current_database` now refers to the current collection of tables.
* `CanCreateSchema` is deprecated and `create_schema`, `drop_schema`,
  `list_schemas`, and `current_schema` are deprecated and redirect to the
  corresponding method/property ending in `database`.
* We add a `CanListCatalog` and `CanCreateCatalog` that can list and
   create collections of `database`, respectively.
   The new methods are `list_catalogs`, `create_catalog`, `drop_catalog`,
* There is a new `current_catalog` property.

refactor(duckdb): catalog database switchover

refactor(snowflake): catalog database switchover

refactor(trino): catalog database switchover

refactor(postgres): catalog database switchover

refactor(oracle): catalog database switchover

refactor(mssql): catalog database switchover

refactor(bigquery): catalog database switchover

refactor(clickhouse): catalog database switchover

refactor(datafusion): catalog database switchover

refactor(mysql): catalog database switchover

refactor(exasol): catalog database switchover

refactor(risingwave): catalog database switchover

test(list_databases): add test for list_database contents
This is not a breaking change as `get_schema` was introduced during
the-epic-split refactor and has not yet been released.
`schema` kwarg is deprecated in all `Backend.table` usage. Wherever it
was present, we now have handling to warn if `schema` is passed, and to
encourage users to switch usage to the tuple of `(catalog, database)`.
This is an internals only change, but changes `database` to `catalog`
and changes `schema` to `database` for our internal `Namespace` op.
The `schema` kwarg here was introduced during the-epic-split, so not deprecating.
Deprecate the use of `schema` as a hierarchical term in all `*_view` DDL
methods.
Added a helper mixin to try to avoid continued repetition of the warning
code and the repeated parsing of the `database` + `schema` combinations.

refactor(pyspark): remove schema kwarg from *_view

This would be a breaking change but currently this functionality is
broken anyway.
In any case, PySpark only has support for a single level of table
hierarchy, so it should just be `database` as an option here.

refactor(sqlite): deprecate schema in `*_view` ddl

refactor(bigquery): deprecate schema in `*_view` ddl methods
`schema` as a hierarchical term is deprecated in `insert`.
It will be removed Ibis 10.0.

refactor(bigquery): deprecate `schema` kwarg in insert

refactor(snowflake): deprecate `schema` kwarg in insert
@@ -309,8 +345,31 @@ def list_tables(
with self._safe_raw_sql(sql) as cur:
out = cur.fetchall()

# Include temporary tables only if no database has been explicitly specified
# to avoid temp tables showing up in all calls to `list_tables`
if db == "public":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, this is a bit of an edge case.

On the one hand, temporary tables are always visible to the user who created them. On the other, the intent of asking for tables in a specific schema would seem to preclude showing them.

What you've done here seems fine to me.

Comment on lines +458 to +513
def _to_catalog_db_tuple(self, table_loc: sge.Table):
if table_loc is None or table_loc == (None, None):
return None, None

if (sg_cat := table_loc.args["catalog"]) is not None:
sg_cat.args["quoted"] = False
sg_cat = sg_cat.sql(self.name)
if (sg_db := table_loc.args["db"]) is not None:
sg_db.args["quoted"] = False
sg_db = sg_db.sql(self.name)

return sg_cat, sg_db

def _to_sqlglot_table(self, database):
if database is None:
return None
elif isinstance(database, tuple):
if len(database) > 2:
raise ValueError(
"Only database hierarchies of two or fewer levels are supported."
"\nYou can specify ('catalog', 'database')."
)
elif len(database) == 2:
catalog, database = database
elif len(database) == 1:
database = database[0]
catalog = None
else:
raise ValueError(
f"Malformed database tuple {database} provided"
"\nPlease specify one of:"
'\n("catalog", "database")'
'\n("database",)'
)
database = sg.exp.Table(
catalog=sg.to_identifier(catalog, quoted=self.compiler.quoted),
db=sg.to_identifier(database, quoted=self.compiler.quoted),
)
elif isinstance(database, str):
# There is no definition of a sqlglot catalog.database hierarchy outside
# of the standard table expression.
# sqlglot parsing of the string will assume that it's a Table
# so we unpack the arguments into a new sqlglot object, switching
# table (this) -> database (db) and database (db) -> catalog
table = sg.parse_one(database, into=sg.exp.Table, dialect=self.dialect)
if table.args["catalog"] is not None:
raise exc.IbisInputError(
f"Overspecified table hierarchy provided: `{table.sql(self.dialect)}`"
)
catalog = table.args["db"]
db = table.args["this"]
database = sg.exp.Table(catalog=catalog, db=db)
else:
raise ValueError("oops")

return database
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry you had to write this :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, me too, but I think we'll get some mileage out of it. This can also help with future usage patterns like con.table("catalog.database.table")

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Epic work, thanks for tackling this.

@cpcloud cpcloud added this to the 9.0 milestone Mar 23, 2024
@cpcloud cpcloud added refactor Issues or PRs related to refactoring the codebase sql Backends that generate SQL labels Mar 23, 2024
@gforsyth gforsyth linked an issue Mar 25, 2024 that may be closed by this pull request
1 task
Copy link
Member

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! :shipit:

@gforsyth gforsyth merged commit edde6a3 into ibis-project:main Mar 25, 2024
96 checks passed
@gforsyth gforsyth deleted the no_more_schema branch March 25, 2024 18:11
@gforsyth
Copy link
Member Author

xref #6331

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Issues or PRs related to refactoring the codebase sql Backends that generate SQL
Projects
None yet
3 participants