Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(language): proposal to remove all usage of schema to refer to hierarchy #8477

Closed
gforsyth opened this issue Feb 27, 2024 · 4 comments · Fixed by #8655
Closed

refactor(language): proposal to remove all usage of schema to refer to hierarchy #8477

gforsyth opened this issue Feb 27, 2024 · 4 comments · Fixed by #8655
Assignees
Labels
backends Issues related to all backends refactor Issues or PRs related to refactoring the codebase
Milestone

Comments

@gforsyth
Copy link
Member

Schema as a word in databases is fraught with dual meanings and is
generally horrible. We do not use it consistently in Ibis, even
sometimes using different meanings for the same method but on different
backends.

I propose a complete elimination of the hierarchical "schema" that Postgres and others have saddled us with.

Glossary

For the purposes of this issue and also moving forward in Ibis:

  • database: a collection of tables
  • catalog: a collection of databases
  • schema: a mapping of column names to dtypes and NOTHING ELSE

The database kwarg in Ibis

Post refactor, database can always be one of:

  • string of "database"
  • dotted string of "catalog.database" (although this might error if you
    pass a "catalog.database" to a backend that only has one level of
    hierarchy)
  • tuple of ("catalog", "database")
  • ibis namespace object

Places where we currently use hierarchical schema and removal proposal

Backend.list_tables

Current: list_tables that only takes schema as kwarg

  • mysql
  • postgres
  • oracle

Future:

These 3 are relatively easy, we can deprecate schema and have it warn
and add database

Current: list_tables that takes both database and schema as kwargs

  • trino
  • bigquery
  • duckdb
  • snowflake
  • mssql
  1. Current behavior:

    • user only passes schema, we assume current database
    • user only passes database, we assume current schema (duckdb,
      but this seems like a bug and no one would do this)
    • user only passes database, we assume information_schema as
      schema (mssql)
    • user only passes database, we error (trino, bigquery, snowflake)

Future:

  • if user only passes deprecated schema, warn, treat as new database

  • if user only passes database, treat as new database This is
    technically a hard break that we can't warn users about (in code).
    This would only impact (if anyone) users of the mssql backend and maybe DuckDB (but I doubt it)

  • if user passes both old database and old schema, warn, treat as
    "catalog.database"

Backend.table

Current: Backend.table that takes schema and database

  • Default SQL backend behavior
  • bigquery currently errors if database only

Future:

Deprecate schema, warn if schema is passed, make kw-only

In ibis 8, we were creating an ops.Namespace, passing that to
_get_sqla_table and only using the schema attribute of
ops.Namespace so I don't think anyone was using only database in a
functional way even if it wasn't explicitly erroring there.

(sqlalchemy only accepts a schema kwarg when defining a
sqlalchemy table, this is one of the reasons for the current mess.)

While this is a breaking change, I don't believe it will break anyone's
code without warning.

Current: Backend.table that takes schema as mapping

  • polars takes a _schema kwarg that is unused
  • pandas
  • dask

For dask and pandas you can use schema to override the mapping
schema used to create a table, sort of similar to using ibis.table?

Future:

Deprecate schema keyword, offer no replacement for this weird and
inconsistent functionality.

Possible: add the database kwarg for API consistency and no-op if it
gets used (or error?)

Backend.get_schema

This is new since TES and we can rename it to get_database (or
something else)

Backend.list_schemas

We deprecate this, point users at list_databases

Backend.list_databases

Current: was undefined or was returning nothing (because backend has no catalog support)

Future:

This will now return database And we add a catalog kwarg so you can
list databases in a given catalog

Current: returns catalog

Future:

This will just break. This is unfortunate, but I'm cautiously optimistic
that no one is using this programmatically.

Other changes and possibly additional steps in later versions

  • Add Backend.list_catalogs which behaves like the old version of
    list_databases

After we remove schema (next major version after deprecating),
we can consider adding an additional catalog kwarg to several of the
above methods. We will still continue to allow all the database
behaviors listed above and we add error handling for if someone
specifies catalog and also provides a dotted path as database (we
have this now for overspecifying database.schema).

@contang0
Copy link

Agreed! I find it confusing when the word "schema" is used in place of "catalog."

@cpcloud
Copy link
Member

cpcloud commented Feb 28, 2024

Solid proposal, +1 from me. IMO we should get this in for 9.0 and introduce the warnings in 9.0 and fully remove (and thus break) the API in 10.0.

This fits clearly under the backend API stability efforts for this year.

@gforsyth gforsyth added this to the 9.0 milestone Feb 28, 2024
@gforsyth gforsyth self-assigned this Mar 7, 2024
@gforsyth gforsyth added refactor Issues or PRs related to refactoring the codebase backends Issues related to all backends labels Mar 7, 2024
@gforsyth gforsyth moved this from backlog to cooking in Ibis planning and roadmap Mar 7, 2024
@NickCrews
Copy link
Contributor

This is a great change. I have been confused by this in the past.

detail: should we support table name specs in symmetry with your proposal for databases? eg

  • backend.table("mycatalog.mydatabase.mytable")
  • backend.table(("mycatalog", "mydatabase", "mytable"))
  • backend.table(("mydatabase", "mytable"), catalog="mycatalog")
  • etc

@gforsyth
Copy link
Member Author

Hey @NickCrews -- yes, I think we should. I'm almost (?) ready to put up a draft PR with a large number of deprecations and changes. Right now, to try to provide some warning and reasonable warnings, I'm largely "only" removing schema (and not also adding "catalog"). My target state is to add catalog, but the bookkeeping around possible argument combinations when we have one deprecated but supported kwarg and two other valid kwargs is horrible.
So my hope is to land a PR deprecating and/or removing all usage of hierarchical schema for 9.0, and then have catalog as an option everywhere that makes sense for 10.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backends Issues related to all backends refactor Issues or PRs related to refactoring the codebase
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants