Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(duckdb): fix create_table() in databases with spaces in the name #9817

Merged
merged 1 commit into from
Aug 26, 2024

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Aug 11, 2024

If you tried

import ibis

conn = ibis.duckdb.connect("with space.duckdb")
conn.create_table("foo", {"a": [0,1,2]})

then you would get the DDL CREATE TABLE with space.main."foo" ("a" BIGINT). Now, the database name is quoted.

I went through the create_table() method in all the backends and looked for where we forgot to use "quoted", but I almost definitely missed some cases.

I need to add a test for this, but I would love your help with this:

  • should it add/modify a create_table() test to check for a database with spaces in the name?
  • should it go somewhere near other tests that check for spaces? I think we have those?
  • or should we go larger, and modify the con fixture to sometimes be a database with spaces in the name?

@gforsyth
Copy link
Member

There's always things to fix involving quoting, but I think this should be scoped as per the title, to connecting to database files that have spaces in the name.

I don't think the clickhouse change is relevant here. This is going to have to be a backend-specific test, I think, since creating database files is going to be inherently backend specific.

@NickCrews
Copy link
Contributor Author

NickCrews commented Aug 19, 2024

@gforsyth thanks for the reply. Sorry, I think I need a little more handholding :) Can you point me to which files the test should be added to?

I'm thinking something like

@pytest.mark.paramaterize(
    "database",
    [
        "with spaces",
        ("spaced catalog", "spaced database"),
        ("spaced catalog.spaced database"),
    ]
)
@pytest.mark.never(
    ["clickhouse", ], reason="doesn't have single database files"
)
def test_create_table_quoting(conn, database):
    t = conn.create_table("foo", {"a": [0,1,2]}, database=database)
    result = set(conn.execute(t.foo))
    assert result == {1,2,3}

?

Your clickhouse comment went over my head, I'm not that familiar with SQL engines so the distinction between a "database file", "catalog", and "database" are fuzzy for me.

@gforsyth
Copy link
Member

gforsyth commented Aug 19, 2024

Hey @NickCrews -- sure, it's fuzzy for me, too, because there's limited consensus on what any of these words mean. Here goes:

database: a collection of abstract table objects (Postgres and others call this schema, which is a terrible idea)
catalog: a collection of database

database file: this one is even less standard, but I meant for backends like duckdb and sqlite, which have a single on-disk file that stores all of the data.

Now, for the fix you're proposing database file only applies to DuckDB, since DuckDB uses the name of the file (up until the first . as the name of the default catalog)

[ins] In [14]: con = ibis.duckdb.connect("test.db")

[ins] In [15]: con.list_catalogs()
Out[15]: ['system', 'temp', 'test']

[ins] In [16]: con = ibis.duckdb.connect("test.ddb.db")

[ins] In [17]: con.list_catalogs()
Out[17]: ['system', 'temp', 'test']

In some backends (uhh, datafusion, mssql, trino (probably?)) you could test this by first creating a catalog (with a space in the name), then creating a database in that catalog, then creating a table, ensuring you can grab a reference to the table, then cleaning it all up.

But you can't do that with DuckDB because you can't create a catalog from an existing session (well, there might be a way, but let's leave that alone), because it's determined by the file name (whether that file exists already or not).

So you can't use the con or the backend fixture, because it's going to give you a DuckDB connection that's already set up and already pointing at :memory:.

Instead, I recommend putting a test in ibis/backends/duckdb/tests/test_client.py that only parametrizes over the filenames you want to test (which will translate to catalog names), and then make sure you can create a table.

So something like

@pytest.mark.paramaterize(
    "database_file",
    [
        "with spaces.ddb",
        "space catalog.duckdb.db",
    ]
)
def test_create_table_quoting(database_file):
    conn = ibis.duckdb.connect(database_file)
    t = conn.create_table("foo", {"a": [0,1,2]})
    result = set(conn.execute(t.foo))
    assert result == {1,2,3}

That is going to leave a file on-disk, though, so you should probably use the pytest tmpdir fixture to make sure they're being created in an ephemeral place.

And to my point about Clickhouse, this test can only run on DuckDB, because it's the only one that uses the file-name to generate the catalog name.

If this is a problem with other backends, there are few-enough backends that allow creating catalogs at all that they are probably better handled individually, with backend-specific tests.

…ces in the name

If you tried

```python
import ibis

conn = ibis.duckdb.connect("with space.duckdb")
conn.create_table("foo", {"a": [0,1,2]})
```

then you would get the DDL `CREATE TABLE with space.main."foo" ("a" BIGINT)`. Now, the database name is quoted.
@NickCrews
Copy link
Contributor Author

got it, thanks @gforsyth , that makes sense. The duckdb test was quite easy to implement. Ibis's terminology for the namespace hierarchy is unoverloaded and quite clear, I think it is good.

I left in the clickhouse implementation fix, even though I didn't add a test for it. Let me know if you want me to remove it (though that seems silly to me, this new version looks strictly better and I highly doubt is hiding anything)

Side note: I don't like how I only added this test to duckdb, but I wouldn't want each test_client.py in each backend to implement it separately either. Too easy for the semantics to drift between them. Ideally there was some central test_catalog_can_have_spaces(), where you could see all the backends that we had marked as notimpl, never, etc, and then this dispatched out to individual backends to actually implement the test.

@gforsyth
Copy link
Member

Side note: I don't like how I only added this test to duckdb, but I wouldn't want each test_client.py in each backend to implement it separately either. Too easy for the semantics to drift between them. Ideally there was some central test_catalog_can_have_spaces(), where you could see all the backends that we had marked as notimpl, never, etc, and then this dispatched out to individual backends to actually implement the test.

It has to live in the duckdb-specific test file to test DuckDB -- the others could maybe be handled by creating a catalog in the schema/*.sql files, but some things are just inherently backend-specific and you're going to lose your mind if you try to create an abstraction that covers both (e.g.) adding an additional Iceberg catalog to PySpark and creating a file on-disk for DuckDB.

Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@gforsyth gforsyth merged commit 9da3c9f into ibis-project:main Aug 26, 2024
81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants