Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check if table exist #406

Closed
djouallah opened this issue Feb 10, 2024 · 6 comments · Fixed by #415
Closed

check if table exist #406

djouallah opened this issue Feb 10, 2024 · 6 comments · Fixed by #415

Comments

@djouallah
Copy link

Feature Request / Improvement

it will be nice to have an API to quickly check if a table exist, or alternatively create table if it does not exist
current I am doing this

try:
   table = catalog.create_table("aemo.price",schema=price.schema)
  except:
   pass
  table = catalog.load_table("aemo.price")
  table.append(price)
@Fokko
Copy link
Contributor

Fokko commented Feb 13, 2024

I'm hesitant if we want to add this. I'd rather add the CREATE OR REPLACE semantic.

The following logic will avoid fetching the table when not needed:

try:
    table = catalog.create_table("aemo.price",schema=price.schema)
except:
    table = catalog.load_table("aemo.price")
table.append(price)

I think in your case a CREATE OR REPLACE is more feasible. Otherwise you might append the data twice, right?

@djouallah
Copy link
Author

djouallah commented Feb 13, 2024

I don't want to replace the table, if the table exist leave it as it is , otherwise create a new one, but not replace it.

actually this is what I use with delta

if spark.catalog.tableExists("scada"):

@Fokko
Copy link
Contributor

Fokko commented Feb 13, 2024

I like the table_exists method 👍

@sungwy
Copy link
Collaborator

sungwy commented Feb 13, 2024

I think table_exists function that @djouallah proposed and the PR that @hussein-awala is working on to support CREATE TABLE IF NOT EXISTS both serve different purposes. And I think that we should support both in PyIceberg:

table_exists:

  • important if we just want to check that a table exists in a namespace. I'd argue this is the same as calling list_tables and checking if the table exists in the returned list, and hence isn't as critical to implement as 'CREATE TABLE IF NOT EXISTS'
  • It is however, very simple to implement, and we could just support it

CREATE TABLE IF NOT EXISTS

  • allows users to deploy an idempotent table creation statement into Production, so that the same code can be run to first create a table, and then ignore the creation of the table henceforth without requiring a code change.
  • This semantic is different from running table_exists and then invoking create_table sequentially, because CREATE TABLE IF NOT EXISTS is a single call to the catalog. In table_exists + create_table, the two calls are made separately and sequentially, meaning there is a probability that a concurrent process could have created a table, leading to create_table failing, even if table_exists returned False for a given process.
  • An alternative is just to ask users to try and catch TableAlreadyExistsError in their code when calling create_table

@Gowthami03B
Copy link
Contributor

Can I take a stab at the table_exists method proposed here? @Fokko @djouallah @syun64

@Fokko
Copy link
Contributor

Fokko commented Feb 16, 2024

@Gowthami03B That would be great! 👍

One note here:

important if we just want to check that a table exists in a namespace. I'd argue this is the same as calling list_tables and checking if the table exists in the returned list, and hence isn't as critical to implement as 'CREATE TABLE IF NOT EXISTS'

I think it would make more sense to do an actual load_table instead of calling list_tables, mostly because there is a discussion on the REST spec to add pagination. Calling the list_table would then result in many consecutive requests to build up the list, which is not very performant. For the load_table we load the metadata, but I think that's okay.

Thanks @syun64 for summarizing the options.

An alternative is just to ask users to try and catch TableAlreadyExistsError in their code when calling create_table

I believe this is the most Pythonic way of doing it, but I agree that we could mirror the SQL CREATE IF NOT EXISTS. How about adding this to the Catalog(ABC) itself. This way we don't have to add this logic to each of the implementations:

catalog = load_catalog('default')
catalog.create_table_if_not_exists('schema.table', schema=...)

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants