-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add catalog
& database
support to create_table
#9038
Comments
catalog
& database
support to create_table
catalog
& database
support to create_table
Resolves #9038 Adds support for specifying the `catalog` in various `pyspark` calls. BREAKING CHANGE: Arguments to `create_database`, `drop_database`, and `get_schema` are now keyword-only except for the `name` args. Calls to these functions that have relied on positional argument ordering need to be updated.
@gforsyth I'm still trying to debug this myself to understand but wanted to post here as well. Let me know if I should open a new issue instead of posting on the closed one. The fix doesn't seem to be working on my end if I'm using it correctly. I double checked my Ibis version to make sure I'm on the right one, I installed w/
I tested w/ the following code and got an error that made me think it was trying to split string to tuple, but I already provided a tuple:
So I tried providing catalog and database as string with dot separator and my error looks similar to the error I got when I opened the issue initially. It seems like it accepted my catalog argument, but dropped my database argument and substituted
|
Hey @mark-druffel -- I don't have multiple catalogs set up, so it's very possible I missed something. That first error you got is because it needs to be a ispark.create_table(name = "raw_media_meas_campaign_info", obj = df, database=("comms_media_dev", "dart_extensions"), overwrite=True) That said, the second way should be equivalent to the tuple way, so something is a bit wrong. I will try to figure out what's going sideways. |
That makes sense, I just tried that and error looks the same as the second attempt from above. Please let me know if there's anything I can do to help and thanks so much for your quick response!
|
Question that might seem a bit odd, but does the table show up in the appropriate place in spite of the error message? ispark.list_tables(database=("comms_media_dev", "dart_extensions")) |
Ok, I think I have a fix for this. This is a horrible bit of bookkeeping. For context, this is what is happening: We set the catalog using a context manager, and we set the database also using a context manager. Currently what is happening to you is this weird edge case: set catalog to then we write the table, great! Now we try to change catalog and database back in reverse order and... set database to set catalog back to So I think what we need to do is instead: set catalog It would be really great if spark would allow for setting both of these values at the same time, but that is apparently not a thing. |
^ Sorry disregard didn't see your last pop through. Yea spark not allowing both at the same time is really annoying imho |
If you want to try out that PR, @mark-druffel, that would be a huge help until I can get a much more complicated pyspark testing setup put together. |
Sorry for the delay, databricks takes forever to start... Now it says the schema can't be found, but I provided an
Tables: ['ibis_read_parquet_472gdhsajjakrgoq2mzf7ffz7u', 'ibis_read_parquet_73xgg7oaunet5oyv5rmderp7wa', 'ibis_read_parquet_dzbw5jngqngsxpg6ug7u266w2i', 'ibis_read_parquet_g2kop6usdncf3k67qgk4i7igpi', 'ibis_read_parquet_j6q3xnj7uzcg5ecfsmdty6l4xa', 'ibis_read_parquet_wifrr4hijbevvdhhlv5kivn2ey', 'ibis_read_parquet_xzhqoneiorfqhfiqdk7nqmpe4u', 'raw_media_meas_offer_info', 'raw_target_history', 'standardized_media_meas_campaign_info', 'standardized_media_meas_offer_info', 'standardized_target_history']
Current Catalog: hive_metastore
Current Database: default
[[SCHEMA_NOT_FOUND](https://docs.microsoft.com/azure/databricks/error-messages/error-classes#schema_not_found)] The schema `dart_extensions` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a catalog, verify the current_schema() output, or qualify the name with the correct catalog.
To tolerate the error on drop use DROP SCHEMA IF EXISTS. SQLSTATE: 42704
File <command-4437199335976496>, line 10
8 print(f"Current Catalog: {ispark._session.catalog.currentCatalog()}\n")
9 print(f"Current Database: {ispark._session.catalog.currentDatabase()}\n")
---> 10 ispark.create_table(name = "raw_media_meas_campaign_info", obj = df, database=('comms_media_dev','dart_extensions'), overwrite=True, format = "delta")
11 print(f"Current Catalog: {ispark._session.catalog.currentCatalog()}\n")
12 print(f"Current Database: {ispark._session.catalog.currentDatabase()}\n")
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-10d3656f-1fae-4528-917f-49d0869552d4/lib/python3.10/site-packages/ibis/backends/pyspark/__init__.py:532, in Backend.create_table(self, name, obj, schema, database, temp, overwrite, format)
529 else:
530 raise com.IbisError("The schema or obj parameter is required")
--> 532 return self.table(name, database=db)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-10d3656f-1fae-4528-917f-49d0869552d4/lib/python3.10/site-packages/ibis/backends/sql/__init__.py:137, in SQLBackend.table(self, name, schema, database)
134 catalog = table_loc.catalog or None
135 database = table_loc.db or None
--> 137 table_schema = self.get_schema(name, catalog=catalog, database=database)
138 return ops.DatabaseTable(
139 name,
140 schema=table_schema,
141 source=self,
142 namespace=ops.Namespace(catalog=catalog, database=database),
143 ).to_expr()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-10d3656f-1fae-4528-917f-49d0869552d4/lib/python3.10/site-packages/ibis/backends/pyspark/__init__.py:459, in Backend.get_schema(self, table_name, catalog, database)
457 table_loc = self._to_sqlglot_table((catalog, database))
458 catalog, db = self._to_catalog_db_tuple(table_loc)
--> 459 with self._active_catalog_database(catalog, db):
460 df = self._session.table(table_name)
461 struct = PySparkType.to_ibis(df.schema)
File /usr/lib/python3.10/contextlib.py:135, in _GeneratorContextManager.__enter__(self)
133 del self.args, self.kwds, self.func
134 try:
--> 135 return next(self.gen)
136 except StopIteration:
137 raise RuntimeError("generator didn't yield") from None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-10d3656f-1fae-4528-917f-49d0869552d4/lib/python3.10/site-packages/ibis/backends/pyspark/__init__.py:254, in Backend._active_catalog_database(self, catalog, db)
252 if not PYSPARK_LT_34 and catalog is not None:
253 self._session.catalog.setCurrentCatalog(catalog)
--> 254 self._session.catalog.setCurrentDatabase(db)
255 yield
256 finally:
File /databricks/spark/python/pyspark/instrumentation_utils.py:47, in _wrap_function.<locals>.wrapper(*args, **kwargs)
45 start = time.perf_counter()
46 try:
---> 47 res = func(*args, **kwargs)
48 logger.log_success(
49 module_name, class_name, function_name, time.perf_counter() - start, signature
50 )
51 return res
File /databricks/spark/python/pyspark/sql/catalog.py:193, in Catalog.setCurrentDatabase(self, dbName)
183 def setCurrentDatabase(self, dbName: str) -> None:
184 """
185 Sets the current default database in this session.
186
(...)
191 >>> spark.catalog.setCurrentDatabase("default")
192 """
--> 193 return self._jcatalog.setCurrentDatabase(dbName)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1355, in JavaMember.__call__(self, *args)
1349 command = proto.CALL_COMMAND_NAME +\
1350 self.command_header +\
1351 args_command +\
> 1352 proto.END_COMMAND_PART
> 1354 answer = self.gateway_client.send_command(command)
> -> 1355 return_value = get_return_value(
> 1356 answer, self.gateway_client, self.target_id, self.name)
> 1358 for temp_arg in temp_args:
> 1359 if hasattr(temp_arg, "_detach"):
> File /databricks/spark/python/pyspark/errors/exceptions/captured.py:230, in capture_sql_exception.<locals>.deco(*a, **kw)
> 226 converted = convert_exception(e.java_exception)
> 227 if not isinstance(converted, UnknownException):
> 228 # Hide where the exception came from that shows a non-Pythonic
> 229 # JVM exception message.
> --> 230 raise converted from None
> 231 else:
> 232 raise |
Hey @mark-druffel -- EDIT: let's continue over in #9067 where I'm trying to fix this |
Hey @mark-druffel -- we've merged in my fixes from #9067 so hopefully |
What happened?
TLDR
I'm wondering if it's intended that the
database
argument increate_table
works different than the one indrop_table
?create_table
only accepts astr
anddrop_table
accepts atuple
.If I set the catalog and database via pyspark,
create_table
works as excepted, but I can't figure out a way to do so in mycreate_table
, I had to do it through the pyspark session directly:I can drop a table without accessing the pyspark session:
Additional Details
To drop my table I can just specify the catalog and database in my call:
Trying the same approach with
create_table
if fails:I also tried with dot separator:
I then tried to set the catalog and provide the database name and got a permissions error. Looking through the error, it looks like create_table didn't pass the
database
argument because the database was set to default (i.e.comms_media_dev.default
:If I set the catalog and database via pyspark,
create_table
works as excepted:What version of ibis are you using?
https://github.com/ibis-project/ibis.git@93552812ee9e8e0e3397bc226cc20c381fcd173b
What backend(s) are you using, if any?
pyspark
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: