Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add databricks support #9248

Open
1 task done
ruiyang2015 opened this issue May 24, 2024 · 8 comments · May be fixed by #10223
Open
1 task done

feat: add databricks support #9248

ruiyang2015 opened this issue May 24, 2024 · 8 comments · May be fixed by #10223
Assignees
Labels
feature Features or general enhancements new backend PRs or issues related to adding new backends
Milestone

Comments

@ruiyang2015
Copy link

Which new backend would you like to see in Ibis?

I would like to see databricks backend support.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@ruiyang2015 ruiyang2015 added feature Features or general enhancements new backend PRs or issues related to adding new backends labels May 24, 2024
@cpcloud
Copy link
Member

cpcloud commented May 29, 2024

@ruiyang2015 Can you clarify a bit what "databricks" means here? Is that databricks cloud, databricks connect, or something else?

@ruiyang2015
Copy link
Author

@ruiyang2015 Can you clarify a bit what "databricks" means here? Is that databricks cloud, databricks connect, or something else?

We use databricks connect and databricks SQL endpoints.

@cpcloud
Copy link
Member

cpcloud commented Aug 2, 2024

@Kilo59 Thanks for the links! The databricks DB-API looks pretty solid. They even support fetching query results as arrow tables.

@techdebtcreator
Copy link

+1 for this since my company mainly uses Databricks SQL Warehouse via the databricks-sql-connector package. Unless I missed something, I'm currently only able to connect to Databricks clusters (through the Ibis PySpark backend in conjunction with the databricks-connect package).

@nrlugg
Copy link

nrlugg commented Aug 26, 2024

For reference, there is also the databricks-sdk package which could also be used for querying Databricks SQL Warehouses using the statement_execution submodule.

This package is particularly interesting because, if you use format=Format.ARROW_STREAM and disposition=Disposition.EXTERNAL_LINKS, it allows streaming chunks of serialized arrow tables (i.e., arrow IPC format), and this could potentially be used for processing tables which are larger than memory and/or be read using async or multi-threading to stream the data faster.

Also, the results of the executed query are stored temporarily in cloud storage which means the urls of executed query chunks could be cached and reused without having to execute the query again (if the query doesn't change).

Also also, pyarrow is not pinned to any particular version (unlike databricks-sql-connector where it is) which makes dependency management a less restrictive.

@hershelm
Copy link

+1 would be great to see native support for databricks

anyone in the thread, may find this blog post useful: https://posit.co/blog/databases-with-posit/

see the section on "Databricks" and the "Python" tab

@cpcloud cpcloud added this to the 10.0 milestone Sep 20, 2024
@cpcloud
Copy link
Member

cpcloud commented Sep 20, 2024

We will be tackling this for the 10.0 release, stay tuned!

@cpcloud cpcloud self-assigned this Sep 20, 2024
@cpcloud cpcloud linked a pull request Sep 25, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements new backend PRs or issues related to adding new backends
Projects
Status: backlog
Development

Successfully merging a pull request may close this issue.

6 participants