-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to read-write to SQL databases. #4928
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Ah CI runs with |
56b4630
to
5747ad6
Compare
wow this is super cool! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thank you !
Also feel free to add a section in the documentation about this amazing feature !
I think you can add it to loading.mdx
right after the Parquet section (rendered here) here:
datasets/docs/source/loading.mdx
Lines 180 to 184 in 5389985
### Parquet | |
Parquet files are stored in a columnar format, unlike row-based files like a CSV. Large datasets may be stored in a Parquet file because it is more efficient and faster at returning your query. | |
To load a Parquet file: |
sql_file_reader = pd.read_sql( | ||
f"SELECT * FROM `{self.config.table_name}`", conn, **self.config.read_sql_kwargs | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if this loads the full database into memory before returning chunks from it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I see here it doesn't load everything, it moves the DB cursor for chunksize
rows and yield the batch. So it should Just Work ™️ .
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@lhoestq I'm getting error in integration tests, not sure if it's related to my PR. Any help would be appreciated :)
|
I just relaunched the tests, it should be fixed now |
Thanks a lot for working on this! I have some concerns with the current design:
All this makes me think we shouldn't expose this builder as a packaged module and, instead, limit the API to WDYT? |
Hi @mariosasko thank you for your review. I agree that I agree that the So far, only I will have time to work on this over the weekend. Please let me know what you think if I do the following:
Cheers! |
Perhaps after we merge #4957 (Done!), you can subclass |
@Dref360 I've made final changes/refinements to align the SQL API with Pandas/Dask. Let me know what you think. |
Thank you so much! I was missing a lot of things sorry about that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow super impressive, thanks @Dref360 and @mariosasko !
It looks super nice, I just left some minor comments
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
I think we can merge if the tests pass. One last thing I would like to get your opinion on - currently, if SQLAlchemy is not installed, the missing dependency error will be thrown inside |
Is sqlalchemy always required for pd.read_sql ? If so, I think we can raise the error on our side. |
In our case, it's always required as we only support database URIs.
Yes, it will remain optional for datasets but will be required for building the docs (as is |
Ok I see ! Sounds good :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
@@ -196,6 +196,24 @@ To load remote Parquet files via HTTP, pass the URLs instead: | |||
>>> wiki = load_dataset("parquet", data_files=data_files, split="train") | |||
``` | |||
|
|||
### SQL | |||
|
|||
Read database contents with with [`Dataset.from_sql`]. Both table names and queries are supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like that?
Read database contents with with [`Dataset.from_sql`]. Both table names and queries are supported. | |
Read database contents with with [`Dataset.from_sql`]. Both table names and queries are supported. | |
Requires [`sqlalchemy`](https://www.sqlalchemy.org/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to add a tip to the from_sql
docstring instead, but thanks anyways :).
Fixes #3094
Add ability to read/write to SQLite files and also read from any SQL database supported by SQLAlchemy.
I didn't add SQLAlchemy as a dependence as it is fairly big and it remains optional.
I also recorded a Loom to showcase the feature.
https://www.loom.com/share/f0e602c2de8a46f58bca4b43333d541f