Add ability to read-write to SQL databases. #4928

Dref360 · 2022-09-03T19:09:08Z

Fixes #3094

Add ability to read/write to SQLite files and also read from any SQL database supported by SQLAlchemy.

I didn't add SQLAlchemy as a dependence as it is fairly big and it remains optional.

I also recorded a Loom to showcase the feature.

https://www.loom.com/share/f0e602c2de8a46f58bca4b43333d541f

HuggingFaceDocBuilderDev · 2022-09-03T19:15:47Z

The documentation is not available anymore as the PR was closed or merged.

Dref360 · 2022-09-03T20:41:27Z

Ah CI runs with pandas=1.3.5 which doesn't return the number of row inserted.

julien-c · 2022-09-05T10:16:47Z

wow this is super cool!

lhoestq

Nice thank you !

Also feel free to add a section in the documentation about this amazing feature !
I think you can add it to loading.mdx right after the Parquet section (rendered here) here:

datasets/docs/source/loading.mdx

Lines 180 to 184 in 5389985

    
           ### Parquet 
        
           Parquet files are stored in a columnar format, unlike row-based files like a CSV. Large datasets may be stored in a Parquet file because it is more efficient and faster at returning your query.  
        
           To load a Parquet file:

src/datasets/arrow_dataset.py

lhoestq · 2022-09-05T13:37:40Z

src/datasets/packaged_modules/sql/sql.py

+                sql_file_reader = pd.read_sql(
+                    f"SELECT * FROM `{self.config.table_name}`", conn, **self.config.read_sql_kwargs
+                )


Do you know if this loads the full database into memory before returning chunks from it ?

From what I see here it doesn't load everything, it moves the DB cursor for chunksize rows and yield the batch. So it should Just Work ™️ .

tests/io/test_sql.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Dref360 · 2022-09-06T14:28:43Z

@lhoestq I'm getting error in integration tests, not sure if it's related to my PR. Any help would be appreciated :)

if not self._is_valid_token(token):
>               raise ValueError("Invalid token passed!")
E               ValueError: Invalid token passed!

lhoestq · 2022-09-06T14:35:53Z

I just relaunched the tests, it should be fixed now

mariosasko · 2022-09-07T18:50:50Z

Thanks a lot for working on this!

I have some concerns with the current design:

Besides SQLite, the loader should also work with the other engines supported by SQLAlchemy. (A better name for it in the current state would be sqlite :))
It should support arbitrary queries/table names - only the latter currently works.
Exposing this loader as a packaged builder (load_dataset("sql", ...)) is not a good idea for the following reasons:
- Considering the scenario where a table with the same name is present in multiple files is very unlikely, the data files resolution is not needed here. And if we remove that, what the name of the default split should be? "train"?
- load_dataset("sql", ...) also implies that streaming should work, but that's not the case. And I don't think we can change that, considering how hard it is to make SQLite files streamable.

All this makes me think we shouldn't expose this builder as a packaged module and, instead, limit the API to Dataset.from_sql/Dataset.to_sql (with the signatures matching the ones in pandas as much as possible; regarding this, note that SQLAlchemy connections are not hashable/picklable, which is required for caching, but I think it's OK only to allow URI strings as connections to bypass that (Dask has the same limitation).

WDYT?

Dref360 · 2022-09-08T01:00:55Z

Hi @mariosasko thank you for your review.

I agree that load_dataset('sql',...) is a bit weird and I would be happy to remove it. To be honest, I only added it when I saw that it was the preferred way in loading.mdx.

I agree that the SELECT should be a parameters as well. I'll add it.

So far, only Dataset.to_sql explicitly supports any SQLAlchemy Connexion, I'm pretty sure that Dataset.from_sql would work with a Connexion as well, but it would break the typing from the parent class which is path_or_paths: NestedDataStructureLike[PathLike]. I would prefer not to break this API Contract.

I will have time to work on this over the weekend. Please let me know what you think if I do the following:

Remove load_dataset('sql', ...) and edit the documentation to use to_sql, from_sql.
Tentatively make Dataset.from_sql typing work with SQLAlchemy Connexion.
Add support for custom queries (Default would be SELECT * FROM {table_name}).

Cheers!

mariosasko · 2022-09-08T17:55:46Z

Perhaps after we merge #4957 (Done!), you can subclass AbstractDatasetInputStream instead of AbstractDatasetReader to not break the contract with the connection object. Also, let's avoid having the default value for the query/table (you can set it to None in the builder and raise an error in the builder config's __post_init__ if it's not provided). Other than that, sounds good!

src/datasets/arrow_dataset.py

src/datasets/io/sql.py

…o_sql

mariosasko · 2022-09-21T15:57:28Z

@Dref360 I've made final changes/refinements to align the SQL API with Pandas/Dask. Let me know what you think.

Dref360 · 2022-09-21T17:28:54Z

Thank you so much! I was missing a lot of things sorry about that.
LGTM

lhoestq

Wow super impressive, thanks @Dref360 and @mariosasko !

It looks super nice, I just left some minor comments

src/datasets/arrow_dataset.py

src/datasets/io/sql.py

src/datasets/packaged_modules/sql/sql.py

src/datasets/arrow_dataset.py

src/datasets/packaged_modules/sql/sql.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

…3094/io_sql

mariosasko · 2022-09-23T13:54:53Z

I think we can merge if the tests pass.

One last thing I would like to get your opinion on - currently, if SQLAlchemy is not installed, the missing dependency error will be thrown inside pandas.read_sql. Do you think we should be the ones throwing this error, e.g. after the imports in packaged_modules/sql/sql.py if SQLALCHEMY_AVAILABLE is False (note that this would mean making sqlalchemy a required dependency for the docs to be able to add SqlConfig to the package reference)?

lhoestq · 2022-09-23T14:48:10Z

One last thing I would like to get your opinion on - currently, if SQLAlchemy is not installed, the missing dependency error will be thrown inside pandas.read_sql

Is sqlalchemy always required for pd.read_sql ? If so, I think we can raise the error on our side.
But sqlalchemy should still be an optional dependency for datasets IMO

mariosasko · 2022-09-23T14:59:28Z

@lhoestq

Is sqlalchemy always required for pd.read_sql ? If so, I think we can raise the error on our side.

In our case, it's always required as we only support database URIs.

But sqlalchemy should still be an optional dependency for datasets IMO

Yes, it will remain optional for datasets but will be required for building the docs (as iss3fs, for instance).

lhoestq · 2022-09-26T09:35:58Z

Ok I see ! Sounds good :)

lhoestq

Thanks !

docs/source/loading.mdx

Dref360 · 2022-10-03T14:31:22Z

docs/source/loading.mdx

@@ -196,6 +196,24 @@ To load remote Parquet files via HTTP, pass the URLs instead:
 >>> wiki = load_dataset("parquet", data_files=data_files, split="train")
 ```

+### SQL
+
+Read database contents with with [`Dataset.from_sql`]. Both table names and queries are supported.


Something like that?

Suggested change

Read database contents with with [`Dataset.from_sql`]. Both table names and queries are supported.

Read database contents with with [`Dataset.from_sql`]. Both table names and queries are supported.

Requires [`sqlalchemy`](https://www.sqlalchemy.org/).

I decided to add a tip to the from_sql docstring instead, but thanks anyways :).

Add ability to read-write to SQL databases.

f76b87c

Fix issue where pandas<1.4.0 doesn't return the number of rows

5747ad6

Dref360 force-pushed the HF-3094/io_sql branch from 56b4630 to 5747ad6 Compare September 3, 2022 20:44

Fix issue where connections were not closed properly

3811a5e

lhoestq reviewed Sep 5, 2022

View reviewed changes

Dref360 and others added 2 commits September 5, 2022 15:30

Apply suggestions from code review

27d56b7

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Change according to reviews

e9af3cf

Dref360 added 3 commits September 17, 2022 09:15

Change according to reviews

87eeb1a

Merge main

70e57c7

Inherit from AbstractDatasetInputStream in SqlDatasetReader

c3597c9

Dref360 commented Sep 18, 2022

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

Dref360 commented Sep 18, 2022

View reviewed changes

src/datasets/io/sql.py Show resolved Hide resolved

Dref360 and others added 9 commits September 18, 2022 10:49

Revert typing in SQLDatasetReader as we do not support Connexion

61cf29a

Align API with Pandas/Daskk

453f2c3

Update tests

5410f51

Update docs

3c128be

Update some more tests

40268ae

Merge branch 'main' of github.com:huggingface/datasets into HF-3094/i…

7830d91

…o_sql

Missing comma

dc005df

Small docs fix

a3c39d9

Style

7c4999e

mariosasko requested a review from lhoestq September 21, 2022 15:58

lhoestq reviewed Sep 22, 2022

View reviewed changes

mariosasko and others added 6 commits September 23, 2022 14:44

Update src/datasets/arrow_dataset.py

920de97

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update src/datasets/packaged_modules/sql/sql.py

9ecdb1f

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Address some comments

27c9674

Merge branch 'HF-3094/io_sql' of github.com:Dref360/datasets into HF-…

ad20c27

…3094/io_sql

Address the rest

81ad0e4

Improve tests

3714fb0

lhoestq approved these changes Sep 26, 2022

View reviewed changes

docs/source/loading.mdx Show resolved Hide resolved

Dref360 commented Oct 3, 2022

View reviewed changes

sqlalchemy required tip

f3610c8

mariosasko merged commit d7dfbc8 into huggingface:main Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to read-write to SQL databases. #4928

Add ability to read-write to SQL databases. #4928

Dref360 commented Sep 3, 2022

HuggingFaceDocBuilderDev commented Sep 3, 2022 •

edited

Loading

Dref360 commented Sep 3, 2022

julien-c commented Sep 5, 2022

lhoestq left a comment

lhoestq Sep 5, 2022

Dref360 Sep 5, 2022

Dref360 commented Sep 6, 2022

lhoestq commented Sep 6, 2022

mariosasko commented Sep 7, 2022 •

edited

Loading

Dref360 commented Sep 8, 2022

mariosasko commented Sep 8, 2022 •

edited

Loading

mariosasko commented Sep 21, 2022

Dref360 commented Sep 21, 2022

lhoestq left a comment

mariosasko commented Sep 23, 2022

lhoestq commented Sep 23, 2022

mariosasko commented Sep 23, 2022

lhoestq commented Sep 26, 2022

lhoestq left a comment

Dref360 Oct 3, 2022 •

edited

Loading

mariosasko Oct 3, 2022

	### Parquet

	Parquet files are stored in a columnar format, unlike row-based files like a CSV. Large datasets may be stored in a Parquet file because it is more efficient and faster at returning your query.

	To load a Parquet file:

	Read database contents with with [`Dataset.from_sql`]. Both table names and queries are supported.
	Read database contents with with [`Dataset.from_sql`]. Both table names and queries are supported.
	Requires [`sqlalchemy`](https://www.sqlalchemy.org/).

Add ability to read-write to SQL databases. #4928

Add ability to read-write to SQL databases. #4928

Conversation

Dref360 commented Sep 3, 2022

HuggingFaceDocBuilderDev commented Sep 3, 2022 • edited Loading

Dref360 commented Sep 3, 2022

julien-c commented Sep 5, 2022

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Sep 5, 2022

Choose a reason for hiding this comment

Dref360 Sep 5, 2022

Choose a reason for hiding this comment

Dref360 commented Sep 6, 2022

lhoestq commented Sep 6, 2022

mariosasko commented Sep 7, 2022 • edited Loading

Dref360 commented Sep 8, 2022

mariosasko commented Sep 8, 2022 • edited Loading

mariosasko commented Sep 21, 2022

Dref360 commented Sep 21, 2022

lhoestq left a comment

Choose a reason for hiding this comment

mariosasko commented Sep 23, 2022

lhoestq commented Sep 23, 2022

mariosasko commented Sep 23, 2022

lhoestq commented Sep 26, 2022

lhoestq left a comment

Choose a reason for hiding this comment

Dref360 Oct 3, 2022 • edited Loading

Choose a reason for hiding this comment

mariosasko Oct 3, 2022

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 3, 2022 •

edited

Loading

mariosasko commented Sep 7, 2022 •

edited

Loading

mariosasko commented Sep 8, 2022 •

edited

Loading

Dref360 Oct 3, 2022 •

edited

Loading