feat: support read_parquet for backend with no native support #9744

jitingxu1 · 2024-08-01T00:25:40Z

Description of changes

Support read_parquet for backends that do not have native support (like duckdb). This implementation leverages the PyArrow read_table function.

If a backend does not have its own version, it will fall back on this pyarrow implementation.

Issues closed

This addresses part of issue #9448. Additional tasks related to this issue will be completed and submitted individually.

cpcloud · 2024-08-01T17:02:32Z

@jitingxu1 This PR has a lot of failures. Can you take a look so we can decide how to move forward?

ibis/backends/tests/test_register.py

cpcloud · 2024-08-02T09:11:54Z

ibis/backends/tests/test_register.py

-    table = con.read_parquet(tmp_path / f"*.{ext}")
+    if con.name == "clickhouse":
+        # clickhouse does not support read directory
+        table = con.read_parquet(tmp_path / f"*.{ext}")


This doesn't seem like the right approach. You're changing what's being tested. Why can't you leave this code unchanged here?

the pyarrow read_table cannot read things like tmp_path/*.parquet and clickhouse cannot read the directory.

We have three kinds of read_parquet:

backends use pyarrow read_table could read single file and directory, but not glob pattern without specify the filesystem.

duckdb, and some other accepts all the above three formats

clickhouse does not accept directory.

Maybe we could add something before read_table to convert the path/*.parquet --> path, then it accepts all

The test is whether the backend can read a glob of parquet files, the answer to that seems to be "no", so it should be marked as notyet

ibis/backends/tests/test_register.py

jitingxu1 · 2024-08-08T16:43:33Z

rewrite it to support urls as the input:

regular url: https, 'ftp' and so on
fsspec compatible url: s3, gcp
local files
- support: single file, directory, glob patterns

@cpcloud for review again, Thanks

ibis/backends/__init__.py

jitingxu1 · 2024-08-21T16:44:29Z

increased the test coverage. @cpcloud

ibis/backends/__init__.py

ibis/backends/tests/test_register.py

cpcloud · 2024-08-23T12:40:17Z

ibis/backends/__init__.py

+        self.create_table(table_name, table)
+        return self.table(table_name)
+
+    def _get_pyarrow_table_from_path(self, path: str | Path, **kwargs) -> pa.Table:


Why can't the implementation of this just be:

return pq.read_table(path, **kwargs)

Did you try that already?

I tried that in my first commit, it cannot handle all the cases:

such as glob pattern and Parquet files hosted on some uri: i.e HTTPS SFTP

Pyarrow implements natively the following filesystem subclasses:

Local FS (LocalFileSystem)

S3 (S3FileSystem)

Google Cloud Storage File System (GcsFileSystem)

Hadoop Distributed File System (HDFS) (HadoopFileSystem)

@cpcloud does this make sense to you?

ibis/backends/__init__.py

jitingxu1 · 2024-09-19T17:15:11Z

HI @cpcloud ,

I got several timeout error in the CI for this PR, is there something we need to fix in another PR

FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_74[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_72[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_94[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_92[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_07[trino]@tpcds - Failed: Timeout >90.0s

is it related to trino setup in the CI?

ibis/backends/tests/test_register.py

jitingxu1 · 2024-09-20T19:10:49Z

Hi @cpcloud

In this PR, I have the trino/impala test on read_parquet, it reads about 7300 rows from functional_alltypes.parquet(seems like insertion after 10k rows will have a performance issue), I suspect it impacts the trino database performance, I have the following timeout error in other tests, I suspect it is caused by read large parquet file in test_read_parquet, does this make sense?

Should I skip the Trino and Impala in this test too? Or do you have better way to handle this?

I got several timeout error in the CI for this PR, is there something we need to fix in another PR

FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_74[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_72[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_94[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_92[trino]@tpcds - Failed: Timeout >90.0s
FAILED ibis/backends/tests/tpc/ds/test_queries.py::test_07[trino]@tpcds - Failed: Timeout >90.0s

is it related to trino setup in the CI?

gforsyth · 2024-09-23T17:51:32Z

I'm going to try something here to see if I can isolate which test is leaving us in a (sometimes) broken state only on the nix osx runs

gforsyth · 2024-09-23T18:05:31Z

Ok, that's one pass for the nix mac osx job. I'm going to cycle it a few times to make sure.

gforsyth

Most of my comments are inline -- I think we should start off offering this with a simplified implementation. It's nice to offer users options, but I think we should be measured in how much we enable directly.

gforsyth · 2024-09-23T18:33:25Z

ibis/backends/__init__.py

+            When reading data from cloud storage (such as Amazon S3 or Google Cloud Storage),
+            credentials can be provided via the `filesystem` argument by creating an appropriate
+            filesystem object (e.g., `pyarrow.fs.S3FileSystem`).


This is true, and additionally, pq.read_table also supports the standard aws auth patterns (environment variables, or aws SSO credentials, or instance credentials)

gforsyth · 2024-09-23T18:55:30Z

ibis/backends/__init__.py

+        import pyarrow.parquet as pq
+
+        path = str(path)
+        # handle url
+        if util.is_url(path):
+            import fsspec
+
+            credentials = kwargs.pop("credentials", {})
+            with fsspec.open(path, **credentials) as f:
+                with BytesIO(f.read()) as reader:
+                    return pq.read_table(reader)
+
+        # handle fsspec compatible url
+        if util.is_fsspec_url(path):
+            return pq.read_table(path, **kwargs)
+
+        # Handle local file paths or patterns
+        paths = glob.glob(path)
+        if not paths:
+            raise ValueError(f"No files found matching pattern: {path!r}")
+        elif len(paths) == 1:
+            paths = paths[0]
+
+        return pq.read_table(paths, **kwargs)
+


I think we should reconsider handling all of these cases -- this sort of branching logic means that when a user reports an error, we'll have any number of possible culprits to consider, and it makes it harder to debug for everyone.

I think (and I could be wrong) that nearly all of these cases are covered by pq.read_table by itself, and that's much easier to document and debug.

read_table also has support for being passed an fsspec object, so if someone needs to read from a hypertext url, they can use fsspec as a shim for that. (This is something we can add a note about in the docstring).

pq.read_table could handle most of the cases, I will simplify the logic, to see how much cases could be covered. Thanks for your suggestion.

gforsyth · 2024-09-23T18:57:17Z

ibis/backends/__init__.py

+        path = str(path)
+        # handle url
+        if util.is_url(path):
+            import fsspec


fsspec is not a dependency of Ibis (it's in our test suite) so this would need extra import handling if we leave it in (but see my other comments)

gforsyth · 2024-09-23T18:58:25Z

Ok, that's one pass for the nix mac osx job. I'm going to cycle it a few times to make sure.

Ok, skipping the mocked URL test on DuckDB seems to have resolved the nested transaction failures on the nix osx CI job

jitingxu1 · 2024-09-24T18:55:37Z

Ok, that's one pass for the nix mac osx job. I'm going to cycle it a few times to make sure.

Ok, skipping the mocked URL test on DuckDB seems to have resolved the nested transaction failures on the nix osx CI job

Thank you so much.

jitingxu1 · 2024-09-25T15:05:25Z

ibis/backends/__init__.py

+    @util.experimental
+    def read_parquet(
+        self, path: str | Path | BytesIO, table_name: str | None = None, **kwargs: Any
+    ) -> ir.Table:


Instead of BytesIO, I could pass the fsspec object, It could be HTTPFile if we pass an HTTP url. Not sure what is the best way to handle the type of path

@gforsyth any suggestion?

I think fsspec is a good option.

gforsyth · 2024-09-25T14:31:42Z

ibis/backends/__init__.py

+        """Register a parquet file as a table in the current backend.
+
+        This function reads a Parquet file and registers it as a table in the current
+        backend. Note that for Impala and Trino backends, the performance


Suggested change

backend. Note that for Impala and Trino backends, the performance

backend. Note that for the Impala and Trino backends, the performance

gforsyth · 2024-09-25T14:34:01Z

ibis/backends/__init__.py

+
+        table_name = table_name or util.gen_name("read_parquet")
+        paths = list(glob.glob(str(path)))
+        if paths:


I would add a comment here indicating that this is to help with reading from remote file locations

gforsyth · 2024-09-25T14:34:32Z

ibis/backends/__init__.py

+        else:
+            table = pq.read_table(path, **kwargs)
+
+        self.create_table(table_name, table)


Similar to the read_csv PR, this should probably be a memtable so we don't create a persistent table by default

support read_parquet for backend with no native support

ab2ad16

jitingxu1 added 2 commits August 1, 2024 17:46

fix unit tests

661f50d

resolve Xpass and clickhouse tests

e16f1bb

jitingxu1 commented Aug 2, 2024

View reviewed changes

ibis/backends/tests/test_register.py Outdated Show resolved Hide resolved

cpcloud requested changes Aug 2, 2024

View reviewed changes

jitingxu1 added 3 commits August 5, 2024 16:44

handle different inputs

eaec7a2

Merge branch 'main' into extend-read-parquet

9106ad8

pandas not suporting glob pattern

27d7a08

jitingxu1 requested review from cpcloud and gforsyth August 6, 2024 03:18

Merge branch 'main' into extend-read-parquet

ac6117f

cpcloud requested changes Aug 8, 2024

View reviewed changes

ibis/backends/__init__.py Outdated Show resolved Hide resolved

jitingxu1 added 3 commits August 18, 2024 14:12

tests for url and fssepc url

3ce9674

resolve pandas use pyarrow as default

24530ca

add test for is_url and is_fsspec_url

bb238af

jitingxu1 requested a review from cpcloud August 21, 2024 16:43

cpcloud reviewed Aug 21, 2024

View reviewed changes

ibis/backends/__init__.py Outdated Show resolved Hide resolved

cpcloud reviewed Aug 21, 2024

View reviewed changes

ibis/backends/tests/test_register.py Outdated Show resolved Hide resolved

jitingxu1 and others added 4 commits August 22, 2024 17:13

change to fssepc and add examples

12cfc7d

add reason for mark.never

2cf597a

re run workflow

b4cf0ea

Merge branch 'main' into extend-read-parquet

2ba5002

cpcloud reviewed Aug 23, 2024

View reviewed changes

jitingxu1 requested a review from cpcloud August 27, 2024 00:09

jitingxu1 and others added 3 commits September 11, 2024 15:25

Merge branch 'main' into extend-read-parquet

6f2c754

lint

24bfe38

Merge branch 'main' into extend-read-parquet

6a50c46

jitingxu1 and others added 4 commits September 15, 2024 12:45

remove pandas

4579bff

Merge branch 'main' into extend-read-parquet

d1ed444

Merge branch 'main' into extend-read-parquet

b01bc6a

Merge branch 'ibis-project:main' into extend-read-parquet

e70de2f

cpcloud reviewed Sep 19, 2024

View reviewed changes

ibis/backends/__init__.py Outdated Show resolved Hide resolved

jitingxu1 and others added 2 commits September 19, 2024 10:04

Merge branch 'ibis-project:main' into extend-read-parquet

413ada7

reconcile coe

c3fba44

ncclementi reviewed Sep 19, 2024

View reviewed changes

ibis/backends/tests/test_register.py Show resolved Hide resolved

ibis/backends/tests/test_register.py Show resolved Hide resolved

jitingxu1 mentioned this pull request Sep 19, 2024

bug(trino): cannot create table with large size data in trino #10178

Closed

1 task

Merge branch 'main' into extend-read-parquet

8b6b3c6

github-actions bot added the tests Issues or PRs related to tests label Sep 20, 2024

skip trino and impala

0d55190

jitingxu1 force-pushed the extend-read-parquet branch from 006cfa7 to 0d55190 Compare September 20, 2024 19:26

jitingxu1 requested a review from cpcloud September 20, 2024 19:27

jitingxu1 added 2 commits September 20, 2024 13:19

Trigger CI

fda5493

chore: trigger CI

71ebb8e

chore(test): skip test for backends with own parquet readers

2473c02

gforsyth reviewed Sep 23, 2024

View reviewed changes

jitingxu1 and others added 3 commits September 25, 2024 07:07

chore: simplify the logic

3ab60a8

Merge branch 'main' into extend-read-parquet

59c03e0

chore: lint

c0c1fd1

jitingxu1 requested a review from gforsyth September 25, 2024 14:30

jitingxu1 commented Sep 25, 2024

View reviewed changes

gforsyth reviewed Sep 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support read_parquet for backend with no native support #9744

feat: support read_parquet for backend with no native support #9744

jitingxu1 commented Aug 1, 2024

cpcloud commented Aug 1, 2024

cpcloud Aug 2, 2024

jitingxu1 Aug 2, 2024 •

edited

Loading

gforsyth Aug 5, 2024

jitingxu1 commented Aug 8, 2024

jitingxu1 commented Aug 21, 2024

cpcloud Aug 23, 2024 •

edited

Loading

jitingxu1 Aug 23, 2024

jitingxu1 Aug 28, 2024

jitingxu1 commented Sep 19, 2024 •

edited

Loading

jitingxu1 commented Sep 20, 2024

gforsyth commented Sep 23, 2024

gforsyth commented Sep 23, 2024

gforsyth left a comment

gforsyth Sep 23, 2024

gforsyth Sep 23, 2024

jitingxu1 Sep 24, 2024

gforsyth Sep 23, 2024

gforsyth commented Sep 23, 2024

jitingxu1 commented Sep 24, 2024

jitingxu1 Sep 25, 2024

gforsyth Sep 25, 2024

gforsyth Sep 25, 2024

gforsyth Sep 25, 2024

gforsyth Sep 25, 2024

	backend. Note that for Impala and Trino backends, the performance
	backend. Note that for the Impala and Trino backends, the performance

feat: support read_parquet for backend with no native support #9744

Are you sure you want to change the base?

feat: support read_parquet for backend with no native support #9744

Conversation

jitingxu1 commented Aug 1, 2024

Description of changes

Issues closed

cpcloud commented Aug 1, 2024

Choose a reason for hiding this comment

jitingxu1 Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jitingxu1 commented Aug 8, 2024

jitingxu1 commented Aug 21, 2024

cpcloud Aug 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jitingxu1 commented Sep 19, 2024 • edited Loading

jitingxu1 commented Sep 20, 2024

gforsyth commented Sep 23, 2024

gforsyth commented Sep 23, 2024

gforsyth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gforsyth commented Sep 23, 2024

jitingxu1 commented Sep 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jitingxu1 Aug 2, 2024 •

edited

Loading

cpcloud Aug 23, 2024 •

edited

Loading

jitingxu1 commented Sep 19, 2024 •

edited

Loading