feat(flink): add `read_***()` for Flink backend #7777

mfatihaktas · 2023-12-15T19:29:58Z

Adds the following functions in ibis/backends/flink/__init__.py:

~~register()~~
read_file()
read_parquet()
read_csv()
read_json()

Addition of these functions clears several test functions in test_param.py and ~~test_register.py~~.

Edit: Removed register() per this comment.

cpcloud · 2023-12-16T13:18:44Z

Let's avoid adding register for now. It's a bit crufty and from a time when we thought we might be able to unify all the read_* methods under a single read(...) function.

mfatihaktas · 2023-12-17T01:22:38Z

Let's avoid adding register for now. It's a bit crufty and from a time when we thought we might be able to unify all the read_* methods under a single read(...) function.

Thanks for the note, removed register().

ibis/backends/flink/__init__.py

jcrist · 2024-01-03T21:09:07Z

ibis/backends/flink/__init__.py

+        ir.Table
+            The just-registered table
+        """
+        obj = self._get_dataframe_from_path(path)


Does flink support natively loading directly from files? If so, I think we should use that.

If it doesn't, I'm not sure how I feel about automatically loading files using pandas and forwarding them that way. We do something similar for local in-memory backends like duckdb/pandas where file paths are unambiguously local. But for a potentially distributed system like flink, automatically using a local file reader may have unexpected behavior with path names, and also may be inefficient.

If we do decide to handle file reading manually using an in-memory reader here, we should use the ones provided by pyarrow directly, not pandas. These have builtin support for directory datasets, and better match the behavior of other backends.

Thanks, I agree that reading the file is not ideal while creating the table. Flink actually supports creating tables with filesystem connector. This however requires specifying the schema, as create_table() also requires:

def create_table( self, name: str, obj: pd.DataFrame | pa.Table | ir.Table | None = None, *, schema: sch.Schema | None = None, database: str | None = None, catalog: str | None = None, tbl_properties: dict | None = None, watermark: Watermark | None = None, temp: bool = False, overwrite: bool = False, ) -> ir.Table: ... if obj is None and schema is None: raise exc.IbisError("`schema` or `obj` is required") ...

So I think we have two options:

Add a required argument schema for read_***(). This would deviate from the interface existing for other backends.

"Read" the schema from the file with pyarrow and feed it into create_table() with filesystem connector. This has the same issue you raised in accessing the files in a distributed system.

Which one do you think makes more sense? We could also implement both where the user can specify the schema, and if not specified we could construct the schema from the file and raise an error in case of a failure in accessing the file.

cpcloud · 2024-01-04T10:53:52Z

@mfatihaktas Apologies again for the churn, can you reopen this PR against main?

mfatihaktas · 2024-01-04T17:36:36Z

@mfatihaktas Apologies again for the churn, can you reopen this PR against main?

Reopened: #7908

…7908) Reviving #7777 which got closed after switching to `main`.

mfatihaktas force-pushed the flink-deep-dive-on-test_json-2 branch 5 times, most recently from ca1ec35 to 76ee200 Compare December 16, 2023 11:07

mfatihaktas force-pushed the flink-deep-dive-on-test_json-2 branch from 76ee200 to 4a125d9 Compare December 17, 2023 01:17

mfatihaktas changed the title ~~feat(flink): add register() and read_***() for Flink backend~~ feat(flink): add dread_***() for Flink backend Dec 18, 2023

mfatihaktas changed the title ~~feat(flink): add dread_***() for Flink backend~~ feat(flink): add read_***() for Flink backend Dec 18, 2023

mfatihaktas marked this pull request as ready for review December 18, 2023 22:11

mfatihaktas force-pushed the flink-deep-dive-on-test_json-2 branch from 4a125d9 to 8cf8ed5 Compare December 19, 2023 16:36

mfatihaktas force-pushed the flink-deep-dive-on-test_json-2 branch from 8cf8ed5 to 601cd13 Compare January 3, 2024 18:01

jcrist reviewed Jan 3, 2024

View reviewed changes

test(flink): deep dive on the tests marked for Flink in test_json.py

0e69c18

mfatihaktas force-pushed the flink-deep-dive-on-test_json-2 branch from 601cd13 to 0e69c18 Compare January 3, 2024 22:39

mfatihaktas requested a review from jcrist January 3, 2024 22:39

cpcloud deleted the branch ibis-project:master January 4, 2024 10:43

cpcloud closed this Jan 4, 2024

mfatihaktas mentioned this pull request Jan 4, 2024

test(flink): deep dive on the tests marked for Flink in test_json.py #7908

Merged

cpcloud pushed a commit that referenced this pull request Jan 21, 2024

test(flink): deep dive on the tests marked for Flink in test_json.py (#…

3eebc41

…7908) Reviving #7777 which got closed after switching to `main`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(flink): add `read_***()` for Flink backend #7777

feat(flink): add `read_***()` for Flink backend #7777

mfatihaktas commented Dec 15, 2023 •

edited

Loading

cpcloud commented Dec 16, 2023

mfatihaktas commented Dec 17, 2023

jcrist Jan 3, 2024

mfatihaktas Jan 3, 2024

cpcloud commented Jan 4, 2024

mfatihaktas commented Jan 4, 2024

feat(flink): add read_***() for Flink backend #7777

feat(flink): add read_***() for Flink backend #7777

Conversation

mfatihaktas commented Dec 15, 2023 • edited Loading

cpcloud commented Dec 16, 2023

mfatihaktas commented Dec 17, 2023

jcrist Jan 3, 2024

Choose a reason for hiding this comment

mfatihaktas Jan 3, 2024

Choose a reason for hiding this comment

cpcloud commented Jan 4, 2024

mfatihaktas commented Jan 4, 2024

feat(flink): add `read_***()` for Flink backend #7777

feat(flink): add `read_***()` for Flink backend #7777

mfatihaktas commented Dec 15, 2023 •

edited

Loading