feat(pyspark): implement new experimental read/write directory methods #9272

chloeh13q · 2024-05-30T15:12:47Z

Description of changes

Implement new experimental read/write directory methods in Pyspark backend to support streaming read/write.

Issues closed

#8984

chloeh13q · 2024-05-30T18:42:38Z

xref: ibis-project/testing-data#9

ibis/backends/pyspark/__init__.py

cpcloud · 2024-06-03T17:58:08Z

ibis/backends/pyspark/tests/conftest.py

@@ -360,6 +325,7 @@ def con(data_dir, tmp_path_factory, worker_id):
 @pytest.fixture(scope="session")
 def con_streaming(data_dir, tmp_path_factory, worker_id):
    backend_test = TestConfForStreaming.load_data(data_dir, tmp_path_factory, worker_id)
+    backend_test._load_data()


Not 100% sure why this is needed, let me investigate a bit what's going on.

I figured out what the problem is! The pyspark backend loads statefully and the streaming conf ended up reusing the same temp directory from the batch conf, which already exists when the batch conf tried to load data, so the streaming conf skips data loading. Because tmpdir is passed as a fixture, I added a line of code that changes the directory naming for the streaming conf.

ibis/backends/pyspark/tests/test_import_export.py

cpcloud

Let's merge this! Since we're using the experimental we are free to break this across non-major versions based on user feedback/bugs/etc.

chloeh13q mentioned this pull request May 30, 2024

feat: add Spark streaming support #8868

Closed

1 task

chloeh13q force-pushed the feat/spark-new-read-write branch from b008b1f to c9ef6df Compare May 30, 2024 18:13

chloeh13q marked this pull request as ready for review May 30, 2024 18:42

chloeh13q requested a review from jcrist May 30, 2024 18:42

cpcloud reviewed May 31, 2024

View reviewed changes

ibis/backends/pyspark/__init__.py Outdated Show resolved Hide resolved

cpcloud reviewed Jun 3, 2024

View reviewed changes

chloeh13q force-pushed the feat/spark-new-read-write branch from 76f613b to 206cd9c Compare June 3, 2024 18:12

cpcloud added the pyspark The Apache PySpark backend label Jun 3, 2024

cpcloud added this to the 9.1 milestone Jun 3, 2024

cpcloud force-pushed the feat/spark-new-read-write branch from 206cd9c to 3093145 Compare June 3, 2024 19:41

chloeh13q force-pushed the feat/spark-new-read-write branch from d51bd42 to f933779 Compare June 3, 2024 20:34

cpcloud modified the milestones: 9.1, 9.2 Jun 13, 2024

chloeh13q force-pushed the feat/spark-new-read-write branch from 85317b3 to 4f03594 Compare June 27, 2024 19:00

Chloe He and others added 8 commits June 27, 2024 12:45

feat(pyspark): implement new experimental read/write directory methods

e4cb5ad

add unit tests

a985fc4

change method return type, add docstrings

dd66507

change to

ade4d86

rebase

852953a

fix methods called in unit tests

be82eef

small refactor

c318efe

fix issue with data loading

6bc9364

chloeh13q force-pushed the feat/spark-new-read-write branch from 3237871 to 6bc9364 Compare June 27, 2024 19:45

remove set trace

36c18cc

cpcloud reviewed Jun 27, 2024

View reviewed changes

ibis/backends/pyspark/tests/test_import_export.py Show resolved Hide resolved

Chloe He added 2 commits June 27, 2024 13:06

fix rebase

1e0bcf9

add params and limit to the new to_* methods

194d1b6

chloeh13q requested a review from cpcloud June 27, 2024 20:13

cpcloud approved these changes Jul 1, 2024

View reviewed changes

Merge branch 'main' into feat/spark-new-read-write

a55ebc2

cpcloud enabled auto-merge (squash) July 1, 2024 12:33

cpcloud merged commit adade5e into ibis-project:main Jul 1, 2024
76 checks passed

ncclementi mentioned this pull request Jul 25, 2024

feat(parquet): figure out convention for multi-file parquet writing #8584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pyspark): implement new experimental read/write directory methods #9272

feat(pyspark): implement new experimental read/write directory methods #9272

chloeh13q commented May 30, 2024

chloeh13q commented May 30, 2024

cpcloud Jun 3, 2024

chloeh13q Jun 3, 2024

chloeh13q Jun 27, 2024

cpcloud left a comment

feat(pyspark): implement new experimental read/write directory methods #9272

feat(pyspark): implement new experimental read/write directory methods #9272

Conversation

chloeh13q commented May 30, 2024

Description of changes

Issues closed

chloeh13q commented May 30, 2024

cpcloud Jun 3, 2024

Choose a reason for hiding this comment

chloeh13q Jun 3, 2024

Choose a reason for hiding this comment

chloeh13q Jun 27, 2024

Choose a reason for hiding this comment

cpcloud left a comment

Choose a reason for hiding this comment