Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pyspark): implement new experimental read/write directory methods #9272

Merged
merged 12 commits into from
Jul 1, 2024

Conversation

chloeh13q
Copy link
Contributor

Description of changes

Implement new experimental read/write directory methods in Pyspark backend to support streaming read/write.

Issues closed

#8984

@chloeh13q chloeh13q force-pushed the feat/spark-new-read-write branch from b008b1f to c9ef6df Compare May 30, 2024 18:13
@chloeh13q
Copy link
Contributor Author

xref: ibis-project/testing-data#9

@chloeh13q chloeh13q marked this pull request as ready for review May 30, 2024 18:42
@chloeh13q chloeh13q requested a review from jcrist May 30, 2024 18:42
@@ -360,6 +325,7 @@ def con(data_dir, tmp_path_factory, worker_id):
@pytest.fixture(scope="session")
def con_streaming(data_dir, tmp_path_factory, worker_id):
backend_test = TestConfForStreaming.load_data(data_dir, tmp_path_factory, worker_id)
backend_test._load_data()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure why this is needed, let me investigate a bit what's going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured out what the problem is! The pyspark backend loads statefully and the streaming conf ended up reusing the same temp directory from the batch conf, which already exists when the batch conf tried to load data, so the streaming conf skips data loading. Because tmpdir is passed as a fixture, I added a line of code that changes the directory naming for the streaming conf.

@chloeh13q chloeh13q force-pushed the feat/spark-new-read-write branch from 76f613b to 206cd9c Compare June 3, 2024 18:12
@cpcloud cpcloud added the pyspark The Apache PySpark backend label Jun 3, 2024
@cpcloud cpcloud added this to the 9.1 milestone Jun 3, 2024
@cpcloud cpcloud force-pushed the feat/spark-new-read-write branch from 206cd9c to 3093145 Compare June 3, 2024 19:41
@chloeh13q chloeh13q force-pushed the feat/spark-new-read-write branch from d51bd42 to f933779 Compare June 3, 2024 20:34
@cpcloud cpcloud modified the milestones: 9.1, 9.2 Jun 13, 2024
@chloeh13q chloeh13q force-pushed the feat/spark-new-read-write branch from 85317b3 to 4f03594 Compare June 27, 2024 19:00
@chloeh13q chloeh13q force-pushed the feat/spark-new-read-write branch from 3237871 to 6bc9364 Compare June 27, 2024 19:45
@chloeh13q chloeh13q requested a review from cpcloud June 27, 2024 20:13
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this! Since we're using the experimental we are free to break this across non-major versions based on user feedback/bugs/etc.

@cpcloud cpcloud enabled auto-merge (squash) July 1, 2024 12:33
@cpcloud cpcloud merged commit adade5e into ibis-project:main Jul 1, 2024
76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pyspark The Apache PySpark backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants