Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48802][SS][FOLLOWUP] FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed #47195

Conversation

ragnarok56
Copy link
Contributor

@ragnarok56 ragnarok56 commented Jul 3, 2024

What changes were proposed in this pull request?

This is a followup to a bug identified from #45362. When setting maxCachedFiles to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of unreadFiles which was only being null checked. This update includes additional checks to verify that unreadFiles is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that unreadFiles is only set if a) there are files remaining in the listing and b) maxCachedFiles is greater than 0

Why are the changes needed?

Setting the maxCachedFiles configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users.

Does this PR introduce any user-facing change?

Fixes the case where users may want to always perform a full listing of files each batch by setting maxCachedFiles to 0

How was this patch tested?

New test added to verify maxCachedFiles set to 0 would perform a file listing each batch

Was this patch authored or co-authored using generative AI tooling?

No

this creates a scenario where a batch that processes files will clear
out the cached files as per the config setting
but inadvertantly causes the following batch to execute
with a cached files list of 0 instead of listing files again
fix scalastyle (again)
@HyukjinKwon
Copy link
Member

Mind filing a JiRA? See also https://spark.apache.org/contributing.html

@ragnarok56 ragnarok56 changed the title FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed [SPARK-48802][SS] FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed Jul 4, 2024
@ragnarok56
Copy link
Contributor Author

Created https://issues.apache.org/jira/browse/SPARK-48802 for this as a bug

@ragnarok56 ragnarok56 changed the title [SPARK-48802][SS] FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed [SPARK-48802][SS][FOLLOWUP] FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed Jul 4, 2024
Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

ericm-db pushed a commit to ericm-db/spark that referenced this pull request Jul 10, 2024
…causes batch with no files to be processed

### What changes were proposed in this pull request?

This is a followup to a bug identified from apache#45362.  When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked.  This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0

### Why are the changes needed?

Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users.

### Does this PR introduce _any_ user-facing change?

Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0

### How was this patch tested?

New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase.

Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com>
Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
…causes batch with no files to be processed

### What changes were proposed in this pull request?

This is a followup to a bug identified from apache#45362.  When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked.  This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0

### Why are the changes needed?

Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users.

### Does this PR introduce _any_ user-facing change?

Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0

### How was this patch tested?

New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase.

Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com>
Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…causes batch with no files to be processed

### What changes were proposed in this pull request?

This is a followup to a bug identified from apache#45362.  When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked.  This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0

### Why are the changes needed?

Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users.

### Does this PR introduce _any_ user-facing change?

Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0

### How was this patch tested?

New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase.

Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com>
Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants