-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48802][SS][FOLLOWUP] FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed #47195
Closed
ragnarok56
wants to merge
5
commits into
apache:master
from
ragnarok56:filestreamsource-maxcachedfiles-edgecase
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
this creates a scenario where a batch that processes files will clear out the cached files as per the config setting but inadvertantly causes the following batch to execute with a cached files list of 0 instead of listing files again
Mind filing a JiRA? See also https://spark.apache.org/contributing.html |
ragnarok56
changed the title
FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed
[SPARK-48802][SS] FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed
Jul 4, 2024
Created https://issues.apache.org/jira/browse/SPARK-48802 for this as a bug |
ragnarok56
changed the title
[SPARK-48802][SS] FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed
[SPARK-48802][SS][FOLLOWUP] FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed
Jul 4, 2024
HeartSaVioR
approved these changes
Jul 8, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks! Merging to master. |
ericm-db
pushed a commit
to ericm-db/spark
that referenced
this pull request
Jul 10, 2024
…causes batch with no files to be processed ### What changes were proposed in this pull request? This is a followup to a bug identified from apache#45362. When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked. This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0 ### Why are the changes needed? Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users. ### Does this PR introduce _any_ user-facing change? Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0 ### How was this patch tested? New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase. Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com> Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
jingz-db
pushed a commit
to jingz-db/spark
that referenced
this pull request
Jul 22, 2024
…causes batch with no files to be processed ### What changes were proposed in this pull request? This is a followup to a bug identified from apache#45362. When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked. This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0 ### Why are the changes needed? Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users. ### Does this PR introduce _any_ user-facing change? Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0 ### How was this patch tested? New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase. Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com> Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
attilapiros
pushed a commit
to attilapiros/spark
that referenced
this pull request
Oct 4, 2024
…causes batch with no files to be processed ### What changes were proposed in this pull request? This is a followup to a bug identified from apache#45362. When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked. This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0 ### Why are the changes needed? Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users. ### Does this PR introduce _any_ user-facing change? Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0 ### How was this patch tested? New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase. Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com> Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This is a followup to a bug identified from #45362. When setting
maxCachedFiles
to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array ofunreadFiles
which was only being null checked. This update includes additional checks to verify thatunreadFiles
is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure thatunreadFiles
is only set if a) there are files remaining in the listing and b)maxCachedFiles
is greater than 0Why are the changes needed?
Setting the
maxCachedFiles
configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users.Does this PR introduce any user-facing change?
Fixes the case where users may want to always perform a full listing of files each batch by setting
maxCachedFiles
to 0How was this patch tested?
New test added to verify
maxCachedFiles
set to 0 would perform a file listing each batchWas this patch authored or co-authored using generative AI tooling?
No