-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28869][CORE] Roll over event log files #25670
Conversation
Fixing it would be simple - clone the properties and include into |
cc. @felixcheung (as shepherd of SPARK-28594) @vanzin @squito @gengliangwang @dongjoon-hyun also cc. to @Ngone51 as might be interested on this. |
Test build #110065 has finished for PR 25670 at commit
|
Test build #110068 has finished for PR 25670 at commit
|
This is also occurring with current master branch. You can reproduce it with below query in spark-shell, with continuously pushing records to topic1 and topic2.
|
Test build #110073 has finished for PR 25670 at commit
|
is there a separate jira on this? |
Yes, filed https://issues.apache.org/jira/browse/SPARK-28967 and submitted a patch #25672 |
4bb9de0
to
082cf16
Compare
Test build #110141 has finished for PR 25670 at commit
|
Test build #110157 has finished for PR 25670 at commit
|
72a6253
to
9b4d53d
Compare
Test build #110202 has finished for PR 25670 at commit
|
retest this, please |
Test build #110208 has finished for PR 25670 at commit
|
First failure: known flaky test, SPARK-26989 (submitted a patch #25706) Second failure: SparkContext is leaked in test and makes tests failing.
One of stack trace follows:
|
retest this, please |
Test build #110220 has finished for PR 25670 at commit
|
I'd like to put some efforts if that accelerates reviewing. Would it help reviewing if I split this to two parts - writer (EventLoggingListener) / reader (SHS)? |
Btw, I'll start working on next stuff which doesn't depend on this patch. Maybe I'll split them down into smaller parts than what I planned. Some parts may not be unused until following part will come. |
retest this, please |
Test build #110704 has finished for PR 25670 at commit
|
UT failure: SPARK-29104 - not relevant. |
retest this, please |
FYI, #25811 is submitted to cover supporting snapshot/restore KVStore. |
Test build #110730 has finished for PR 25670 at commit
|
retest this, please |
Test build #110733 has finished for PR 25670 at commit
|
Test build #112065 has finished for PR 25670 at commit
|
writer.foreach { w => | ||
val currentLen = countingOutputStream.get.getBytesWritten | ||
if (currentLen + eventJson.length > eventFileMaxLength) { | ||
rollEventLogFile() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an edge case where a single event may be larger than the configured limit. (currentLen would be 0 in that case.) That's bad configuration, but in that case you should probably log something and not roll the file. Or maybe add a lower limit to the configuration so that the app doesn't even start or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice finding! Maybe having lower limit to 10 MiB seems to be OK - I wouldn't imagine the size of a single event is bigger than 10 MiB. Will make a change.
Test build #112134 has finished for PR 25670 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small nit.
.doc("The max size of event log file to be rolled over.") | ||
.bytesConf(ByteUnit.BYTE) | ||
.checkValue(_ >= (1024 * 1024 * 10), "Max file size of event log should be configured to" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ByteUnit.MiB.toBytes(10)
Looks ok, will merge tomorrow if no one else comments. |
Test build #112190 has finished for PR 25670 at commit
|
Merging to master. |
Thanks all for reviewing and merging! |
@HeartSaVioR just gone through this PR and plan to join later developments. You can ping me if this feature goes forward. One question what I have now. Do I see correctly that the actual implementation measures the events size before compression? If yes maybe my suggestion can be considered. Namely I can see 2 possibilities to overcame this but both cases the basic idea is the same. Dstream variable can be wrapped with
I've tested the second approach with lz4, lzf, snappy, zstd and only lz4 didn't flush the buffer immediately. Of course this doesn't mean 2nd approach is advised, just wanted to give more info... |
|
||
private[spark] val EVENT_LOG_ROLLING_MAX_FILE_SIZE = | ||
ConfigBuilder("spark.eventLog.rolling.maxFileSize") | ||
.doc("The max size of event log file to be rolled over.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry leaving a comment late like this but it should have been better to say this configuration is only effective when spark.eventLog.rolling.enabled
is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there're counter examples in Spark configurations which rely on the fact once there's a configuration .enabled
, others are effective only when that is enabled.
Even we only check from the SHS configuration, spark.history.fs.cleaner.*
, spark.history.kerberos.*
, spark.history.ui.acls.*
fall into the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I tend to disagree with omitting such dependent configurations in their documentations. Can we add and link related configurations in the documentations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Looks like we'll have to agree to disagree then. No one has privilege to make someone do the work under his/her authorship which he/she disagrees with - it will end up putting wrong authorship on commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HeartSaVioR, it had to be reviewed. I just happened to review and leave some comments late. Logically if that's not documented, how do users know what configuration is effective when? At least I had to read the codes to confirm.
Also, I am trying to make sure we're on the same page so I wouldn't happen to leave this comment again since you are a regular contributor. I don't think this is a good pattern to don't document the relationship between configurations. I am going to send an email to the dev list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon
Thanks for initiating the thread in dev mailing list. I'm following up the thread and will be back once we get some sort of consensus.
… for rolling event log ### What changes were proposed in this pull request? This patch addresses the post-hoc review comment linked here - #25670 (comment) ### Why are the changes needed? We would like to explicitly document the direct relationship before we finish up structuring of configurations. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #27576 from HeartSaVioR/SPARK-28869-FOLLOWUP-doc. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
… for rolling event log ### What changes were proposed in this pull request? This patch addresses the post-hoc review comment linked here - #25670 (comment) ### Why are the changes needed? We would like to explicitly document the direct relationship before we finish up structuring of configurations. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #27576 from HeartSaVioR/SPARK-28869-FOLLOWUP-doc. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 446b2d2) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
… for rolling event log ### What changes were proposed in this pull request? This patch addresses the post-hoc review comment linked here - apache#25670 (comment) ### Why are the changes needed? We would like to explicitly document the direct relationship before we finish up structuring of configurations. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes apache#27576 from HeartSaVioR/SPARK-28869-FOLLOWUP-doc. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This patch is a part of [SPARK-28594](https://issues.apache.org/jira/browse/SPARK-28594) and design doc for SPARK-28594 is linked here: https://docs.google.com/document/d/12bdCC4nA58uveRxpeo8k7kGOI2NRTXmXyBOweSi4YcY/edit?usp=sharing This patch proposes adding new feature to event logging, rolling event log files via configured file size. Previously event logging is done with single file and related codebase (`EventLoggingListener`/`FsHistoryProvider`) is tightly coupled with it. This patch adds layer on both reader (`EventLogFileReader`) and writer (`EventLogFileWriter`) to decouple implementation details between "handling events" and "how to read/write events from/to file". This patch adds two properties, `spark.eventLog.rollLog` and `spark.eventLog.rollLog.maxFileSize` which provides configurable behavior of rolling log. The feature is disabled by default, as we only expect huge event log for huge/long-running application. For other cases single event log file would be sufficient and still simpler. This is a part of SPARK-28594 which addresses event log growing infinitely for long-running application. This patch itself also provides some option for the situation where event log file gets huge and consume their storage. End users may give up replaying their events and want to delete the event log file, but given application is still running and writing the file, it's not safe to delete the file. End users will be able to delete some of old files after applying rolling over event log. No, as the new feature is turned off by default. Added unit tests, as well as basic manual tests. Basic manual tests - ran SHS, ran structured streaming query with roll event log enabled, verified split files are generated as well as SHS can load these files, with handling app status as incomplete/complete. Closes apache#25670 from HeartSaVioR/SPARK-28869. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
### What changes were proposed in this pull request? This PR aims to enable `spark.eventLog.rolling.enabled` by default for Apache Spark 4.0.0. ### Why are the changes needed? Since Apache Spark 3.0.0, we have been using event log rolling not only for **long-running jobs**, but also for **some failed jobs** to archive the partial event logs incrementally. - #25670 ### Does this PR introduce _any_ user-facing change? - No because `spark.eventLog.enabled` is disabled by default. - For the users with `spark.eventLog.enabled=true`, yes, `spark-events` directory will have different layouts. However, all 3.3+ `Spark History Server` can read both old and new event logs. I believe that the event log users are already using this configuration to avoid the loss of event logs for long-running jobs and some failed jobs. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43638 from dongjoon-hyun/SPARK-45771. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? This PR aims to enable `spark.eventLog.rolling.enabled` by default for Apache Spark 4.0.0. ### Why are the changes needed? Since Apache Spark 3.0.0, we have been using event log rolling not only for **long-running jobs**, but also for **some failed jobs** to archive the partial event logs incrementally. - apache#25670 ### Does this PR introduce _any_ user-facing change? - No because `spark.eventLog.enabled` is disabled by default. - For the users with `spark.eventLog.enabled=true`, yes, `spark-events` directory will have different layouts. However, all 3.3+ `Spark History Server` can read both old and new event logs. I believe that the event log users are already using this configuration to avoid the loss of event logs for long-running jobs and some failed jobs. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#43638 from dongjoon-hyun/SPARK-45771. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This patch is a part of SPARK-28594 and design doc for SPARK-28594 is linked here: https://docs.google.com/document/d/12bdCC4nA58uveRxpeo8k7kGOI2NRTXmXyBOweSi4YcY/edit?usp=sharing
This patch proposes adding new feature to event logging, rolling event log files via configured file size.
Previously event logging is done with single file and related codebase (
EventLoggingListener
/FsHistoryProvider
) is tightly coupled with it. This patch adds layer on both reader (EventLogFileReader
) and writer (EventLogFileWriter
) to decouple implementation details between "handling events" and "how to read/write events from/to file".This patch adds two properties,
spark.eventLog.rollLog
andspark.eventLog.rollLog.maxFileSize
which provides configurable behavior of rolling log. The feature is disabled by default, as we only expect huge event log for huge/long-running application. For other cases single event log file would be sufficient and still simpler.Why are the changes needed?
This is a part of SPARK-28594 which addresses event log growing infinitely for long-running application.
This patch itself also provides some option for the situation where event log file gets huge and consume their storage. End users may give up replaying their events and want to delete the event log file, but given application is still running and writing the file, it's not safe to delete the file. End users will be able to delete some of old files after applying rolling over event log.
Does this PR introduce any user-facing change?
No, as the new feature is turned off by default.
How was this patch tested?
Added unit tests, as well as basic manual tests.
Basic manual tests - ran SHS, ran structured streaming query with roll event log enabled, verified split files are generated as well as SHS can load these files, with handling app status as incomplete/complete.