-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1800] Introduce ApplicationTotalCount and ApplicationFallbackCount metric to record the total and fallback count of application #3026
base: main
Are you sure you want to change the base?
Conversation
f3126d4
to
95503c3
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3026 +/- ##
==========================================
- Coverage 32.99% 32.66% -0.33%
==========================================
Files 331 336 +5
Lines 19842 20042 +200
Branches 1787 1792 +5
==========================================
- Hits 6545 6544 -1
- Misses 12933 13134 +201
Partials 364 364 ☔ View full report in Codecov by Sentry. |
95503c3
to
8eb3bde
Compare
…ckCount metric to record the total and fallback count of application
8eb3bde
to
4d40188
Compare
Ping @turboFei, @codenohup, @reswqa, @FMX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this patch.
Only some comments for unnecessary proto field order change.
Due 0.6.0 is not released, so I am using the main branch for our company deployment, so it is better that we can keep the proto field order here. Thanks.
master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/ha/MetaHandler.java
Show resolved
Hide resolved
…ckCount metric to record the total and fallback count of application
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleMaster.java
Outdated
Show resolved
Hide resolved
…ckCount metric to record the total and fallback count of application
e698224
to
c8fd052
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@@ -118,10 +123,24 @@ public <K, V, C> ShuffleHandle registerShuffle( | |||
String appId = SparkUtils.appUniqueId(dependency.rdd().context()); | |||
initializeLifecycleManager(appId); | |||
|
|||
if (!registeredApps.contains(appId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a SparkContext or ShuffleManager or lifecycleManager have multiple apps?
What changes were proposed in this pull request?
Introduce
ApplicationTotalCount
andApplicationFallbackCount
metric to record the total and fallback count of application.Why are the changes needed?
There is no any metric to record the total count of application running with celeborn shuffle and engine bulit-in shuffle and the fallback count of application. Meanwhile, the fallback of Flink shuffle is based on job granularity rather than shuffle granularity.
Follw up #3012 (comment).
Does this PR introduce any user-facing change?
No.
How was this patch tested?
DefaultMetaSystemSuiteJ#testShuffleAndApplicationCountWithFallback
RatisMasterStatusSystemSuiteJ#testShuffleAndApplicationCountWithFallback