-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18364][YARN] Expose metrics for YarnShuffleService #22485
Conversation
Registers the shuffle server's metrics with the Hadoop Node Manager's DefaultMetricsSystem. Test metric collector gets right converted calls camel-case shuffleService Pass scalastyle Reformat and organize imports With import order specified at http://spark.apache.org/contributing.html
ok to test |
Test build #96357 has finished for PR 22485 at commit
|
Sorry for not following through on getting this into Apache. FWIW, it's been in the Palantir fork of Spark for over a year: https://github.com/palantir/spark/search?q=SPARK-18364&unscoped_q=SPARK-18364 |
@ash211 thanks for commenting, looks like you added palantir#236 on top of what is here. looks like it gets snapshot of the timer values to return some of the raw values and percentiles rather then just some rates, is that correct? |
@@ -168,6 +170,15 @@ protected void serviceInit(Configuration conf) throws Exception { | |||
TransportConf transportConf = new TransportConf("shuffle", new HadoopConfigProvider(conf)); | |||
blockHandler = new ExternalShuffleBlockHandler(transportConf, registeredExecutorFile); | |||
|
|||
// register metrics on the block handler into the Node Manager's metrics system. | |||
YarnShuffleServiceMetrics serviceMetrics = | |||
new YarnShuffleServiceMetrics(blockHandler.getAllMetrics()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation should be 2 spaces, fix here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
"blockTransferRateBytes")) { | ||
test(s"$testname - collector receives correct types") { | ||
val builder = mock(classOf[MetricsRecordBuilder]) | ||
when(builder.addCounter(any(), anyLong())).thenReturn(builder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where these when's are used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these metrics are Timer or Meter so its called in YarnShuffleServiceMetrics.collectMetrics
.
https://github.com/apache/spark/pull/22485/files#diff-04b0327ae6bce4204c03442b0e6d8718R63
// this metric writes only one gauge to the collector | ||
test("registeredExecutorsSize - collector receives correct types") { | ||
val builder = mock(classOf[MetricsRecordBuilder]) | ||
when(builder.addCounter(any(), anyLong())).thenReturn(builder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't seem to use these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted this wasnt used.
new YarnShuffleServiceMetrics(blockHandler.getAllMetrics()); | ||
|
||
MetricsSystemImpl metricsSystem = (MetricsSystemImpl) DefaultMetricsSystem.instance(); | ||
metricsSystem.register( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ash211 you are calling registerSource here instead of register which required the reflection was there some reason you were doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
register
method only did some fancy stuff on metrics name, so I didnt understand the motivation to use registerSource
. Also I think Ash only follows notifications on JIRA ticket so I notified him there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay.
My explanation for using reflection was here: https://github.com/palantir/spark/pull/149/files#r107770857 -- basically registerSource
isn't accessible from here.
As to why use registerSource
vs register
, I don't think I knew about register
at the time. Looking at https://github.com/apache/hadoop/blame/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/impl/MetricsSystemImpl.java registerSource
ends up being called both ways.
If you can get the right behavior using the accessible method rather than doing reflection, I'd say we should go with that way.
cc @robert3005 @mccheah fysa for potential future merge conflict
when(builder.addCounter(any(), anyLong())).thenReturn(builder) | ||
when(builder.addGauge(any(), anyDouble())).thenReturn(builder) | ||
|
||
YarnShuffleServiceMetrics.collectMetric(builder, testname, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are calling a private function here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
when(builder.addCounter(any(), anyLong())).thenReturn(builder) | ||
when(builder.addGauge(any(), anyDouble())).thenReturn(builder) | ||
|
||
YarnShuffleServiceMetrics.collectMetric(builder, "registeredExecutorsSize", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are calling a private function here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
ok to test |
@tgravescs thanks for review, I made corrections for your notes and probably need some permissions to retest. Could you please test this PR again? |
ok to test |
test this please |
@@ -56,7 +56,7 @@ public void getMetrics(MetricsCollector collector, boolean all) { | |||
/** | |||
* The metric types used in {@link ExternalShuffleBlockHandler.ShuffleMetrics} | |||
*/ | |||
private static void collectMetric(MetricsRecordBuilder metricsRecordBuilder, String name, Metric metric) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add a comment to the java "Exposed for testing."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
import org.apache.hadoop.metrics2.MetricsCollector; | ||
import org.apache.hadoop.metrics2.MetricsInfo; | ||
import org.apache.hadoop.metrics2.MetricsRecordBuilder; | ||
import org.apache.hadoop.metrics2.MetricsSource; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a newline between hadoop and spark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark import removed because of checkstyle wont allow to have import in javadoc when you have {@link className
} syntax. I think checkstyle rules should be corrected.
Test build #96508 has finished for PR 22485 at commit
|
import org.apache.hadoop.metrics2.MetricsSource; | ||
import org.apache.spark.network.shuffle.ExternalShuffleBlockHandler; | ||
|
||
import java.util.Map; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: java includes should go before the org.apache ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
*/ | ||
@Override | ||
public void getMetrics(MetricsCollector collector, boolean all) { | ||
MetricsRecordBuilder metricsRecordBuilder = collector.addRecord("shuffleService"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rename this to be sparkShuffleService in order to differentiate from other shuffle servers that could be on the nodemanager (MR/tez).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
|
||
/** | ||
* The metric types used in {@link ExternalShuffleBlockHandler.ShuffleMetrics} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment to javadoc that its visible for testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
few minor nits that I missed in my first pass, otherwise looks good. |
test this please |
Test build #96516 has finished for PR 22485 at commit
|
Test build #96547 has finished for PR 22485 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, one last minor comment, otherwise lgtm, I'll be offline til next Monday so will commit then if no one else gets to first.
|
||
MetricsSystemImpl metricsSystem = (MetricsSystemImpl) DefaultMetricsSystem.instance(); | ||
metricsSystem.register( | ||
"shuffleService", "Metrics on the Spark Shuffle Service", serviceMetrics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best to make this sparkshuffleservice as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
95856e9
to
e259b6f
Compare
Test build #96552 has finished for PR 22485 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
merged to master, thanks @mareksimunek |
…gistered connections as well as the number of active connections to YARN Shuffle Service Recently, the ability to expose the metrics for YARN Shuffle Service was added as part of [SPARK-18364](apache#22485). We need to add some metrics to be able to determine the number of active connections as well as open connections to the external shuffle service to benchmark network and connection issues on large cluster environments. Added two more shuffle server metrics for Spark Yarn shuffle service: numRegisteredConnections which indicate the number of registered connections to the shuffle service and numActiveConnections which indicate the number of active connections to the shuffle service at any given point in time. If these metrics are outputted to a file, we get something like this: 1533674653489 default.shuffleService: Hostname=server1.abc.com, openBlockRequestLatencyMillis_count=729, openBlockRequestLatencyMillis_rate15=0.7110833548897356, openBlockRequestLatencyMillis_rate5=1.657808981793011, openBlockRequestLatencyMillis_rate1=2.2404486061620474, openBlockRequestLatencyMillis_rateMean=0.9242558551196706, numRegisteredConnections=35, blockTransferRateBytes_count=2635880512, blockTransferRateBytes_rate15=2578547.6094160094, blockTransferRateBytes_rate5=6048721.726302424, blockTransferRateBytes_rate1=8548922.518223226, blockTransferRateBytes_rateMean=3341878.633637769, registeredExecutorsSize=5, registerExecutorRequestLatencyMillis_count=5, registerExecutorRequestLatencyMillis_rate15=0.0027973949328659836, registerExecutorRequestLatencyMillis_rate5=0.0021278007987206426, registerExecutorRequestLatencyMillis_rate1=2.8270296777387467E-6, registerExecutorRequestLatencyMillis_rateMean=0.006339206380043053, numActiveConnections=35 Closes apache#22498 from pgandhi999/SPARK-18364. Authored-by: pgandhi <pgandhi@oath.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request? This PR is follow-up of closed apache#17401 which only ended due to of inactivity, but its still nice feature to have. Given review by jerryshao taken in consideration and edited: - VisibleForTesting deleted because of dependency conflicts - removed unnecessary reflection for `MetricsSystemImpl` - added more available types for gauge ## How was this patch tested? Manual deploy of new yarn-shuffle jar into a Node Manager and verifying that the metrics appear in the Node Manager-standard location. This is JMX with an query endpoint running on `hostname:port` Resulting metrics look like this: ``` curl -sk -XGET hostname:port | grep -v '#' | grep 'shuffleService' hadoop_nodemanager_openblockrequestlatencymillis_rate15{name="shuffleService",} 0.31428910657834713 hadoop_nodemanager_blocktransferratebytes_rate15{name="shuffleService",} 566144.9983653595 hadoop_nodemanager_blocktransferratebytes_ratemean{name="shuffleService",} 2464409.9678099006 hadoop_nodemanager_openblockrequestlatencymillis_rate1{name="shuffleService",} 1.2893844732240272 hadoop_nodemanager_registeredexecutorssize{name="shuffleService",} 2.0 hadoop_nodemanager_openblockrequestlatencymillis_ratemean{name="shuffleService",} 1.255574678369966 hadoop_nodemanager_openblockrequestlatencymillis_count{name="shuffleService",} 315.0 hadoop_nodemanager_openblockrequestlatencymillis_rate5{name="shuffleService",} 0.7661929192569739 hadoop_nodemanager_registerexecutorrequestlatencymillis_ratemean{name="shuffleService",} 0.0 hadoop_nodemanager_registerexecutorrequestlatencymillis_count{name="shuffleService",} 0.0 hadoop_nodemanager_registerexecutorrequestlatencymillis_rate1{name="shuffleService",} 0.0 hadoop_nodemanager_registerexecutorrequestlatencymillis_rate5{name="shuffleService",} 0.0 hadoop_nodemanager_blocktransferratebytes_count{name="shuffleService",} 6.18271213E8 hadoop_nodemanager_registerexecutorrequestlatencymillis_rate15{name="shuffleService",} 0.0 hadoop_nodemanager_blocktransferratebytes_rate5{name="shuffleService",} 1154114.4881816586 hadoop_nodemanager_blocktransferratebytes_rate1{name="shuffleService",} 574745.0749848988 ``` Closes apache#22485 from mareksimunek/SPARK-18364. Lead-authored-by: marek.simunek <marek.simunek@firma.seznam.cz> Co-authored-by: Andrew Ash <andrew@andrewash.com> Signed-off-by: Thomas Graves <tgraves@apache.org>
…gistered connections as well as the number of active connections to YARN Shuffle Service Recently, the ability to expose the metrics for YARN Shuffle Service was added as part of [SPARK-18364](apache#22485). We need to add some metrics to be able to determine the number of active connections as well as open connections to the external shuffle service to benchmark network and connection issues on large cluster environments. Added two more shuffle server metrics for Spark Yarn shuffle service: numRegisteredConnections which indicate the number of registered connections to the shuffle service and numActiveConnections which indicate the number of active connections to the shuffle service at any given point in time. If these metrics are outputted to a file, we get something like this: 1533674653489 default.shuffleService: Hostname=server1.abc.com, openBlockRequestLatencyMillis_count=729, openBlockRequestLatencyMillis_rate15=0.7110833548897356, openBlockRequestLatencyMillis_rate5=1.657808981793011, openBlockRequestLatencyMillis_rate1=2.2404486061620474, openBlockRequestLatencyMillis_rateMean=0.9242558551196706, numRegisteredConnections=35, blockTransferRateBytes_count=2635880512, blockTransferRateBytes_rate15=2578547.6094160094, blockTransferRateBytes_rate5=6048721.726302424, blockTransferRateBytes_rate1=8548922.518223226, blockTransferRateBytes_rateMean=3341878.633637769, registeredExecutorsSize=5, registerExecutorRequestLatencyMillis_count=5, registerExecutorRequestLatencyMillis_rate15=0.0027973949328659836, registerExecutorRequestLatencyMillis_rate5=0.0021278007987206426, registerExecutorRequestLatencyMillis_rate1=2.8270296777387467E-6, registerExecutorRequestLatencyMillis_rateMean=0.006339206380043053, numActiveConnections=35 Closes apache#22498 from pgandhi999/SPARK-18364. Authored-by: pgandhi <pgandhi@oath.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request? This PR is follow-up of closed apache#17401 which only ended due to of inactivity, but its still nice feature to have. Given review by jerryshao taken in consideration and edited: - VisibleForTesting deleted because of dependency conflicts - removed unnecessary reflection for `MetricsSystemImpl` - added more available types for gauge ## How was this patch tested? Manual deploy of new yarn-shuffle jar into a Node Manager and verifying that the metrics appear in the Node Manager-standard location. This is JMX with an query endpoint running on `hostname:port` Resulting metrics look like this: ``` curl -sk -XGET hostname:port | grep -v '#' | grep 'shuffleService' hadoop_nodemanager_openblockrequestlatencymillis_rate15{name="shuffleService",} 0.31428910657834713 hadoop_nodemanager_blocktransferratebytes_rate15{name="shuffleService",} 566144.9983653595 hadoop_nodemanager_blocktransferratebytes_ratemean{name="shuffleService",} 2464409.9678099006 hadoop_nodemanager_openblockrequestlatencymillis_rate1{name="shuffleService",} 1.2893844732240272 hadoop_nodemanager_registeredexecutorssize{name="shuffleService",} 2.0 hadoop_nodemanager_openblockrequestlatencymillis_ratemean{name="shuffleService",} 1.255574678369966 hadoop_nodemanager_openblockrequestlatencymillis_count{name="shuffleService",} 315.0 hadoop_nodemanager_openblockrequestlatencymillis_rate5{name="shuffleService",} 0.7661929192569739 hadoop_nodemanager_registerexecutorrequestlatencymillis_ratemean{name="shuffleService",} 0.0 hadoop_nodemanager_registerexecutorrequestlatencymillis_count{name="shuffleService",} 0.0 hadoop_nodemanager_registerexecutorrequestlatencymillis_rate1{name="shuffleService",} 0.0 hadoop_nodemanager_registerexecutorrequestlatencymillis_rate5{name="shuffleService",} 0.0 hadoop_nodemanager_blocktransferratebytes_count{name="shuffleService",} 6.18271213E8 hadoop_nodemanager_registerexecutorrequestlatencymillis_rate15{name="shuffleService",} 0.0 hadoop_nodemanager_blocktransferratebytes_rate5{name="shuffleService",} 1154114.4881816586 hadoop_nodemanager_blocktransferratebytes_rate1{name="shuffleService",} 574745.0749848988 ``` Closes apache#22485 from mareksimunek/SPARK-18364. Lead-authored-by: marek.simunek <marek.simunek@firma.seznam.cz> Co-authored-by: Andrew Ash <andrew@andrewash.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit a802c69)
…gistered connections as well as the number of active connections to YARN Shuffle Service Recently, the ability to expose the metrics for YARN Shuffle Service was added as part of [SPARK-18364](apache#22485). We need to add some metrics to be able to determine the number of active connections as well as open connections to the external shuffle service to benchmark network and connection issues on large cluster environments. Added two more shuffle server metrics for Spark Yarn shuffle service: numRegisteredConnections which indicate the number of registered connections to the shuffle service and numActiveConnections which indicate the number of active connections to the shuffle service at any given point in time. If these metrics are outputted to a file, we get something like this: 1533674653489 default.shuffleService: Hostname=server1.abc.com, openBlockRequestLatencyMillis_count=729, openBlockRequestLatencyMillis_rate15=0.7110833548897356, openBlockRequestLatencyMillis_rate5=1.657808981793011, openBlockRequestLatencyMillis_rate1=2.2404486061620474, openBlockRequestLatencyMillis_rateMean=0.9242558551196706, numRegisteredConnections=35, blockTransferRateBytes_count=2635880512, blockTransferRateBytes_rate15=2578547.6094160094, blockTransferRateBytes_rate5=6048721.726302424, blockTransferRateBytes_rate1=8548922.518223226, blockTransferRateBytes_rateMean=3341878.633637769, registeredExecutorsSize=5, registerExecutorRequestLatencyMillis_count=5, registerExecutorRequestLatencyMillis_rate15=0.0027973949328659836, registerExecutorRequestLatencyMillis_rate5=0.0021278007987206426, registerExecutorRequestLatencyMillis_rate1=2.8270296777387467E-6, registerExecutorRequestLatencyMillis_rateMean=0.006339206380043053, numActiveConnections=35 Closes apache#22498 from pgandhi999/SPARK-18364. Authored-by: pgandhi <pgandhi@oath.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 8dd29fe)
This PR is follow-up of closed apache#17401 which only ended due to of inactivity, but its still nice feature to have. Given review by jerryshao taken in consideration and edited: - VisibleForTesting deleted because of dependency conflicts - removed unnecessary reflection for `MetricsSystemImpl` - added more available types for gauge Manual deploy of new yarn-shuffle jar into a Node Manager and verifying that the metrics appear in the Node Manager-standard location. This is JMX with an query endpoint running on `hostname:port` Resulting metrics look like this: ``` curl -sk -XGET hostname:port | grep -v '#' | grep 'shuffleService' hadoop_nodemanager_openblockrequestlatencymillis_rate15{name="shuffleService",} 0.31428910657834713 hadoop_nodemanager_blocktransferratebytes_rate15{name="shuffleService",} 566144.9983653595 hadoop_nodemanager_blocktransferratebytes_ratemean{name="shuffleService",} 2464409.9678099006 hadoop_nodemanager_openblockrequestlatencymillis_rate1{name="shuffleService",} 1.2893844732240272 hadoop_nodemanager_registeredexecutorssize{name="shuffleService",} 2.0 hadoop_nodemanager_openblockrequestlatencymillis_ratemean{name="shuffleService",} 1.255574678369966 hadoop_nodemanager_openblockrequestlatencymillis_count{name="shuffleService",} 315.0 hadoop_nodemanager_openblockrequestlatencymillis_rate5{name="shuffleService",} 0.7661929192569739 hadoop_nodemanager_registerexecutorrequestlatencymillis_ratemean{name="shuffleService",} 0.0 hadoop_nodemanager_registerexecutorrequestlatencymillis_count{name="shuffleService",} 0.0 hadoop_nodemanager_registerexecutorrequestlatencymillis_rate1{name="shuffleService",} 0.0 hadoop_nodemanager_registerexecutorrequestlatencymillis_rate5{name="shuffleService",} 0.0 hadoop_nodemanager_blocktransferratebytes_count{name="shuffleService",} 6.18271213E8 hadoop_nodemanager_registerexecutorrequestlatencymillis_rate15{name="shuffleService",} 0.0 hadoop_nodemanager_blocktransferratebytes_rate5{name="shuffleService",} 1154114.4881816586 hadoop_nodemanager_blocktransferratebytes_rate1{name="shuffleService",} 574745.0749848988 ``` Closes apache#22485 from mareksimunek/SPARK-18364. Lead-authored-by: marek.simunek <marek.simunek@firma.seznam.cz> Co-authored-by: Andrew Ash <andrew@andrewash.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit a802c69) (cherry picked from commit d1bb3ea1eed2ea037d97ca9ac95b514ecaf6d50a) Change-Id: Ia7ed71f9de822e7c968f7618eb0f1b86e5d49912 (cherry picked from commit af2e4cd6f754222888272de16ff4c3df88c2c38b)
What changes were proposed in this pull request?
This PR is follow-up of closed #17401 which only ended due to of inactivity, but its still nice feature to have.
Given review by @jerryshao taken in consideration and edited:
MetricsSystemImpl
How was this patch tested?
Manual deploy of new yarn-shuffle jar into a Node Manager and verifying that the metrics appear in the Node Manager-standard location. This is JMX with an query endpoint running on
hostname:port
Resulting metrics look like this: