adding JVM GC Time and Executor CPU time heuristic #279

skakker · 2017-08-23T10:57:39Z

The ratio of jvmGcTime to executorCpuTime is checked, to see if GC is taking too much time (providing more memory could help) or too little time (memory may be over provisioned, and can be reduced).

Its recommended to increase the executor memory if too much time is being spent in GC:
Low: avg (jvmGcTime / executorCpuTime) >= .08
Moderate: avg (jvmGcTime / executorCpuTime) >= .1
Critical: avg (jvmGcTime / executorCpuTime) >= .15
Severe:avg (jvmGcTime / executorCpuTime) >= .2

Its recommended to decrease the executor memory if too little time is being spent in GC.
Low: avg (jvmGcTime / executorCpuTime) < .05)
Moderate: avg (jvmGcTime / executorCpuTime) < .04)
Critical: avg (jvmGcTime / executorCpuTime) < .03)
Severe: avg (jvmGcTime / executorCpuTime) < .01)

hadoop-1 does not have JobStatus.getFinishTime(). This causes dr-elephant to hang. Set the start time to be same as finish time for h1 jobs. For consistency, reverted to the old method of scraping the job tracker url so that we get only start time, and set the finish time to be equal to start time for retired jobs as well. RB=417975 BUGS=HADOOP-8640 R=fli,mwagner A=fli

RB=417448 BUGS=HADOOP-8648 R=fli A=fli

…ff 51 reducers instead of 50

…istic

…r time help page

…increasing mapred.min.split.size for too many mappers, NOT mapred.max.split.size

…n Help topics page

…name RB=468832 BUGS=HADOOP-10405 R=fli A=fli,ahsu

…JobType; 2. use only application id all data holders and use job id while saving to DB; 3. Change configuration styles of heuristics, fetchers and jobtypes into separate files; 4. Make those configuration files providable via other locations so that we can seemlessly change configuration without releasing a new deployment zip. 5. Remove HadoopVersion classic, yarn, using 1 or 2 instead.

* Fix SparkMetricsAggregator to not produce negative ResourceUsage

…#229)

…ally. (linkedin#243)

…hese objects when we implement our own parser (linkedin#248)

* We have been ignoring Failed Tasks in calculation of resource usage. This handles that. * Fixes Exception heuristic which was supposed to give the stacktrace.

…inkedin#250)

…to use them (linkedin#254)

akshayrai · 2017-09-07T18:10:46Z

test/com/linkedin/drelephant/spark/heuristics/StagesHeuristicTest.scala

-    executorRunTime: Long,
-    name: String
-  ): StageDataImpl = new StageDataImpl(
+                        status: StageStatus,


Fix the indentation.

akshayrai · 2017-09-07T18:11:07Z

test/com/linkedin/drelephant/spark/heuristics/StagesHeuristicTest.scala

+      it("has the total executor cpu time") {
+        evaluator.executorCpuTime should be(18600000)
+      }
+      it("has ascending severity for ratio of JVM GC time to executor cpu time") {


Please maintain a spacing after each assertion.

akshayrai · 2017-09-08T05:01:24Z

app/com/linkedin/drelephant/spark/heuristics/StagesHeuristic.scala

@@ -116,6 +129,15 @@ object StagesHeuristic {
    critical = Duration("60min").toMillis,
    ascending = true
  )
+
+  /** The severity thresholds for the ratio of JVM GC Time and executor CPU Time */


Can you add a comment on what basis these thresholds were set? If it is experimental, please mention it in the comments and that it may change in the future.

These are experimental, so are likely to change as we evaluate the results on more Spark applications.

akshayrai · 2017-09-08T05:23:41Z

test/com/linkedin/drelephant/spark/heuristics/StagesHeuristicTest.scala

-    stageDatas: Seq[StageDataImpl],
-    appConfigurationProperties: Map[String, String]
-  ): SparkApplicationData = {
+                                   stageDatas: Seq[StageDataImpl],


Fix indentation

akshayrai · 2017-09-08T05:26:17Z

app/com/linkedin/drelephant/spark/heuristics/StagesHeuristic.scala

+      resultDetails = resultDetails :+ new HeuristicResultDetails("Note", "The ratio of JVM GC Time and executor Time is above normal, we recommend to increase the executor memory")
+    }
+    if (evaluator.severityTimeD.getValue != Severity.NONE.getValue) {
+      resultDetails = resultDetails :+ new HeuristicResultDetails("Note", "The ratio of JVM GC Time and executor Time is below normal, we recommend to decrease the executor memory")


Can you clarify this behavior in the comments?

edwinalu · 2017-09-15T22:26:06Z

app/com/linkedin/drelephant/spark/heuristics/StagesHeuristic.scala

@@ -36,7 +36,8 @@ import org.apache.spark.status.api.v1.StageStatus
  * each stage.
  */
 class StagesHeuristic(private val heuristicConfigurationData: HeuristicConfigurationData)
-    extends Heuristic[SparkApplicationData] {
+  extends Heuristic[SparkApplicationData] {


Is the change in indentation intentional?

edwinalu · 2017-09-15T22:31:15Z

app/com/linkedin/drelephant/spark/heuristics/StagesHeuristic.scala

@@ -116,6 +129,15 @@ object StagesHeuristic {
    critical = Duration("60min").toMillis,
    ascending = true
  )
+
+  /** The severity thresholds for the ratio of JVM GC Time and executor CPU Time */


These are experimental, so are likely to change as we evaluate the results on more Spark applications.

edwinalu · 2017-09-15T22:33:18Z

app/com/linkedin/drelephant/spark/heuristics/StagesHeuristic.scala

@@ -196,15 +233,44 @@ object StagesHeuristic {
      val averageExecutorRuntime = stageData.executorRunTime / executorInstances
      (averageExecutorRuntime, stageRuntimeMillisSeverityThresholds.severityOf(averageExecutorRuntime))
    }
+
+    private def getTimeValues(stageDatas: Seq[StageData]): (Long, Long) = {


Would it be possible to add some comments for the function.

edwinalu · 2017-09-15T22:40:32Z

app/com/linkedin/drelephant/spark/fetchers/statusapiv1/statusapiv1.scala

@@ -179,6 +179,7 @@ trait TaskData{
 trait TaskMetrics{


Would it be useful to have totalCpuTime in ExecutorSummary (it would be fewer calls to get the info)? It seems good to have at the stage level, so that users know which stage is problematical.

shkhrgpt · 2017-10-05T20:24:25Z

I think it would good if you can provide a brief description of this PR. It helps in the review.
Thanks.

…commendations fix

edwinalu · 2017-10-10T16:06:00Z

Thanks for adding StageStatus. The new changes look good.

shkhrgpt · 2017-10-11T18:46:40Z

app/com/linkedin/drelephant/spark/fetchers/statusapiv1/StageStatus.java

@@ -0,0 +1,17 @@
+package com.linkedin.drelephant.spark.fetchers.statusapiv1;


Adding a Java class in a Scala package looks weird. Either create a Scala class here or create a separate package for this class.

In apache spark documentation, StageStatus is a .java class hence, to keep things uniform, I will create a new package for this class.

shkhrgpt · 2017-10-11T21:25:01Z

app/com/linkedin/drelephant/spark/heuristics/StagesHeuristic.scala

+      Exception.ignoring(classOf[NoSuchElementException]) {
+        stageDatas.foreach((stageData: StageData) => {
+          stageData.tasks.get.values.foreach((taskData: TaskData) => {
+            jvmGcTimeTotal += taskData.taskMetrics.getOrElse(taskMetricsDummy).jvmGcTime


I don't think it's correct to add GC time for each task to get a total GC time. JVM GC time is a global entity if multiple tasks are running in parallel, they will have the same GC time, and adding them up will be overcounting.

shkhrgpt · 2017-10-11T21:26:56Z

app/com/linkedin/drelephant/spark/heuristics/StagesHeuristic.scala

+    var (jvmTime, executorCpuTime) = getTimeValues(stageDatas)
+
+    var ratio: Double = {
+      ratio = jvmTime.toDouble / executorCpuTime.toDouble


Jvm GC time is in wall clock, and Executor CPU time is in CPU clock. I don't know if dividing them is right to do here. Instead of using executor CPU time, we should executor duration.

shkhrgpt · 2017-10-11T21:29:28Z

app/com/linkedin/drelephant/spark/heuristics/StagesHeuristic.scala

+        shuffleWriteMetrics = None)).get
+      //ignoring the exception as there are cases when there is no task data, in such cases, 0 is taken as the default value
+      Exception.ignoring(classOf[NoSuchElementException]) {
+        stageDatas.foreach((stageData: StageData) => {


I don't think makes sense to sum these ratios over stages, we should sum them over executors.

shkhrgpt · 2017-10-11T21:32:32Z

Why this heuristic is part of stage heuristic. Since JVM GC is an executor property, I think it makes more sense to see this as executor heuristic.

mcvsubbu and others added 30 commits January 16, 2015 13:53

[HADOOP-8648]Updating dr-elephant release ID to 0.6.4

be47552

RB=417448 BUGS=HADOOP-8648 R=fli A=fli

[HADOOP-8648] Reset version to 0.6.5-SNAPSHOT

b6ddadd

HADOOP-8294: Improved slow performance of join queries

194dee4

HADOOP-5369: Paginate search results of Dr. Elephant

45e3ca2

HADOOP-7948: Make Dr. Elephant's Reducer Time Moderate Severity cut-o…

f7562cc

…ff 51 reducers instead of 50

HADOOP-8320: Add detailed suggestions in Dr. Elephant help page

f6a2a95

HADOOP-8320: Add detailed suggestions in Dr. Elephant help page

8272cf0

HADOOP-8856: Fix a broken Dr.Elephant test case for reducer time heur…

138693a

…istic

HADOOP-8859: Add ideal task time suggestion in Dr. Elephant 's reduce…

81a27f7

…r time help page

HADOOP-8846: Release Dr. Elephant v0.6.5

6beb02d

HADOOP-8929: Update Dr.Elephant version number to 0.6.6-SNAPSHOT

12408ba

HADOOP-9685: Dr. Elephant fails to fetch task counter data on hadoop2

28d1283

HADOOP-8941: Further improve search performance for Dr. Elephant

335198a

HADOOP-9714: Add Pluggable jobtype to Dr. Elephant

adf1cc2

HADOOP-9716: Dr. Elephant rest interface needs search endpoint

c16a050

HADOOP-9853: Release Dr. Elephant v0.6.6

d73bc48

HADOOP-9886: Fix for making Dr. Elephant's pagination thread safe

d163a69

HADOOP-10289: Update Dr. Elephant version number to 0.6.7-SNAPSHOT

0fcaa7b

HADOOP-10089: Dr. Elephant misses on type of Voldemort bnp job

9e72289

HADOOP-10301: Dr. Elephant occasionally misses a job

872b20e

HADOOP-9900: Dr. Elephant Mapper Input Size Help Page should suggest …

877ae89

…increasing mapred.min.split.size for too many mappers, NOT mapred.max.split.size

HADOOP-10290: Release Dr. Elephant v0.6.7

71a7fc5

HADOOP-10314: Dr. Elephant should not mention deprecated properties i…

9009da5

…n Help topics page

[HADOOP-10405] Fix one-off error in getting cluster name from NN host…

1bab26e

…name RB=468832 BUGS=HADOOP-10405 R=fli A=fli,ahsu

HADOOP-11768: Update Dr. Elephant version to v1.0.0-SNAPSHOT

478b2d5

OFFREL-234: Adding Spark log analysers into Dr. Elephant

df3f7a6

Addressing naming problems, code cleaness and added more help pages

5845765

Renaming Mapreduce -> MapReduce, Classic -> Hadoop1, Yarn -> Hadoop2

1e12e8c

shankar37 and others added 12 commits April 18, 2017 12:14

Spark metrics aggregator fix (linkedin#237)

b7e04ab

* Fix SparkMetricsAggregator to not produce negative ResourceUsage

Minor bug fixes in exception and UI (linkedin#238)

a1f866a

Updates Spark configuration heuristic severity calculations (linkedin…

7c373d4

…#229)

Enables SparkFetcher to only get eventLog via rest and process it loc…

1ca2676

…ally. (linkedin#243)

Refactor statusapiv1 to trait and implement for ease of creation of t…

cae79c7

…hese objects when we implement our own parser (linkedin#248)

MRfetcher ignores failed tasks (linkedin#249)

cdf680b

* We have been ignoring Failed Tasks in calculation of resource usage. This handles that. * Fixes Exception heuristic which was supposed to give the stacktrace.

Add index on severity, finish_time to speed up welcome page display (l…

f77886a

…inkedin#250)

Add pinball scheduler to dr-elephant (linkedin#253)

54a16fd

add s3, s3a, s3n bytes read and bytes written, and update heuristics …

1d6f3f6

…to use them (linkedin#254)

Add custom flowtime per scheduler (linkedin#268)

9a65e0e

Add filtering on Job Definition Id in the Search view (linkedin#269)

7230038

added logic for map reduce time-skew heuristic (linkedin#267)

752a94b

akshayrai suggested changes Sep 8, 2017

View reviewed changes

edwinalu reviewed Sep 15, 2017

View reviewed changes

adding JVM GC Time and Executor CPU time heuristic

5e060d1

skakker force-pushed the gc_heuristic branch from fb80795 to 5e060d1 Compare September 20, 2017 09:24

swasti added 2 commits October 9, 2017 11:37

Changed default compression codec from .snappy to .lz4 and a minor re…

1f5ecc9

…commendations fix

added a local instance for StageStatus

3a90264

shkhrgpt reviewed Oct 11, 2017

View reviewed changes

now dividing JvmGcTime by exectorRunTime instead of executorCpuTime

34ffbe2

skakker force-pushed the gc_heuristic branch from e3fa67b to 34ffbe2 Compare November 15, 2017 07:02

Update StagesHeuristic.scala

ff73621

shkhrgpt mentioned this pull request Dec 7, 2017

Gc time #311

Merged

akshayrai force-pushed the master branch from 7c2fd7f to 8b46933 Compare December 12, 2017 05:09

akshayrai closed this Jan 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding JVM GC Time and Executor CPU time heuristic #279

adding JVM GC Time and Executor CPU time heuristic #279

skakker commented Aug 23, 2017 •

edited

Loading

akshayrai Sep 7, 2017

akshayrai Sep 7, 2017

akshayrai Sep 8, 2017

edwinalu Sep 15, 2017

akshayrai Sep 8, 2017

akshayrai Sep 8, 2017

edwinalu Sep 15, 2017

edwinalu Sep 15, 2017

edwinalu Sep 15, 2017

edwinalu Sep 15, 2017

shkhrgpt commented Oct 5, 2017

edwinalu commented Oct 10, 2017

shkhrgpt Oct 11, 2017

skakker Oct 23, 2017

shkhrgpt Oct 11, 2017

shkhrgpt Oct 11, 2017

shkhrgpt Oct 11, 2017

shkhrgpt commented Oct 11, 2017

		@@ -0,0 +1,17 @@
		package com.linkedin.drelephant.spark.fetchers.statusapiv1;

adding JVM GC Time and Executor CPU time heuristic #279

adding JVM GC Time and Executor CPU time heuristic #279

Conversation

skakker commented Aug 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shkhrgpt commented Oct 5, 2017

edwinalu commented Oct 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shkhrgpt commented Oct 11, 2017

skakker commented Aug 23, 2017 •

edited

Loading