Spark Heuristics for Dr. Elephant #324

skakker · 2018-01-25T10:55:26Z

This PR adds new Spark Heuristics in Dr Elephant. There are several heuristics like:
Executor JVM Memory Heuristic: Checks whether the allocated is much more than the peak JVM used memory
Executor Peak Unified Memory: Checks whether the memory allocated for Unified Memory region is much more than the memory used by it.
Executor Spill: Checks whether Executor memory is spilled and gives suggestions.
Executor GC: checks how much time is spent in GC as compared to the total running time of the job.
Driver: checks driver configurations, driver GC time and JVM used memory.
Stages With Failed Tasks: checks for the stages having failed tasks and checks for the error messages of the tasks failed.

edwinalu · 2018-01-25T19:48:59Z

app-conf/HeuristicConf.xml

+    <!--<params>
+      <driver_peak_jvm_memory_threshold>1.25,1.5,2,3</driver_peak_jvm_memory_threshold>
+      <gc_severity_threshold>0.08,0.09,0.1,0.15</gc_severity_threshold>
+      <peak_unified_memory_threshold>0.7,0.6,0.4,0.2</peak_unified_memory_threshold>


We probably don't need to check unified memory for the driver -- it looks like the driver can have non-zero storage memory, but there isn't a separate spark.memory.fraction or spark.memory.storageFraction for the driver. There are usually more executors, so it makes sense to tune for the executors.

edwinalu · 2018-01-25T20:09:25Z

app/com/linkedin/drelephant/spark/heuristics/DriverHeuristic.scala

+import com.linkedin.drelephant.spark.fetchers.statusapiv1.ExecutorSummary
+
+/**
+  * A heuristic based the driver's configurations and memory used. It checks whether the configuration values specified are within the threshold range. It also analyses the peak JVM memory used, unified memory and time spent in GC by the job.


Could this comment be broken into multiple-lines, to make it easier to read? Is there is max char limit for Dr. Elephan code? Same for the following code.

edwinalu · 2018-01-25T20:10:55Z

app/com/linkedin/drelephant/spark/heuristics/DriverHeuristic.scala

+
+  override def getHeuristicConfData(): HeuristicConfigurationData = heuristicConfigurationData
+
+  lazy val peakUnifiedMemoryThresholdString: String = heuristicConfigurationData.getParamMap.get(PEAK_UNIFIED_MEMORY_THRESHOLD_KEY)


As noted earlier, let's skip the check for driver unified memory.

edwinalu · 2018-01-25T20:23:56Z

app/com/linkedin/drelephant/spark/heuristics/UnifiedMemoryHeuristic.scala

-        if (peakUnifiedMemoryExecutorSeverity.getValue > severityPeakUnifiedMemoryVariable.getValue) {
-          severityPeakUnifiedMemoryVariable = peakUnifiedMemoryExecutorSeverity
+
+    lazy val severity: Severity = if (sparkExecutorMemory <= MemoryFormatUtils.stringToBytes("2G")) {


This should use the constant declared for JVM used memory.

edwinalu · 2018-01-25T20:27:43Z

app/com/linkedin/drelephant/spark/heuristics/JvmUsedMemoryHeuristic.scala

      DEFAULT_MAX_EXECUTOR_PEAK_JVM_USED_MEMORY_THRESHOLDS
    } else {
      SeverityThresholds.parse(jvmUsedMemoryHeuristic.executorPeakJvmMemoryThresholdString.split(",").map(_.toDouble * (maxExecutorPeakJvmUsedMemory + reservedMemory)).toString, ascending = false).getOrElse(DEFAULT_MAX_EXECUTOR_PEAK_JVM_USED_MEMORY_THRESHOLDS)
    }

-    val MAX_DRIVER_PEAK_JVM_USED_MEMORY_THRESHOLDS : SeverityThresholds = if(jvmUsedMemoryHeuristic.driverPeakJvmMemoryThresholdString == null) {
-      DEFAULT_MAX_DRIVER_PEAK_JVM_USED_MEMORY_THRESHOLDS
+    lazy val severity = if (sparkExecutorMemory <= MemoryFormatUtils.stringToBytes("2G")) {


Please declare "2G" as a constant, and add some comments.

edwinalu · 2018-01-25T20:53:28Z

test/com/linkedin/drelephant/spark/heuristics/DriverHeuristicTest.scala

+import scala.collection.JavaConverters
+import scala.concurrent.duration.Duration
+
+class DriverHeuristicTest extends FunSpec with Matchers {


Please add some comments.

edwinalu · 2018-01-25T20:58:05Z

app/views/help/spark/helpDriverHeuristic.scala.html

+<p>This is a heuristic for checking whether driver is well tuned and the configurations are set to a good value.</p>
+<p>It checks the following properties</p>
+<h4>Driver Max Peak JVM Used Memory</h4>
+<p>This is to analyse whether the driver memory is set to a good value. To avoid wasted memory, it checks if the peak JVM used memory by the driver is reasonably close to the blocked driver memory which is specified in spark.driver.memory. If the peak JVM memory is much smaller, then the driver memory should be reduced.</p>


Change "This is to analyse" to "This analyses"

edwinalu · 2018-01-25T21:00:42Z

app/com/linkedin/drelephant/spark/heuristics/StagesHeuristic.scala

@@ -192,7 +192,12 @@ object StagesHeuristic {
    }

    private def averageExecutorRuntimeAndSeverityOf(stageData: StageData): (Long, Severity) = {
-      val averageExecutorRuntime = stageData.executorRunTime / executorInstances
+      val allTasks : Int = if((stageData.numActiveTasks + stageData.numCompleteTasks + stageData.numFailedTasks) == 0) {


Another option would be max((stageData.numActiveTasks + stageData.numCompleteTasks + stageData.numFailedTasks), 1)

edwinalu · 2018-01-25T21:04:59Z

app/com/linkedin/drelephant/spark/heuristics/DriverHeuristic.scala

+    }
+
+    //peakJvmMemory calculations
+


Extra newline.

edwinalu · 2018-01-25T21:05:12Z

app/com/linkedin/drelephant/spark/heuristics/DriverHeuristic.scala

+    }
+
+    //Gc Calculations
+


Extra newline.

edwinalu · 2018-01-25T21:06:55Z

app/com/linkedin/drelephant/spark/heuristics/DriverHeuristic.scala

+  * A heuristic based the driver's configurations and memory used. It checks whether the configuration values specified are within the threshold range. It also analyses the peak JVM memory used, unified memory and time spent in GC by the job.
+  *
+  */
+class DriverHeuristic(private val heuristicConfigurationData: HeuristicConfigurationData)


This heuristic contains multiple checks. Is it possible to add some information about which checks are problematical? Another option could be to split into separate heuristics.

…figurable

edwinalu · 2018-01-29T17:47:46Z

app/com/linkedin/drelephant/spark/heuristics/ConfigurationHeuristic.scala

@@ -105,18 +105,24 @@ class ConfigurationHeuristic(private val heuristicConfigurationData: HeuristicCo
      result.addResultDetail(SPARK_SHUFFLE_SERVICE_ENABLED, formatProperty(evaluator.isShuffleServiceEnabled.map(_.toString)),
        "Spark shuffle service is not enabled.")
    }
-    if (evaluator.severityMinExecutors == Severity.CRITICAL) {
+    if (evaluator.severityMinExecutors != Severity.NONE) {
      result.addResultDetail("Minimum Executors", "The minimum executors for Dynamic Allocation should be <=1. Please change it in the " + SPARK_DYNAMIC_ALLOCATION_MIN_EXECUTORS + " field.")


Please change to "should be = 1". We don't want users to set min executors to 0 or negative values. Also, please use constant THRESHOLD_MIN_EXECUTORS instead of hard-coding the 1, in case of changes later.

edwinalu · 2018-01-29T17:48:15Z

app/com/linkedin/drelephant/spark/heuristics/ConfigurationHeuristic.scala

      result.addResultDetail("Minimum Executors", "The minimum executors for Dynamic Allocation should be <=1. Please change it in the " + SPARK_DYNAMIC_ALLOCATION_MIN_EXECUTORS + " field.")
    }
-    if (evaluator.severityMaxExecutors == Severity.CRITICAL) {
+    if (evaluator.severityMaxExecutors != Severity.NONE) {
      result.addResultDetail("Maximum Executors", "The maximum executors for Dynamic Allocation should be <=900. Please change it in the " + SPARK_DYNAMIC_ALLOCATION_MAX_EXECUTORS + " field.")


Please use constant THRESHOLD_MAX_EXECUTORS instead of hard-coding 900.

edwinalu · 2018-01-29T17:49:34Z

app/com/linkedin/drelephant/spark/heuristics/ConfigurationHeuristic.scala

      result.addResultDetail("Executor Overhead Memory", "Please do not specify excessive amount of overhead memory for Executors. Change it in the field " + SPARK_YARN_EXECUTOR_MEMORY_OVERHEAD)
    }
+    if(evaluator.severityExecutorCores != Severity.NONE) {
+      result.addResultDetail("Executor cores", "The number of executor cores should be <=4. Please change it in the field " + SPARK_EXECUTOR_CORES_KEY)


Could the comment for executor cores also use a constant, or take the value from the configurations?

edwinalu · 2018-01-29T17:58:57Z

app/views/help/spark/helpDriverHeuristic.scala.html

 <p>It checks the following properties</p>
 <h4>Driver Max Peak JVM Used Memory</h4>
-<p>This is to analyse whether the driver memory is set to a good value. To avoid wasted memory, it checks if the peak JVM used memory by the driver is reasonably close to the blocked driver memory which is specified in spark.driver.memory. If the peak JVM memory is much smaller, then the driver memory should be reduced.</p>
-<h4>Driver Max Unified Memory</h4>
-<p>If the driver's Peak Unified Memory Consumption is much smaller than the allocated Unified Memory space, then we recommend decreasing the allocated Unified Memory Region. You can try decreasing <i>spark.driver.memory</i> which would decrease the unified memory space.</p>


What is "blocked driver memory"? Could this just be "driver memory"?

edwinalu · 2018-01-29T17:59:45Z

app/views/help/spark/helpJvmUsedMemoryHeuristic.scala.html

@@ -18,3 +18,4 @@ <h4>Executor Max Peak JVM Used Memory</h4>
 <p>This is to analyse whether the executor memory is set to a good value. To avoid wasted memory, it checks if the peak JVM used memory by the executor is reasonably close to the blocked executor memory which is specified in spark.executor.memory. If the peak JVM memory is much smaller, then the executor memory should be reduced.</p>
 <h4>Driver Max Peak JVM Used Memory</h4>
 <p>This is to analyse whether the driver memory is set to a good value. To avoid wasted memory, it checks if the peak JVM used memory by the driver is reasonably close to the blocked driver memory which is specified in spark.driver.memory. If the peak JVM memory is much smaller, then the driver memory should be reduced.</p>


Similarly, there is a reference to "blocked driver memory". Could you please explain blocked memory, or change to "driver memory"?

edwinalu · 2018-01-29T18:01:22Z

app/views/help/spark/helpJvmUsedMemoryHeuristic.scala.html

@@ -18,3 +18,4 @@ <h4>Executor Max Peak JVM Used Memory</h4>
 <p>This is to analyse whether the executor memory is set to a good value. To avoid wasted memory, it checks if the peak JVM used memory by the executor is reasonably close to the blocked executor memory which is specified in spark.executor.memory. If the peak JVM memory is much smaller, then the executor memory should be reduced.</p>


This has a reference to "blocked executor memory." Could you please explain blocked memory, or change to "executor memory"?

… to suggested memory

… memory

…euristic

skakker · 2018-02-05T06:22:48Z

The comments have been addressed.

akshayrai · 2018-02-05T08:50:04Z

Thanks for adding the tests @skakker.

@edwinalu, I hope you have finished the review. I am merging these changes.

Separated out Driver checks into a separate Driver Metrics heuristic: Checks driver configurations, driver GC time and JVM used memory.

support changes in Dr Elephant

f6449f1

skakker changed the base branch from master to customSHSWork January 25, 2018 10:55

edwinalu reviewed Jan 25, 2018

View reviewed changes

swasti added 2 commits January 28, 2018 17:50

fixed review comments, added recommendations and made some values con…

1c96d41

…figurable

fixing heusristic details

f137dec

edwinalu reviewed Jan 29, 2018

View reviewed changes

swasti added 7 commits January 30, 2018 17:17

made constants configurable

dd18f5d

changed unified memory calculations

5dc7b95

changed language of blocked memory to user allocated memory

2566197

added units to time in GC and JVM heuristic

07c90b0

added comments for all the test classes and changed a suggestion name…

0a7bb26

… to suggested memory

added test for when max value is same as allocated memory for unified…

1b72f3e

… memory

changed name of unified memory heuristic to executor unified memory h…

057c09e

…euristic

skakker changed the title ~~Custom shs work~~ Spark Heuristics for Dr. Elephant Feb 5, 2018

akshayrai approved these changes Feb 5, 2018

View reviewed changes

swasti added 4 commits February 5, 2018 12:20

rounded off suggested memory value to next int

2fd1ef4

added test for rounding off function

b06d1b6

adding more tests for rounding off function

c210e60

addressed empty string

af150f4

akshayrai merged commit f41b136 into linkedin:customSHSWork Feb 5, 2018

akshayrai pushed a commit that referenced this pull request Feb 21, 2018

Spark Heuristic Fixes for Dr. Elephant (#324)

e35770e

Separated out Driver checks into a separate Driver Metrics heuristic: Checks driver configurations, driver GC time and JVM used memory.

akshayrai pushed a commit that referenced this pull request Feb 27, 2018

Spark Heuristic Fixes for Dr. Elephant (#324)

11b3e19

Separated out Driver checks into a separate Driver Metrics heuristic: Checks driver configurations, driver GC time and JVM used memory.

akshayrai pushed a commit that referenced this pull request Mar 6, 2018

Spark Heuristic Fixes for Dr. Elephant (#324)

6e60877

Separated out Driver checks into a separate Driver Metrics heuristic: Checks driver configurations, driver GC time and JVM used memory.

arpang pushed a commit to arpang/dr-elephant that referenced this pull request Mar 14, 2018

Spark Heuristic Fixes for Dr. Elephant (linkedin#324)

8ca4c90

Separated out Driver checks into a separate Driver Metrics heuristic: Checks driver configurations, driver GC time and JVM used memory.

akshayrai pushed a commit that referenced this pull request Mar 19, 2018

Spark Heuristic Fixes for Dr. Elephant (#324)

a49a0f8

Separated out Driver checks into a separate Driver Metrics heuristic: Checks driver configurations, driver GC time and JVM used memory.

akshayrai pushed a commit that referenced this pull request Mar 19, 2018

Spark Heuristic Fixes for Dr. Elephant (#324)

7dfe640

Separated out Driver checks into a separate Driver Metrics heuristic: Checks driver configurations, driver GC time and JVM used memory.

akshayrai pushed a commit that referenced this pull request Mar 30, 2018

Spark Heuristic Fixes for Dr. Elephant (#324)

faeb2f8

Separated out Driver checks into a separate Driver Metrics heuristic: Checks driver configurations, driver GC time and JVM used memory.

akshayrai pushed a commit that referenced this pull request Apr 6, 2018

Spark Heuristic Fixes for Dr. Elephant (#324)

25c07bb

Separated out Driver checks into a separate Driver Metrics heuristic: Checks driver configurations, driver GC time and JVM used memory.

akshayrai pushed a commit that referenced this pull request May 21, 2018

Spark Heuristic Fixes for Dr. Elephant (#324)

fe8b9c5

Separated out Driver checks into a separate Driver Metrics heuristic: Checks driver configurations, driver GC time and JVM used memory.

varunsaxena pushed a commit that referenced this pull request Oct 16, 2018

Spark Heuristic Fixes for Dr. Elephant (#324)

2a3de05

Separated out Driver checks into a separate Driver Metrics heuristic: Checks driver configurations, driver GC time and JVM used memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Heuristics for Dr. Elephant #324

Spark Heuristics for Dr. Elephant #324

skakker commented Jan 25, 2018 •

edited

Loading

edwinalu Jan 25, 2018

edwinalu Jan 25, 2018

edwinalu Jan 25, 2018

edwinalu Jan 25, 2018

edwinalu Jan 25, 2018

edwinalu Jan 25, 2018

edwinalu Jan 25, 2018

edwinalu Jan 25, 2018

edwinalu Jan 25, 2018

edwinalu Jan 25, 2018

edwinalu Jan 25, 2018

edwinalu Jan 29, 2018

edwinalu Jan 29, 2018

edwinalu Jan 29, 2018

edwinalu Jan 29, 2018

edwinalu Jan 29, 2018

edwinalu Jan 29, 2018

skakker commented Feb 5, 2018

akshayrai commented Feb 5, 2018


		override def getHeuristicConfData(): HeuristicConfigurationData = heuristicConfigurationData

		lazy val peakUnifiedMemoryThresholdString: String = heuristicConfigurationData.getParamMap.get(PEAK_UNIFIED_MEMORY_THRESHOLD_KEY)

		@@ -18,3 +18,4 @@ <h4>Executor Max Peak JVM Used Memory</h4>
		<p>This is to analyse whether the executor memory is set to a good value. To avoid wasted memory, it checks if the peak JVM used memory by the executor is reasonably close to the blocked executor memory which is specified in spark.executor.memory. If the peak JVM memory is much smaller, then the executor memory should be reduced.</p>

Spark Heuristics for Dr. Elephant #324

Spark Heuristics for Dr. Elephant #324

Conversation

skakker commented Jan 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skakker commented Feb 5, 2018

akshayrai commented Feb 5, 2018

skakker commented Jan 25, 2018 •

edited

Loading