Spark metrics aggregator fix #237

shankar37 · 2017-04-14T13:36:09Z

Sometimes the endtime is 0 in the applicationInfo. This is most likely
because ApplicationEnd event was never fired. Fix to MetricsAggregator
to handle this by setting duration to 0 in those cases. We need to
handle this in the fetcher to set appropriate endtime in a later PR

Sometimes the endtime is 0 in the applicationInfo. This is most likely because ApplicationEnd event was never fired. Fix to MetricsAggregator to handle this by setting duration to 0 in those cases. We need to handle this in the fetcher to set appropriate endtime in a later PR

shkhrgpt · 2017-04-14T16:40:27Z

I think this PR is also related to another issue, #227. In this issue, we are talking about only collecting data from SHS for an application only when the application has completed. Fixing that issue will also address this problem correctly. I don't know if making aggregated value ZERO is the right solution.

shankar37 · 2017-04-17T02:41:38Z

@shkhrgpt @superbobry As I mentioned in the description, this doesn't attempt to fix the data issue for incomplete jobs. It only is a temporary fix to get those jobs to be analyzed as well instead of skipping those records due to database error on the value being negative. As I said in the description, that would require a separate PR. Looks like @superbobry is working on fixing it for Rest Fetcher. If that works out, we can do the same for FS Fetcher.

shkhrgpt · 2017-04-17T17:42:18Z

@shankar37 As a temporary fix, I think it's fine to make aggregated values ZERO if application completion time is negative.

However, I think there may be a logical bug in the implementation. In this implementation, we only set resource used to ZERO when applicationDurationMillis is negative, however, resourcesWastedMBSeconds is also dependent on resourcesAllocatedForUse, therefore, I think we should also set resource wasted to ZERO when applicationDurationMillis is negative.

An easy implementation would be to exit the method after line 53 if applicationDurationMillis is negative.

shankar37 · 2017-04-18T03:48:14Z

Fixed. Take a look

shkhrgpt · 2017-04-18T04:09:58Z

app/com/linkedin/drelephant/spark/SparkMetricsAggregator.scala

-      case false => 0.0
+    if( applicationDurationMillis < 0) {
+      logger.warn(s"applicationDurationMillis is negative. Skipping Metrics Aggregation:${applicationDurationMillis}")
+    }  else {


Maybe you can remove the else statement if you put return statement in the if block?

Not sure what you mean. The function returns Unit.

Oh damn. My mistake, I got back to the Java world. Ignore it please.

shkhrgpt · 2017-04-18T04:15:26Z

app/com/linkedin/drelephant/spark/SparkMetricsAggregator.scala

+        case false => 0.0
+      }
+      //allocated is the total used resource from the cluster.
+      if (resourcesAllocatedForUse.isValidLong && resourcesAllocatedForUse >= 0) {


I think we shouldn't check resourcesAllocatedForUse >= 0 because it's good to get an error if it's less than ZERO, rather than ignoring it.

Agree. done

shkhrgpt · 2017-04-18T04:15:48Z

app/com/linkedin/drelephant/spark/SparkMetricsAggregator.scala

+      if (resourcesAllocatedForUse.isValidLong && resourcesAllocatedForUse >= 0) {
+        hadoopAggregatedData.setResourceUsed(resourcesAllocatedForUse.toLong)
+      }
+      if( resourcesWastedMBSeconds >= 0.0) {


I think we shouldn't check resourcesWastedMBSeconds >= 0 because it's good to get an error if it's less than ZERO, rather than ignoring it.

agree. done

shkhrgpt · 2017-04-18T04:16:32Z

app/com/linkedin/drelephant/spark/SparkMetricsAggregator.scala

-    //allocated is the total used resource from the cluster.
-    if (resourcesAllocatedForUse.isValidLong) {
-      hadoopAggregatedData.setResourceUsed(resourcesAllocatedForUse.toLong)
-    } else {


Why did we remove this logging line?

by mistake. Added it back

shkhrgpt · 2017-04-18T04:42:47Z

LGTM.
Thanks for addressing my comments.

akshayrai · 2017-04-18T06:44:14Z

Thanks for the review @shkhrgpt.

* Fix SparkMetricsAggregator to not produce negative ResourceUsage

shankar added 2 commits April 14, 2017 18:39

Merge remote-tracking branch 'linkedin/master'

d70cc6e

Skip aggregation if duration is zero instead of hardcoding to 0

b5562b0

shkhrgpt reviewed Apr 18, 2017

View reviewed changes

fix code review comments

d70e5f7

akshayrai merged commit b7e04ab into linkedin:master Apr 18, 2017

shankar37 deleted the spark-metrics-aggregator-fix branch April 19, 2017 09:20

skakker pushed a commit to skakker/dr-elephant that referenced this pull request Dec 14, 2017

Spark metrics aggregator fix (linkedin#237)

642f3a0

* Fix SparkMetricsAggregator to not produce negative ResourceUsage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark metrics aggregator fix #237

Spark metrics aggregator fix #237

shankar37 commented Apr 14, 2017

shkhrgpt commented Apr 14, 2017

shankar37 commented Apr 17, 2017

shkhrgpt commented Apr 17, 2017

shankar37 commented Apr 18, 2017

shkhrgpt Apr 18, 2017

shankar37 Apr 18, 2017

shkhrgpt Apr 18, 2017

shkhrgpt Apr 18, 2017

shankar37 Apr 18, 2017

shkhrgpt Apr 18, 2017

shankar37 Apr 18, 2017

shkhrgpt Apr 18, 2017

shankar37 Apr 18, 2017

shkhrgpt commented Apr 18, 2017

akshayrai commented Apr 18, 2017

Spark metrics aggregator fix #237

Spark metrics aggregator fix #237

Conversation

shankar37 commented Apr 14, 2017

shkhrgpt commented Apr 14, 2017

shankar37 commented Apr 17, 2017

shkhrgpt commented Apr 17, 2017

shankar37 commented Apr 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shkhrgpt commented Apr 18, 2017

akshayrai commented Apr 18, 2017