-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark metrics aggregator fix #237
Spark metrics aggregator fix #237
Conversation
Sometimes the endtime is 0 in the applicationInfo. This is most likely because ApplicationEnd event was never fired. Fix to MetricsAggregator to handle this by setting duration to 0 in those cases. We need to handle this in the fetcher to set appropriate endtime in a later PR
I think this PR is also related to another issue, #227. In this issue, we are talking about only collecting data from SHS for an application only when the application has completed. Fixing that issue will also address this problem correctly. I don't know if making aggregated value ZERO is the right solution. |
@shkhrgpt @superbobry As I mentioned in the description, this doesn't attempt to fix the data issue for incomplete jobs. It only is a temporary fix to get those jobs to be analyzed as well instead of skipping those records due to database error on the value being negative. As I said in the description, that would require a separate PR. Looks like @superbobry is working on fixing it for Rest Fetcher. If that works out, we can do the same for FS Fetcher. |
@shankar37 As a temporary fix, I think it's fine to make aggregated values ZERO if application completion time is negative. However, I think there may be a logical bug in the implementation. In this implementation, we only set resource used to ZERO when An easy implementation would be to exit the method after line 53 if |
Fixed. Take a look |
case false => 0.0 | ||
if( applicationDurationMillis < 0) { | ||
logger.warn(s"applicationDurationMillis is negative. Skipping Metrics Aggregation:${applicationDurationMillis}") | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can remove the else statement if you put return statement in the if block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean. The function returns Unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh damn. My mistake, I got back to the Java world. Ignore it please.
case false => 0.0 | ||
} | ||
//allocated is the total used resource from the cluster. | ||
if (resourcesAllocatedForUse.isValidLong && resourcesAllocatedForUse >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't check resourcesAllocatedForUse >= 0
because it's good to get an error if it's less than ZERO, rather than ignoring it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. done
if (resourcesAllocatedForUse.isValidLong && resourcesAllocatedForUse >= 0) { | ||
hadoopAggregatedData.setResourceUsed(resourcesAllocatedForUse.toLong) | ||
} | ||
if( resourcesWastedMBSeconds >= 0.0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't check resourcesWastedMBSeconds >= 0
because it's good to get an error if it's less than ZERO, rather than ignoring it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. done
//allocated is the total used resource from the cluster. | ||
if (resourcesAllocatedForUse.isValidLong) { | ||
hadoopAggregatedData.setResourceUsed(resourcesAllocatedForUse.toLong) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we remove this logging line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by mistake. Added it back
LGTM. |
Thanks for the review @shkhrgpt. |
* Fix SparkMetricsAggregator to not produce negative ResourceUsage
Sometimes the endtime is 0 in the applicationInfo. This is most likely
because ApplicationEnd event was never fired. Fix to MetricsAggregator
to handle this by setting duration to 0 in those cases. We need to
handle this in the fetcher to set appropriate endtime in a later PR