Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark metrics aggregator fix #237

Merged
merged 4 commits into from
Apr 18, 2017

Conversation

shankar37
Copy link
Contributor

Sometimes the endtime is 0 in the applicationInfo. This is most likely
because ApplicationEnd event was never fired. Fix to MetricsAggregator
to handle this by setting duration to 0 in those cases. We need to
handle this in the fetcher to set appropriate endtime in a later PR

shankar added 2 commits April 14, 2017 18:39
Sometimes the endtime is 0 in the applicationInfo. This is most likely
because ApplicationEnd event was never fired. Fix to MetricsAggregator
to handle this by setting duration to 0 in those cases. We need to
handle this in the fetcher to set appropriate endtime in a later PR
@shkhrgpt
Copy link
Contributor

I think this PR is also related to another issue, #227. In this issue, we are talking about only collecting data from SHS for an application only when the application has completed. Fixing that issue will also address this problem correctly. I don't know if making aggregated value ZERO is the right solution.

@shankar37
Copy link
Contributor Author

@shkhrgpt @superbobry As I mentioned in the description, this doesn't attempt to fix the data issue for incomplete jobs. It only is a temporary fix to get those jobs to be analyzed as well instead of skipping those records due to database error on the value being negative. As I said in the description, that would require a separate PR. Looks like @superbobry is working on fixing it for Rest Fetcher. If that works out, we can do the same for FS Fetcher.

@shkhrgpt
Copy link
Contributor

@shankar37 As a temporary fix, I think it's fine to make aggregated values ZERO if application completion time is negative.

However, I think there may be a logical bug in the implementation. In this implementation, we only set resource used to ZERO when applicationDurationMillis is negative, however, resourcesWastedMBSeconds is also dependent on resourcesAllocatedForUse, therefore, I think we should also set resource wasted to ZERO when applicationDurationMillis is negative.

An easy implementation would be to exit the method after line 53 if applicationDurationMillis is negative.

@shankar37
Copy link
Contributor Author

Fixed. Take a look

case false => 0.0
if( applicationDurationMillis < 0) {
logger.warn(s"applicationDurationMillis is negative. Skipping Metrics Aggregation:${applicationDurationMillis}")
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can remove the else statement if you put return statement in the if block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. The function returns Unit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh damn. My mistake, I got back to the Java world. Ignore it please.

case false => 0.0
}
//allocated is the total used resource from the cluster.
if (resourcesAllocatedForUse.isValidLong && resourcesAllocatedForUse >= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't check resourcesAllocatedForUse >= 0 because it's good to get an error if it's less than ZERO, rather than ignoring it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. done

if (resourcesAllocatedForUse.isValidLong && resourcesAllocatedForUse >= 0) {
hadoopAggregatedData.setResourceUsed(resourcesAllocatedForUse.toLong)
}
if( resourcesWastedMBSeconds >= 0.0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't check resourcesWastedMBSeconds >= 0 because it's good to get an error if it's less than ZERO, rather than ignoring it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. done

//allocated is the total used resource from the cluster.
if (resourcesAllocatedForUse.isValidLong) {
hadoopAggregatedData.setResourceUsed(resourcesAllocatedForUse.toLong)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we remove this logging line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by mistake. Added it back

@shkhrgpt
Copy link
Contributor

LGTM.
Thanks for addressing my comments.

@akshayrai akshayrai merged commit b7e04ab into linkedin:master Apr 18, 2017
@akshayrai
Copy link
Contributor

Thanks for the review @shkhrgpt.

@shankar37 shankar37 deleted the spark-metrics-aggregator-fix branch April 19, 2017 09:20
skakker pushed a commit to skakker/dr-elephant that referenced this pull request Dec 14, 2017
* Fix SparkMetricsAggregator to not produce negative  ResourceUsage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants