Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unit to RuntimeMetric #17737

Merged
merged 1 commit into from
May 18, 2022
Merged

Add unit to RuntimeMetric #17737

merged 1 commit into from
May 18, 2022

Conversation

zacw7
Copy link
Member

@zacw7 zacw7 commented May 6, 2022

Resolves #17633
Add unit to RuntimeMetric:

"runtimeStats": {
  "getTableHandleTimeNanos": {
    "name": "getTableHandleTimeNanos",
    "unit": "NANO",
    "sum": 42914,
    "count": 1,
    "max": 42914,
    "min": 42914
  },
  "S1-storageReadDataBytes": {
    "name": "S1-storageReadDataBytes",
    "unit": "BYTE",
    "...": "..."
  },
  "...": "..."
}

Allowed units so far (aligned with Velox):

NONE, NANO, BYTE
== RELEASE NOTES ==

General Changes
* Add the support for setting units of runtime metrics.

@zacw7 zacw7 requested a review from a team as a code owner May 6, 2022 15:36
@aweisberg
Copy link
Contributor

Is this going to make verifier not work still? I am asking because we rely on it for release verification.

@zacw7
Copy link
Member Author

zacw7 commented May 6, 2022

Is this going to make verifier not work still? I am asking because we rely on it for release verification.

Yes :( But we are not planning to land this in the current release since it's too close to the release cut. Will try to figure it out how to handle the verifier failure next week.

@highker highker requested a review from kewang1024 May 8, 2022 08:04
Copy link
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explain the motivation behind the change from map to list and what's the benefit ?

@zacw7 zacw7 force-pushed the metric-unit branch 2 times, most recently from 560c459 to c81d92a Compare May 12, 2022 20:35
@zacw7 zacw7 requested a review from kewang1024 May 12, 2022 22:08
@ThriftEnum
public enum RuntimeUnit
{
NONE(0), NANO(1), BYTE(2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we have one more TIME to represent how many times
  2. Can we remove NONE, why do we need it?

Copy link
Member Author

@zacw7 zacw7 May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have one more TIME to represent how many times

You mean use TIME for the counting metrics like driverCountPerTask or fragmentResultCacheHitCount? I think it might be a little be confusing. TIME looks like a time unit - like seconds/minutes. Also the current setup is aligned with Velox. I think it's good for start.

Can we remove NONE, why do we need it?

I would suggest keeping it:

  1. For compatibility reasons, we are introducing this unit to runtime metrics and there are many metrics haven't been set up with units yet, we would like those metrics to have a default fall-back value NONE.
  2. For some metrics like optimizedWithMaterializedView ( 1 means the query is rewritten to read from available materialized views to improve the performance), it does make much sense to set a unit to them, so let's keep it as NONE.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, can we

  1. Link the Velox file in the description as well since those two have to be compatible with each other
  2. Can we add an issue or task as follow up to slowly deprecate the NONE and adopt meaningful units
  3. As for TIME, I understand the confusion (I would add a comment to avoid confusion), but also hard to think of any other names :) , feel free to change to a name more appropriate, let's include this unit as an option in the follow up task/issue as well

@@ -30,6 +31,7 @@
public class RuntimeMetric
{
private final String name;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the name here since it's duplicate information?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda cautious to remove the existing field from the protocol. Some components might rely on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the Velox definition doesn't have it, so to make it more consistent, can we at least check what components are relying on it and would it be easy to remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this is another breaking change. The main reason we choose to keep the map structure instead of converting into array is that we cannot remove the existing information.

@JsonProperty("sum") long sum,
@JsonProperty("count") long count,
@JsonProperty("max") long max,
@JsonProperty("min") long min)
{
this(name);
this(name, unit);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the older version would pass in null as the unit, would it throw exceptions because of the requireNonNull check in construction function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be checked inside this(name, unit).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and it throws exception, so wouldn't the new version of presto that has this change be incompatible with older version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point. I've tested it, it works fine using the old presto cli.

@zacw7
Copy link
Member Author

zacw7 commented May 13, 2022

Hey @kewang1024, thanks for reviewing the PR. I've addressed most of the comments and removed the error-prone methods like addMetricValue(String name, long value) so unit must be specified. For now, most of the metrics are just set with NONE. There is another work planned for Eric to assign the right units to the metrics and to make better use of the units.

@zacw7 zacw7 requested a review from kewang1024 May 13, 2022 20:35
@kewang1024
Copy link
Collaborator

Hey @kewang1024, thanks for reviewing the PR. I've addressed most of the comments and removed the error-prone methods like addMetricValue(String name, long value) so unit must be specified. For now, most of the metrics are just set with NONE. There is another work planned for Eric to assign the right units to the metrics and to make better use of the units.

Thanks! For the work planned, can we create another task or issue on github for better tracking?

@zacw7 zacw7 removed the request for review from achopragh May 16, 2022 17:23
@zacw7
Copy link
Member Author

zacw7 commented May 16, 2022

Hey @kewang1024, thanks for reviewing the PR. I've addressed most of the comments and removed the error-prone methods like addMetricValue(String name, long value) so unit must be specified. For now, most of the metrics are just set with NONE. There is another work planned for Eric to assign the right units to the metrics and to make better use of the units.

Thanks! For the work planned, can we create another task or issue on github for better tracking?

#17770

Copy link
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a second look

  1. The reason why it adds a new field but it's not breaking when interacting with old version presto client is that the data is sent one-way from coordinator to client (thus adding RuntimeUnit and having its non-null check is fine)

  2. Why we can't remove redundant name from RuntimeMetric right now is presto client is reading this information
    StatusPrinter: stats.getRuntimeStats().getMetrics().values().stream().sorted(Comparator.comparing(RuntimeMetric::getName))

    However, this would cause a potential bug when integrating with Velox since Velox doesn't have name in RuntimeMetric
    So when data is collected from c++ worker to java coordinator, name would be missing from RuntimeMetric

    Suggestion: 1) make changes to StatusPrinter to use name from the map instead of from RuntimeMetric 2) remove name from RuntimeMetric to be consistent with Velox

    I'm fine with either doing it in this PR or in the future PR (create an issue for it). Thoughts? @yuanzhanhku

@kewang1024 kewang1024 requested review from highker and yuanzhanhku and removed request for highker May 17, 2022 06:56
@@ -85,6 +90,13 @@ public String getName()
return name;
}

@JsonProperty
@ThriftField(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason to reuse an existing thrift id?
When adding a new field in Thrift, we should use a new id. Otherwise, existing serialized Thrift struct cannot be parsed. This will also cause compatibility issue if the coordinator and the worker run different versions.

@@ -82,7 +83,7 @@ public Optional<Long> getFileSize()
public void incrementCounter(String name, long value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to add unit to this function and migrate existing usages.
This can be done in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Created an issue to keep track of it - #17777

mergedRuntimeStats.addMetricValueIgnoreZero(RuntimeMetricName.TASK_BLOCKED_TIME_NANOS, taskStats.getTotalBlockedTimeInNanos());
mergedRuntimeStats.addMetricValue(DRIVER_COUNT_PER_TASK, NONE, taskStats.getTotalDrivers());
mergedRuntimeStats.addMetricValue(TASK_ELAPSED_TIME_NANOS, NONE, taskStats.getElapsedTimeInNanos());
mergedRuntimeStats.addMetricValueIgnoreZero(TASK_QUEUED_TIME_NANOS, NONE, taskStats.getQueuedTimeInNanos());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to change the units in a separate PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@zacw7 zacw7 requested a review from yuanzhanhku May 17, 2022 22:46
@kewang1024 kewang1024 requested a review from highker May 17, 2022 23:58
@highker highker merged commit 705889d into prestodb:master May 18, 2022
@zacw7 zacw7 deleted the metric-unit branch May 18, 2022 18:50
@rschlussel
Copy link
Contributor

@zacw7 this is causing failures for verifier. What's the plan here?

@zacw7
Copy link
Member Author

zacw7 commented Jun 3, 2022

@zacw7 this is causing failures for verifier. What's the plan here?

It seems it only affects Sapphire verifier, let me test if relaxing the null check can prevent the error.

@zacw7
Copy link
Member Author

zacw7 commented Jun 4, 2022

@zacw7 this is causing failures for verifier. What's the plan here?

Fixed in #17840

@highker highker mentioned this pull request Jul 6, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add unit to RuntimeMetric
6 participants