-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unit to RuntimeMetric #17737
Add unit to RuntimeMetric #17737
Conversation
Is this going to make verifier not work still? I am asking because we rely on it for release verification. |
Yes :( But we are not planning to land this in the current release since it's too close to the release cut. Will try to figure it out how to handle the verifier failure next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explain the motivation behind the change from map to list and what's the benefit ?
presto-common/src/main/java/com/facebook/presto/common/RuntimeMetric.java
Outdated
Show resolved
Hide resolved
presto-common/src/main/java/com/facebook/presto/common/RuntimeMetricKey.java
Outdated
Show resolved
Hide resolved
presto-common/src/main/java/com/facebook/presto/common/RuntimeMetricKey.java
Outdated
Show resolved
Hide resolved
560c459
to
c81d92a
Compare
@ThriftEnum | ||
public enum RuntimeUnit | ||
{ | ||
NONE(0), NANO(1), BYTE(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can we have one more
TIME
to represent how many times - Can we remove
NONE
, why do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have one more TIME to represent how many times
You mean use TIME
for the counting metrics like driverCountPerTask
or fragmentResultCacheHitCount
? I think it might be a little be confusing. TIME
looks like a time unit - like seconds/minutes. Also the current setup is aligned with Velox. I think it's good for start.
Can we remove NONE, why do we need it?
I would suggest keeping it:
- For compatibility reasons, we are introducing this unit to runtime metrics and there are many metrics haven't been set up with units yet, we would like those metrics to have a default fall-back value
NONE
. - For some metrics like
optimizedWithMaterializedView
( 1 means the query is rewritten to read from available materialized views to improve the performance), it does make much sense to set a unit to them, so let's keep it asNONE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, can we
- Link the Velox file in the description as well since those two have to be compatible with each other
- Can we add an issue or task as follow up to slowly deprecate the NONE and adopt meaningful units
- As for
TIME
, I understand the confusion (I would add a comment to avoid confusion), but also hard to think of any other names :) , feel free to change to a name more appropriate, let's include this unit as an option in the follow up task/issue as well
presto-common/src/main/java/com/facebook/presto/common/RuntimeStats.java
Outdated
Show resolved
Hide resolved
presto-common/src/main/java/com/facebook/presto/common/RuntimeStats.java
Outdated
Show resolved
Hide resolved
@@ -30,6 +31,7 @@ | |||
public class RuntimeMetric | |||
{ | |||
private final String name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the name
here since it's duplicate information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kinda cautious to remove the existing field from the protocol. Some components might rely on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the Velox definition doesn't have it, so to make it more consistent, can we at least check what components are relying on it and would it be easy to remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this is another breaking change. The main reason we choose to keep the map structure instead of converting into array is that we cannot remove the existing information.
@JsonProperty("sum") long sum, | ||
@JsonProperty("count") long count, | ||
@JsonProperty("max") long max, | ||
@JsonProperty("min") long min) | ||
{ | ||
this(name); | ||
this(name, unit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the older version would pass in null as the unit
, would it throw exceptions because of the requireNonNull
check in construction function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be checked inside this(name, unit)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and it throws exception, so wouldn't the new version of presto that has this change be incompatible with older version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got your point. I've tested it, it works fine using the old presto cli.
Hey @kewang1024, thanks for reviewing the PR. I've addressed most of the comments and removed the error-prone methods like |
Thanks! For the work planned, can we create another task or issue on github for better tracking? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a second look
-
The reason why it adds a new field but it's not breaking when interacting with old version presto client is that the data is sent one-way from coordinator to client (thus adding
RuntimeUnit
and having its non-null check is fine) -
Why we can't remove redundant
name
fromRuntimeMetric
right now is presto client is reading this information
StatusPrinter: stats.getRuntimeStats().getMetrics().values().stream().sorted(Comparator.comparing(RuntimeMetric::getName))
However, this would cause a potential bug when integrating with Velox since Velox doesn't have
name
inRuntimeMetric
So when data is collected from c++ worker to java coordinator,name
would be missing fromRuntimeMetric
Suggestion: 1) make changes to
StatusPrinter
to usename
from the map instead of fromRuntimeMetric
2) removename
fromRuntimeMetric
to be consistent with VeloxI'm fine with either doing it in this PR or in the future PR (create an issue for it). Thoughts? @yuanzhanhku
@@ -85,6 +90,13 @@ public String getName() | |||
return name; | |||
} | |||
|
|||
@JsonProperty | |||
@ThriftField(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason to reuse an existing thrift id?
When adding a new field in Thrift, we should use a new id. Otherwise, existing serialized Thrift struct cannot be parsed. This will also cause compatibility issue if the coordinator and the worker run different versions.
@@ -82,7 +83,7 @@ public Optional<Long> getFileSize() | |||
public void incrementCounter(String name, long value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to add unit
to this function and migrate existing usages.
This can be done in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Created an issue to keep track of it - #17777
mergedRuntimeStats.addMetricValueIgnoreZero(RuntimeMetricName.TASK_BLOCKED_TIME_NANOS, taskStats.getTotalBlockedTimeInNanos()); | ||
mergedRuntimeStats.addMetricValue(DRIVER_COUNT_PER_TASK, NONE, taskStats.getTotalDrivers()); | ||
mergedRuntimeStats.addMetricValue(TASK_ELAPSED_TIME_NANOS, NONE, taskStats.getElapsedTimeInNanos()); | ||
mergedRuntimeStats.addMetricValueIgnoreZero(TASK_QUEUED_TIME_NANOS, NONE, taskStats.getQueuedTimeInNanos()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to change the units in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
@zacw7 this is causing failures for verifier. What's the plan here? |
It seems it only affects Sapphire verifier, let me test if relaxing the null check can prevent the error. |
Resolves #17633
Add unit to RuntimeMetric:
Allowed units so far (aligned with Velox):