-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apis: add MetricPrediction crd #1875
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1875 +/- ##
==========================================
+ Coverage 67.23% 67.54% +0.30%
==========================================
Files 410 413 +3
Lines 45662 46072 +410
==========================================
+ Hits 30702 31120 +418
+ Misses 12742 12696 -46
- Partials 2218 2256 +38
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
0699511
to
42c4bc9
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
42c4bc9
to
7ea4ab9
Compare
typo: |
7ea4ab9
to
f474d61
Compare
Add some user stories to help understand how the API is used |
udpated with more user stories. |
f474d61
to
7c351bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
// API version of the referent | ||
APIVersion string `json:"apiVersion,omitempty"` | ||
// Hierarchy indicates the hierarchy of the target for profiling | ||
Hierarchy ProfileHierarchy `json:"hierarchy,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WorkloadRef
makes sense, but the field Hierarchy
doesn't look connected to the workload reference, is the definition really appropriate here? I noticed that PodSelectorRef
has the same situation, so Hierarchy
itself is a field that needs to be described independently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Hierarchy
field is related to Workload
, which means the metric prediction is container or pod level , which is only effective for K8s workloads.
Metric Prediction can also work for other types of workloads beyond K8s such as a FaaS job. Although the workload definition is not is K8s, the metric is recorded as prometheus format.
function_cpu_usage{service="service-word-count", function="f-map-name", slice="slice-0"} 1.2
function_cpu_usage{service="service-word-count", function="f-map-name", slice="slice-1"} 1.3
function_cpu_usage{service="service-word-count", function="f-reduce-name", slice="all"} 2
It means a workload called(service-word-count), which is consisted of two jobs(map and reduce).
For resource recommendation scenario, we want to profile service-word-count/f-map-name
and service-word-count/f-reduce-name
.
This workload can be defined as AnalysisTargetPromethuesLabelGroup, and we should support service/function
as the key for aggregation in PromethuesLabelGroup
, which does not need Hierarchy
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information looks important and should be added to the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add more explanation when we support PromethuesLabelGroup
.
In the case where there is another layer in the usage scenario mentioned earlier, does MetricPrediction need to be a CRD? |
// PrometheusMetric defines the prometheus metric to be analyzed | ||
type PrometheusMetric struct { | ||
// Resource defines the key of resource to be analyzed | ||
Resource v1.ResourceName `json:"name,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the name resource
is suitable here. For example, is CPI a resource
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just defined as name
here
@hormes The Recommendation controller in Koordinator does not need create MetricPrediction CR in APIServer, which means the MetricPrediction is a In the following scenarios MetricPrediction CR will be created:
First we will support usage scenario, and the development will take two steps:
|
Signed-off-by: 佑祎 <zzw261520@alibaba-inc.com>
7c351bf
to
4531674
Compare
New changes are detected. LGTM label has been removed. |
// Source defines the source of metric, which can be metric server or prometheus | ||
Source MetricSourceType `json:"source"` | ||
// MetricServer defines the metric server source, which is effective when source is metric server | ||
MetricServer *MetricServerSource `json:"metricServer,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetricServer *MetricServerSource `json:"metricServer,omitempty"` | |
MetricsAPI *MetricsAPIMetricSource `json:"metricsAPI,omitempty"` |
/hold until we have implemented first user strory |
This issue has been automatically marked as stale because it has not had recent activity.
|
This issue has been automatically closed because it has not had recent activity.
|
Ⅰ. Describe what this PR does
define metric prediction crd for recommendation and predction.
the following yaml defines a MetricPrediction called mericprediction-sample.
mericprediction-sample spec claims that it needs the resource prediction for a workload.
mericprediction-sample status returns the cpu and memory profiling result for all containers of nginx workload.
Ⅱ. Does this pull request fix one issue?
more infos can get from #1880
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews
Intergrating with Metric Predition Framework
Metric Prediction Framework is kind of a "deep module", providing algorithms and prediction models in backend. There could be multiple profilers built with Metric Prediction as a foundation. Here are some scenarios about how to use the framework.
The spec of Recommendation defines it needs the recommended resources(CPU and memory) for a deployment named nginx-sample, and the recommendResources in status show the result for each container.
The recommendation is calculated with quantile value of history metrics. If the Using the Metric Prediciton as profiling model, the requirement of recommendation-sample can be expressed in MetricPrediction.
For different kind of workload, the recommendation can select specified quantile value from MetricPrediction, for example p95 for Deployment and average for Job, then increase with a 10–15% margin for safety.
Pod orchestration varies over time on node, and each pod has its own cycle on resource usage. The NodeQoS CR below describe the usage prediction according to workload metric prediction based on time series.
The usageOverTtime result in Node QoS is aggregated from MetricPredicion of all workloads running on the Node now, so that the descheduler can check whether there are nodes overloaded in near future then rebalance some pods to others.
Pod may got interference during runtime due to the resource contention on node, which can be analysed through CPI, PSI, CPU schedule latency etc. Specifiy algorithm such as OCSVM in MetricPrediction then the model will be available in status.
The Interference Manager will parse and send the corresonding model of workload to koordlet. koordlet will execute QoS strategies once it finds some pod is an outlier according to recent metrics.
V. Checklist
make test