Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Disccusion] Metrics API design. #2

Open
kevinten10 opened this issue Sep 9, 2021 · 7 comments
Open

[Disccusion] Metrics API design. #2

kevinten10 opened this issue Sep 9, 2021 · 7 comments

Comments

@kevinten10
Copy link
Member

kevinten10 commented Sep 9, 2021

Goal

Design Metrics application-level indicator monitoring API

Progress

We can first refer to some information and define a first version of the API.

Reference

dapr/dapr#2817
mosn/layotto#90
dapr/dapr#2988
dapr/dapr#100
dapr/dapr#3449
dapr/dapr#3455
dapr/dapr#3549
mosn/layotto#214

@JasmineJ1230
Copy link

In most business scenarios, Event Logs, Digital Indexes and Action Execution Sequences are widely used in application monitoring. I think we should provide a well support for these metric forms.

Here are some ideas for these functions.
Just a very simple sketch~ I hope the roughly defined APIs can express my understanding and assumptions of this function module.

1. Events

Event Log marks the occurrence of a specified situation, which is often related with some alarms.
There is no need for an event to hold too much information. It should be light and simple. We can be simply build an event with an specified name, which is unique for the current application, and some optional short decriptions.
Perhaps something like this...

service Runtime {
  // log event.
  rpc OnEvent(OnEventRequest) returns (google.protobuf.Empty) {}
}

message Event {
    required string event_name = 1;
    optional string desciption = 2;
    required long timestamp = 3;
}
message OnEventRequest {
    required string app_id = 1;
    required Event event = 2;
}

Alarms can be set for specified events. Email reminding is the most common way to handle the alarm. User can also defined their own handlers as an ehanced function if necessary.

service Runtime {
  // start transaction, get the unique transaction id.
  rpc CreateEventAlarm(CreateEventAlarmRequest) returns (CreateEventAlarmResponse) {}

  // record action in transaction.
  rpc DeleteEventAlarm(DeleteEventAlarmRequest) returns (google.protobuf.Empty) {}
}

message Alarm {
    optional string alarm_name = 1;
    optional repeated string handlers = 2;
}
message EventAlarm {
    Alarm alarm = 1;
    string event_name = 2;
}
message CreateEventAlarmRequest {
    required string app_id = 1;
    required string event_name = 2;
    required string alarm_name = 3;
    optional repeated string handlers = 4;
}
message CreateEventAlarmResponse {
    required string app_id = 1;
    EventAlarm event_alarm = 2;
}

message DeleteEventAlarmRequest {
    required string app_id = 1;
    required string event_name = 2;
    required string alarm_name = 3;
}

2. Digital Index.

Digital Index describes the performance changes of an application over a period of time. They can be processed in different ways and serve well for futher ​data analysis.

service Runtime {
  // start transaction, get the unique transaction id.
  rpc CreateIndex(CreateIndexRequest) returns (CreateIndexResponse) {}

  // record action in transaction.
  rpc publishIndexData(PublishIndexDataRequest) returns (google.protobuf.Empty) {}
}

message Index {
    string index_name = 1;
    string data_type = 2;
    repeated string processors = 3;
}
message CreateIndexRequest {
    required string app_id = 1;
    required string index_name = 2;
    required string data_type = 3;
    repeated string processors = 4;
}
message CreateIndexResponse {
    required string app_id = 1;
    Index index = 2;
}

message PublishIndexDataRequest {
    required string app_id = 1;
    required string index_name = 2;
    required string value = 3;
    required long timestamp = 4;
}

Also, alarms can be set, and triggered when the index touch a specific amount.

service Runtime {
  rpc CreateIndexAlarm(CreateIndexAlarmRequest) returns (CreateIndexAlarmResponse) {}

  // record action in transaction.
  rpc publishIndexData(PublishIndexDataRequest) returns (google.protobuf.Empty) {}

  rpc DeleteIndexAlarm(DeleteIndexAlarmRequest) returns (google.protobuf.Empty) {}
}

message IndexAlarm {
   string index_name = 1;
   Alarm alarm = 2;
   // perhaps regular expression? or use structures with some pre-defined enums.
   string rule = 3;
}

message CreateIndexAlarmRequest {
    required string app_id = 1;
    required string index_name = 2;
    required string alarm_name = 3;
    repeated string handlers = 4;
    required string rule = 5;
}

message CreateIndexAlarmResponse {
    required string app_id = 1;
    IndexAlarm index_alarm = 2;
}

message DeleteIndexAlarmRequest {
    required string app_id = 1;
    required string index_name = 2;
    required string alarm_name = 3;
}

Although all the fuctions should be customizable, we can also provide some easy accesses for those common metric attributes, expecially for the system indicators such as memory usage, cpu and so on.

3. Action Execution Sequence.

Action Execution Sequence records how an function was performed in detail, which is useful for troubleshooting. It might be the most difficult form of metric logging.
To string the actions together, we have to hold a unique id for the current sequence, and each request in the sequence must hold the same sequence id.

service Runtime {
  // start transaction, get the unique transaction id.
  rpc StartTransaction(StartTransactionRequest) returns (StartTransactionAlarmResponse) {}

  // record action in transaction.
  rpc RecordAction(RecordActionRequest) returns (google.protobuf.Empty) {}
}
message StartTransactionRequest {
    required string app_id = 1;
    required string transaction_name = 2;
}

message StartTransactionAlarmResponse {
    required string app_id = 1;
    string transaction_id = 2;
    string transaction_name = 3;
}

message RecordActionRequest {
    required string app_id = 1;
    required string transaction_id = 2;
    required string action_name = 3;
    optiona map<string, string> action_details = 4;
    long timestamp = 5;
}

Let's make some futher discussion about the design of API. Looking forward for your reply~

@kevinten10
Copy link
Member Author

cool! perfect

Please give me a moment to let me understand your design.

@kevinten10
Copy link
Member Author

kevinten10 commented Sep 27, 2021

@JasmineJ1230 Can you pack these definitions into one proto file

If you have time, can you provide java implementations of these interfaces?

@JasmineJ1230
Copy link

JasmineJ1230 commented Sep 29, 2021

@JasmineJ1230 Can you pack these definitions into one proto file

If you have time, can you provide java implementations of these interfaces?

OK~ I will make a more complete api design during the National Day holiday. (Perhaps 10/1~3?)

Let's do more detailed discussion on the api defination after that. I will contact you when there is any progress~

Also, when we completed the design of api, I think it would not be a difficult stuff to provide the java implementations.

@kevinten10
Copy link
Member Author

I put your definition above here: https://github.com/reactivegroup/cloud-runtimes-jvm/blob/feature/metrics/spec/proto.runtime.v1/Metrics.proto

And you can directly give Layotto a proposal

@pinxiong
Copy link
Contributor

We'd better refer OpenTelemetry, which is already the public accepted standard for monitoring, tracing, and metrics.

github: https://github.com/open-telemetry

@JasmineJ1230
Copy link

I have done some learning about Open Telemetry, as well as finished some demo and experiments.
Open Telemetry defines a set of reasonable models and APIs about telemetry data, which has been widely recognized by mainstream cloud manufacturers. If we follow its specifications and develop with its API and SDK, we can reduce a lot of migration costs in the later stage.
But actually, Open Telemetry is still in the development stage, and some functions are immature and incomplete. Maybe we can have a discussion on how and where to use Open Telemetry for our development to follow the mainstream standard while ensure the stability of our project.

I put my learning report here #5, which can be surved as a reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants