Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add meta service network metrics and http health checker #6071

Merged
merged 2 commits into from
Jun 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 17 additions & 2 deletions docs/doc/50-manage/00-metasrv/50-metasrv-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ These metrics describe the status of the `metasrv`. All these metrics are prefix
| ----------------- | ------------------------------------------------- | ------- |
| current_leader_id | Current leader id of cluster, 0 means no leader. | IntGauge |
| is_leader | Whether or not this node is current leader. | Gauge |
| node_is_health | Whether or not this node is health. | IntGauge |
| leader_changes | Number of leader changes seen. | Counter |
| applying_snapshot | Whether or not statemachine is applying snapshot. | Gauge |
| proposals_applied | Total number of consensus proposals applied. | Gauge |
Expand All @@ -32,6 +33,8 @@ These metrics describe the status of the `metasrv`. All these metrics are prefix

`is_leader` indicate if this `metasrv` currently is the leader of cluster, and `leader_changes` show the total number of leader changes since start.If change leader too frequently, it will impact the performance of `metasrv`, also it signal that the cluster is unstable.

If and only if the node state is `Follower` or `Leader` , `node_is_health` is 1, otherwise is 0.

`proposals_applied` records the total number of applied write requests.

`proposals_pending` indicates how many proposals are queued to commit currently.Rising pending proposals suggests there is a high client load or the member cannot commit proposals.
Expand All @@ -40,9 +43,9 @@ These metrics describe the status of the `metasrv`. All these metrics are prefix

`watchers` show the total number of active watchers currently.

### Network
### Raft Network

These metrics describe the network status of the `metasrv`. All these metrics are prefixed with `metasrv_network_`.
These metrics describe the network status of raft nodes in the `metasrv`. All these metrics are prefixed with `metasrv_raft_network_`.

| Name | Description | Labels | Type |
| ----------------------- | ------------------------------------------------- | --------------------------------- | ------------- |
Expand Down Expand Up @@ -71,3 +74,15 @@ These metrics describe the network status of the `metasrv`. All these metrics ar
`snapshot_recv_success` and `snapshot_recv_failures` indicates the success and fail number of receive snapshot.`snapshot_recv_inflights` indicate the inflight receiving snapshot, each time receive a snapshot, this field will increment by one, after receiving snapshot is done, this field will decrement by one.

`snapshot_recv_seconds` indicate the total latency distributions of snapshot receives.

### Meta Network

These metrics describe the network status of meta service in the `metasrv`. All these metrics are prefixed with `metasrv_meta_network_`.

| Name | Description | Type |
| ---------------- | ------------------------------------------------------ | ---------- |
| meta_sent_bytes | Total number of sent bytes to meta grpc client. | IntCounter |
| meta_recv_bytes | Total number of recv bytes from meta grpc client. | IntCounter |
| meta_inflights | Total number of inflight meta grpc requests. | IntGauge |
| meta_req_success | Total number of success request from meta grpc client. | IntCounter |
| meta_req_failed | Total number of fail request from meta grpc client. | IntCounter |
35 changes: 34 additions & 1 deletion metasrv/src/api/grpc/grpc_service.rs
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,9 @@ use tonic::Streaming;
use crate::executor::ActionHandler;
use crate::meta_service::meta_service_impl::GrpcStream;
use crate::meta_service::MetaNode;
use crate::metrics::add_meta_metrics_meta_request_inflights;
use crate::metrics::incr_meta_metrics_meta_recv_bytes;
use crate::metrics::incr_meta_metrics_meta_sent_bytes;
use crate::version::from_digit_ver;
use crate::version::to_digit_ver;
use crate::version::METASRV_SEMVER;
Expand Down Expand Up @@ -147,22 +150,41 @@ impl MetaService for MetaServiceImpl {
self.check_token(request.metadata())?;
common_tracing::extract_remote_span_as_parent(&request);

incr_meta_metrics_meta_recv_bytes(request.get_ref().encoded_len() as u64);

let action: MetaGrpcWriteReq = request.try_into()?;

add_meta_metrics_meta_request_inflights(1);

tracing::info!("Receive write_action: {:?}", action);

let body = self.action_handler.execute_write(action).await;

add_meta_metrics_meta_request_inflights(-1);

incr_meta_metrics_meta_sent_bytes(body.encoded_len() as u64);

Ok(Response::new(body))
}

async fn read_msg(&self, request: Request<RaftRequest>) -> Result<Response<RaftReply>, Status> {
self.check_token(request.metadata())?;
common_tracing::extract_remote_span_as_parent(&request);

incr_meta_metrics_meta_recv_bytes(request.get_ref().encoded_len() as u64);

let action: MetaGrpcReadReq = request.try_into()?;

add_meta_metrics_meta_request_inflights(1);

tracing::info!("Receive read_action: {:?}", action);

let res = self.action_handler.execute_read(action).await;

add_meta_metrics_meta_request_inflights(-1);

incr_meta_metrics_meta_sent_bytes(res.encoded_len() as u64);

Ok(Response::new(res))
}

Expand Down Expand Up @@ -210,13 +232,20 @@ impl MetaService for MetaServiceImpl {
request: Request<TxnRequest>,
) -> Result<Response<TxnReply>, Status> {
self.check_token(request.metadata())?;
incr_meta_metrics_meta_recv_bytes(request.get_ref().encoded_len() as u64);
add_meta_metrics_meta_request_inflights(1);

common_tracing::extract_remote_span_as_parent(&request);

let request = request.into_inner();

tracing::info!("Receive txn_request: {:?}", request);

let body = self.action_handler.execute_txn(request).await;
add_meta_metrics_meta_request_inflights(-1);

incr_meta_metrics_meta_sent_bytes(body.encoded_len() as u64);

Ok(Response::new(body))
}

Expand All @@ -228,7 +257,11 @@ impl MetaService for MetaServiceImpl {
let members = meta_node.get_meta_addrs().await.map_err(|e| {
Status::internal(format!("Cannot get metasrv member list, error: {:?}", e))
})?;
Ok(Response::new(MemberListReply { data: members }))

let resp = MemberListReply { data: members };
incr_meta_metrics_meta_sent_bytes(resp.encoded_len() as u64);

Ok(Response::new(resp))
}
}

Expand Down
18 changes: 17 additions & 1 deletion metasrv/src/api/http_service.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,18 @@ use common_base::base::Stoppable;
use common_exception::Result;
use common_tracing::tracing;
use poem::get;
use poem::http::StatusCode;
use poem::listener::RustlsConfig;
use poem::web::Json;
use poem::Endpoint;
use poem::EndpointExt;
use poem::IntoResponse;
use poem::Response;
use poem::Route;

use crate::configs::Config;
use crate::meta_service::MetaNode;
use crate::metrics::get_meta_metrics_node_is_health;

pub struct HttpService {
cfg: Config,
Expand All @@ -46,7 +51,7 @@ impl HttpService {
fn build_router(&self) -> impl Endpoint {
#[cfg_attr(not(feature = "memory-profiling"), allow(unused_mut))]
let mut route = Route::new()
.at("/v1/health", get(super::http::v1::health::health_handler))
.at("/v1/health", get(health_handler))
.at("/v1/config", get(super::http::v1::config::config_handler))
.at(
"/v1/cluster/nodes",
Expand Down Expand Up @@ -129,3 +134,14 @@ impl Stoppable for HttpService {
self.shutdown_handler.stop(force).await
}
}

#[poem::handler]
pub async fn health_handler() -> Response {
if !get_meta_metrics_node_is_health() {
return StatusCode::SERVICE_UNAVAILABLE.into_response();
}
Json(super::http::v1::health::HealthCheckResponse {
status: super::http::v1::health::HealthCheckStatus::Pass,
})
.into_response()
}
11 changes: 10 additions & 1 deletion metasrv/src/executor/action_handler.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ use common_meta_types::TxnReply;
use common_meta_types::TxnRequest;

use crate::meta_service::MetaNode;
use crate::metrics::incr_meta_metrics_meta_request_result;

pub struct ActionHandler {
/// The raft-based meta data entry.
Expand All @@ -48,6 +49,7 @@ impl ActionHandler {
match action {
MetaGrpcWriteReq::UpsertKV(a) => {
let r = self.meta_node.upsert_kv(a).await;
incr_meta_metrics_meta_request_result(r.is_ok());
RaftReply::from(r)
}
}
Expand All @@ -59,25 +61,32 @@ impl ActionHandler {
match action {
MetaGrpcReadReq::GetKV(a) => {
let r = self.meta_node.get_kv(&a.key).await;
incr_meta_metrics_meta_request_result(r.is_ok());
RaftReply::from(r)
}
MetaGrpcReadReq::MGetKV(a) => {
let r = self.meta_node.mget_kv(&a.keys).await;
incr_meta_metrics_meta_request_result(r.is_ok());
RaftReply::from(r)
}
MetaGrpcReadReq::ListKV(a) => {
let r = self.meta_node.prefix_list_kv(&a.prefix).await;
incr_meta_metrics_meta_request_result(r.is_ok());
RaftReply::from(r)
}
MetaGrpcReadReq::PrefixListKV(a) => {
let r = self.meta_node.prefix_list_kv(&a.0).await;
incr_meta_metrics_meta_request_result(r.is_ok());
RaftReply::from(r)
}
}
}

pub async fn execute_txn(&self, req: TxnRequest) -> TxnReply {
match self.meta_node.transaction(req).await {
let ret = self.meta_node.transaction(req).await;
incr_meta_metrics_meta_request_result(ret.is_ok());

match ret {
Ok(resp) => resp,
Err(err) => TxnReply {
success: false,
Expand Down
7 changes: 7 additions & 0 deletions metasrv/src/meta_service/raftmeta.rs
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ use openraft::Config;
use openraft::Raft;
use openraft::RaftMetrics;
use openraft::SnapshotPolicy;
use openraft::State;
use tonic::Status;

use crate::meta_service::meta_leader::MetaLeader;
Expand All @@ -63,6 +64,7 @@ use crate::metrics::incr_meta_metrics_leader_change;
use crate::metrics::incr_meta_metrics_read_failed;
use crate::metrics::set_meta_metrics_current_leader;
use crate::metrics::set_meta_metrics_is_leader;
use crate::metrics::set_meta_metrics_node_is_health;
use crate::metrics::set_meta_metrics_proposals_applied;
use crate::network::Network;
use crate::store::MetaRaftStore;
Expand Down Expand Up @@ -379,6 +381,11 @@ impl MetaNode {
};
if changed.is_ok() {
let mm = metrics_rx.borrow().clone();

set_meta_metrics_node_is_health(
mm.state == State::Follower || mm.state == State::Leader,
);

if let Some(cur) = mm.current_leader {
// if current leader has changed?
if let Some(leader) = current_leader {
Expand Down
Loading