Skip to content

Commit

Permalink
Merge pull request #6071 from lichuang/metasrv_metrics
Browse files Browse the repository at this point in the history
Feature: add meta service network metrics and http health checker
  • Loading branch information
BohuTANG authored Jun 20, 2022
2 parents 88e56ef + d11f0ba commit 9ca506f
Show file tree
Hide file tree
Showing 8 changed files with 204 additions and 19 deletions.
19 changes: 17 additions & 2 deletions docs/doc/50-manage/00-metasrv/50-metasrv-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ These metrics describe the status of the `metasrv`. All these metrics are prefix
| ----------------- | ------------------------------------------------- | ------- |
| current_leader_id | Current leader id of cluster, 0 means no leader. | IntGauge |
| is_leader | Whether or not this node is current leader. | Gauge |
| node_is_health | Whether or not this node is health. | IntGauge |
| leader_changes | Number of leader changes seen. | Counter |
| applying_snapshot | Whether or not statemachine is applying snapshot. | Gauge |
| proposals_applied | Total number of consensus proposals applied. | Gauge |
Expand All @@ -32,6 +33,8 @@ These metrics describe the status of the `metasrv`. All these metrics are prefix

`is_leader` indicate if this `metasrv` currently is the leader of cluster, and `leader_changes` show the total number of leader changes since start.If change leader too frequently, it will impact the performance of `metasrv`, also it signal that the cluster is unstable.

If and only if the node state is `Follower` or `Leader` , `node_is_health` is 1, otherwise is 0.

`proposals_applied` records the total number of applied write requests.

`proposals_pending` indicates how many proposals are queued to commit currently.Rising pending proposals suggests there is a high client load or the member cannot commit proposals.
Expand All @@ -40,9 +43,9 @@ These metrics describe the status of the `metasrv`. All these metrics are prefix

`watchers` show the total number of active watchers currently.

### Network
### Raft Network

These metrics describe the network status of the `metasrv`. All these metrics are prefixed with `metasrv_network_`.
These metrics describe the network status of raft nodes in the `metasrv`. All these metrics are prefixed with `metasrv_raft_network_`.

| Name | Description | Labels | Type |
| ----------------------- | ------------------------------------------------- | --------------------------------- | ------------- |
Expand Down Expand Up @@ -71,3 +74,15 @@ These metrics describe the network status of the `metasrv`. All these metrics ar
`snapshot_recv_success` and `snapshot_recv_failures` indicates the success and fail number of receive snapshot.`snapshot_recv_inflights` indicate the inflight receiving snapshot, each time receive a snapshot, this field will increment by one, after receiving snapshot is done, this field will decrement by one.

`snapshot_recv_seconds` indicate the total latency distributions of snapshot receives.

### Meta Network

These metrics describe the network status of meta service in the `metasrv`. All these metrics are prefixed with `metasrv_meta_network_`.

| Name | Description | Type |
| ---------------- | ------------------------------------------------------ | ---------- |
| meta_sent_bytes | Total number of sent bytes to meta grpc client. | IntCounter |
| meta_recv_bytes | Total number of recv bytes from meta grpc client. | IntCounter |
| meta_inflights | Total number of inflight meta grpc requests. | IntGauge |
| meta_req_success | Total number of success request from meta grpc client. | IntCounter |
| meta_req_failed | Total number of fail request from meta grpc client. | IntCounter |
35 changes: 34 additions & 1 deletion metasrv/src/api/grpc/grpc_service.rs
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,9 @@ use tonic::Streaming;
use crate::executor::ActionHandler;
use crate::meta_service::meta_service_impl::GrpcStream;
use crate::meta_service::MetaNode;
use crate::metrics::add_meta_metrics_meta_request_inflights;
use crate::metrics::incr_meta_metrics_meta_recv_bytes;
use crate::metrics::incr_meta_metrics_meta_sent_bytes;
use crate::version::from_digit_ver;
use crate::version::to_digit_ver;
use crate::version::METASRV_SEMVER;
Expand Down Expand Up @@ -147,22 +150,41 @@ impl MetaService for MetaServiceImpl {
self.check_token(request.metadata())?;
common_tracing::extract_remote_span_as_parent(&request);

incr_meta_metrics_meta_recv_bytes(request.get_ref().encoded_len() as u64);

let action: MetaGrpcWriteReq = request.try_into()?;

add_meta_metrics_meta_request_inflights(1);

tracing::info!("Receive write_action: {:?}", action);

let body = self.action_handler.execute_write(action).await;

add_meta_metrics_meta_request_inflights(-1);

incr_meta_metrics_meta_sent_bytes(body.encoded_len() as u64);

Ok(Response::new(body))
}

async fn read_msg(&self, request: Request<RaftRequest>) -> Result<Response<RaftReply>, Status> {
self.check_token(request.metadata())?;
common_tracing::extract_remote_span_as_parent(&request);

incr_meta_metrics_meta_recv_bytes(request.get_ref().encoded_len() as u64);

let action: MetaGrpcReadReq = request.try_into()?;

add_meta_metrics_meta_request_inflights(1);

tracing::info!("Receive read_action: {:?}", action);

let res = self.action_handler.execute_read(action).await;

add_meta_metrics_meta_request_inflights(-1);

incr_meta_metrics_meta_sent_bytes(res.encoded_len() as u64);

Ok(Response::new(res))
}

Expand Down Expand Up @@ -210,13 +232,20 @@ impl MetaService for MetaServiceImpl {
request: Request<TxnRequest>,
) -> Result<Response<TxnReply>, Status> {
self.check_token(request.metadata())?;
incr_meta_metrics_meta_recv_bytes(request.get_ref().encoded_len() as u64);
add_meta_metrics_meta_request_inflights(1);

common_tracing::extract_remote_span_as_parent(&request);

let request = request.into_inner();

tracing::info!("Receive txn_request: {:?}", request);

let body = self.action_handler.execute_txn(request).await;
add_meta_metrics_meta_request_inflights(-1);

incr_meta_metrics_meta_sent_bytes(body.encoded_len() as u64);

Ok(Response::new(body))
}

Expand All @@ -228,7 +257,11 @@ impl MetaService for MetaServiceImpl {
let members = meta_node.get_meta_addrs().await.map_err(|e| {
Status::internal(format!("Cannot get metasrv member list, error: {:?}", e))
})?;
Ok(Response::new(MemberListReply { data: members }))

let resp = MemberListReply { data: members };
incr_meta_metrics_meta_sent_bytes(resp.encoded_len() as u64);

Ok(Response::new(resp))
}
}

Expand Down
18 changes: 17 additions & 1 deletion metasrv/src/api/http_service.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,18 @@ use common_base::base::Stoppable;
use common_exception::Result;
use common_tracing::tracing;
use poem::get;
use poem::http::StatusCode;
use poem::listener::RustlsConfig;
use poem::web::Json;
use poem::Endpoint;
use poem::EndpointExt;
use poem::IntoResponse;
use poem::Response;
use poem::Route;

use crate::configs::Config;
use crate::meta_service::MetaNode;
use crate::metrics::get_meta_metrics_node_is_health;

pub struct HttpService {
cfg: Config,
Expand All @@ -46,7 +51,7 @@ impl HttpService {
fn build_router(&self) -> impl Endpoint {
#[cfg_attr(not(feature = "memory-profiling"), allow(unused_mut))]
let mut route = Route::new()
.at("/v1/health", get(super::http::v1::health::health_handler))
.at("/v1/health", get(health_handler))
.at("/v1/config", get(super::http::v1::config::config_handler))
.at(
"/v1/cluster/nodes",
Expand Down Expand Up @@ -129,3 +134,14 @@ impl Stoppable for HttpService {
self.shutdown_handler.stop(force).await
}
}

#[poem::handler]
pub async fn health_handler() -> Response {
if !get_meta_metrics_node_is_health() {
return StatusCode::SERVICE_UNAVAILABLE.into_response();
}
Json(super::http::v1::health::HealthCheckResponse {
status: super::http::v1::health::HealthCheckStatus::Pass,
})
.into_response()
}
11 changes: 10 additions & 1 deletion metasrv/src/executor/action_handler.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ use common_meta_types::TxnReply;
use common_meta_types::TxnRequest;

use crate::meta_service::MetaNode;
use crate::metrics::incr_meta_metrics_meta_request_result;

pub struct ActionHandler {
/// The raft-based meta data entry.
Expand All @@ -48,6 +49,7 @@ impl ActionHandler {
match action {
MetaGrpcWriteReq::UpsertKV(a) => {
let r = self.meta_node.upsert_kv(a).await;
incr_meta_metrics_meta_request_result(r.is_ok());
RaftReply::from(r)
}
}
Expand All @@ -59,25 +61,32 @@ impl ActionHandler {
match action {
MetaGrpcReadReq::GetKV(a) => {
let r = self.meta_node.get_kv(&a.key).await;
incr_meta_metrics_meta_request_result(r.is_ok());
RaftReply::from(r)
}
MetaGrpcReadReq::MGetKV(a) => {
let r = self.meta_node.mget_kv(&a.keys).await;
incr_meta_metrics_meta_request_result(r.is_ok());
RaftReply::from(r)
}
MetaGrpcReadReq::ListKV(a) => {
let r = self.meta_node.prefix_list_kv(&a.prefix).await;
incr_meta_metrics_meta_request_result(r.is_ok());
RaftReply::from(r)
}
MetaGrpcReadReq::PrefixListKV(a) => {
let r = self.meta_node.prefix_list_kv(&a.0).await;
incr_meta_metrics_meta_request_result(r.is_ok());
RaftReply::from(r)
}
}
}

pub async fn execute_txn(&self, req: TxnRequest) -> TxnReply {
match self.meta_node.transaction(req).await {
let ret = self.meta_node.transaction(req).await;
incr_meta_metrics_meta_request_result(ret.is_ok());

match ret {
Ok(resp) => resp,
Err(err) => TxnReply {
success: false,
Expand Down
7 changes: 7 additions & 0 deletions metasrv/src/meta_service/raftmeta.rs
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ use openraft::Config;
use openraft::Raft;
use openraft::RaftMetrics;
use openraft::SnapshotPolicy;
use openraft::State;
use tonic::Status;

use crate::meta_service::meta_leader::MetaLeader;
Expand All @@ -63,6 +64,7 @@ use crate::metrics::incr_meta_metrics_leader_change;
use crate::metrics::incr_meta_metrics_read_failed;
use crate::metrics::set_meta_metrics_current_leader;
use crate::metrics::set_meta_metrics_is_leader;
use crate::metrics::set_meta_metrics_node_is_health;
use crate::metrics::set_meta_metrics_proposals_applied;
use crate::network::Network;
use crate::store::MetaRaftStore;
Expand Down Expand Up @@ -379,6 +381,11 @@ impl MetaNode {
};
if changed.is_ok() {
let mm = metrics_rx.borrow().clone();

set_meta_metrics_node_is_health(
mm.state == State::Follower || mm.state == State::Leader,
);

if let Some(cur) = mm.current_leader {
// if current leader has changed?
if let Some(leader) = current_leader {
Expand Down
Loading

1 comment on commit 9ca506f

@vercel
Copy link

@vercel vercel bot commented on 9ca506f Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Successfully deployed to the following URLs:

databend – ./

databend.rs
databend-git-main-databend.vercel.app
databend-databend.vercel.app
databend.vercel.app

Please sign in to comment.