Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check all of the 3rd party services #1469

Closed
xmonader opened this issue Nov 26, 2023 · 8 comments
Closed

Health check all of the 3rd party services #1469

xmonader opened this issue Nov 26, 2023 · 8 comments
Assignees
Labels
type_feature New feature or request
Milestone

Comments

@xmonader
Copy link
Contributor

xmonader commented Nov 26, 2023

To increase user satisfaction, we should implement basic healthchecks that periodically check the performance of all third-party services we rely on, such as RMB, TFchain, GridProxy, GraphQL, and others. These health checks should be executed at least once a minute? the intervals can be rethought , and any detected performance degradation should be promptly displayed in the user interface and trigger notifications to alert users and system administrators. By proactively monitoring and addressing performance issues, we can minimize disruptions and provide a seamless user experience.

Note: the alerting of the system administrators is going to be addressed in another issue

@xmonader xmonader added this to 3.14.x Nov 26, 2023
@xmonader xmonader added this to the 2.4.0 milestone Nov 26, 2023
@xmonader xmonader added the type_feature New feature or request label Dec 11, 2023
@AlaaElattar AlaaElattar removed this from 3.14.x Jan 17, 2024
@AlaaElattar AlaaElattar modified the milestones: 2.4.0, 2.3.0 Jan 17, 2024
@AlaaElattar AlaaElattar moved this to Accepted in 3.13.x Jan 17, 2024
@0oM4R 0oM4R moved this from Accepted to In Progress in 3.13.x Jan 18, 2024
@0oM4R
Copy link
Contributor

0oM4R commented Jan 18, 2024

Work completed:

created a monitoring package that has healthcheck classes to ping each service;

investigation

now we have a class for each service but i don't know how exactly will create the instance of those classes, the idea is we can't only have a method that pings the urls, some of the services will pinged through http request, polkadot version method, and rmb ping method, so will handle the implementation of them later

@0oM4R 0oM4R mentioned this issue Jan 18, 2024
5 tasks
@0oM4R
Copy link
Contributor

0oM4R commented Jan 21, 2024

Work completed:
created an interface and some classes that implements it and do some internal logic;
currently we can run this script to ping the provided services:

import { servicesLiveChecker } from "../src/index";
async function HealthCheck() {
  try {
    console.log(await servicesLiveChecker("fakeURL", "", "wss://tfchain.dev.grid.tf/ws", "wss://relay.dev.grid.tf"));
    process.exit(0);
  } catch (err) {
    console.log(err);
  }
}

HealthCheck();

and the output:

2024-01-21 19:35:09        API/INIT: RPC methods not decorated: transaction_unstable_submitAndWatch, transaction_unstable_unwatch
2024-01-21 19:35:09        API/INIT: Not decorating runtime apis without matching versions: TransactionPaymentApi/4 (1 known), Metadata/2 (1 known)
2024-01-21 19:35:10        API/INIT: RPC methods not decorated: transaction_unstable_submitAndWatch, transaction_unstable_unwatch
2024-01-21 19:35:10        API/INIT: Not decorating runtime apis without matching versions: TransactionPaymentApi/4 (1 known), Metadata/2 (1 known)
disconnecting
{ GraphQl: 'Down', TFChain: 'Alive', RMB: 'Alive' }

will enhance the code and provide a way to check the life of custom service if needed,
some enhancements to do:

  • pass service urls as object;
  • provide retries
  • stream errors

@xmonader
Copy link
Contributor Author

It should be more like

async  () => {
   svc1 = ...
   svc2 = ...
   svc3 = ...
   svcs = [ svc1, ... ]
   then loop on each and call .isHealthy function 

}

@0oM4R
Copy link
Contributor

0oM4R commented Jan 21, 2024

the mentioned script is totally abstract one;
the current version of the servicesLiveChecker is:

note as i mentioned will enhance the code by passing an object and will loop throw its keys

export async function servicesLiveChecker(
  GraphQlURL?: string,
  GridProxyURL?: string,
  TFChainURL?: string,
  RMBrelayURL?: string,
) {
  const LIVENESS = {};

  if (GraphQlURL) {
    LIVENESS["GraphQl"] = (await HealthChecker(new GraphQlHealthCheck(GridProxyURL))) ? "Alive" : `Down`;
  }
  if (GridProxyURL) {
    LIVENESS["GridProxy"] = (await HealthChecker(new GridProxyHealthCheck(GridProxyURL))) ? "Alive" : `Down`;
  }
  if (TFChainURL) {
    LIVENESS["TFChain"] = (await HealthChecker(new TFChainHealthCheck(TFChainURL))) ? "Alive" : "Down";
  }
  if (RMBrelayURL && TFChainURL) {
    LIVENESS["RMB"] = (await HealthChecker(new RMBHealthCheck(RMBrelayURL, TFChainURL, "sr25519"))) ? "Alive" : "Down";
  }
  return LIVENESS;
}

async function HealthChecker(HealthChecker: ILivenessChecker, retries = 2) {
  let alive = false;
  while (!alive && retries > 0) {
    alive = await HealthChecker.LiveChecker();
    retries--;
  }
  if ("disconnectHandler" in HealthChecker) HealthChecker.disconnectHandler();
  return alive;
}

@0oM4R
Copy link
Contributor

0oM4R commented Jan 21, 2024

It should be more like

async  () => {
   svc1 = ...
   svc2 = ...
   svc3 = ...
   svcs = [ svc1, ... ]
   then loop on each and call .isHealthy function 

}

updated version of the monitoring function :

const HEALTH_CHECK_INTERVAL = 5000;
const MAX_RETRIES = 2;

export async function TFServicesLiveMonitor(services: TFServices, interval = HEALTH_CHECK_INTERVAL): Promise<void> {
  const serviceArray: IServiceMonitor[] = initializeServices(services);

  // eslint-disable-next-line no-constant-condition
  // making sure we have at least one service to monitor
  while (serviceArray.length) {
    for (const service of serviceArray) {
      await liveChecker(service);
    }
    await new Promise(resolve => setTimeout(resolve, interval));
  }
}

export function initializeServices(services: TFServices): IServiceMonitor[] {
  const serviceArray: IServiceMonitor[] = [];

  for (const serviceName in services) {
    switch (serviceName) {
      case "graphQL":
        serviceArray.push(new GraphQlMonitor(services.graphQL.LivenessURL));
        break;
      case "gridProxy":
        serviceArray.push(new GridProxyMonitor(services.gridProxy.LivenessURL));
        break;
      case "tfChain":
        serviceArray.push(new TFChainMonitor(services.tfChain.LivenessURL));
        break;
      case "rmb":
        serviceArray.push(new RMBMonitor(services.rmb.LivenessURL, services?.tfChain.LivenessURL, "sr25519"));
        break;
      default:
        console.warn(`Unknown service: ${serviceName}`);
        break;
    }
  }

  return serviceArray;
}

async function liveChecker(liveChecker: IServiceMonitor, retries = MAX_RETRIES): Promise<ServiceStatus> {
  let alive = false;
  while (!alive && retries > 0) {
    alive = await liveChecker.LiveChecker();
    retries--;
  }
  return { [liveChecker.ServiceName]: alive ? "Alive" : "Down" };
}

trying to update the code to add the ability of check service aliveness , and the healthy status

@AlaaElattar AlaaElattar moved this from In Progress to Pending review in 3.13.x Jan 22, 2024
@0oM4R 0oM4R moved this from Pending review to In Progress in 3.13.x Jan 29, 2024
@0oM4R
Copy link
Contributor

0oM4R commented Jan 29, 2024

Wrok completed

  • resolved PR comments; just a few things are left to do;

Work in Progress (WIP):

refactor events handler

Investigation and Solution:

  • will add the service monitor class and expose it to be used in other clients
  • add jest tests

@0oM4R 0oM4R moved this from In Progress to Pending review in 3.13.x Jan 30, 2024
@0oM4R 0oM4R moved this from Pending review to Accepted in 3.13.x Feb 5, 2024
@0oM4R 0oM4R moved this from Accepted to In Progress in 3.13.x Feb 5, 2024
@0oM4R
Copy link
Contributor

0oM4R commented Feb 5, 2024

Work completed:
resolved some comments on the PR

@0oM4R 0oM4R moved this from In Progress to Pending review in 3.13.x Feb 5, 2024
@0oM4R 0oM4R moved this from Pending review to In Progress in 3.13.x Feb 12, 2024
@0oM4R 0oM4R moved this from In Progress to In Verification in 3.13.x Feb 12, 2024
@A-Harby
Copy link
Contributor

A-Harby commented Feb 21, 2024

Verified, Dev branch.

The script worked fine, giving the correct status (proxy was a fake url),
image

With the correct URLs.
image

TC2620 - Monitoring Script

@A-Harby A-Harby moved this from In Verification to Done in 3.13.x Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_feature New feature or request
Projects
No open projects
Status: Done
Development

No branches or pull requests

6 participants