Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port the metrics db out of Explorer infrastructure and into N1 #11156

Open
gmilescu opened this issue Apr 25, 2024 · 6 comments
Open

Port the metrics db out of Explorer infrastructure and into N1 #11156

gmilescu opened this issue Apr 25, 2024 · 6 comments
Assignees

Comments

@gmilescu
Copy link

No description provided.

@Trisfald
Copy link
Contributor

Trisfald commented May 3, 2024

After an initial investigation I propose here the following plan in order to port the telemetry infrastructure into N1.

First milestone - feature parity:

  • Setup a Postgres database in N1
    • Understand and clean up the schema used in the explorer database (ref)
    • Understand what data is needed and what is not (data sent by nodes)
    • Understand the telemetry protocol details (format of the data, frequency, load, how is it persisted and queried)
    • Create the DB instance in GCP (with high availability). One instance for both mainnet and testnet, the database will be different though
  • Create the middle layer service to receive telemetry data
    • Handling schema creation
    • Expose an HTTP endpoint accepting POST requests with JSON payload
    • Write the data received into the database
    • Metrics and health checks
    • Packaging and CI
    • Deploy multiple replicas for high availability
  • Setup alerting and monitoring for the new infrastructure
  • Configure the new Postgres datasource in N1 Grafana (docs)
  • Make a copy of the dashboard(s) and use the new datasource from NearOne

Second milestone - migration:

  • Make sure the new solution works and is stable
  • [ ] Ask Pagoda to redirect telemetry traffic to the new service
  • Update dashboards
  • Add the new telemetry endpoint to the node's config
  • (next release) Deprecate and remote the old telemetry endpoint

Third milestone - nice to have:

  • Autodetect the chain ID
    • TBD if we want to do this
    • Modify the telemetry protocol to include the chain ID
    • Switch to a single endpoint and use chain ID to distinguish between mainnet and testnet
  • Remove Pagoda telemetry forwarding
    • Update the node config to use the new telemetry endpoint. A single one if chain detection has been implemented, otherwise stick to mainnet and testnet.
    • Later ask Pagoda to remove the HTTP forwarding, if no new requests hit the old endpoints.

@Trisfald
Copy link
Contributor

Update as of Fri May 10:

  • Created a page in outline for this project: link.
  • Fleshed out the implementation plan and the design.
  • Progress made on the data model: understanding how telemetry data is stored and queries, which data is useful and which isn't, clearing up unimplemented and legacy fields.
  • Started working on the thin HTTP service layer.

@Trisfald
Copy link
Contributor

Fri 17 May Update:

  • Provisioned the database on GCP
  • Completed the work on the HTTP server
  • Deployed the necessaries resources on GCP with monitoring, etc
  • Done synthetic load test and functional test with a local validator

@Trisfald
Copy link
Contributor

Fri May 24 update:

  • Setup of the new telemetry subdomain: https://telemetry.nearone.org
  • Test with testnet canaries
  • Added the new datasources to grafana
  • Created a dashboard with the new datasources

Some changes to the overall plan:

  • Pagoda will simply redirect telemetry traffic to N1 (probably next week)

@Trisfald
Copy link
Contributor

Redirecting telemetry traffic from Pagoda to N1 is non trivial, so I'm falling back to have both endpoints for a while and rotating out the Pagoda endpoint in the future.

github-merge-queue bot pushed a commit that referenced this issue May 29, 2024
…r telemetry (#11407)

Add a new telemetry endpoint hosted on N1 alongside the old telemetry
endpoint hosted by Pagoda and based on the legacy explorer. Part of
#11156.

EDIT: adding the new endpoint without replacing the old one
@Trisfald
Copy link
Contributor

The new telemetry endpoint is active and nodes are sending traffic to it. Also, automatic chain detection has been implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants