Add machine info to telemetry. Closes #404 #411

juliangruber · 2024-04-16T13:24:57Z

Closes #404

I didn't see a good way to test free disk space (depends on mount points etc), but that's not relevant for Station modules atm anyway.

lib/telemetry.js

bajtos

@juliangruber could you please coordinate with @PatrickNercessian? I think he will want to use this data for the public dashboard too.

juliangruber · 2024-04-17T11:33:38Z

@juliangruber could you please coordinate with @PatrickNercessian? I think he will want to use this data for the public dashboard too.

Good idea! @PatrickNercessian what do you think about these machine metrics and how they are stored?

lib/telemetry.js

PatrickNercessian · 2024-04-17T13:56:31Z

@juliangruber could you please coordinate with @PatrickNercessian? I think he will want to use this data for the public dashboard too.

Good idea! @PatrickNercessian what do you think about these machine metrics and how they are stored?

The original plan was to use Rust's built-in libraries for device info like this in Zinnia, and pass it to the modules. Do we prefer to check this inside Core and pass it to Zinnia who then passes it to the modules, similar to station_id?

If so, should we instead generate this info outside of the telemetry context, so we can pass it Zinnia easily, and then just pass it to telemetry functions?

bajtos · 2024-04-18T10:26:25Z

I am strongly against passing machine info from Station Core to Zinnia and Station Modules.

As I understand the proposal in this pull request, it records machine telemetry directly from Station Core to our InfluxDB, as an alternative design/solution to what we envisioned with Zinnia & Station Modules.

If we land this PR, we will have telemetry points about Station machines written to InfluxDB and stored in a bucket with a 30-day retention (see #404).

How much work is it to ingest this telemetry from InfluxDB to the Station public dashboard you are building, @PatrickNercessian? I think this will require a new component that is different from what we have discussed so far. I think it will need to periodically read data in InfluxDB and write aggregated metrics to a place where they can be persisted forever and exposed for the Dashboard UI.

juliangruber · 2024-04-18T10:39:38Z

I agree with @bajtos, I don't see the benefit in passing this module data to the module source. Every Station should submit this, and modules don't need to be aware. Also, this is not per-module information. Or am I missing why this is important for the module to read?

juliangruber · 2024-04-18T10:40:31Z

I think your plan might have been to submit this machine info with every measurement that the module submits, right? Then indeed it would need to be passed to the module.

lib/telemetry.js

Co-authored-by: Miroslav Bajtoš <oss@bajtos.net>

bajtos · 2024-04-18T11:05:56Z

BTW, in the longer term, I'd like to replace the process-uuid value with stationId introduced by #406, sign the payload using the private key and include the signature as another field in the telemetry point. This allows consumers to verify that the correct Station Core instance submitted the machine spec, making it more difficult for potential attackers to submit a fake machine spec on behalf of a victim.

By including stationId in the machine spec, we can also link measurements committed to Spark or Voyager to the specs of the machine that performed those measurements.

I am just sharing my thoughts and visions. This is not blocking progress on this pull request; we can land it without stationId.

PatrickNercessian · 2024-04-18T15:23:32Z

Ah I didn't realize this was writing to our centralized InfluxDB, I assumed it was for something locally in each Core deployment...

I haven't worked with InfluxDB before, so the reliability of my estimate of effort is somewhat limited. That being said, I can think of a few routes we could go.

A) In spark-evaluate, for every stationId within every round (we don't need to do for every measurement), we check InfluxDB for the latest machine info, and store that in the (stationId, day) row of daily_node_metrics along with the rest of the columns. Whether or not this makes sense might depend on how long current round evaluations currently take, and how long the latency would be for checking the InfluxDB for this information
B) In some new continuously polling service, we check InfluxDB for new machine information, and store it in a new Postgres table. Then the daily_node_metrics table points to that table's information as a foreign key

There are probably some other ways we could do it too, these are my first thoughts

lib/telemetry.js

juliangruber · 2024-04-30T06:58:50Z

@PatrickNercessian we discussed requirements for this and realized that storing in Influx is fine, since we are mostly interested at the current state of the network, not historic data, and that is easy and cheap to query from Influx

PatrickNercessian · 2024-04-30T18:46:27Z

@PatrickNercessian we discussed requirements for this and realized that storing in Influx is fine, since we are mostly interested at the current state of the network, not historic data, and that is easy and cheap to query from Influx

Sounds good. Do our centralized services have permission to read from InfluxDB? I don't see any existing code about it.

Also, any thoughts on the above two ideas about high-level implementation for this querying?

bajtos

LGTM

juliangruber · 2024-05-03T05:13:31Z

Sounds good. Do our centralized services have permission to read from InfluxDB? I don't see any existing code about it.

Yes! With an access token, it's readable from the public internet. I believe there were some snippets reading from it, currently our Grafana instance is the main consumer.

Also, any thoughts on the above two ideas about high-level implementation for this querying?

I think it would be nicer for spark-stats to periodically query the information. It can also listen to the smart contract for round start events. Otherwise we add a task to spark-evaluate that has nothing to do with evaluating the work done.

juliangruber added 3 commits April 16, 2024 15:24

add machine info to telemetry

6626918

refactor control flow

d282c8c

todo

811bf4d