Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add machine info to telemetry. Closes #404 #411

Merged
merged 16 commits into from
May 3, 2024
Merged

Conversation

juliangruber
Copy link
Member

@juliangruber juliangruber commented Apr 16, 2024

Closes #404

I didn't see a good way to test free disk space (depends on mount points etc), but that's not relevant for Station modules atm anyway.

Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliangruber could you please coordinate with @PatrickNercessian? I think he will want to use this data for the public dashboard too.

@juliangruber
Copy link
Member Author

@juliangruber could you please coordinate with @PatrickNercessian? I think he will want to use this data for the public dashboard too.

Good idea! @PatrickNercessian what do you think about these machine metrics and how they are stored?

@PatrickNercessian
Copy link
Contributor

@juliangruber could you please coordinate with @PatrickNercessian? I think he will want to use this data for the public dashboard too.

Good idea! @PatrickNercessian what do you think about these machine metrics and how they are stored?

The original plan was to use Rust's built-in libraries for device info like this in Zinnia, and pass it to the modules. Do we prefer to check this inside Core and pass it to Zinnia who then passes it to the modules, similar to station_id?

If so, should we instead generate this info outside of the telemetry context, so we can pass it Zinnia easily, and then just pass it to telemetry functions?

@bajtos
Copy link
Member

bajtos commented Apr 18, 2024

I am strongly against passing machine info from Station Core to Zinnia and Station Modules.

As I understand the proposal in this pull request, it records machine telemetry directly from Station Core to our InfluxDB, as an alternative design/solution to what we envisioned with Zinnia & Station Modules.

If we land this PR, we will have telemetry points about Station machines written to InfluxDB and stored in a bucket with a 30-day retention (see #404).

How much work is it to ingest this telemetry from InfluxDB to the Station public dashboard you are building, @PatrickNercessian? I think this will require a new component that is different from what we have discussed so far. I think it will need to periodically read data in InfluxDB and write aggregated metrics to a place where they can be persisted forever and exposed for the Dashboard UI.

@juliangruber
Copy link
Member Author

I agree with @bajtos, I don't see the benefit in passing this module data to the module source. Every Station should submit this, and modules don't need to be aware. Also, this is not per-module information. Or am I missing why this is important for the module to read?

@juliangruber
Copy link
Member Author

I think your plan might have been to submit this machine info with every measurement that the module submits, right? Then indeed it would need to be passed to the module.

Co-authored-by: Miroslav Bajtoš <oss@bajtos.net>
@bajtos
Copy link
Member

bajtos commented Apr 18, 2024

BTW, in the longer term, I'd like to replace the process-uuid value with stationId introduced by #406, sign the payload using the private key and include the signature as another field in the telemetry point. This allows consumers to verify that the correct Station Core instance submitted the machine spec, making it more difficult for potential attackers to submit a fake machine spec on behalf of a victim.

By including stationId in the machine spec, we can also link measurements committed to Spark or Voyager to the specs of the machine that performed those measurements.

I am just sharing my thoughts and visions. This is not blocking progress on this pull request; we can land it without stationId.

@juliangruber juliangruber marked this pull request as ready for review April 18, 2024 12:21
@juliangruber juliangruber requested a review from bajtos April 18, 2024 12:21
@PatrickNercessian
Copy link
Contributor

Ah I didn't realize this was writing to our centralized InfluxDB, I assumed it was for something locally in each Core deployment...

I haven't worked with InfluxDB before, so the reliability of my estimate of effort is somewhat limited. That being said, I can think of a few routes we could go.

A) In spark-evaluate, for every stationId within every round (we don't need to do for every measurement), we check InfluxDB for the latest machine info, and store that in the (stationId, day) row of daily_node_metrics along with the rest of the columns. Whether or not this makes sense might depend on how long current round evaluations currently take, and how long the latency would be for checking the InfluxDB for this information
B) In some new continuously polling service, we check InfluxDB for new machine information, and store it in a new Postgres table. Then the daily_node_metrics table points to that table's information as a foreign key

There are probably some other ways we could do it too, these are my first thoughts

@juliangruber
Copy link
Member Author

@PatrickNercessian we discussed requirements for this and realized that storing in Influx is fine, since we are mostly interested at the current state of the network, not historic data, and that is easy and cheap to query from Influx

@PatrickNercessian
Copy link
Contributor

@PatrickNercessian we discussed requirements for this and realized that storing in Influx is fine, since we are mostly interested at the current state of the network, not historic data, and that is easy and cheap to query from Influx

Sounds good. Do our centralized services have permission to read from InfluxDB? I don't see any existing code about it.

Also, any thoughts on the above two ideas about high-level implementation for this querying?

Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@juliangruber
Copy link
Member Author

Sounds good. Do our centralized services have permission to read from InfluxDB? I don't see any existing code about it.

Yes! With an access token, it's readable from the public internet. I believe there were some snippets reading from it, currently our Grafana instance is the main consumer.

Also, any thoughts on the above two ideas about high-level implementation for this querying?

I think it would be nicer for spark-stats to periodically query the information. It can also listen to the smart contract for round start events. Otherwise we add a task to spark-evaluate that has nothing to do with evaluating the work done.

@juliangruber juliangruber merged commit 699dc39 into main May 3, 2024
15 checks passed
@juliangruber juliangruber deleted the add/machine-stats branch May 3, 2024 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Capture machine metrics
3 participants