OpenTelemetry metrics & Miner Index fix for document max-size limit & new deal-watcher #811

jsign · 2021-03-25T13:16:25Z

This PR includes multiple improvements:

The Miner Index was refactored, and new flags/envs can be tuned to: refresh the index on Powergate start, tune the level of parallelization to speed up rebuilding the index, and the frequency of index refresh.
Any go-datastore implementation is wrapped to a cusotmized go-ds-measure to publish Prometheus metrics of Get, Put, Query, Commit, and other usual operations of go-datastore. We forked the original repo to add Txn based metrics.
Opencensus was removed, and was replaced with Opentelemetry.
The miner index saving switched from being saved as a complete snapshot, to being saved per miner. This was to fix a limitation of MongoDB backed go-datastore not allowing to save documents bigger than 16MB.
A new DealWatcher component was created to provide an instant-based solution to detect deal state changes. The existing poll-style is kept because we aren't totally sure how reliable this new way is. In fact, this was done originally and we switched to poll-based since long-running websocket connections of go-jsonrpc for the Lotus client were unreliable. Keeping both allows to get the benefits of instant updating, but if for some reason that breaks, we still do polling once in a while. (Apart from trying to recover that broken watcher).

A new set of application metrics are exposed:

General metrics about the build:
- Build date
- Git summary
- Git branch
- Git commit
  Datastore:
- All mentioned operations of go-datastore latencies.
Deal watcher metrics:
- How many deal updates we get from the chain
- How many times the watcher got in bad health and had to be re-created. This helps to understand how reliable this new mechanism is.
Deal module:
- A counter of how many deals are being actively tracked (in-progress)
- Same as above but for retrievals
- A counter of the number of finalized deals, tagged with success or error.
- A counter of the number of bytes of finalized deals, tagged with success or error.
Cold-storage:
- A gauge to understand the data pre-processing queue (deal size + CommP calculation), tagged with "waiting" and "in-progress". (Remember that we rate-limit by default to 2 "in-progress" since it's a heavy process to avoid killing the Lotus node)
Scheduler:
- A counter to show the Jobs in final status, each tagged with "success", "failed", "canceled".
- A gauge to show Jobs in "queued" and "executing" status.
Indices:
- Ask and Miner index progress gauge. (0%/100%
Wallet module:
- A counter of the number of created wallet addresses
- A histogram of wallet transfer amounts.

I believe there's a lot of room to continue thinking if these are the right metrics or other ones that are useful to be gathered. The main point here was doing some kickoff and we can keep making some incremental improvements not that we have things wired.

Here's a quick screenshot of the Grafana dashboard. It has some rough edges, but we'll eventually also publish in Grafana dashboards:

While doing this work, some secondary repos were forked+edited and created:

https://github.com/textileio/go-ds-measure/: Forked from original and added txn related metrics.
https://github.com/textileio/go-metrics-opentelemetry: New repo that provides an implementation of http://github.com/ipfs/go-metrics-interface, which is the interface used in go-ds-measure (since it uses a generic interface for metrics collection).

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

index/miner/lotusidx/store/store.go

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

jsign · 2021-03-25T13:17:36Z

api/client/utils_test.go

+		IndexMinersRefreshOnStart:     false,
+		IndexMinersOnChainMaxParallel: 1,
+		IndexMinersOnChainFrequency:   time.Minute,


New knobs via env/flags.

jsign · 2021-03-25T13:18:31Z

api/server/admin/service.go

-	minerIndex "github.com/textileio/powergate/v2/index/miner/module"
+	minerIndex "github.com/textileio/powergate/v2/index/miner/lotusidx"


Switching names since module still feels too abstract. I mention lotus in the new name since this implementation of index/miner uses a Lotus as a Filecoin client.

jsign · 2021-03-25T13:19:11Z

api/server/server.go

@@ -74,6 +76,7 @@ var (
 		2: migration.V2StorageInfoDealIDs,
 		3: migration.V3StorageJobsIndexMigration,
 		4: migration.V4RecordsMigration,
+		5: migration.V5DeleteOldMinerIndex,


Simple migration to delete old Miner Index store, since the new one uses another key namespace.

jsign · 2021-03-25T13:19:45Z

api/server/server.go

-	mi, err := minerModule.New(txndstr.Wrap(ds, "index/miner"), clientBuilder, fchost, mm, conf.IndexMinersRefreshOnStart, conf.DisableIndices)
+
+	log.Info("Starting miner index...")
+	minerIdxConf := minerIndex.Config{
+		RefreshOnStart:     conf.IndexMinersRefreshOnStart,
+		Disable:            conf.DisableIndices,
+		OnChainMaxParallel: conf.IndexMinersOnChainMaxParallel,
+		OnChainFrequency:   conf.IndexMinersOnChainFrequency,
+	}


To avoid constructor parameter explosion: introducing some config struct.

jsign · 2021-03-25T13:39:15Z

api/server/server.go

-		return nil, fmt.Errorf("opening badger datastore: %s", err)
-	}
-	return ds, nil
+	return measure.New("powergate.datastore", ds), nil


Here we now return a wrapped datastore (being Badger or MongoDB backed) that will intercept go-datastore calls and produce counters and histograms for operations. These will be collected by OpenTelemetry and exposed as Prometehus endpoint.

jsign · 2021-03-25T14:03:18Z

index/miner/lotusidx/store/store.go

+}
+
+// SaveOnChain creates/updates on-chain information of miners.
+func (s *Store) SaveOnChain(ctx context.Context, index miner.ChainIndex) error {


The miner index has a lot of miners. I used a datastore.Batching interface to allow to persist things in batches since single updates were too slow.

jsign · 2021-03-25T14:03:38Z

index/miner/lotusidx/store/store.go

+		if i%1000 == 0 {
+			if err := b.Commit(); err != nil {
+				return fmt.Errorf("committing batch: %s", err)
+			}
+			b, err = s.ds.Batch()
+			if err != nil {
+				return fmt.Errorf("creating batch: %s", err)
+			}
+		}


Every 1000 miner index entries, we commit the batch and start a new one.

jsign · 2021-03-25T14:05:10Z

lotus/metrics.go

@@ -7,19 +7,11 @@ import (
 	"time"


Migrating some old metrics about Lotus health from OpenCensus to OpenTelemetry, plus adding other to know about sync lag. (See screenshot from Grafana dashboard in Lotus section).

jsign · 2021-03-25T14:06:15Z

migration/migration5.go

+	Run: func(ds datastoreReaderWriter) error {
+		q := query.Query{Prefix: "/index/miner/chainstore"}
+		res, err := ds.Query(q)
+		if err != nil {
+			return fmt.Errorf("querying records: %s", err)
+		}
+		defer func() { _ = res.Close() }()
+
+		var count int
+		for v := range res.Next() {
+			if err := ds.Delete(datastore.NewKey(v.Key)); err != nil {
+				return fmt.Errorf("deleting miner chainstore key: %s", err)
+			}
+			count++
+		}
+		log.Infof("deleted %d chainstore keys", count)
+
+		return nil
+	},


The new miner-index store saves the data in a new namespace. The only thing this migration is doing is removing the old persisted miner-index data. If wasn't removed it would stay there forever. Isn't really that big, but doing it as to keep things clean.

jsign · 2021-03-25T14:07:10Z

wallet/lotuswallet/wallet.go

@@ -18,6 +18,7 @@ import (
 	logger "github.com/ipfs/go-log/v2"


Added some metrics in the Wallet Module to have some counter of how many new addresses and transfers are happening. This would show interesting info for Powergate used by Hub, since we create a new wallet-address per new Account/User. It will be cool to see.

Yea very interesting.

eightysteele · 2021-03-25T16:26:09Z

Any go-datastore implementation is wrapped to a cusotmized go-ds-measure to publish Prometheus metrics of Get, Put, Query, Commit, and other usual operations of go-datastore. We forked the original repo to add Txn based metrics.

So good!

Curious if any of the refactoring here makes it easier to eventually add in some rolling averages stuff (textileio/textile#519).

jsign · 2021-03-25T16:40:45Z

Any go-datastore implementation is wrapped to a cusotmized go-ds-measure to publish Prometheus metrics of Get, Put, Query, Commit, and other usual operations of go-datastore. We forked the original repo to add Txn based metrics.

So good!

Curious if any of the refactoring here makes it easier to eventually add in some rolling averages stuff (textileio/textile#519).

The Hub Miner Index is built with mindexd daemon. This was done to allow the Hub to decide its own path on what the Hub Miner Index should look like. (Also, to gather deal-making information from all our hosted powergates, so it has more information than any powergate too. But this fact is unrelated to that feature).

That daemon is already querying Powergate on regular intervals about prices offered by miners, so it can already do further aggregations (being rolling averages, or other things) on its own. It would need to keep some short history of the last ask prices, and do more calculation on tops or similar.

eightysteele · 2021-03-25T16:42:34Z

Gotcha, that makes sense 👍🏻 (thanks!)

asutula

Looks great. No issues at all, just some general questions for my own curiosity.

asutula · 2021-03-29T14:59:53Z

deals/module/deals.go

@@ -105,6 +106,16 @@ func (m *Module) Watch(ctx context.Context, proposal cid.Cid) (<-chan deals.Stor
 	go func() {
 		defer close(updates)

+		watcherUpdates := make(chan struct{}, 20)


Very cool. Glad we're going to try the push based mechanism, and will either have success or file some new lotus issues. Good outcome either way.

asutula · 2021-03-29T15:07:00Z

deals/module/retrieve.go

@@ -103,6 +103,9 @@ func (m *Module) retrieve(ctx context.Context, lapi *apistruct.FullNodeStruct, l
 	go func() {
 		defer lapiCls()
 		defer close(out)
+		m.metricRetrievalTracking.Add(ctx, 1)
+		defer m.metricRetrievalTracking.Add(ctx, -1)
+


I like how easy this api is to use. In the new services I've been creating, I've exposed this sort of metrics data through the api thinking that some external monitoring would poll the api. Is that even possible? Either way, this seems like a nicer solution.

Yeah, that could be possible. I'd vote to let the daemon itself publish the metric instead of being an external component, just at to make metrics a baked-in feature of the daemon.

asutula · 2021-03-29T15:21:05Z

index/miner/lotusidx/metrics.go

+	mi.metricLock.Lock()
+	defer mi.metricLock.Unlock()
+	result.Observe(mi.onchainProgress, onchainSubindex)
+	result.Observe(mi.metaProgress, metaSubindex)


Awesome you can observe like this. Seems like magic, I guess it's probably simple since it's Go.

But I am really curious, when it comes to actually being notified the observed valued changed... how does that work? You update the observed value directly, not via some func wrapper, right?

This metric is defined as:

_ = metric.Must(meter).NewFloat64ValueObserver("powergate.index.ask.progress", ai.progressValueObserver, metric.WithDescription("Ask index refresh progress"), metric.WithUnit(unit.Dimensionless))

So what happens in the NewXXXXObserver style of metrics, is that the OpenTelemetry library will call the callback (progressValueObserver) on defined intervals.
This has some benefits of not needing to record quickly changing metrics, and also registering a ton of observations within the same timestamp.

In this case, since the update progress is a gauge, I let this metric be updated on every collection interval.

There are quite a few things to learn, and keep improving. I think this is the best reference as to understand how different measurements work. I still need to give a full read, but that's the best source of documentation for the metrics API.

asutula · 2021-03-29T16:01:32Z

wallet/lotuswallet/wallet.go

@@ -18,6 +18,7 @@ import (
 	logger "github.com/ipfs/go-log/v2"


Yea very interesting.

jsign added 19 commits March 25, 2021 09:13

module renaming

e3aa6e2

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

redo logic

ffb6aec

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

add migration to delete old keys

88bf959

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

opentelemetry kickoff

e59033d

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

add new flags

6acb832

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

more metrics & migration & ds-measure

dbfedbf

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

default histogram buckets & sort gateway

c1eb7b1

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

deal watcher

13bb192

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

.

9b67321

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

move to sjstore and add runtime

11ae43d

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

dw metrics

93d2c54

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

add record metrics

4ebf8bd

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

fix compiling

20a7522

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

log messages & fix counter

b9cf6c2

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

fix cls call

6839da3

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

rebase fixes

58cbc6e

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

change dependencies

d93168a

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

fix metrics initialization

d3519c3

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

update dependabot dependencies

cc8e2e7

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

jsign self-assigned this Mar 25, 2021

jsign force-pushed the jsign/chgmi branch from 1a4cadf to cc8e2e7 Compare March 25, 2021 13:16

jsign added feature rd-minor labels Mar 25, 2021

github-actions bot reviewed Mar 25, 2021

View reviewed changes

index/miner/lotusidx/store/store.go Outdated Show resolved Hide resolved

index/miner/lotusidx/store/store.go Outdated Show resolved Hide resolved

jsign added 3 commits March 25, 2021 10:37

lints

8e3db65

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

revert test config

cb710b8

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

rever other test config

3eb5f34

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

jsign changed the title ~~Opentelemetry application metrics & MongoDB fix for documents hard-limit & new deal-watcher~~ Opentelemetry application metrics & Miner Index fix for document max-size limit & new deal-watcher Mar 25, 2021

jsign changed the title ~~Opentelemetry application metrics & Miner Index fix for document max-size limit & new deal-watcher~~ OpenTelemetry metrics & Miner Index fix for document max-size limit & new deal-watcher Mar 25, 2021

jsign commented Mar 25, 2021

View reviewed changes

jsign marked this pull request as ready for review March 25, 2021 14:38

jsign requested a review from asutula March 25, 2021 14:38

asutula approved these changes Mar 29, 2021

View reviewed changes

jsign merged commit 813687f into master Mar 29, 2021

jsign deleted the jsign/chgmi branch March 29, 2021 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenTelemetry metrics & Miner Index fix for document max-size limit & new deal-watcher #811

OpenTelemetry metrics & Miner Index fix for document max-size limit & new deal-watcher #811

jsign commented Mar 25, 2021 •

edited

Loading

jsign Mar 25, 2021

jsign Mar 25, 2021

jsign Mar 25, 2021

jsign Mar 25, 2021

jsign Mar 25, 2021

jsign Mar 25, 2021

jsign Mar 25, 2021

jsign Mar 25, 2021

jsign Mar 25, 2021

jsign Mar 25, 2021

asutula Mar 29, 2021

eightysteele commented Mar 25, 2021

jsign commented Mar 25, 2021

eightysteele commented Mar 25, 2021

asutula left a comment

asutula Mar 29, 2021

asutula Mar 29, 2021

jsign Mar 29, 2021

asutula Mar 29, 2021

asutula Mar 29, 2021

jsign Mar 29, 2021

asutula Mar 29, 2021

		minerIndex "github.com/textileio/powergate/v2/index/miner/module"
		minerIndex "github.com/textileio/powergate/v2/index/miner/lotusidx"

		@@ -18,6 +18,7 @@ import (
		logger "github.com/ipfs/go-log/v2"

OpenTelemetry metrics & Miner Index fix for document max-size limit & new deal-watcher #811

OpenTelemetry metrics & Miner Index fix for document max-size limit & new deal-watcher #811

Conversation

jsign commented Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eightysteele commented Mar 25, 2021

jsign commented Mar 25, 2021

eightysteele commented Mar 25, 2021

asutula left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsign commented Mar 25, 2021 •

edited

Loading