Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OTel metrics #98

Merged
merged 16 commits into from
Aug 30, 2023
Merged

Add OTel metrics #98

merged 16 commits into from
Aug 30, 2023

Conversation

siburu
Copy link
Contributor

@siburu siburu commented Aug 21, 2023

No description provided.

@siburu siburu marked this pull request as ready for review August 21, 2023 15:11
@siburu siburu requested a review from a team as a code owner August 21, 2023 15:11
@siburu siburu requested a review from bluele August 21, 2023 15:11
core/metric.go Outdated
name := fmt.Sprintf("%s.processed_block_height", namespaceRoot)
callback := func(ctx context.Context, observer api.Int64Observer) error {
observer.Observe(metricSrcProcessedBlockHeight.Load(), api.WithAttributes(attribute.String("chain", "src")))
observer.Observe(metricDstProcessedBlockHeight.Load(), api.WithAttributes(attribute.String("chain", "dst")))
Copy link

@mizuochik mizuochik Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding attributes expressing more detailed context?

Being ambiguous with chain:src and chain:dst only. A relayer process treats multiple src/dest pairs.

We may add attributes like chain_id:xxx and direction:src|dest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core/metric.go Outdated

// create the instrument "relayer.processed_block_height"
name := fmt.Sprintf("%s.processed_block_height", namespaceRoot)
callback := func(ctx context.Context, observer api.Int64Observer) error {
Copy link

@mizuochik mizuochik Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: How about writing callbacks more directly as follows?

  1. Define func Observe(ctx context.Context, observer api.Observer) on SyncHeaders
  2. Make InitializeMetrics to take a SyncHeaders value and register SyncHeaders.Observe as callback

The advantage is avoiding global atomic variables metricXXX and reducing codes treat them.

Copy link
Contributor Author

@siburu siburu Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your good suggestion! However there is a reason why I can't adopt this soon.

An instance of SyncHeaders is created, used, and dropped for each relay cycle.
Therefore it is meaningless to observer SyncHeaders to obtain metrics.
Not only SyncHeaders but all other states in the relayer are also short-lived and not held longer than one relay cycle.
Therefore we have to add some "global variables" to store metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to say that we basically want to keep the core structure of the relayer stateless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed my big mistake ... when I tried to move metric.go to another directory.
metric.go can be referred to by any other components in the relayer.
Therefore, to avoid cyclic reference, metric.go itself should not depends on any types or functions in the relayer.

Copy link
Contributor Author

@siburu siburu Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct structure seems the following:

  • The metrics subsystem (metric.go) keeps:
    • The instance of MeterProvider
    • The instance of Meter
    • The instruments like Counters and Gauges
      • Since we want to easily list up all metrics exported by the relayer, I put them together.
  • The other components (e.g. relayer core, tendermint module, mock prover) keeps:
    • Original data needed to calculate final metrics values
    • Logics to calculate final metrics values from original data

Copy link

@mizuochik mizuochik Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining! I have a question and a comment. (These aren't fixing requests. I agree the direction implementing core logics stateless.)

Question

Are you saying that SyncHeader is not ALWAYS long-lived?

SyncHeader seems long-lived structure when a relayer starts in service mode.

https://github.com/siburu/yui-relayer/blob/e6c5f9eca4fbde03c04218149776f068e948bb1d/core/service.go#L13
https://github.com/siburu/yui-relayer/blob/e6c5f9eca4fbde03c04218149776f068e948bb1d/core/naive-strategy.go#L65

It's OK that metrics are availbable only running in service mode.

Comment

We can also avoid cyclic reference by defining an interface like type SubjectI interface { Observe() } at metrics.go. (like here)
Other types depend on SubjectI. metrics.go will keep independence from other go files/packages.

Copy link
Contributor Author

@siburu siburu Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mizuochik
Right. I was wrong about the lifespan of SyncHeader.

Please forget the 3 comments that start from #98 (comment).
My current position is #98 (comment).

I think all the gauges that I added are more suitable for synchronous gauge, but the current release of OTel-go doesn't support synchronous gauge yet.
Therefore I defined our own implementation of synchronous gauge in metrics/sync_gauge.go.
As a result, I didn't need to implement Observe in SyncHeader or NaiveStrategy.

I believe synchronous gauge will be supported by OTel-go in the near future.
open-telemetry/opentelemetry-specification#3540
At that time, we can easily replace metrics/sync_gauge.go with the official implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I agreed!

SyncGauge seems suitable to me as well.

core/metric.go Outdated

// create the instrument "relayer.backlog_size"
name = fmt.Sprintf("%s.backlog_size", namespaceRoot)
callback = func(ctx context.Context, observer api.Int64Observer) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Same as above.

We may move this callback and the backlog variables to NaiveStrategy's method and fields.

core/metric.go Outdated
name = fmt.Sprintf("%s.backlog_size", namespaceRoot)
callback = func(ctx context.Context, observer api.Int64Observer) error {
observer.Observe(metricSrcBacklogSize.Load(), api.WithAttributes(attribute.String("chain", "src")))
observer.Observe(metricDstBacklogSize.Load(), api.WithAttributes(attribute.String("chain", "dst")))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as processed_block_height.

core/metric.go Outdated

// create the instrument "relayer.backlog_oldest_height"
name = fmt.Sprintf("%s.backlog_oldest_height", namespaceRoot)
callback = func(ctx context.Context, observer api.Int64Observer) error {
Copy link

@mizuochik mizuochik Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Same as above.

This callback also able to be move to NaiveStrategy.

In summary, observable callbacks may be implemented as follows.

metrics.go
//
// Measurement subject
//
type SubjectI struct {
  Observe(ctx context.Context, observer metricapi.Observer) error
}

func InitializeMetrics(....., subjects SubjectI...) error {  // Taking measurement subjects (syncHeaders, naiveStrategy, etc...)
  // Initialize metric providers, instruments, etc...
  :

  // Register measurement subjects (syncHeaders, naiveStrategy, etc...)
  for _, subject := range subjects {
    if _, err := meter.RegisterCallback(
        subject.Observe,
        ProcessedBlockHeightGauge,
        BacklogSizeGauge,        
        BacklogOldestHeightGauge,
        .....  // rest observable instruments
      ); err != nil {
      panic(err)
    }
  }
  :
}
headers.go
func (sh *syncHeaders) Observe(ctx context.Context, observer api.Observer) error {
  :
  observer.Observe(ProcessedBlockHeightGauge, .....)
  :
}
naive-strategy.go
type NaiveStrategy struct {
    :
    // backlogs may have to be keep here
    srcBacklog  PacketInfoList
    destBacklog PacketInfoList
}

:

// The below callback taking generic `api.Observer` (not like `api.Int64Observer`)
// handles *multiple* observable gauges at once.
func (st *NaiveStrategy) Observe(ctx context.Context, observer api.Observer) error {
  :
  observer.Observe(BacklogSizeGauge, ...)
  :
  observer.Observe(BacklogOldestHeightGauge, ...)
  :
}

func (st NaiveStrategy) UnrelayedPackets(src, dst *ProvableChain, sh SyncHeaders, includeRelayedButUnfinalized bool) (*RelayPackets, error) {
  :
  ReceivePacketsFinalizedCounter.Add(1)  // This counter is non-observable, so increment in main relayer logics directly
  :
}

core/metric.go Outdated
instProcessedBlockHeight api.Int64ObservableGauge
instBacklogSize api.Int64ObservableGauge
instBacklogOldestHeight api.Int64ObservableGauge
instReceivePacketsFinalized api.Int64Counter
Copy link

@mizuochik mizuochik Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Naming xxxGauge and xxxCounter may be better.

Gauge and Counter are domain-specific terms around observability, so better express the intent of variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core/metric.go Outdated
instReceivePacketsFinalized.Add(ctx, 1, api.WithAttributes(attribute.String("chain", "dst")))
}
}
metricDstBacklog = newDstBacklog

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any QA or tests for the above logic?

core/metric.go Outdated
name = fmt.Sprintf("%s.backlog_oldest_height", namespaceRoot)
callback = func(ctx context.Context, observer api.Int64Observer) error {
observer.Observe(metricSrcBacklogOldestHeight.Load(), api.WithAttributes(attribute.String("chain", "src")))
observer.Observe(metricDstBacklogOldestHeight.Load(), api.WithAttributes(attribute.String("chain", "dst")))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as processed_block_height.

core/metric.go Outdated
go func() {
mux := http.NewServeMux()
mux.Handle("/metrics", promhttp.Handler())
if err := http.ListenAndServe(addr, mux); err != nil && err != http.ErrServerClosed {
Copy link

@mizuochik mizuochik Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above handler stop, the entire process never stops. Metrics will be lost even if the process is alive.

We want relayer processes stopping when metrics become unavailable.
Cloud infrastructures like K8s heal those crashed processes by restarting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
… down

Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
@siburu siburu requested a review from mizuochik August 28, 2023 03:12
@mizuochik
Copy link

@siburu Thanks for fixing. I commented.
#98 (comment)

LGTM about the others!

Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
…rometheus-addr` field from the global config

Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
Signed-off-by: Masanori Yoshida <masanori.yoshida@datachain.jp>
meterProvider *metric.MeterProvider
meter api.Meter

ProcessedBlockHeightGauge *Int64SyncGauge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siburu
It would be helpful to write a succinct description for each metric here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I noticed that they are written in the initialization of them.

Copy link
Contributor Author

@siburu siburu Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I should move them to here! Thank you.

Copy link
Member

@bluele bluele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM👍

@siburu siburu merged commit 10824d7 into hyperledger-labs:main Aug 30, 2023
5 checks passed
@siburu siburu deleted the add-otel-metrics branch October 16, 2023 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants