Fix the scraper/discover manager coordination on the Prometheus receiver #2089

rakyll · 2020-11-09T00:51:40Z

The receiver contains various unnecessary sections. Rewriting the
receiver's Start for better maintainability.

Related to #1909.

codecov · 2020-11-09T00:58:08Z

Codecov Report

Merging #2089 (883685b) into master (0275e52) will decrease coverage by 0.03%.
The diff coverage is 60.00%.

@@            Coverage Diff             @@
##           master    #2089      +/-   ##
==========================================
- Coverage   92.16%   92.13%   -0.04%     
==========================================
  Files         279      279              
  Lines       16980    16971       -9     
==========================================
- Hits        15650    15636      -14     
- Misses        910      915       +5     
  Partials      420      420

Impacted Files	Coverage Δ
receiver/prometheusreceiver/internal/ocastore.go	`91.30% <ø> (-0.37%)`	⬇️
service/internal/resources.go	`25.33% <ø> (ø)`
receiver/prometheusreceiver/metrics_receiver.go	`68.75% <60.00%> (-8.75%)`	⬇️
receiver/prometheusreceiver/internal/logger.go	`81.63% <0.00%> (-4.09%)`	⬇️
...eceiver/prometheusreceiver/internal/transaction.go	`93.10% <0.00%> (-2.30%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0275e52...883685b. Read the comment docs.

receiver/prometheusreceiver/metrics_receiver.go

bogdandrutu · 2020-11-11T15:41:03Z

receiver/prometheusreceiver/metrics_receiver.go

+		discoveryCtx := context.Background()
+		discoveryCtx, cancel := context.WithCancel(ctx)
+		r.cancelFunc = cancel


Line 60 can be removed, the discoveryCtx is always re-written correct? (but see the other comment about why the discoverCtx should not inherit from the context passed to start)

This is very interesting pattern that I observed in a lot of places (maybe you can help me understand better how can we avoid this "possible" mistake).

Extending the incoming context from the Start func I think is wrong, when the context is saved for "async" work. In this case you pass discoveryCtx to the discovery manager, and save a reference to it, then check if the context is cancelled to understand when to stop.

Here is why I think it is wrong:

The caller of the "Start" will possibly add a "deadline" to the context to ensure that the components start within a certain amount of time. Or can define its own cancelation to the context (for whatever reasons).

In case of extending from that context, you inherit the cancellation. And from the deadline example the extended context will be cancelled (the WithTimeout suggest to call the cancellation func when done), the caller is consider done when you return from Start, so the extended context discoveryCtx will be closed almost immediately after the Start ends.

I usually try to tell users that when you schedule "async" work you cannot inherit from "sync" context because the caller may cancel the context when you return (or you need to do a trick like I did here

opentelemetry-collector/exporter/exporterhelper/queued_retry.go

Line 277 in add864b

type noCancellationContext struct {

where you inherit everything but the cancellation).

This is a long comment, hope you understand the problem, we had this #1813 where exactly this happened during the execution of an gRPC request we were scheduling async work in the queue_retry and return, the gRPC when we return will call done on the timeout return func, and the saved context will be "cancelled" so all the export calls will fail because we will pass a done context.

I'm not familiar with the lifecycle of the components, so I looked at the the Host interface and it recommended me to start a new background context:

// If the component needs to perform a long-running starting operation then it is recommended // that Start() returns quickly and the long-running operation is performed in background. // In that case make sure that the long-running operation does not use the context passed // to Start() function since that context will be cancelled soon and can abort the long-running operation. // Create a new context from the context.Background() for long-running operations. Start(ctx context.Context, host Host) error

I'm not familiar with the lifecycle of the components. Do we expect components to handle the cancellation/timeout signals in Start? If yes, we should mentioned it in the interface. If you clarify what the components should do with the context, I can try to make modifications. Unfortunately the Prometheus APIs doesn't allow us to pass different contexts to discovery.NewManager and discoveryManager.Run and that's the real problem here.

In an ideal situation, this code should be written like the following but the Prometheus APIs doesn't allow us to do it:

func (r *pReceiver) Start(ctx context.Context, host component.Host) error { discoveryManager := discovery.NewManager(ctx, logger) // ... discoveryRunCtx := context.Background() discoveryRunCtx, cancel := context.WithCancel(ctx) discoveryManager.Run(discoveryRunCtx);

Splitting the cancellation and timeout sounds like a hack to get around the limitations of the APIs. Am I right?

Btw, I removed the discoveryCtx := context.Background() because it seems like the only way ctx is cancelled when things are timed out or something wrong happened in the starting process. It's ok to cancel the discovery manager in both of those cases. Let me know if this looks ok.

@rakyll I think code will be simpler:

// In the service try to start all components in maximum 30 seconds func (srv *Service) startAllComponents() { ctx, cancelFunc := context.WithTimeout(context.Background(), 30*time.Second) defer cancelFunc() for i := range srv.components { srv.components[i].Start(ctx, srv.host) } } func (r *pReceiver) Start(ctx context.Context, host component.Host) error { discoveryCtx, cancel := context.WithCancel(ctx) r.cancelFunc = cancel // Manager will run until discoveryCtx is cancelled. discoveryManager := discovery.NewManager(discoveryCtx, logger) ... }

As you can see the discoveryCtx will be cancelled when the startAllComponents ends, way before the service shutting down.

If you want, you can write a test that does exactly what I wrote, and you will see that the manager will be cancelled before you expect in the "Shutdown".

In an ideal situation, this code should be written like the following but the Prometheus APIs doesn't allow us to do it:

func (r *pReceiver) Start(ctx context.Context, host component.Host) error { discoveryManager := discovery.NewManager(ctx, logger) // ... discoveryRunCtx := context.Background() discoveryRunCtx, cancel := context.WithCancel(ctx) discoveryManager.Run(discoveryRunCtx); }

Yes, the context pass to run should be decoupled from the context pass to Start. You are right.
But because you cannot pass a different context to Run, you have to do discoveryCtx, cancel := context.WithCancel(context.Background()) otherwise you have the problem that I am mentioning.

Switched back to the background context.

The receiver contains various unnecessary sections. Rewriting the receiver's Start for better maintainability. Related to #1909.

rakyll · 2020-11-12T19:24:24Z

The conflict is fixed.

bogdandrutu · 2020-11-12T23:31:29Z

@rakyll Please fix the lint error:

receiver/prometheusreceiver/internal/ocastore.go:62:2: field once is unused (U1000)

rakyll · 2020-11-13T00:00:23Z

Done.

Fix open-telemetry#2089 Prepare release 1.9.0

rakyll requested a review from a team as a code owner November 9, 2020 00:51

rakyll mentioned this pull request Nov 9, 2020

Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable #1909

Closed

tigrannajaryan assigned asuresh4 Nov 10, 2020

bogdandrutu reviewed Nov 11, 2020

View reviewed changes

Fix the scraper/discover manager coordination on the Prometheus receiver

ffccbf7

The receiver contains various unnecessary sections. Rewriting the receiver's Start for better maintainability. Related to #1909.

Use the background context

6f4cc91

bogdandrutu approved these changes Nov 12, 2020

View reviewed changes

Remove dead code

883685b

bogdandrutu approved these changes Nov 13, 2020

View reviewed changes

bogdandrutu merged commit 3e3b201 into open-telemetry:master Nov 13, 2020

rakyll deleted the prom branch November 13, 2020 16:43

Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this pull request Jul 5, 2024

[RELEASE] Release version 1.9.0 (open-telemetry#2091)

7887d32

Fix open-telemetry#2089 Prepare release 1.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the scraper/discover manager coordination on the Prometheus receiver #2089

Fix the scraper/discover manager coordination on the Prometheus receiver #2089

rakyll commented Nov 9, 2020

codecov bot commented Nov 9, 2020 •

edited

Loading

bogdandrutu Nov 11, 2020

bogdandrutu Nov 11, 2020

rakyll Nov 12, 2020 •

edited

Loading

rakyll Nov 12, 2020

bogdandrutu Nov 12, 2020 •

edited

Loading

bogdandrutu Nov 12, 2020 •

edited

Loading

rakyll Nov 12, 2020

rakyll commented Nov 12, 2020

bogdandrutu commented Nov 12, 2020

rakyll commented Nov 13, 2020

Fix the scraper/discover manager coordination on the Prometheus receiver #2089

Fix the scraper/discover manager coordination on the Prometheus receiver #2089

Conversation

rakyll commented Nov 9, 2020

codecov bot commented Nov 9, 2020 • edited Loading

Codecov Report

bogdandrutu Nov 11, 2020

Choose a reason for hiding this comment

bogdandrutu Nov 11, 2020

Choose a reason for hiding this comment

rakyll Nov 12, 2020 • edited Loading

Choose a reason for hiding this comment

rakyll Nov 12, 2020

Choose a reason for hiding this comment

bogdandrutu Nov 12, 2020 • edited Loading

Choose a reason for hiding this comment

bogdandrutu Nov 12, 2020 • edited Loading

Choose a reason for hiding this comment

rakyll Nov 12, 2020

Choose a reason for hiding this comment

rakyll commented Nov 12, 2020

bogdandrutu commented Nov 12, 2020

rakyll commented Nov 13, 2020

codecov bot commented Nov 9, 2020 •

edited

Loading

rakyll Nov 12, 2020 •

edited

Loading

bogdandrutu Nov 12, 2020 •

edited

Loading

bogdandrutu Nov 12, 2020 •

edited

Loading