Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the scraper/discover manager coordination on the Prometheus receiver #2089

Merged
merged 3 commits into from
Nov 13, 2020

Conversation

rakyll
Copy link
Contributor

@rakyll rakyll commented Nov 9, 2020

The receiver contains various unnecessary sections. Rewriting the
receiver's Start for better maintainability.

Related to #1909.

@rakyll rakyll requested a review from a team as a code owner November 9, 2020 00:51
@codecov
Copy link

codecov bot commented Nov 9, 2020

Codecov Report

Merging #2089 (883685b) into master (0275e52) will decrease coverage by 0.03%.
The diff coverage is 60.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2089      +/-   ##
==========================================
- Coverage   92.16%   92.13%   -0.04%     
==========================================
  Files         279      279              
  Lines       16980    16971       -9     
==========================================
- Hits        15650    15636      -14     
- Misses        910      915       +5     
  Partials      420      420              
Impacted Files Coverage Δ
receiver/prometheusreceiver/internal/ocastore.go 91.30% <ø> (-0.37%) ⬇️
service/internal/resources.go 25.33% <ø> (ø)
receiver/prometheusreceiver/metrics_receiver.go 68.75% <60.00%> (-8.75%) ⬇️
receiver/prometheusreceiver/internal/logger.go 81.63% <0.00%> (-4.09%) ⬇️
...eceiver/prometheusreceiver/internal/transaction.go 93.10% <0.00%> (-2.30%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0275e52...883685b. Read the comment docs.

Comment on lines 60 to 62
discoveryCtx := context.Background()
discoveryCtx, cancel := context.WithCancel(ctx)
r.cancelFunc = cancel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 60 can be removed, the discoveryCtx is always re-written correct? (but see the other comment about why the discoverCtx should not inherit from the context passed to start)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very interesting pattern that I observed in a lot of places (maybe you can help me understand better how can we avoid this "possible" mistake).

Extending the incoming context from the Start func I think is wrong, when the context is saved for "async" work. In this case you pass discoveryCtx to the discovery manager, and save a reference to it, then check if the context is cancelled to understand when to stop.

Here is why I think it is wrong:

  • The caller of the "Start" will possibly add a "deadline" to the context to ensure that the components start within a certain amount of time. Or can define its own cancelation to the context (for whatever reasons).
  • In case of extending from that context, you inherit the cancellation. And from the deadline example the extended context will be cancelled (the WithTimeout suggest to call the cancellation func when done), the caller is consider done when you return from Start, so the extended context discoveryCtx will be closed almost immediately after the Start ends.

I usually try to tell users that when you schedule "async" work you cannot inherit from "sync" context because the caller may cancel the context when you return (or you need to do a trick like I did here

type noCancellationContext struct {
where you inherit everything but the cancellation).

This is a long comment, hope you understand the problem, we had this #1813 where exactly this happened during the execution of an gRPC request we were scheduling async work in the queue_retry and return, the gRPC when we return will call done on the timeout return func, and the saved context will be "cancelled" so all the export calls will fail because we will pass a done context.

Copy link
Contributor Author

@rakyll rakyll Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the lifecycle of the components, so I looked at the the Host interface and it recommended me to start a new background context:

	// If the component needs to perform a long-running starting operation then it is recommended
	// that Start() returns quickly and the long-running operation is performed in background.
	// In that case make sure that the long-running operation does not use the context passed
	// to Start() function since that context will be cancelled soon and can abort the long-running operation.
	// Create a new context from the context.Background() for long-running operations.
	Start(ctx context.Context, host Host) error

I'm not familiar with the lifecycle of the components. Do we expect components to handle the cancellation/timeout signals in Start? If yes, we should mentioned it in the interface. If you clarify what the components should do with the context, I can try to make modifications. Unfortunately the Prometheus APIs doesn't allow us to pass different contexts to discovery.NewManager and discoveryManager.Run and that's the real problem here.

In an ideal situation, this code should be written like the following but the Prometheus APIs doesn't allow us to do it:

func (r *pReceiver) Start(ctx context.Context, host component.Host) error {
discoveryManager := discovery.NewManager(ctx, logger)

// ...

discoveryRunCtx := context.Background()
discoveryRunCtx, cancel := context.WithCancel(ctx)
discoveryManager.Run(discoveryRunCtx);

Splitting the cancellation and timeout sounds like a hack to get around the limitations of the APIs. Am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I removed the discoveryCtx := context.Background() because it seems like the only way ctx is cancelled when things are timed out or something wrong happened in the starting process. It's ok to cancel the discovery manager in both of those cases. Let me know if this looks ok.

Copy link
Member

@bogdandrutu bogdandrutu Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakyll I think code will be simpler:

// In the service try to start all components in maximum 30 seconds
func (srv *Service) startAllComponents() {
  ctx, cancelFunc := context.WithTimeout(context.Background(), 30*time.Second)
  defer cancelFunc()
  for i := range srv.components {
    srv.components[i].Start(ctx, srv.host)
  }
}


func (r *pReceiver) Start(ctx context.Context, host component.Host) error {
  discoveryCtx, cancel := context.WithCancel(ctx)
  r.cancelFunc = cancel
  // Manager will run until discoveryCtx is cancelled.
  discoveryManager := discovery.NewManager(discoveryCtx, logger)
  ...
}

As you can see the discoveryCtx will be cancelled when the startAllComponents ends, way before the service shutting down.

If you want, you can write a test that does exactly what I wrote, and you will see that the manager will be cancelled before you expect in the "Shutdown".

Copy link
Member

@bogdandrutu bogdandrutu Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an ideal situation, this code should be written like the following but the Prometheus APIs doesn't allow us to do it:

func (r *pReceiver) Start(ctx context.Context, host component.Host) error {
discoveryManager := discovery.NewManager(ctx, logger)

// ...

discoveryRunCtx := context.Background()
discoveryRunCtx, cancel := context.WithCancel(ctx)
discoveryManager.Run(discoveryRunCtx);
}

Yes, the context pass to run should be decoupled from the context pass to Start. You are right.
But because you cannot pass a different context to Run, you have to do discoveryCtx, cancel := context.WithCancel(context.Background()) otherwise you have the problem that I am mentioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched back to the background context.

The receiver contains various unnecessary sections. Rewriting the
receiver's Start for better maintainability.

Related to #1909.
@rakyll
Copy link
Contributor Author

rakyll commented Nov 12, 2020

The conflict is fixed.

@bogdandrutu
Copy link
Member

@rakyll Please fix the lint error:

receiver/prometheusreceiver/internal/ocastore.go:62:2: field once is unused (U1000)

@rakyll
Copy link
Contributor Author

rakyll commented Nov 13, 2020

Done.

@bogdandrutu bogdandrutu merged commit 3e3b201 into open-telemetry:master Nov 13, 2020
@rakyll rakyll deleted the prom branch November 13, 2020 16:43
Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this pull request Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants