Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Exclude basic-authorship-proposer from the continuous tasks alert #7484

Merged
merged 3 commits into from
Nov 4, 2020

Conversation

tomaka
Copy link
Contributor

@tomaka tomaka commented Nov 3, 2020

#7328 introduced a new task named basic-authorship-proposer.
This task is spawned whenever a slot is ready for the node to create a slot, and ends once the block has been created.

Unfortunately in #7250 we made the assumption that a task that starts once, then ends once, then doesn't start again for 10 minutes, was actually probably supposed either to not end or to start again, and thus we raise an alert.

This means that if more than 10 minutes pass between the first time a node has a slot available to it and the second time a node has a slot available to it, an alert will be triggered.

In this PR I propose to special-case basic-authorship-proposer as a task that doesn't trigger this alert.

@tomaka tomaka added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Nov 3, 2020
@tomaka tomaka requested a review from mxinden November 3, 2020 17:28
Copy link
Contributor

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of excluding specific alerts instead of including specific ones which I initially proposed in #7250.

The worst that can happen when excluding tasks it that one gets a false positive alert. The worst that can happen when including specifc tasks is a missed alert. I favor the former.

@mxinden
Copy link
Contributor

mxinden commented Nov 3, 2020

From a higher level perspective I see this alert as a good hack, but not as a long term (/perfect) solution. In my eyes, when a long running essential task stops, the whole node should gracefully shutdown. That would eventually trigger the standard up Prometheus alert. What do you think?

@tomaka
Copy link
Contributor Author

tomaka commented Nov 3, 2020

In my eyes, when a long running essential task stops, the whole node should gracefully shutdown.

We do indeed have this concept of "essential task" in the code already, but it's not actually used except by GrandPa I believe.

@tomaka
Copy link
Contributor Author

tomaka commented Nov 3, 2020

bot merge

@ghost
Copy link

ghost commented Nov 3, 2020

Waiting for commit status.

@ghost
Copy link

ghost commented Nov 3, 2020

Checks failed; merge aborted.

@tomaka
Copy link
Contributor Author

tomaka commented Nov 4, 2020

bot merge

@ghost
Copy link

ghost commented Nov 4, 2020

Waiting for commit status.

@ghost
Copy link

ghost commented Nov 4, 2020

Checks failed; merge aborted.

@tomaka tomaka merged commit cae2d65 into paritytech:master Nov 4, 2020
@tomaka tomaka deleted the fix-alert-proposer branch November 4, 2020 13:01
@tomaka
Copy link
Contributor Author

tomaka commented Nov 4, 2020

CI was failing because of an ICE during the Rust code docs generation. Clearly not related to this PR.

@mxinden
Copy link
Contributor

mxinden commented Nov 4, 2020

I wonder whether CI still deploys the altered alerting rule to our Prometheus server, even on test failures. Let's check back in a couple of minutes.

@mxinden
Copy link
Contributor

mxinden commented Nov 4, 2020

For the record, the update is now deployed on our Prometheus server.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants