Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not panic on AzureRM Errors #44

Open
jkroepke opened this issue Jun 26, 2023 · 2 comments
Open

Do not panic on AzureRM Errors #44

jkroepke opened this issue Jun 26, 2023 · 2 comments

Comments

@jkroepke
Copy link
Contributor

jkroepke commented Jun 26, 2023

Hi,

from time to time we are receiving error from Azure for cost management related errors.

Some of them are persistent and after five retries, the exporter will be panic and exited.

I would like to ask, if the behavior can be changed from panic level to error level. I do not need any benefit of letting the exporter terminate.

Example panic trace after 5 retries:

goroutine 1265 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0xc0001eac00?, {0x0?, 0x0?, 0xc00046e5c0?})
	/go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0000fcd00, {0x0, 0x0, 0x0})
	/go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
go.uber.org/zap.(*SugaredLogger).log(0xc0000b8258, 0x4, {0x0?, 0xc0000d1300?}, {0xc000265350?, 0xc0001d7600?, 0xc000375370?}, {0x0, 0x0, 0x0})
	/go/pkg/mod/go.uber.org/zap@v1.24.0/sugar.go:295 +0xee
go.uber.org/zap.(*SugaredLogger).Panic(...)
	/go/pkg/mod/go.uber.org/zap@v1.24.0/sugar.go:153
main.(*MetricsCollectorAzureRmCosts).sendCostQuery(0x23?, {0x17649b0, 0xc000048048}, 0xc0000b8258, {0xc0000d1300, 0x33}, {0xc0001d7600, 0xc000375370, 0xc000375230, 0x0}, ...)
	/go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:495 +0x451
main.(*MetricsCollectorAzureRmCosts).collectCostManagementMetrics(0xc0001d6480, 0x11?, 0xc000118c40, {0xc0000d1300, 0x33}, {0x15d1413, 0xa}, 0xc000265ef0, {0xc00033b090, 0xb}, ...)
	/go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:370 +0x868
main.(*MetricsCollectorAzureRmCosts).collectRunCostQuery.func1(0xc0002bd9e0, 0xc00042fb48?)
	/go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:225 +0x1db
github.com/webdevops/go-common/azuresdk/armclient.(*SubscriptionsIterator).ForEach(0xc0000b8220?, 0xc00042fd20?, 0xc000265d30)
	/go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230513212717-8a2d16f8bb01/azuresdk/armclient/iterator.subscriptions.go:65 +0x1e2
main.(*MetricsCollectorAzureRmCosts).collectRunCostQuery(0xc0001d6480, 0xc000265ef0, 0xc0001c5e10?)
	/go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:223 +0x2ed
main.(*MetricsCollectorAzureRmCosts).Collect(0xc0001d6480, 0x1753340?)
	/go/src/github.com/webdevops/azure-resourcemanager-exporter/metrics_azurerm_costs.go:183 +0xe5
github.com/webdevops/go-common/prometheus/collector.(*Collector).collectRun.func1()
	/go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230513212717-8a2d16f8bb01/prometheus/collector/collector.go:380 +0x98
created by github.com/webdevops/go-common/prometheus/collector.(*Collector).collectRun
	/go/pkg/mod/github.com/webdevops/go-common@v0.0.0-20230513212717-8a2d16f8bb01/prometheus/collector/collector.go:351 +0x13b
@tesharp
Copy link

tesharp commented Sep 19, 2023

Seeing the same. Creates gaps in all metrics gathered. Would prefer if it just wrote an error

@HelenaSeidel
Copy link

HelenaSeidel commented Nov 23, 2023

The underlying issue is that panic() is used to handle errors, further up in the callstack it is attempted to "catch" those with recover() which does not work

Recover is a built-in function that regains control of a panicking goroutine. Recover is only useful inside deferred functions. During normal execution, a call to recover will return nil and have no other effect. If the current goroutine is panicking, a call to recover will capture the value given to panic and resume normal execution.
further details

This is basically what is being done here and putting it into the playground quickly shows that it fails.

package main

import (
	"fmt"
)

func foo() {
	panic("foobar")
}

func main() {
	foo()
	recover := recover()
	fmt.Println(recover)
}

The really BIG problem is that this incorrect treatment of errors is used for the back-off-retry mechanic in github.com/webdevops/go-common/prometheus/collector IMO this entire behavior should be broken

Currently I am working on putting a bandaid at least onto the cost collector to get rid of this issue. However an entire rework of error handling in (at least) this project, the azure-resourcegraph-exporter and go-common is required

Furthermore I don't understand why the prometheus go client library is not used more extensively here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants