Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downtime: enforce positive duration #9773

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Al2Klimov
Copy link
Member

A negative duration would crash Icinga DB and doesn't make sense. Existing Downtimes with a so small duration disappear immediately - can't be broken.

refs #9774

@julianbrost Shall we v2.14ify it as a "breaking" change?

Before

curl -fvku root:123456 -X POST -H 'Accept: application/json' 'https://127.0.0.1:5665/v1/actions/schedule-downtime?type=Host&filter=true&start_time=14000&end_time=2000000000&author=ak&comment=-&fixed=0&duration=-14'

Icinga accepted Downtime with negative duration, but Icinga DB crashed.

2023-05-25T15:24:04.353+0200	FATAL	icingadb	strconv.ParseUint: parsing "-14000": invalid syntax
can't parse flexible_duration into the uint64 DowntimeHistory#FlexibleDuration: -14000
github.com/icinga/icingadb/pkg/structify.structifyMapByTree
	/Users/aklimov/NET/WS/icingadb/pkg/structify/structify.go:97
github.com/icinga/icingadb/pkg/structify.MakeMapStructifier.func1
	/Users/aklimov/NET/WS/icingadb/pkg/structify/structify.go:42
github.com/icinga/icingadb/pkg/icingadb/history.writeOneEntityStage.func1
	/Users/aklimov/NET/WS/icingadb/pkg/icingadb/history/sync.go:182
github.com/icinga/icingadb/pkg/icingadb/history.writeMultiEntityStage.func1.1
	/Users/aklimov/NET/WS/icingadb/pkg/icingadb/history/sync.go:219
golang.org/x/sync/errgroup.(*Group).Go.func1
	/Users/aklimov/go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57
runtime.goexit
	/usr/local/Cellar/go/1.20.4/libexec/src/runtime/asm_amd64.s:1598
can't structify map map[string]interface {}{"author":"ak", "comment":"-", "downtime_id":"53ca5045a24a3a423b50fae26d754390ffb833e7", "end_time":"1685020998143", "endpoint_id":"961a69437b549424e354d153055be28784709c21", "entry_time":"1685021012143", "environment_id":"0025bd7d33c169a93c04a5c91d1de041c722cbbf", "event_id":"e68edda979c22a639701e65d3899830ba66e693d", "event_type":"downtime_start", "flexible_duration":"-14000", "has_been_cancelled":"0", "host_id":"961a69437b549424e354d153055be28784709c21", "is_flexible":"1", "object_type":"host", "scheduled_end_time":"2000000000000", "scheduled_start_time":"14000000", "start_time":"1685021012143", "trigger_time":"1685021012143"} by tree []structify.structBranch{structify.structBranch{field:0, leaf:"", subTree:[]structify.structBranch{structify.structBranch{field:0, leaf:"downtime_id", subTree:[]structify.structBranch(nil)}}}, structify.structBranch{field:1, leaf:"", subTree:[]structify.structBranch{structify.structBranch{field:0, leaf:"environment_id", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:1, leaf:"endpoint_id", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:2, leaf:"object_type", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:3, leaf:"host_id", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:4, leaf:"service_id", subTree:[]structify.structBranch(nil)}}}, structify.structBranch{field:2, leaf:"", subTree:[]structify.structBranch{structify.structBranch{field:0, leaf:"cancelled_by", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:1, leaf:"has_been_cancelled", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:2, leaf:"cancel_time", subTree:[]structify.structBranch(nil)}}}, structify.structBranch{field:3, leaf:"triggered_by_id", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:4, leaf:"parent_id", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:5, leaf:"entry_time", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:6, leaf:"author", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:7, leaf:"comment", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:8, leaf:"is_flexible", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:9, leaf:"flexible_duration", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:10, leaf:"scheduled_start_time", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:11, leaf:"scheduled_end_time", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:12, leaf:"start_time", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:13, leaf:"end_time", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:14, leaf:"scheduled_by", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:15, leaf:"trigger_time", subTree:[]structify.structBranch(nil)}}
github.com/icinga/icingadb/pkg/structify.MakeMapStructifier.func1
	/Users/aklimov/NET/WS/icingadb/pkg/structify/structify.go:42
github.com/icinga/icingadb/pkg/icingadb/history.writeOneEntityStage.func1
	/Users/aklimov/NET/WS/icingadb/pkg/icingadb/history/sync.go:182
github.com/icinga/icingadb/pkg/icingadb/history.writeMultiEntityStage.func1.1
	/Users/aklimov/NET/WS/icingadb/pkg/icingadb/history/sync.go:219
golang.org/x/sync/errgroup.(*Group).Go.func1
	/Users/aklimov/go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57
runtime.goexit
	/usr/local/Cellar/go/1.20.4/libexec/src/runtime/asm_amd64.s:1598
can't structify values map[string]interface {}{"author":"ak", "comment":"-", "downtime_id":"53ca5045a24a3a423b50fae26d754390ffb833e7", "end_time":"1685020998143", "endpoint_id":"961a69437b549424e354d153055be28784709c21", "entry_time":"1685021012143", "environment_id":"0025bd7d33c169a93c04a5c91d1de041c722cbbf", "event_id":"e68edda979c22a639701e65d3899830ba66e693d", "event_type":"downtime_start", "flexible_duration":"-14000", "has_been_cancelled":"0", "host_id":"961a69437b549424e354d153055be28784709c21", "is_flexible":"1", "object_type":"host", "scheduled_end_time":"2000000000000", "scheduled_start_time":"14000000", "start_time":"1685021012143", "trigger_time":"1685021012143"}
github.com/icinga/icingadb/pkg/icingadb/history.writeOneEntityStage.func1
	/Users/aklimov/NET/WS/icingadb/pkg/icingadb/history/sync.go:184
github.com/icinga/icingadb/pkg/icingadb/history.writeMultiEntityStage.func1.1
	/Users/aklimov/NET/WS/icingadb/pkg/icingadb/history/sync.go:219
golang.org/x/sync/errgroup.(*Group).Go.func1
	/Users/aklimov/go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57
runtime.goexit
	/usr/local/Cellar/go/1.20.4/libexec/src/runtime/asm_amd64.s:1598

After

Icinga prevents such:

> POST /v1/actions/schedule-downtime?type=Host&filter=true&start_time=14000&end_time=2000000000&author=ak&comment=-&fixed=0&duration=-14 HTTP/1.1
> Host: 127.0.0.1:5665
> Authorization: Basic cm9vdDoxMjM0NTY=
> User-Agent: curl/7.86.0
> Accept: application/json
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Server: Icinga/v2.13.0-720-gc7b5b2138
< Content-Type: application/json
< Content-Length: 108

@cla-bot cla-bot bot added the cla/signed label May 25, 2023
@icinga-probot icinga-probot bot added area/documentation End-user or developer help enhancement New feature or request labels May 25, 2023
@julianbrost
Copy link
Contributor

@julianbrost Shall we v2.14ify it as a "breaking" change?

What level of breaking are we talking about here? What happens with existing downtimes? Will they prevent Icinga from starting and require manual cleanup? Or will the magically disappear due to ignore_on_error?

Anyways, a flexible downtime with duration<=0 didn't do anything useful, did it? At best it would be treated as ending immediately and have little effect, at worst, it would even trigger some bugs I guess.

What about fixed downtimes? Could you create those in a way that persisted a negative duration? As fixed downtimes basically ignore the duration, these should work fine but this change would break them.

< HTTP/1.1 500 Internal Server Error

When trying to schedule such a downtime, this should result in some 4xx error (probably bad request).

Apart from that: ScheduledDowntime has the same field that should be handled in a consistent way.

A negative duration would crash Icinga DB and doesn't make sense. Existing
Downtimes with a so small duration disappear immediately - can't be broken.
@Al2Klimov
Copy link
Member Author

@julianbrost Shall we v2.14ify it as a "breaking" change?

What level of breaking are we talking about here? What happens with existing downtimes? Will they prevent Icinga from starting and require manual cleanup? Or will the magically disappear due to ignore_on_error?

Every problematic object will magically disappear in case of ignore_on_error. Otherwise yes, stuff will prevent Icinga from starting. Before this change everything created by API will disappear sooner or later permanently as Icinga can manage _api files. The only real problem are the ones in _etc the user created explicitly with a negative duration, but w/o ignore_on_error.

Anyways, a flexible downtime with duration<=0 didn't do anything useful, did it? At best it would be treated as ending immediately and have little effect, at worst, it would even trigger some bugs I guess.

Well, the only bug I know of:

What about fixed downtimes? Could you create those in a way that persisted a negative duration? As fixed downtimes basically ignore the duration, these should work fine but this change would break them.

Exactly.

< HTTP/1.1 500 Internal Server Error

When trying to schedule such a downtime, this should result in some 4xx error (probably bad request).

This is a general problem. E.g. start_time=-14000 already gets rejected this way.

@Al2Klimov Al2Klimov force-pushed the downtime-enforce-positive-duration branch from c7b5b21 to de543a8 Compare May 25, 2023 14:48
@julianbrost
Copy link
Contributor

What about fixed downtimes? Could you create those in a way that persisted a negative duration? As fixed downtimes basically ignore the duration, these should work fine but this change would break them.

Exactly.

So this change would remove downtimes that would work perfectly fine (apart from in Icinga DB) as an ignored attribute is set to a strange value? That doesn't sound good, those should survive.

@Al2Klimov
Copy link
Member Author

The user must explicitly set that strange value – what for? As you already said: it's pointless.

@Al2Klimov
Copy link
Member Author

Oh, and we have already broken less pointless things for less benefit:

@Al2Klimov Al2Klimov requested a review from julianbrost May 25, 2023 15:14
@julianbrost
Copy link
Contributor

for less benefit

Well that was an edge-case that broke the IDO, so it actually sounds quite similar to this issue here.

For that issue, I considered it quite unlikely to affect any real config as it required group names that can be represented as an integer (or another non-string type) and a real-world group name should probably contain some textual name.

Now for this issue, we have to guess how likely this happens in practice. In #9774, you can see that scheduled_start_time is also after scheduled_start_time, so that looks more like an accident during testing/developing some automation. Imagine some automation creating fixed downtimes with duration=-1, if you're still using the IDO, this would probably work perfectly fine but this PR would drop all your existing downtimes. It's for us to decide if we consider this acceptable or to be a problem. Part of answering that question should be whether implementing a more graceful fix would require an unreasonable amount of additional work.

@Al2Klimov
Copy link
Member Author

ther implementing a more graceful fix would require an unreasonable amount of additional work.

Our options are:

  • In Icinga DB make one struct field and its DB column signed
  • In Icinga do max(duration,0) stuff while dumping downtimes

I, however, prefer this solution. For me negative durations aren’t real-world config either. Such automation errors shall be 500/400-ed at early stage as I do here IMAO.

@Al2Klimov Al2Klimov added this to the 2.14.0 milestone May 26, 2023
@julianbrost
Copy link
Contributor

I, however, prefer this solution. For me negative durations aren’t real-world config either.

Shouldn't be real-world config but you wouldn't have noticed if you did this accidentally, because well, until now it's accepted.

Such automation errors shall be 500/400-ed at early stage as I do here IMAO.

There's definitely no reason to store duration < 0 for a fixed downtime. But that would probably also be enough: don't store it, i.e. just ignore the duration parameter for a fixed downtime (similar to how an extra parameter asdf is ignored) and check duration > 0 for new flexible downtimes.

The thing with enforcing this for all downtimes is that this will just delete existing downtimes. For flexible ones that might be acceptable because they wouldn't have done anything useful anyways as they'd end immediately. But fixed ones would have worked perfectly fine (apart from syncing to Icinga DB). So if someone had such existing downtimes, an upgrade would just delete them, the user wouldn't even see an error and have a chance to fix this and if they were important, they would have to be restored from a backup (if it exists 🙈).

We don't have a way to tell if this would affect someone, so I'd opt for the safe path, i.e. validate new downtimes where they are created and fix existing existing ones automatically (maybe even in {Get,Set}Duration() with some if (val < 0) { val = 0; } to prevent surprising results in any calculations based on the duration).

  • In Icinga DB make one struct field and its DB column signed

No, negative values are obviously unintended/broken here, so we should not propagate them (which would just result in Icinga DB Web having to deal with strange values instead).

@julianbrost
Copy link
Contributor

This PR somewhat evolved into #9775 and #9776, I'm not exactly sure what the plan with this PR is now.

@Al2Klimov Al2Klimov removed the request for review from julianbrost May 30, 2023 15:45
@Al2Klimov Al2Klimov marked this pull request as draft May 30, 2023 15:45
@Al2Klimov Al2Klimov self-assigned this May 30, 2023
@Al2Klimov Al2Klimov added core/quality Improve code, libraries, algorithms, inline docs and removed enhancement New feature or request labels May 30, 2023
@Al2Klimov Al2Klimov linked an issue May 30, 2023 that may be closed by this pull request
1 task
@Al2Klimov Al2Klimov mentioned this pull request Jun 1, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation End-user or developer help cla/signed core/quality Improve code, libraries, algorithms, inline docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Config: reject Downtimes w/ negative duration or end before start
2 participants