Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Incorrect normalization causes panic #1811

Closed
shanson7 opened this issue Apr 30, 2020 · 7 comments · Fixed by #1849
Closed

Incorrect normalization causes panic #1811

shanson7 opened this issue Apr 30, 2020 · 7 comments · Fixed by #1849
Assignees
Labels
Milestone

Comments

@shanson7
Copy link
Collaborator

Describe the bug

Relevant stack:

/metrictank/expr/seriesaggregators.go:147 (0xbc58bd)
/metrictank/expr/func_aggregate.go:74 (0xb98a3d)
/metrictank/expr/func_aggregate.go:61 (0xb97fea)
/metrictank/expr/plan.go:309 (0xbc4dcb)
/metrictank/api/graphite.go:844 (0xc39bf4)
/metrictank/api/graphite.go:268 (0xc32d0e)

I was able to narrow the crash down to a query like sum(a, summarize(a, '10min')), where the lcm used to normalize is 600 (e.g. interval of a is a factor of 600). It seems that summarize adds in an extra datapoint at the beginning (pre-marking, as graphite does) so it has an extra datapoint after normalization.

It seems that perhaps summarize should be following almost identical logic to normalizeTo (with some exception for the alignToFrom case). This would result in postmarked result, but would be compatible for all later normalization.

@Dieterbe
Copy link
Contributor

Dieterbe commented Jun 8, 2020

@shanson7 what version was this? please always mention the version

@shanson7
Copy link
Collaborator Author

shanson7 commented Jun 8, 2020

Lately it's been a (possibly significant, based on the number of outstanding PRs) modified version of master at the time of posting. I believe this one is from this BB release branch https://github.com/bloomberg/metrictank/tree/release_20200429 which boils down to this commit 75884f2

@Dieterbe
Copy link
Contributor

Dieterbe commented Jun 8, 2020

I was able to briefly reproduce this with v0.13.1-800-g74ba401c

i launched a fresh docker-dev and ran this in grafana:

#A : stats.docker-env.response.200 (hidden)
#B: sum(#A, summarize(#A, '10min'))

stackdump below. strangely after a few minutes it started working

<html> <head><title>PANIC: runtime error: index out of range [1] with length 1</title> <meta charset="utf-8" /> <style type="text/css"> html, body { font-family: "Roboto", sans-serif; color: #333333; background-color: #ea5343; margin: 0px; } h1 { color: #d04526; background-color: #ffffff; padding: 20px; border-bottom: 1px dashed #2b3848; } pre { margin: 20px; padding: 20px; border: 2px solid #2b3848; background-color: #ffffff; white-space: pre-wrap; /* css-3 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* Internet Explorer 5.5+ */ } </style> </head>
<body>
 <h1>PANIC</h1> <pre style="font-weight: bold;">runtime error: index out of range [1] with length 1</pre> <pre>/usr/lib/go/src/runtime/panic.go:88 (0x4326e2) 
$GOPATH/src/github.com/grafana/metrictank/expr/seriesaggregators.go:147 (0xbaf325) 
$GOPATH/src/github.com/grafana/metrictank/expr/func_aggregate.go:74 (0xb8540d) 
$GOPATH/src/github.com/grafana/metrictank/expr/func_aggregate.go:61 (0xb84f5d) 
$GOPATH/src/github.com/grafana/metrictank/expr/plan.go:309 (0xbae7f7) 
$GOPATH/src/github.com/grafana/metrictank/api/graphite.go:851 (0xc20597) 
$GOPATH/src/github.com/grafana/metrictank/api/graphite.go:270 (0xc19773)
/usr/lib/go/src/reflect/value.go:460 (0x4b6eca)
/usr/lib/go/src/reflect/value.go:321 (0x4b63d3) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:177 (0xab1dab) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:137 (0xab1789) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:121 (0xacc5e8) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:112 (0xc07691) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/raintank/gziper/gzip.go:100 (0xc07684) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:79 (0xacc490) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:157 (0xab1ad7) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:135 (0xab1878) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:121 (0xacc5e8) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:112 (0xadd0f5) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/recovery.go:161 (0xadd0e8) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/logger.go:40 (0xad0053) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:157 (0xab1ad7) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:135 (0xab1878) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:121 (0xacc5e8) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:112 (0xb31520) 
$GOPATH/src/github.com/grafana/metrictank/api/middleware/logger.go:45 (0xb31509)
/usr/lib/go/src/reflect/value.go:460 (0x4b6eca)
/usr/lib/go/src/reflect/value.go:321 (0x4b63d3) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:177 (0xab1dab) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:137 (0xab1789) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:121 (0xacc5e8) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:112 (0xb32dab) 
$GOPATH/src/github.com/grafana/metrictank/api/middleware/tracer.go:70 (0xb32d96) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:79 (0xacc490) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:157 (0xab1ad7) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:135 (0xab1878) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:121 (0xacc5e8) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:112 (0xb3248d) 
$GOPATH/src/github.com/grafana/metrictank/api/middleware/stats.go:76 (0xb32478) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:79 (0xacc490) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:157 (0xab1ad7) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:135 (0xab1878) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:121 (0xacc5e8) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:112 (0xadd0f5) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/recovery.go:161 (0xadd0e8) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/logger.go:40 (0xad0053) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:157 (0xab1ad7) 
$GOPATH/src/github.com/grafana/metrictank/vendor/github.com/go-macaron/inject/inject.go:135 (0xab1878) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/context.go:121 (0xacc5e8) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/router.go:187 (0xade306) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/router.go:294 (0xad876d) 
$GOPATH/src/github.com/grafana/metrictank/vendor/gopkg.in/macaron.v1/macaron.go:220 (0xad138a)
/usr/lib/go/src/net/http/server.go:2807 (0x742792)
/usr/lib/go/src/net/http/server.go:1895 (0x73e20b)
/usr/lib/go/src/runtime/asm_amd64.s:1373 (0x467780) </pre> 
</body> </html>"
message:"Internal Server Error"

@Dieterbe
Copy link
Contributor

Yeah I'll update this soon with more info, but I think the gist of it is that summarize does not return series in canonical form, whereas basically everywhere else in the code we do, and we assume that we have canonical series everywhere.

@Dieterbe
Copy link
Contributor

Dieterbe commented Jun 10, 2020

If a client requests a cross series aggregation between a canonical series and one that is not canonical because it had alignToFrom, I think we may want to just return "http 400 bad request"

in practice (e.g. inside of sumSeries) we may not know why a series is not canonical, so perhaps we should first log all these cases, then review that indeed they're only caused by queries with summmarize(....,true) and then enable the rejections

@Dieterbe
Copy link
Contributor

Note that summarize(alignToFrom=true) does something truly exceptional in terms of the kind of series it creates (and timestamps it sets), it looks very different from any other series generated anywhere else (fetching, normalizing, and any other function call all result in canonical series)

So we could add a "plan validator": after constructing the AST we can detect if a query wants to run summarize(alignToFrom=true) and then somehow combine that result with other -canonical- series by means of aggregate(), groupByTags, sumSeries, asPercent, etc etc.
But frankly, I rather just discourage or deprecate that param, or - since it seems so rarely used - we can leave it to panic the query with a non-optimal message.

The more interesting case is alignToFrom=false.

This would result in postmarked result, but would be compatible for all later normalization.

Switching wholesale to postmarking is a non-trivial departure from how graphite does things, and wouldn't really solve the problem: it would still create an excess datapoint, except this time it would lie at the end (and have a ts > from), also resulting in a non-canonical series. And plotting it in Grafana would create a shifted graph because between graphite and metrictank, all points would be moved over by interval.

Because the current implementation returns a first point that lies before from (!) and actually is not plotted by Grafana anyway, I think we can simply stop returning that first point. Visually in Grafana, it would look exactly the same, though json api users might notice the first point being removed. But it seems much less impact this way.

@Dieterbe
Copy link
Contributor

Dieterbe commented Jun 15, 2020

To demonstrate, I ran build/mt-fakemetrics backfill --metricname fakemetrics.sine.max --value-policy daily-sine:4000,15h,1000 --mpo 1 --period 10s --speedup $((7*3600*24)) --offset $((366*24))h --kafka-mdm-addr localhost:9092 in docker/docker-dev-custom-cfg-kafka
and ran this query:

screen

summarize() returns:

metrictank_1    | (models.Series) {
metrictank_1    |  Target: (string) (len=46) "summarize(fakemetrics.sine.max, \"1min\", \"max\")",
metrictank_1    |  Tags: (map[string]string) (len=3) {
metrictank_1    |   (string) (len=9) "summarize": (string) (len=4) "1min",
metrictank_1    |   (string) (len=17) "summarizeFunction": (string) (len=3) "max",
metrictank_1    |   (string) (len=4) "name": (string) (len=20) "fakemetrics.sine.max"
metrictank_1    |  },
metrictank_1    |  Interval: (uint32) 60,
metrictank_1    |  QueryPatt: (string) (len=46) "summarize(fakemetrics.sine.max, \"1min\", \"max\")",
metrictank_1    |  QueryFrom: (uint32) 1592217212,
metrictank_1    |  QueryTo: (uint32) 1592217572,
metrictank_1    |  QueryCons: (consolidation.Consolidator) NoneConsolidator,
metrictank_1    |  Consolidator: (consolidation.Consolidator) NoneConsolidator,
metrictank_1    |  QueryMDP: (uint32) 0,
metrictank_1    |  QueryPNGroup: (models.PNGroup) 0,
metrictank_1    |  Meta: (models.SeriesMeta) (len=1 cap=1) {
metrictank_1    |   (models.SeriesMetaProperties) {
metrictank_1    |    SchemaID: (uint16) 1,
metrictank_1    |    Archive: (uint8) 0,
metrictank_1    |    ArchInterval: (uint32) 10,
metrictank_1    |    AggNumNorm: (uint32) 1,
metrictank_1    |    AggNumRC: (uint32) 0,
metrictank_1    |    ConsolidatorNormFetch: (consolidation.Consolidator) MaximumConsolidator,
metrictank_1    |    ConsolidatorRC: (consolidation.Consolidator) NoneConsolidator,
metrictank_1    |    Count: (uint32) 2
metrictank_1    |   }
metrictank_1    |  },
metrictank_1    |  Datapoints: ([]schema.Point) (len=7 cap=2000) {
metrictank_1    |   (schema.Point) {
metrictank_1    |    Val: (float64) 3455.175078391423,
metrictank_1    |    Ts: (uint32) 1592217180
metrictank_1    |   },
metrictank_1    |   (schema.Point) {
metrictank_1    |    Val: (float64) 3622.6372240166997,
metrictank_1    |    Ts: (uint32) 1592217240
metrictank_1    |   },
metrictank_1    |   (schema.Point) {
metrictank_1    |    Val: (float64) 5036.329772566606,
metrictank_1    |    Ts: (uint32) 1592217300
metrictank_1    |   },
metrictank_1    |   (schema.Point) {
metrictank_1    |    Val: (float64) 4193.290146187953,
metrictank_1    |    Ts: (uint32) 1592217360
metrictank_1    |   },
metrictank_1    |   (schema.Point) {
metrictank_1    |    Val: (float64) 5660.382271043138,
metrictank_1    |    Ts: (uint32) 1592217420
metrictank_1    |   },
metrictank_1    |   (schema.Point) {
metrictank_1    |    Val: (float64) 5059.803464757296,
metrictank_1    |    Ts: (uint32) 1592217480
metrictank_1    |   },
metrictank_1    |   (schema.Point) {
metrictank_1    |    Val: (float64) 2827.522883078393,
metrictank_1    |    Ts: (uint32) 1592217540
metrictank_1    |   }
metrictank_1    |  }
metrictank_1    | }

canonical output would have first TS 1592217240 and last TS 1592217540, so 6 points in total.
Note that the first point with TS 1592217180 is not even shown in Grafana (same when you switch to "points" viz, btw)

Dieterbe added a commit that referenced this issue Jun 15, 2020
See #1811 for more details
Before, it was common for output to contain an extraneous point that
lies before 'from' (and wouldn't be rendered by Grafana in graph
panels), which made the output invalid.

This would result in trouble when

* combining such a series with another series
  (#1811)
* or when needing normalization
  (#1845)

This effectively slightly changes the output format, but is more robust.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants