Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/Bosun: enable group by interval in the influx() query #1304

Merged
merged 1 commit into from
Sep 24, 2015
Merged

cmd/Bosun: enable group by interval in the influx() query #1304

merged 1 commit into from
Sep 24, 2015

Conversation

giganteous
Copy link
Contributor

A first throw at fixing the inconsistencies. See bug #1298. I think I've fixed the tests. Fighting with travis now.

@gbrayut
Copy link
Contributor

gbrayut commented Sep 9, 2015

If you rebase this on origin/master travis CI should start working. There is a new version of Esc as well so you will need to run go get -u github.com/mjibson/esc if you need to make any changes to the static content.


t := make(parse.Tags, len(s.Dimensions))
for _, d := range s.Dimensions {
if _, ok := d.Expr.(*influxql.Call); ok { // && call.Name == "time" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the commented out code // && call.Name == "time" {

@gbrayut
Copy link
Contributor

gbrayut commented Sep 15, 2015

I left a note on a line that has some commented out code that should be removed.

Also the docs from #1298 (comment) should be commited into the /docs/expressions.md file as part of this PR, as we usually like to have doc changes in the same PR that the code is changed.

Other than that I tested this on my local instance and it looks good. The following query based on a counter matches what I would expect to get from OpenTSDB:

influx("opentsdb", '''SELECT non_negative_derivative(value) FROM "os.cpu" WHERE host =~ /gb.*/ GROUP BY host''', "30m", "", "1s")

Still not sure exactly how downsampling or fill values work for counters in Influx. If I change 1s to something larger the graph jumps above 100%, so I assume I would need to wrap the above query in another query if I wanted to get the equivalent of a 5m-avg downsample. I looks like Influx does this using something called Continuous Queries.

@gbrayut
Copy link
Contributor

gbrayut commented Sep 15, 2015

Found a github issue about using rates in influxdb. looks like the non_negative_derivative function can take an optional argument and aggregate function. So on my test system that sends scollector data to both influxdb and opentsdb the following queries create the same graphs:

influx("opentsdb", '''SELECT non_negative_derivative(mean(value),1s) FROM "os.cpu" GROUP BY host''', "30m", "", "2m")

q("sum:2m-avg:rate{counter,,1}:os.cpu{host=*}", "30m", "")

@giganteous
Copy link
Contributor Author

Thanks! I'll rebase the work, and fix the issues.

@gbrayut
Copy link
Contributor

gbrayut commented Sep 23, 2015

The docs look good here. Could probably use another rebase to pull in the new influx settings in the conf file, but otherwise LGTM.

Let me know if you are ready for this to be merged.

@giganteous
Copy link
Contributor Author

Rebased.

I'm running our proof of concept system with the rebased code, and our version of the os.diskspace check looks to be okay.

alert os.diskspace {
    macro = host_based
    $notes = This alert triggers when there are issues detected in disk capacity. Two methods are used. The first is a traditional percentage based threshold. This alert also uses a linear regression to attempt to predict the amount of time until we run out of space. Forecasting is a hard problem, in particular to apply generically so there is a lot of room for improvement here. But this is a start
    template = diskspace
    $db = "opentsdb"

    ##Forecast Section
    #Downsampling avg on opentsdb side will save the linear regression a lot of work
    $days_to_zero = (forecastlr(influx($db, '''SELECT min(value) FROM "os.disk.fs.percent_free" GROUP BY disk, host''', "7d", "", "6h"), 0) / 60 / 60 / 24)
    #Threshold can be higher here once we support string lookups in lookup tables https://github.com/bosun-monitor/bosun/issues/268
    $warn_days = $days_to_zero > 0 && $days_to_zero < 7
    $crit_days =   $days_to_zero > 0 && $days_to_zero < 1

    ##Percent Free Section
    $pf_time = "5m"
    $percent_free = avg(influx($db, '''SELECT min(value) FROM "os.disk.fs.percent_free" GROUP BY disk, host''', "5m", "", ""))
    $used = avg(influx($db, '''SELECT max(value) FROM "os.disk.fs.percent_used" GROUP BY disk, host''', "5m", "", ""))
    $total = avg(influx($db, '''SELECT min(value) FROM "os.disk.fs.space_total" GROUP BY disk, host''', "5m", "", ""))

    $warn_percent = $percent_free <  lookup("disk_space", "warn_percent_free")
    #Linux stops root from writing at less than 5%
    $crit_percent = $percent_free <  lookup("disk_space", "crit_percent_free")
    #For graph (long time)
    $percent_free_graph = influx($db, '''SELECT min(value) FROM "os.disk.fs.percent_free" GROUP BY disk, host''', "4d", "", "1h")

    ##Main Logic
    warn = $warn_percent || $warn_days
    crit = $crit_percent || $crit_days

    ##Options
    ignoreUnknown = true
    #This is needed because disks go away when the forecast doesn't
    unjoinedOk = true

}

@gbrayut
Copy link
Contributor

gbrayut commented Sep 24, 2015

LGTM

Thanks for working on this!

gbrayut added a commit that referenced this pull request Sep 24, 2015
cmd/Bosun: enable group by interval in the influx() query
@gbrayut gbrayut merged commit e0c6b4c into bosun-monitor:master Sep 24, 2015
@giganteous giganteous deleted the influxql-enable-groupbyinterval branch September 30, 2015 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants