Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: track stats of end-to-end message processing time #280

Merged
merged 1 commit into from
Oct 29, 2013

Conversation

mynameisfiber
Copy link
Contributor

As per #268, this pull request adds the ability to track statistics about the time it takes for messages to get .finished(). This is done by maintaining a probabilistic percentile calculation over each channel (using the perks package) and merging them when topic quantiles are requested.

This data is surfaced in:

  • The /stats call
  • statsd
  • admin interface

Still to do:

  • Before/after timings
  • Docs

Sample data from /stats call:

$ curl "localhost:4151/stats"
nsqd v0.2.23 (built w/go1.1.2)

[test           ] depth: 0     be-depth: 0     msgs: 591      e2e%: 13.5s, 13.5s, 683.9us, 375.1us, 269.6us
    [test_chan                ] depth: 0     be-depth: 0     inflt: 0    def: 0    re-q: 0     timeout: 0     msgs: 591      e2e%: 13.5s, 13.5s, 1.1ms, 394.7us, 268.8us
        [V2 muon:55712           ] state: 3 inflt: 0    rdy: 62   fin: 591      re-q: 0        msgs: 591      connected: 7m18s

@davemarchevsky
Copy link

You're a machine @mynameisfiber

@mynameisfiber
Copy link
Contributor Author

@davemarchevsky just wait until the comments start flowing in

@mreiferson
Copy link
Member

so we meet again trebek

@@ -348,13 +357,34 @@ func (c *Channel) TouchMessage(clientID int64, id nsq.MessageID) error {
return nil
}

func (c *Channel) processMessagePercentile() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to add another goroutine to the channel to manage this data.

There are a couple reasons...

  1. as written there's a race between some other goroutine that calls Exit() (which closes c.messagePercentilesChan) and any other goroutine that calls FinishMessage() (which sends on c.messagePercentilesChan). We've handled these cases in a variety of ways, but in short, its a pain in the ass.
  2. the overhead. I really want to minimize overhead from this on the fast path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine... we can do it another way also that maintains the state of the last reset and just does a .Reset() on the quantile if it's too old.

I had originally done this but opted out because if a channel hasn't had a finished message in a while, the data will be old and un-reset. There are ways around this though, by having a custom interface into the quantile object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to literally inline it in FinishMessage()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly... originally it did:

    c.removeFromInFlightPQ(item)

    msgStartTime := item.Value.(*inFlightMessage).msg.Timestamp
        nowNano := time.Now().UnixNano()
        if msgStartTime > c.lastPercentileResetTime + timeToReset {
            c.messagePercentilesStream.Reset()
            c.lastPercentileResetTime = nowNano
        }
        c.messagePercentilesStream.Insert(nowNano - msgStartTime)

    return nil

@mreiferson
Copy link
Member

by the way I laughed at the soccer in the face kid basically all day today

@mreiferson
Copy link
Member

but this one is great too

d445c38e-3362-11e3-9ba3-924cc7ee57a9

@mynameisfiber
Copy link
Contributor Author

hahaha, that's great. i've got a couple more gif's up my sleeve depending on how this pull request goes ;)

@mreiferson
Copy link
Member

This is the 25th comment and it's going well so far... I estimate we'll be in the 50ish range before we've merged.

@mreiferson
Copy link
Member

If @jehiah gets involved all bets are off.

@mreiferson
Copy link
Member

If @ploxiln gets involved I'm leaving.

@mynameisfiber
Copy link
Contributor Author

@mreiferson if this gets to 100 comments I'm leaving.... I think that encompasses either @jehiah or @ploxiln joining.

@ploxiln
Copy link
Member

ploxiln commented Oct 22, 2013

does chiming in to say "hah don't worry about that" count as "getting involved"

@mreiferson
Copy link
Member

FYI, if you want to compare before/after performance you can use bench_reader/bench_writer:

# setup
$ nsqd --mem-queue-size=1000000
$ bench_writer
# this creates the channel so the topic drain into the channel isn't part of the benchmark
$ curl 'http://127.0.0.1:4151/create_channel?topic=sub_bench&channel=ch'
# this is the one that is interesting
$ bench_reader

@mynameisfiber
Copy link
Contributor Author

@mreiferson some timings --

master:

(origin/master)$ bench_writer
2013/10/22 15:08:51 duration: 5.043290723s - 37.820mb/s - 198283.235ops/s - 5.043us/op
(origin/master)$ curl 'http://127.0.0.1:4151/create_channel?topic=sub_bench&channel=ch'
{"status_code":200,"status_txt":"OK","data":null}
(origin/master)$ bench_reader
2013/10/22 15:09:06 duration: 7.055659747s - 27.033mb/s - 141730.190ops/s - 7.056us/op

this branch:

(micha/master)$ bench_writer
2013/10/22 15:06:24 duration: 3.298285488s - 57.828mb/s - 303187.824ops/s - 3.298us/op
(micha/master)$ curl 'http://127.0.0.1:4151/create_channel?topic=sub_bench&channel=ch'
{"status_code":200,"status_txt":"OK","data":null}
(micha/master)$ bench_reader
2013/10/22 15:06:45 duration: 7.20925809s - 26.457mb/s - 138710.529ops/s - 7.209us/op

@mynameisfiber
Copy link
Contributor Author

That being said, I've run the benchmarks several times and the values for both branches vary within the same range... I'd say that the average timings for this branch vs master are much closer than their variances.

@mreiferson
Copy link
Member

That's good news. Does that afford us any opportunities for smoothing over the issues relating to how we're resetting?

@mreiferson
Copy link
Member

can you paste the /stats output after a test run on your branch (curious how it looks with some real data)

statsdMemStats = flag.Bool("statsd-mem-stats", true, "toggle sending memory and GC stats to statsd")
statsdPrefix = flag.String("statsd-prefix", "nsq.%s", "prefix used for keys sent to statsd (%s for host replacement)")
messagePercentiles = util.FloatArray{}
messagePercentileResetTime = flag.Float64("message-percentile-reset-time", 600, "time in seconds between message percentile data flushes (default: 600)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a flag.Duration (which is really convenient)

@mynameisfiber
Copy link
Contributor Author

$ curl "localhost:4151/stats"
nsqd v0.2.23-alpha (built w/go1.1.2)

[cluster_test1382469115] depth: 0     be-depth: 0     msgs: 0        percentiles: 0, 0, 0, 0, 0
    [ch                       ] depth: 0     be-depth: 0     inflt: 0    def: 0    re-q: 0     timeout: 0     msgs: 0        percentiles: 0, 0, 0, 0, 0

[sub_bench      ] depth: 0     be-depth: 0     msgs: 1000000  percentiles: 1m34.056548253s, 1m34.056548253s, 23.635964732s, 23.256514065s, 19.914634067s
    [ch                       ] depth: 800   be-depth: 0     inflt: 160  def: 0    re-q: 0     timeout: 0     msgs: 1000000  percentiles: 1m34.056548253s, 1m34.056548253s, 23.49301998s, 23.124245376s, 19.833569623s

[test           ] depth: 0     be-depth: 0     msgs: 0        percentiles: 0, 0, 0, 0, 0
    [test_chan                ] depth: 0     be-depth: 0     inflt: 0    def: 0    re-q: 0     timeout: 0     msgs: 0        percentiles: 0, 0, 0, 0, 0

As for the smoothing, we can always roll between two quantiles and then merge them when results are requested. That would make the queries for the percentiles a bit more expensive, but since that's not per message I'm alright with that.

Thoughts?

@mreiferson
Copy link
Member

thats still pretty unwieldy...

I suppose all you really want is secs, ms, us so instead of:

percentiles: 1m34.056548253s, 1m34.056548253s, 23.49301998s, 23.124245376s, 19.833569623s

we would have

percentiles: 94.1s, 94.1s, 23.5s, 23.1s, 19.8s

and if we had a unit smaller than a second...

percentiles: 5s, 4s, 2s, 500ms, 885us

Maybe we roll our own?

@mynameisfiber
Copy link
Contributor Author

Benchmarks with 2 rolling quantiles:

(micha/master)$ bench_writer
2013/10/22 16:53:00 duration: 8.926523399s - 21.367mb/s - 112025.696ops/s - 8.927us/op
(micha/master))$ curl 'http://127.0.0.1:4151/create_channel?topic=sub_bench&channel=ch'
{"status_code":200,"status_txt":"OK","data":null}
(micha/master)$ bench_reader
2013/10/22 16:53:33 duration: 20.909254323s - 9.122mb/s - 47825.713ops/s - 20.909us/op

(origin/master)$ bench_writer
2013/10/22 16:55:47 duration: 8.712874453s - 21.891mb/s - 114772.686ops/s - 8.713us/op
(origin/master)$ curl 'http://127.0.0.1:4151/create_channel?topic=sub_bench&channel=ch'
{"status_code":200,"status_txt":"OK","data":null}
(origin/master)$ bench_reader
2013/10/22 16:56:08 duration: 14.900942757s - 12.800mb/s - 67109.848ops/s - 14.901us/op

The benchmarks are being quite variable... I think my laptop is a bit overloaded at the moment.

@mreiferson
Copy link
Member

this is what I get:

# before
2013/10/22 17:15:26 duration: 629.3911ms - 30.305mb/s - 158883.721ops/s - 6.294us/op

# after
2013/10/22 17:12:54 duration: 700.623892ms - 27.224mb/s - 142729.931ops/s - 7.006us/op

I should really write a little bench.sh script, you might be neglecting to rm -f *.dat (and restart nsqd) before you run a bench iteration...

@mynameisfiber
Copy link
Contributor Author

New timing format:

$ curl "localhost:4151/stats"
nsqd v0.2.23-alpha (built w/go1.1.2)

[sub_bench      ] depth: 0     be-depth: 0     msgs: 1852000  percentiles: 91.0s, 91.0s, 90.4s, 89.2s, 35.8s
    [ch                       ] depth: 0     be-depth: 0     inflt: 210  def: 0    re-q: 0     timeout: 0     msgs: 1852000  percentiles: 91.0s, 91.0s, 90.4s, 89.2s, 35.6s

The last question remaining is how should we display this in the nsqadmin page? Since we are taking data from multiple nsqd's, I'm not sure what the best way to aggregate the data is. I'm thinking for each percentile we show the min/max value. Thoughts?

statsdAddress string
statsdPrefix string
statsdInterval time.Duration
e2eProcessingLatencyWindowTime time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline before for diff and group with var below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k

@mreiferson
Copy link
Member

could you elaborate re: the issues with adding graphs?

@mynameisfiber
Copy link
Contributor Author

@mreiferson the main thing is just UI.... we are already inserting a lot of data so displaying it inline in the e2e cell would cramp things up. I could add another column, but then we probably go off screen. Any thoughts?

@mreiferson
Copy link
Member

I expect this to be be "VIP metrics", I'd want them to be easily accessible...

screenshot 2013-10-24 15 15 18

and at the top of the topic and channel pages (showing aggregate across all producers) for each quantile

@mynameisfiber
Copy link
Contributor Author

@mreiferson and should the graph show all available percentiles on one plot? multiple sparklines? toggle between them? here is the issue.

@mreiferson
Copy link
Member

yea, I think for sparkline sized graphs... maybe just the aggregate textual values are better...

}
*a = append(*a, v)
}
sort.Sort(sort.Reverse(sort.Float64Slice(*a)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort.Reverse doesn't work on go1.0.3 😦

@ploxiln
Copy link
Member

ploxiln commented Oct 25, 2013

we've got an epic going on over here

@mynameisfiber
Copy link
Contributor Author

New look of nsqadmin with the graphs... thoughts?

nsqadmin

@mreiferson
Copy link
Member

@mynameisfiber I played around with the table layout, I think it's an improvement:

screenshot 2013-10-28 22 18 16

I had to drop the sparklines, though... Lemme know what you think

Pickup the commit at mreiferson@d24d773

@mynameisfiber
Copy link
Contributor Author

That looks awesome! Why do you need to drop the sparklines? We could still have one underneith with for the highest percentile and rowspan set to span the entire "e2e processing latency" section. Now sure how easy that'll be with go/html/template though.

airsnakes++

@mynameisfiber
Copy link
Contributor Author

how i feel about this probabilistic approach working:

image

@mynameisfiber
Copy link
Contributor Author

For the docs:

The end to end message latency is calculated by using the probabilistic
percentile technique described in ["Effective Computation of Biased Quantiles
over Data Streams"](http://www.cs.rutgers.edu/~muthu/bquant.pdf).  We use
bmizerany's implemintation of this algorithm in the
[perks](http://godoc.org/github.com/bmizerany/perks) library.

In order to only show quantile information for the past 10minutes, we
internally have two quantiles that each store 5 minutes worth of latency data
for each channel.  Once a quantile has been used for 5 minutes, we clear the
other quantile and start inserting new data into it.  Since quantiles can be
merged, this means we can have a coarse rolling window.

In addition, all channel quantiles are merged to get topic latency information.
This can be done if the channels are on the same nsqd host, however when data
is being accumulated through lookupd we take averages over the quantile value
for each nsqd host.  In order to maintain some statistics regarding the
distribution of the latencies over multiple hosts, we provide the min/max
values in addition to the average.

@mynameisfiber
Copy link
Contributor Author

Final state of nsqadmin:

nsqadmin

Mouseovers on sparklines review preview of full graph

This commit adds new statistics about how long individual messages take
from being published to being finished.  We maintain several percentile
values, the exactly quantiles which can be specified as runtime
parameters to nsqd.  In addition, this information is displayed quite
shnazily in nsqadmin.
@mreiferson
Copy link
Member

👍 💯 🎆

mreiferson added a commit that referenced this pull request Oct 29, 2013
nsqd: track stats of end-to-end message processing time
@mreiferson mreiferson merged commit fa88eb6 into nsqio:master Oct 29, 2013
@mynameisfiber
Copy link
Contributor Author

How I thought this pull request would go:

How it actually went:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants