omit reads older than X seconds #685

replay · 2017-07-10T19:07:59Z

No description provided.

woodsaj · 2017-07-10T19:09:56Z

mdata/store_cassandra.go

-		cassGetWaitDuration.Value(time.Since(crr.timestamp))
+		waitDuration := time.Since(crr.timestamp)
+		cassGetWaitDuration.Value(waitDuration)
+		if waitDuration > c.omitReadTimeout {


this only drops reqRequests that have been in the queue too long. The requests might have spent many minutes waiting to to be added to c.readQueue due to it being full

so to fix that problem i'll also need to do what you mentioned earlier with the non-blocking adding to the queue. i thought that's out of the scope of this PR, but i can add it too

hmm, maybe not. But if the waitDuration is timed from when the search is started until when it is pulled from the queue, then why were we seeing waitDurations if a few seconds but queries still be queued and running for 10+minutes after requests stopped being sent to a cluster.

I think you are right that some of the reads were waiting to be put into this queue. But my hope/guess was that this "backpressure" problem would get resolved if we omit the old reads that we're consuming here, and hence clear the queue faster.

pushed another commit for that e5b7653

woodsaj · 2017-07-10T19:54:43Z

mdata/store_cassandra.go

+		case c.readQueue <- crrs[i]:
+		default:
+			cassReadQueueFull.Inc()
+			numQueries--


this will cause the range loop on the channel at line #459 to never exit. as no result will be pushed down the channel for this readRequest, seen will never == numQueries and so the channel wont be closed

Actually that's why numQueries is decreased, so that loop on #459 does not wait for this result

Dieterbe

notes so far. need to look at this more.

Dieterbe · 2017-07-25T18:21:18Z

cmd/mt-store-cat/main.go

+	groupTTL                 = flag.String("groupTTL", "d", "group chunks in TTL buckets based on s (second. means unbucketed), m (minute), h (hour) or d (day). only for chunk-summary format")
+	windowFactor             = flag.Int("window-factor", 20, "the window factor be used when creating the metric table schema")
+	timeZoneStr              = flag.String("time-zone", "local", "time-zone to use for interpreting from/to when needed. (check your config)")
+	cassandraOmitReadTimeout = flag.Int("cassandra-omit-read-timeout", 60, "if a read is older than this, it will directly be omitted without executing")


should go in the section above, " flags from metrictank.go, Cassandra" not "our own flags"

Dieterbe · 2017-07-25T18:23:53Z

mdata/store_cassandra.go

@@ -36,6 +36,7 @@ const Table_name_format = `metric_%d`
 var (
 	errChunkTooSmall  = errors.New("unpossibly small chunk in cassandra")
 	errStartBeforeEnd = errors.New("start must be before end.")
+	errReadQueueFull  = errors.New("the cassandra read queue is full")


it's already implied that we're in the code related to cassandra, so can just say "read queue is full". the top error talks about cassandra specifically because that's cassandra the database. all other errors are about "MT's cassandra code base"

Dieterbe · 2017-07-25T18:31:56Z

mdata/store_cassandra.go

+	// reads that were already too old to be executed
+	cassOmitOldRead = stats.NewCounter32("store.cassandra.omitted_old_reads")
+	// reads that could not be pushed into the queue because it was full
+	cassReadQueueFull = stats.NewCounter32("store.cassandra.read_queue_full")


so both metrics drop a read, but for a different reason. the read_queue_full is not very clear about what happened due to the queue being full. so i suggest something like:

store.cassandra.omit_read.too_old store.cassandra.omit_read.queue_full

Dieterbe · 2017-07-25T19:15:16Z

mdata/store_cassandra.go

+		default:
+			cassReadQueueFull.Inc()
+			return nil, errReadQueueFull
+		}


this is a pretty drastic change in behavior. I have to think a bit more about this.
but either way, if we're going to do this, then before putting the first crr, you should do a check if cap(queue) - len(queue) < numQueries. if so, we shouldn't bother putting any crr into the readqueue because it is extremely likely that we will hit the errReadQueueFull condition anyway. and it would be a waste if we can predict that will happen, to still put all those CRR's in for a request that won't be served

switching from block-on-full to drop-on-full has some ramifications. consider the setting cassandra-read-queue-size, currently defaulting to 100 and documented as "max number of outstanding reads before blocking. value doesn't matter much"

this value will need to be increased significantly, and clearly the consequences of this setting are now different, and the value needs to be documented differently.
also when this is merged, the next MT release will be a major version bump because of this change (not a problem, just an fyi)

so the value of read queue size now becomes critically important. here's how I think we can figure out a sane setting:

for any query, for any target retrieved, 1 CRR is issued per month of data crossed, irrespective of which aggregation we're using.

the read queue should allow to buffer a temporay peak in workload, as long as that workload can be cleared out of the queue in, say, 10 seconds.

thus, at any given time, we need to allow to put 10 seconds worth of the max-handleable workload in the queue.

it's better to over-estimate what the max-handleable workload is (because we'll just drop reads for being too old from the queue) than it is to under-estimate (which would result in a smaller queue which is full too quickly, resulting in render requests erroring whereas they could've been processed fine)

so let's aim for a rather high estimate, based on some numbers we've seen in the wild.
Let's say a heavy query involves 20k targets and each targets reads 2y=24months worth of data. let's say we (aim to) support 200 such requests per second. That's 96M CRR's per second, 960M for 10 seconds worth.

for this kind of workload, you'd use >= 8 MT instances.

so each instance would need a read queue that can hold 960M/8=120M CRR's.

the queue is currently a buffered chan of pointers, so 120M*8B per pointer = 960MB RAM worth of read queue.

So, this is one way of calculating how much resources we need for a read queue to accomodate legitimate CRR's (CRR's that we don't want dropped unless we really can't handle the load)

There's other ways to compute this of course.
For example, we also see in our dashboards that some clusters do up to 200k cassandra gets/s . (CRR's/sec), and this with healthy response times.
So we have to ask, how much higher should the volume be before we can confidently drop the reads before even giving them a chance to go through the queue (and be dropped if turns out it's too old). let's say a workload 20x that.that would be 4M CRR's/se, or over 10 seconds, that's 40M, or per node 40M/8=5M. With this model, we're estimating 24x less than with the previous method.

So we can meet somewhere half way between the two methods, and suggest 10M read queue per node.

Here's something interesting though: looking at our dashboards, we have chunks of about 40B avg size, and the avg search results in less than 1 chunk per read; that means that per CRR we get no more than 40B of data.
the ChunkReadRequest type is 80 Bytes big (see sizeof utility for golang), and a pointer to it is 64bit, so each pending CRR costs us 88B of RAM, more than twice the amount of data we will actually retrieve !?

This is a reason why i'm not a huge fan of drop-on-full-queue.

I don't see why the behavior sholud change from blocking to drop-on-full.
I see there was some discussion #685 (review) but I don't get it. the code is pretty clear that the timestamps are set when CRR are created, before they are put in to the queue, and we look at the age after pulling from queue. so the age includes time waited to put into queue

@Dieterbe your math on queue size is way off. You are dreaming to think you can query cassandra at 96million requests per second.

The entire purpose of this PR is to prevent MT instances from trying to send too many queries to cassandra. So base the queue sizes on the read performance of cassandra.

We really dont MT instances querying more than about 5k/second per MT node. So a single query split across 4 shards would be able to query at 20k/second.

So the queue should not be more then about 200k (10seconds at 4x the average rate)
Increasing anymore then that will just lead to overloading cassandra.

done, i've set it to 200000

FTR: after discussing with the 3 of us we decided it's better to provide consistent performance and sometimes drop reads that could otherwise have been served if no other tenants were exerting much load

Dieterbe · 2017-07-26T11:26:53Z

mdata/store_cassandra.go

 	}
 	outcomes := make([]outcome, 0, numQueries)

 	seen := 0
 	for o := range results {
+		if o.omitted {
+			return nil, errReadQueueFull


this condition is triggered by an omit due to tooOld, so why return errReadQueueFull ?

if it wouldn't return an error then we might end up with an incomplete result, which in many cases is worse than an error

my suggestion is to return an error that is correct (e.g errReadTooOld or something)

Dieterbe · 2017-07-26T11:34:07Z

mdata/store_cassandra.go

@@ -362,9 +370,15 @@ func (o asc) Less(i, j int) bool { return o[i].sortKey < o[j].sortKey }

 func (c *CassandraStore) processReadQueue() {
 	for crr := range c.readQueue {
-		cassGetWaitDuration.Value(time.Since(crr.timestamp))
+		waitDuration := time.Since(crr.timestamp)


I think we should set the crr's timestamps more consistently (wrt all other crr's from the same SearchTable call). concretely, I can't comment in the particular code, cause you didn't modify it, but
instead of calling time.Now() in each query() call in SearchTable, we can just reuse pre. since the time in between each query call is negligible anyway.
this will help avoid a class of race conditions that is pretty rare I think, but might happen, where we omit here certain CRR's due to old, while processing others that were just under the threshold. if we omit one, we might as well omit all others from the same request, and by using the same timestamp in all, we can guarantee that all CRR's after an omitted one, will also be omitted.

(also it would perform a bit better by not getting the time over and over)

Dieterbe · 2017-07-26T17:15:06Z

cmd/mt-split-metrics-by-ttl/main.go

@@ -32,7 +32,8 @@ var (
 	cassandraUsername = flag.String("cassandra-username", "cassandra", "username for authentication")
 	cassandraPassword = flag.String("cassandra-password", "cassandra", "password for authentication")

-	windowFactor = flag.Int("window-factor", 20, "the window factor be used when creating the metric table schema")
+	windowFactor             = flag.Int("window-factor", 20, "the window factor be used when creating the metric table schema")
+	cassandraOmitReadTimeout = flag.Int("cassandra-omit-read-timeout", 60, "if a read is older than this, it will directly be omitted without executing")


this tool doesn't do any reads, so having this option here is rather confusing. we can just hardcode a number and comment that it's not relevant because we don't do reads.

actually, same for some other options like cassandraReadConcurrency. can just hardcode such values and comment why

replay · 2017-08-10T09:28:38Z

I updated everything according to the comments.
I'll update the docs regarding the read queue size once the code is approved.

Dieterbe · 2017-08-11T15:58:03Z

looks good now mauro. yes please update docs and config files, then we can merge.

woodsaj reviewed Jul 10, 2017

View reviewed changes

replay force-pushed the omit_old_reads branch from 3517b71 to 0b92409 Compare July 24, 2017 22:04

replay requested a review from Dieterbe July 25, 2017 10:52

Dieterbe suggested changes Jul 25, 2017

View reviewed changes

Dieterbe reviewed Jul 26, 2017

View reviewed changes

replay added 5 commits August 10, 2017 08:57

omit reads older than X seconds

1ea2ae4

do not push into read queue if full

47d586e

fix tools

6d4ec0a

return error instead of incomplete data

670d33c

raise default time out to 60secs

21dc41b

replay force-pushed the omit_old_reads branch from c2245a8 to 21dc41b Compare August 10, 2017 08:59

replay force-pushed the omit_old_reads branch from a1d6c90 to eb45039 Compare August 10, 2017 09:39

fixes according to PR comments

91f4890

replay force-pushed the omit_old_reads branch from eb45039 to 91f4890 Compare August 10, 2017 09:42

Dieterbe approved these changes Aug 11, 2017

View reviewed changes

update docs and defaults

bd701db

replay merged commit 71f7f8f into master Aug 14, 2017

Dieterbe mentioned this pull request Oct 1, 2017

deadlock in SyncChunkSaveState #726

Closed

Dieterbe deleted the omit_old_reads branch September 18, 2018 09:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

omit reads older than X seconds #685

omit reads older than X seconds #685

replay commented Jul 10, 2017

woodsaj Jul 10, 2017

replay Jul 10, 2017

woodsaj Jul 10, 2017

replay Jul 10, 2017

replay Jul 10, 2017

woodsaj Jul 10, 2017

replay Jul 10, 2017

Dieterbe left a comment

Dieterbe Jul 25, 2017

Dieterbe Jul 25, 2017

Dieterbe Jul 25, 2017

Dieterbe Jul 25, 2017

Dieterbe Jul 26, 2017 •

edited

Loading

Dieterbe Jul 27, 2017

Dieterbe Jul 27, 2017

woodsaj Aug 2, 2017 •

edited

Loading

replay Aug 10, 2017

Dieterbe Aug 11, 2017

Dieterbe Jul 26, 2017

replay Aug 2, 2017 •

edited

Loading

Dieterbe Aug 2, 2017

Dieterbe Jul 26, 2017 •

edited

Loading

Dieterbe Jul 26, 2017

Dieterbe Jul 26, 2017

replay commented Aug 10, 2017 •

edited

Loading

Dieterbe commented Aug 11, 2017

omit reads older than X seconds #685

omit reads older than X seconds #685

Conversation

replay commented Jul 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Jul 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woodsaj Aug 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

replay Aug 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Jul 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

replay commented Aug 10, 2017 • edited Loading

Dieterbe commented Aug 11, 2017

Dieterbe Jul 26, 2017 •

edited

Loading

woodsaj Aug 2, 2017 •

edited

Loading

replay Aug 2, 2017 •

edited

Loading

Dieterbe Jul 26, 2017 •

edited

Loading

replay commented Aug 10, 2017 •

edited

Loading