-
Notifications
You must be signed in to change notification settings - Fork 105
Conversation
cassGetWaitDuration.Value(time.Since(crr.timestamp)) | ||
waitDuration := time.Since(crr.timestamp) | ||
cassGetWaitDuration.Value(waitDuration) | ||
if waitDuration > c.omitReadTimeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only drops reqRequests that have been in the queue too long. The requests might have spent many minutes waiting to to be added to c.readQueue due to it being full
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so to fix that problem i'll also need to do what you mentioned earlier with the non-blocking adding to the queue. i thought that's out of the scope of this PR, but i can add it too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, maybe not. But if the waitDuration is timed from when the search is started until when it is pulled from the queue, then why were we seeing waitDurations if a few seconds but queries still be queued and running for 10+minutes after requests stopped being sent to a cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right that some of the reads were waiting to be put into this queue. But my hope/guess was that this "backpressure" problem would get resolved if we omit the old reads that we're consuming here, and hence clear the queue faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed another commit for that e5b7653
mdata/store_cassandra.go
Outdated
case c.readQueue <- crrs[i]: | ||
default: | ||
cassReadQueueFull.Inc() | ||
numQueries-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will cause the range loop on the channel at line #459 to never exit. as no result will be pushed down the channel for this readRequest, seen will never == numQueries and so the channel wont be closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually that's why numQueries
is decreased, so that loop on #459 does not wait for this result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notes so far. need to look at this more.
cmd/mt-store-cat/main.go
Outdated
groupTTL = flag.String("groupTTL", "d", "group chunks in TTL buckets based on s (second. means unbucketed), m (minute), h (hour) or d (day). only for chunk-summary format") | ||
windowFactor = flag.Int("window-factor", 20, "the window factor be used when creating the metric table schema") | ||
timeZoneStr = flag.String("time-zone", "local", "time-zone to use for interpreting from/to when needed. (check your config)") | ||
cassandraOmitReadTimeout = flag.Int("cassandra-omit-read-timeout", 60, "if a read is older than this, it will directly be omitted without executing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should go in the section above, " flags from metrictank.go, Cassandra" not "our own flags"
mdata/store_cassandra.go
Outdated
@@ -36,6 +36,7 @@ const Table_name_format = `metric_%d` | |||
var ( | |||
errChunkTooSmall = errors.New("unpossibly small chunk in cassandra") | |||
errStartBeforeEnd = errors.New("start must be before end.") | |||
errReadQueueFull = errors.New("the cassandra read queue is full") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's already implied that we're in the code related to cassandra, so can just say "read queue is full". the top error talks about cassandra specifically because that's cassandra the database. all other errors are about "MT's cassandra code base"
mdata/store_cassandra.go
Outdated
// reads that were already too old to be executed | ||
cassOmitOldRead = stats.NewCounter32("store.cassandra.omitted_old_reads") | ||
// reads that could not be pushed into the queue because it was full | ||
cassReadQueueFull = stats.NewCounter32("store.cassandra.read_queue_full") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so both metrics drop a read, but for a different reason. the read_queue_full is not very clear about what happened due to the queue being full. so i suggest something like:
store.cassandra.omit_read.too_old
store.cassandra.omit_read.queue_full
default: | ||
cassReadQueueFull.Inc() | ||
return nil, errReadQueueFull | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty drastic change in behavior. I have to think a bit more about this.
but either way, if we're going to do this, then before putting the first crr, you should do a check if cap(queue) - len(queue) < numQueries
. if so, we shouldn't bother putting any crr into the readqueue because it is extremely likely that we will hit the errReadQueueFull condition anyway. and it would be a waste if we can predict that will happen, to still put all those CRR's in for a request that won't be served
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switching from block-on-full to drop-on-full has some ramifications. consider the setting cassandra-read-queue-size
, currently defaulting to 100 and documented as "max number of outstanding reads before blocking. value doesn't matter much"
this value will need to be increased significantly, and clearly the consequences of this setting are now different, and the value needs to be documented differently.
also when this is merged, the next MT release will be a major version bump because of this change (not a problem, just an fyi)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the value of read queue size now becomes critically important. here's how I think we can figure out a sane setting:
- for any query, for any target retrieved, 1 CRR is issued per month of data crossed, irrespective of which aggregation we're using.
- the read queue should allow to buffer a temporay peak in workload, as long as that workload can be cleared out of the queue in, say, 10 seconds.
- thus, at any given time, we need to allow to put 10 seconds worth of the max-handleable workload in the queue.
- it's better to over-estimate what the max-handleable workload is (because we'll just drop reads for being too old from the queue) than it is to under-estimate (which would result in a smaller queue which is full too quickly, resulting in render requests erroring whereas they could've been processed fine)
- so let's aim for a rather high estimate, based on some numbers we've seen in the wild.
Let's say a heavy query involves 20k targets and each targets reads 2y=24months worth of data. let's say we (aim to) support 200 such requests per second. That's 96M CRR's per second, 960M for 10 seconds worth. - for this kind of workload, you'd use >= 8 MT instances.
- so each instance would need a read queue that can hold 960M/8=120M CRR's.
- the queue is currently a buffered chan of pointers, so 120M*8B per pointer = 960MB RAM worth of read queue.
So, this is one way of calculating how much resources we need for a read queue to accomodate legitimate CRR's (CRR's that we don't want dropped unless we really can't handle the load)
There's other ways to compute this of course.
For example, we also see in our dashboards that some clusters do up to 200k cassandra gets/s . (CRR's/sec), and this with healthy response times.
So we have to ask, how much higher should the volume be before we can confidently drop the reads before even giving them a chance to go through the queue (and be dropped if turns out it's too old). let's say a workload 20x that.that would be 4M CRR's/se, or over 10 seconds, that's 40M, or per node 40M/8=5M. With this model, we're estimating 24x less than with the previous method.
So we can meet somewhere half way between the two methods, and suggest 10M read queue per node.
Here's something interesting though: looking at our dashboards, we have chunks of about 40B avg size, and the avg search results in less than 1 chunk per read; that means that per CRR we get no more than 40B of data.
the ChunkReadRequest type is 80 Bytes big (see sizeof
utility for golang), and a pointer to it is 64bit, so each pending CRR costs us 88B of RAM, more than twice the amount of data we will actually retrieve !?
This is a reason why i'm not a huge fan of drop-on-full-queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why the behavior sholud change from blocking to drop-on-full.
I see there was some discussion #685 (review) but I don't get it. the code is pretty clear that the timestamps are set when CRR are created, before they are put in to the queue, and we look at the age after pulling from queue. so the age includes time waited to put into queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dieterbe your math on queue size is way off. You are dreaming to think you can query cassandra at 96million requests per second.
The entire purpose of this PR is to prevent MT instances from trying to send too many queries to cassandra. So base the queue sizes on the read performance of cassandra.
We really dont MT instances querying more than about 5k/second per MT node. So a single query split across 4 shards would be able to query at 20k/second.
So the queue should not be more then about 200k (10seconds at 4x the average rate)
Increasing anymore then that will just lead to overloading cassandra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, i've set it to 200000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR: after discussing with the 3 of us we decided it's better to provide consistent performance and sometimes drop reads that could otherwise have been served if no other tenants were exerting much load
mdata/store_cassandra.go
Outdated
} | ||
outcomes := make([]outcome, 0, numQueries) | ||
|
||
seen := 0 | ||
for o := range results { | ||
if o.omitted { | ||
return nil, errReadQueueFull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this condition is triggered by an omit due to tooOld, so why return errReadQueueFull ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it wouldn't return an error then we might end up with an incomplete result, which in many cases is worse than an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my suggestion is to return an error that is correct (e.g errReadTooOld or something)
@@ -362,9 +370,15 @@ func (o asc) Less(i, j int) bool { return o[i].sortKey < o[j].sortKey } | |||
|
|||
func (c *CassandraStore) processReadQueue() { | |||
for crr := range c.readQueue { | |||
cassGetWaitDuration.Value(time.Since(crr.timestamp)) | |||
waitDuration := time.Since(crr.timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should set the crr's timestamps more consistently (wrt all other crr's from the same SearchTable call). concretely, I can't comment in the particular code, cause you didn't modify it, but
instead of calling time.Now()
in each query() call in SearchTable, we can just reuse pre
. since the time in between each query call is negligible anyway.
this will help avoid a class of race conditions that is pretty rare I think, but might happen, where we omit here certain CRR's due to old, while processing others that were just under the threshold. if we omit one, we might as well omit all others from the same request, and by using the same timestamp in all, we can guarantee that all CRR's after an omitted one, will also be omitted.
(also it would perform a bit better by not getting the time over and over)
cmd/mt-split-metrics-by-ttl/main.go
Outdated
@@ -32,7 +32,8 @@ var ( | |||
cassandraUsername = flag.String("cassandra-username", "cassandra", "username for authentication") | |||
cassandraPassword = flag.String("cassandra-password", "cassandra", "password for authentication") | |||
|
|||
windowFactor = flag.Int("window-factor", 20, "the window factor be used when creating the metric table schema") | |||
windowFactor = flag.Int("window-factor", 20, "the window factor be used when creating the metric table schema") | |||
cassandraOmitReadTimeout = flag.Int("cassandra-omit-read-timeout", 60, "if a read is older than this, it will directly be omitted without executing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tool doesn't do any reads, so having this option here is rather confusing. we can just hardcode a number and comment that it's not relevant because we don't do reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, same for some other options like cassandraReadConcurrency. can just hardcode such values and comment why
I updated everything according to the comments. |
looks good now mauro. yes please update docs and config files, then we can merge. |
No description provided.