Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

omit reads older than X seconds #685

Merged
merged 7 commits into from
Aug 14, 2017
Merged

omit reads older than X seconds #685

merged 7 commits into from
Aug 14, 2017

Conversation

replay
Copy link
Contributor

@replay replay commented Jul 10, 2017

No description provided.

cassGetWaitDuration.Value(time.Since(crr.timestamp))
waitDuration := time.Since(crr.timestamp)
cassGetWaitDuration.Value(waitDuration)
if waitDuration > c.omitReadTimeout {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only drops reqRequests that have been in the queue too long. The requests might have spent many minutes waiting to to be added to c.readQueue due to it being full

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so to fix that problem i'll also need to do what you mentioned earlier with the non-blocking adding to the queue. i thought that's out of the scope of this PR, but i can add it too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe not. But if the waitDuration is timed from when the search is started until when it is pulled from the queue, then why were we seeing waitDurations if a few seconds but queries still be queued and running for 10+minutes after requests stopped being sent to a cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right that some of the reads were waiting to be put into this queue. But my hope/guess was that this "backpressure" problem would get resolved if we omit the old reads that we're consuming here, and hence clear the queue faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed another commit for that e5b7653

case c.readQueue <- crrs[i]:
default:
cassReadQueueFull.Inc()
numQueries--
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will cause the range loop on the channel at line #459 to never exit. as no result will be pushed down the channel for this readRequest, seen will never == numQueries and so the channel wont be closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that's why numQueries is decreased, so that loop on #459 does not wait for this result

Copy link
Contributor

@Dieterbe Dieterbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notes so far. need to look at this more.

groupTTL = flag.String("groupTTL", "d", "group chunks in TTL buckets based on s (second. means unbucketed), m (minute), h (hour) or d (day). only for chunk-summary format")
windowFactor = flag.Int("window-factor", 20, "the window factor be used when creating the metric table schema")
timeZoneStr = flag.String("time-zone", "local", "time-zone to use for interpreting from/to when needed. (check your config)")
cassandraOmitReadTimeout = flag.Int("cassandra-omit-read-timeout", 60, "if a read is older than this, it will directly be omitted without executing")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should go in the section above, " flags from metrictank.go, Cassandra" not "our own flags"

@@ -36,6 +36,7 @@ const Table_name_format = `metric_%d`
var (
errChunkTooSmall = errors.New("unpossibly small chunk in cassandra")
errStartBeforeEnd = errors.New("start must be before end.")
errReadQueueFull = errors.New("the cassandra read queue is full")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's already implied that we're in the code related to cassandra, so can just say "read queue is full". the top error talks about cassandra specifically because that's cassandra the database. all other errors are about "MT's cassandra code base"

// reads that were already too old to be executed
cassOmitOldRead = stats.NewCounter32("store.cassandra.omitted_old_reads")
// reads that could not be pushed into the queue because it was full
cassReadQueueFull = stats.NewCounter32("store.cassandra.read_queue_full")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so both metrics drop a read, but for a different reason. the read_queue_full is not very clear about what happened due to the queue being full. so i suggest something like:

store.cassandra.omit_read.too_old
store.cassandra.omit_read.queue_full

default:
cassReadQueueFull.Inc()
return nil, errReadQueueFull
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty drastic change in behavior. I have to think a bit more about this.
but either way, if we're going to do this, then before putting the first crr, you should do a check if cap(queue) - len(queue) < numQueries. if so, we shouldn't bother putting any crr into the readqueue because it is extremely likely that we will hit the errReadQueueFull condition anyway. and it would be a waste if we can predict that will happen, to still put all those CRR's in for a request that won't be served

Copy link
Contributor

@Dieterbe Dieterbe Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switching from block-on-full to drop-on-full has some ramifications. consider the setting cassandra-read-queue-size, currently defaulting to 100 and documented as "max number of outstanding reads before blocking. value doesn't matter much"

this value will need to be increased significantly, and clearly the consequences of this setting are now different, and the value needs to be documented differently.
also when this is merged, the next MT release will be a major version bump because of this change (not a problem, just an fyi)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the value of read queue size now becomes critically important. here's how I think we can figure out a sane setting:

  • for any query, for any target retrieved, 1 CRR is issued per month of data crossed, irrespective of which aggregation we're using.
  • the read queue should allow to buffer a temporay peak in workload, as long as that workload can be cleared out of the queue in, say, 10 seconds.
  • thus, at any given time, we need to allow to put 10 seconds worth of the max-handleable workload in the queue.
  • it's better to over-estimate what the max-handleable workload is (because we'll just drop reads for being too old from the queue) than it is to under-estimate (which would result in a smaller queue which is full too quickly, resulting in render requests erroring whereas they could've been processed fine)
  • so let's aim for a rather high estimate, based on some numbers we've seen in the wild.
    Let's say a heavy query involves 20k targets and each targets reads 2y=24months worth of data. let's say we (aim to) support 200 such requests per second. That's 96M CRR's per second, 960M for 10 seconds worth.
  • for this kind of workload, you'd use >= 8 MT instances.
  • so each instance would need a read queue that can hold 960M/8=120M CRR's.
  • the queue is currently a buffered chan of pointers, so 120M*8B per pointer = 960MB RAM worth of read queue.

So, this is one way of calculating how much resources we need for a read queue to accomodate legitimate CRR's (CRR's that we don't want dropped unless we really can't handle the load)

There's other ways to compute this of course.
For example, we also see in our dashboards that some clusters do up to 200k cassandra gets/s . (CRR's/sec), and this with healthy response times.
So we have to ask, how much higher should the volume be before we can confidently drop the reads before even giving them a chance to go through the queue (and be dropped if turns out it's too old). let's say a workload 20x that.that would be 4M CRR's/se, or over 10 seconds, that's 40M, or per node 40M/8=5M. With this model, we're estimating 24x less than with the previous method.

So we can meet somewhere half way between the two methods, and suggest 10M read queue per node.

Here's something interesting though: looking at our dashboards, we have chunks of about 40B avg size, and the avg search results in less than 1 chunk per read; that means that per CRR we get no more than 40B of data.
the ChunkReadRequest type is 80 Bytes big (see sizeof utility for golang), and a pointer to it is 64bit, so each pending CRR costs us 88B of RAM, more than twice the amount of data we will actually retrieve !?

This is a reason why i'm not a huge fan of drop-on-full-queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why the behavior sholud change from blocking to drop-on-full.
I see there was some discussion #685 (review) but I don't get it. the code is pretty clear that the timestamps are set when CRR are created, before they are put in to the queue, and we look at the age after pulling from queue. so the age includes time waited to put into queue

Copy link
Member

@woodsaj woodsaj Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dieterbe your math on queue size is way off. You are dreaming to think you can query cassandra at 96million requests per second.

The entire purpose of this PR is to prevent MT instances from trying to send too many queries to cassandra. So base the queue sizes on the read performance of cassandra.

We really dont MT instances querying more than about 5k/second per MT node. So a single query split across 4 shards would be able to query at 20k/second.

So the queue should not be more then about 200k (10seconds at 4x the average rate)
Increasing anymore then that will just lead to overloading cassandra.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, i've set it to 200000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR: after discussing with the 3 of us we decided it's better to provide consistent performance and sometimes drop reads that could otherwise have been served if no other tenants were exerting much load

}
outcomes := make([]outcome, 0, numQueries)

seen := 0
for o := range results {
if o.omitted {
return nil, errReadQueueFull
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this condition is triggered by an omit due to tooOld, so why return errReadQueueFull ?

Copy link
Contributor Author

@replay replay Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it wouldn't return an error then we might end up with an incomplete result, which in many cases is worse than an error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my suggestion is to return an error that is correct (e.g errReadTooOld or something)

@@ -362,9 +370,15 @@ func (o asc) Less(i, j int) bool { return o[i].sortKey < o[j].sortKey }

func (c *CassandraStore) processReadQueue() {
for crr := range c.readQueue {
cassGetWaitDuration.Value(time.Since(crr.timestamp))
waitDuration := time.Since(crr.timestamp)
Copy link
Contributor

@Dieterbe Dieterbe Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should set the crr's timestamps more consistently (wrt all other crr's from the same SearchTable call). concretely, I can't comment in the particular code, cause you didn't modify it, but
instead of calling time.Now() in each query() call in SearchTable, we can just reuse pre. since the time in between each query call is negligible anyway.
this will help avoid a class of race conditions that is pretty rare I think, but might happen, where we omit here certain CRR's due to old, while processing others that were just under the threshold. if we omit one, we might as well omit all others from the same request, and by using the same timestamp in all, we can guarantee that all CRR's after an omitted one, will also be omitted.

(also it would perform a bit better by not getting the time over and over)

@@ -32,7 +32,8 @@ var (
cassandraUsername = flag.String("cassandra-username", "cassandra", "username for authentication")
cassandraPassword = flag.String("cassandra-password", "cassandra", "password for authentication")

windowFactor = flag.Int("window-factor", 20, "the window factor be used when creating the metric table schema")
windowFactor = flag.Int("window-factor", 20, "the window factor be used when creating the metric table schema")
cassandraOmitReadTimeout = flag.Int("cassandra-omit-read-timeout", 60, "if a read is older than this, it will directly be omitted without executing")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tool doesn't do any reads, so having this option here is rather confusing. we can just hardcode a number and comment that it's not relevant because we don't do reads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, same for some other options like cassandraReadConcurrency. can just hardcode such values and comment why

@replay
Copy link
Contributor Author

replay commented Aug 10, 2017

I updated everything according to the comments.
I'll update the docs regarding the read queue size once the code is approved.

@Dieterbe
Copy link
Contributor

looks good now mauro. yes please update docs and config files, then we can merge.

@replay replay merged commit 71f7f8f into master Aug 14, 2017
@Dieterbe Dieterbe deleted the omit_old_reads branch September 18, 2018 09:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants