-
Notifications
You must be signed in to change notification settings - Fork 105
distribute metrics across tables of varying compaction windows #484
Conversation
2f67b48
to
7232df1
Compare
mdata/store_cassandra.go
Outdated
table_name := fmt.Sprintf("metric_%d", window) | ||
|
||
// table isn't know to exist, ensure it exists | ||
err := c.session.Query(fmt.Sprintf(table_schema, keyspace, table_name, window)).Exec() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The window size and the table suffix need to be different.
The general rule of thumb with Cassandra is to try and limit tables to 10s of SSTABLES. I think setting the window appropriately to get no more then 20sstables is a good fit for us.
window_size := (window/20) +1
Once the data is compacted into our "window_size" sstable, it is never compacted again (this means tombstoned data is not deleted). So by having up to 20 sstables, we only incur a 5-10% overhead before data past its ttl is deleted, which is nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we can only guess how large the time range of most queries is, it doesn't necessarily need to be tied to the TTL
. So I was also considering to add some option like compaction-window-factor
and the I'd just:
var window int = 1
for tmp := ttl / (24 * 60 * 60); tmp > 1; tmp = tmp / 2 {
window *= 2
}
window = window / compactionWindowFactor
Because in our case one ratio between TTL
and compaction window
might make sense, but if somebody is mostly interested in archiving until forever but they always just query the last day then our assumption might be sub optimal for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your comment you mean this, right? window_size := (ttl/20) +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@woodsaj what do you think about adding the compation window factor
option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does it have to be a loop? can't you make it a formula? that's easier to reason about then stepping through it in your head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the table in your comment there are multiple tables of the window size 1day. Wouldn't it be better to put them together and use the window size as suffix of the table name?
I dont think so. We wouldnt want 31day ttls and 2day ttls in the same sstable as it impacts delete performance.
However, for these smaller TTLs we should probably use a window size of less then a day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure I understand, with delete performance you mean stuff that has been deleted would be kept around unnecessarily for too long, right?
Ok, I can change everything to use hours instead of days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with TWCS Cassnadra only deletes data when all data in an SSTABLE has reached its TTL.
http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html has a really great write up of TWCS, i suggest you read it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the proposed compaction-window-factor
is an operator-tunable knob so that the operator can make a tradeoff between ~20sstables and "many more" sstables, if their queries are mostly for small ranges. The assumption here is that it's inefficient or suboptimal to read small subsets of data out of a single large sstable.
But is this assumption really true?
I just spent some time trying to get a better understanding of the impact of such a setup.
The recommendation to have fewer larger sstables seems almost universal, though I still haven't gonne to the bottom of it.
In particular I found the "memtable" and "sstable" section of
http://manuel.kiessling.net/2016/07/11/how-cassandras-inner-workings-relate-to-performance/ quite interesting. In particular:
As said, the structure of an SSTable is optimized for efficient reads from disk – entries in one SSTable are sorted by partition key, and an index of all partition keys is loaded into memory when an SSTable is openend. Looking up a row therefore is only one disk seek, with further sequential reads for retrieving the actual row data – thus, no expensive random I/O needs to be performed.
So the larger the sstable, the larger the index it has to load into memory (are these cached by cassandra?), and it can find the row for the given partition (metric key) very fast.
However since we use wide rows it still has to seek to the right ranges of columns that fall within the requested time range. I'm not sure exactly how that works. we use ts
as clustering key.
Anyway this needs some more looking into, but I would be cautious introducing a config option if none of us anticipates using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really interesting discussion. I think for me the reason why I decided to add the option was simple:
I'm pretty sure we will never need to change that to another value than the default 20
in our use case, but since this available online other people who have different use cases might use it. Can we be sure that there is not a single possible use case in which it makes sense to tune that value? I don't think so. So I added it because it's not much additional overhead anyway.
c893994
to
06cd549
Compare
api/dataprocessor.go
Outdated
@@ -542,6 +543,7 @@ type requestContext struct { | |||
To uint32 // may be different than user request, see below | |||
Key string // key to query | |||
AggKey string // aggkey to query (if needed) | |||
Ttl uint32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mdata/ifaces.go
Outdated
@@ -13,4 +13,5 @@ type Metric interface { | |||
Add(ts uint32, val float64) | |||
Get(from, to uint32) (uint32, []chunk.Iter) | |||
GetAggregated(consolidator consolidation.Consolidator, aggSpan, from, to uint32) (uint32, []chunk.Iter) | |||
GetTtl() uint32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mdata/store_cassandra.go
Outdated
* | 512 - 1023 | metrics_512 | 26 | | ||
* | 1024 - 2047 | metrics_1024 | 52 | | ||
* |--------------------|--------------------|-------------------| | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this table can be better by:
- putting the unit in the TTL range and window size column (I think both are in
hours
). in fact for the larger hours, can we also put the number in days (1024 is ~43 days, 2047 is ~85) - adding a
number of sstables
range column, e.g. for TTL range64 - 127
this would contain16 - 55
- going up to about 5 years (it will be common i suspect to have ttls of a few years, at least much more than 2047 hours. should be only about 5 new rows in the table to add)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note also that all ttl's (see ttl
option and the ttl within agg-settings
) can currently be any non-zero number of seconds.
either we clarify this table to show what happens when your ttl is for example 3.5 hours (we simply round down before looking at the range) or we could just add a restriction that it has to be an even number of hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of talking about what happens if the ttl
is 3.5
hours i could simply make ttlUnits
return a float64
instead of a uint32
, right now the returned value is converted into a float64
right after it is returned anyway, then no clarification should be necessary.
mdata/store_cassandra.go
Outdated
// calculate the window size by finding the largest power of 2 that's smaller than ttl | ||
window := uint32(math.Exp2(math.Floor(math.Log2(float64(ttlUnits(ttl)))))) | ||
tableName := fmt.Sprintf(nameFormat, window) | ||
result := tables[ttl] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to lookup from the table. just add the (overwrite any pre-existing) entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I just did that because the alternative would be pretty clunky, because that's an anonymous struct:
tables[ttl] = struct {
table string
window uint32
}{
table: tableName,
window: window/uint32(windowFactor) + 1,
}
I guess I could also just define a type for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I guess that's nicer indeed: 96f3bf6
mdata/store_cassandra.go
Outdated
* | ||
*/ | ||
|
||
// calculate the window size by finding the largest power of 2 that's smaller than ttl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you're talking about "window size" but below you're calling it "window"
mdata/store_cassandra.go
Outdated
tableName := fmt.Sprintf(nameFormat, window) | ||
result := tables[ttl] | ||
result.table = tableName | ||
result.window = window/uint32(windowFactor) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is confusing. higher up window
meant the lowest power of 2 (e.g. 2014), here you're using it as the window size. you're using "window" twice with different meanings.
my suggestion: use roundedTTL instead of window above, and just 'window' to talk about the compaction window.
metrictank.go
Outdated
@@ -237,6 +238,8 @@ func main() { | |||
set := strings.Split(*aggSettings, ",") | |||
finalSettings := make([]mdata.AggSetting, 0) | |||
highestChunkSpan := chunkSpan | |||
ttls := make([]uint32, 1) | |||
ttls[0] = ttl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both of these lines can be replaced by ttls := []uint32{ttl}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
3dd9ffa
to
d5f75e8
Compare
I think this is ready for review again once you have some time |
api/dataprocessor.go
Outdated
@@ -466,7 +467,7 @@ func (s *Server) getSeriesCachedStore(ctx *requestContext, until uint32) []chunk | |||
// the request cannot completely be served from cache, it will require cassandra involvement | |||
if !cacheRes.Complete { | |||
if cacheRes.From != cacheRes.Until { | |||
storeIterGens, err := s.BackendStore.Search(key, cacheRes.From, cacheRes.Until) | |||
storeIterGens, err := s.BackendStore.Search(key, ctx.TTL, cacheRes.From, cacheRes.Until) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the TTL was assigned to the context, it was that of the raw metric. but we may be requesting a rolled up series which may have a different ttl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, good catch. I think that means I'll need to modify models.Req
and add the TTL
there. Then that TTL
attribute cat get set in alignRequests()
when the aggregation is selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like introducing new dependencies. But I think before doing that I should wait for this #509 to get merged, because of the modifications it's making in alignRequests()
which would be useful here.
cmd/mt-split-metrics-by-ttl/main.go
Outdated
|
||
func main() { | ||
flag.Usage = func() { | ||
fmt.Fprintln(os.Stderr, "mt-create-metrics-by-ttl [flags] ttl [ttl...]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split not create
cmd/mt-split-metrics-by-ttl/main.go
Outdated
tables := store.GetTableNames() | ||
|
||
// create directory/link structure that we need to define the future table names | ||
err = os.Mkdir(fmt.Sprintf("%s/%s", tmpDir, *cassandraKeyspace), 0700) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
cmd/mt-split-metrics-by-ttl/main.go
Outdated
} | ||
} | ||
|
||
fmt.Println(fmt.Sprintf("The following tables have been created: %s", strings.Join(tables, ", "))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just call fmt.Printf
cmd/mt-split-metrics-by-ttl/main.go
Outdated
fmt.Println("- Stop metrictanks to stop writes") | ||
fmt.Println(fmt.Sprintf("- Use `nodetool snapshot --table metric %s` on all cluster nodes to create snapshot of metric table", *cassandraKeyspace)) | ||
fmt.Println(" https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsSnapShot.html") | ||
fmt.Println(fmt.Sprintf("- Move snapshot files to directory %s", snapshotDir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're asking people to log into all cassandra nodes and create snapshots, and then copy those snapshots over the network to the machine we ran this program on? that seems very inefficient, and this machine may not even have enough space. am i missing something? can't we call sstableloader from the machines that hold the snapshots? that would probably keep the data local to the node too. if you move the snapshots to another node and then load them there, that node will probably have to send the data to the other cluster nodes (including the one it came from) anyway; since the partitioning will be the same i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the instructions are actually for a single node cassandra. In a cluster this directory structure would be be created/copied to every node in the cluster and then these steps are done on each of the nodes. I guess the output should mention that, but in the end it will always require quite some manual interaction by the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the instructions here say to run nodetool snapshot on all cluster nodes and move the files to the directory on the system where this tool was run. (at least that's how it comes across)
to clarify it to users, perhaps the help message of this tool should say "this tool, and all of its instructions, should be run on all cluster nodes" and then don't instruct people here to go run commands on other nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it. Actually not everything the tool does needs to be run on each node, because the table schema only needs to be created once. But the message is now better at explaining what needs to be done on each node.
@@ -204,12 +205,20 @@ func main() { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 should be 4 now
metrictank has configurable windowfactor, but mt-store-cat and mt-split-metrics-by-ttl have it hardcoded to 20? I'm fine with keeping it configurable but then it should also be configurable in the tools. |
mdata/store_cassandra.go
Outdated
* generated with: https://gist.github.com/replay/69ad7cfd523edfa552cd12851fa74c58 | ||
* | ||
* +-----------------------------+---------------+-------------+----------+ | ||
* | hours / days | table_name | window_size | sstables | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
window_size should mention what unit it is in.
mdata/store_cassandra.go
Outdated
* +-----------------------------+---------------+-------------+----------+ | ||
* | hours / days | table_name | window_size | sstables | | ||
* +-----------------------------+---------------+-------------+----------+ | ||
* | 1 / <1 | metrics_1 | 1 | 1 - 1 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still find this confusing because we don't say anything about various values in between. I suggest we use a range syntax like so:
ttl table_name
------------------------------
hours < 1 metrics_0
1 <= hours < 2 metrics_1
2 <= hours < 4 metrics_2
4 <= hours < 8 metrics_4
(the amount of days can't be specified this way, it's ok for that to be an approximation. maybe just turn that into a days (approx)
column)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean with the values in between. I don't think 1 <= hours < 2
is easier to understand than 1 - 2
if the column is named ttl hours
. Do you mean the values between the limits of the different rows? Actually I think an approximation is good enough in this case, otherwise it just gets unnecessarily complicated. How about this:
mst@mst-nb1:~$ python ./generate_window_size_comment_table.py
+-------------------+---------------+-------------+----------+
| TTL hours (floor) | table_name | window_size | sstables |
+-------------------+---------------+-------------+----------+
| 1 | metrics_1 | 1 | 1 - 1 |
| 2 - 3 | metrics_2 | 1 | 2 - 3 |
| 4 - 7 | metrics_4 | 1 | 4 - 7 |
| 8 - 15 | metrics_8 | 1 | 8 - 15 |
| 16 - 31 | metrics_16 | 1 | 16 - 31 |
| 32 - 63 | metrics_32 | 2 | 16 - 31 |
| 64 - 127 | metrics_64 | 4 | 16 - 31 |
| 128 - 255 | metrics_128 | 7 | 18 - 36 |
| 256 - 511 | metrics_256 | 13 | 19 - 39 |
| 512 - 1023 | metrics_512 | 26 | 19 - 39 |
| 1024 - 2047 | metrics_1024 | 52 | 19 - 39 |
| 2048 - 4095 | metrics_2048 | 103 | 19 - 39 |
| 4096 - 8191 | metrics_4096 | 205 | 19 - 39 |
| 8192 - 16383 | metrics_8192 | 410 | 19 - 39 |
| 16384 - 32767 | metrics_16384 | 820 | 19 - 39 |
| 32768 - 65535 | metrics_32768 | 1639 | 19 - 39 |
+-------------------+---------------+-------------+----------+
I removed the days again because if the days are also floored it gets confusing because sometimes there are overlaps between the rows and sometimes there aren't. I'm sure everybody knows a day has 24h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the values between the limits of the different rows?
yes. I think your suggested table is basically the same as my suggestion, but I still see two slight advantages to my proposal:
- seeing
floor
in the header and figuring out how that relates to the values shown (take your ttl, floor it, only then look at the ranges in the table) is not immediately obvious. though you could clarify that with a comment instructing people how to use the table, but I think my version is clearer more quickly. - my values are simply all the powers of 2, and they relate directly to the table names in the 2nd column. Values like 31 and 1023 look slightly weird.
Actually I think an approximation is good enough in this case, otherwise it just gets unnecessarily complicated
it seems trivial to me to do the right thing here, no need for approximations. your version isn't an approximation either (which is good) so I don't understand what you're getting at here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, nevermind. I just thought the x <= hours <y
notation is confusing.
How is that:
+------------------------+---------------+---------------------+----------+
| TTL hours | table_name | window_size (hours) | sstables |
+------------------------+---------------+---------------------+----------+
| 1 <= hours < 2 | metrics_1 | 1 | 1 - 3 |
| 2 <= hours < 4 | metrics_2 | 1 | 2 - 5 |
| 4 <= hours < 8 | metrics_4 | 1 | 4 - 9 |
| 8 <= hours < 16 | metrics_8 | 1 | 8 - 17 |
| 16 <= hours < 32 | metrics_16 | 1 | 16 - 33 |
| 32 <= hours < 64 | metrics_32 | 2 | 16 - 33 |
| 64 <= hours < 128 | metrics_64 | 4 | 16 - 33 |
| 128 <= hours < 256 | metrics_128 | 7 | 19 - 38 |
| 256 <= hours < 512 | metrics_256 | 13 | 20 - 41 |
| 512 <= hours < 1024 | metrics_512 | 26 | 20 - 41 |
| 1024 <= hours < 2048 | metrics_1024 | 52 | 20 - 41 |
| 2048 <= hours < 4096 | metrics_2048 | 103 | 20 - 41 |
| 4096 <= hours < 8192 | metrics_4096 | 205 | 20 - 41 |
| 8192 <= hours < 16384 | metrics_8192 | 410 | 20 - 41 |
| 16384 <= hours < 32768 | metrics_16384 | 820 | 20 - 41 |
| 32768 <= hours < 65536 | metrics_32768 | 1639 | 20 - 41 |
+------------------------+---------------+---------------------+----------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks 👍 but I think the sstable counts are still off.
for example 65535 / 1639 = 39.9847
, so how could it be 41 sstables, should be 40 I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the time span during which a datapoint needs to be kept can be 39.9847
tables, and the ts
of a datapoint is in the middle of the table, then the end of live of that datapoint will be in the middle of the 41st
table. So that comes from the fact that the ts
doesn't need to be aligned with the table wrapping times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes a lot of sense. but i see that your latest push has the upper limit at (the incorrect) 40 now ? :/ and also the table is in the old format again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right, i think i only updated the script to generate the table and didn't copy paste the output yet. updating now
@@ -301,7 +304,7 @@ func main() { | |||
/*********************************** | |||
Initialize our backendStore | |||
***********************************/ | |||
store, err := mdata.NewCassandraStore(*cassandraAddrs, *cassandraKeyspace, *cassandraConsistency, *cassandraCaPath, *cassandraUsername, *cassandraPassword, *cassandraHostSelectionPolicy, *cassandraTimeout, *cassandraReadConcurrency, *cassandraWriteConcurrency, *cassandraReadQueueSize, *cassandraWriteQueueSize, *cassandraRetries, *cqlProtocolVersion, *cassandraSSL, *cassandraAuth, *cassandraHostVerification) | |||
store, err := mdata.NewCassandraStore(*cassandraAddrs, *cassandraKeyspace, *cassandraConsistency, *cassandraCaPath, *cassandraUsername, *cassandraPassword, *cassandraHostSelectionPolicy, *cassandraTimeout, *cassandraReadConcurrency, *cassandraWriteConcurrency, *cassandraReadQueueSize, *cassandraWriteQueueSize, *cassandraRetries, *cqlProtocolVersion, *cassandraWindowFactor, *cassandraSSL, *cassandraAuth, *cassandraHostVerification, ttls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ttl values we're passing to Cassandra here are a number of seconds.
cassandra expects a number of hours.
I think we should do one of these two:
A) document that CassandraStore needs ttl in hours, and adjust the invocation here
B) make cassandraStore accept TTL as number of seconds, and adjust the other invocations.
I prefer B because a number of hours is awkward (at the very least surprising) compared to all the other internal api's which deal with numbers of seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand why you're saying that cassandra expects the number to be in hours.
When the getTTLTables()
generates the table map it is converting those ttls
from seconds to hours.
In the INSERT
statement, according to the docs, it should always expect the TTL
to be in seconds no matter what the compaction_window_unit
is.
So where does cassandra expect hours?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I meant NewCassandraStore
takes the ttl argument as number of hours, which is what I think we should change to seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's the case.
The ttls
variable in NewCassandraStore
only gets used once, as parameter to GetTTLTables()
, which returns a map of the type ttlTables
where the tables are indexed by ttl
in seconds.
GetTTLTables()
only temporarily converts the ttl
by using ttlUnits()
to calculate the right windowSize
because compaction_window_unit
is HOURS
, but apart from that it's all in seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, i get where the confusion comes from... the mt-split-metrics-by-ttl
wrongly treats the TTL as hours, i'll fix that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should clear up the mess: da4a73e
cmd/mt-split-metrics-by-ttl/main.go
Outdated
if err != nil { | ||
panic(fmt.Sprintf("Failed to create directory: %s", err)) | ||
} | ||
namedTableLinks := make([]string, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can make correctly sized slice and then just use [i]
syntax to set elements instead of append ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right! thx
cmd/mt-split-metrics-by-ttl/main.go
Outdated
fmt.Printf("The following tables have been created: %s\n", strings.Join(tables, ", ")) | ||
fmt.Println("Now continue with the following steps on each cassandra node. In the case") | ||
fmt.Printf("of a cluster setup you may want to copy the directory structure in %s\n", tmpDir) | ||
fmt.Println("to each of the nodes.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we tell people to do the following steps and also to copy the dir structure. but we're not clear on the order of what should happen first (I think first copy dir structure, then do next steps?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also we say "do this on on each cassandra node.. and then "in the case of a cluster". isn't having multiple nodes and a cluster the same thing?
So I think we could simply say.
On each of your cassandra nodes:
- recreate this directory structure (by copying it, or running this tool)
- execute these steps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, how is that:
mst@ubuntu:~/go/src/github.com/raintank/metrictank/cmd/mt-split-metrics-by-ttl$ ./mt-split-metrics-by-ttl 36000
The following tables have been created: metric_8
Now continue with the following steps on your cassandra node(s):
- Recreate (or copy) this directory structure on each cassandra node:
/tmp/metrictank859670692
- Stop metrictanks to stop writes
- Use `nodetool snapshot --table metric metrictank` on all cluster nodes to create
a snapshot of the metric table
https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsSnapShot.html
- Move snapshot files to directory /tmp/metrictank859670692/snapshot
- Restore by executing:
sstableloader -d localhost /tmp/metrictank859670692/metrictank/metric_8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the "Move snapshot files to directory /tmp/metrictank859670692/snapshot" step, can't we just tell them to cd into that dir before running the nodetool snapshot?
also to stop writes, you could stop metrictank or set it to secondary status, see https://github.com/raintank/metrictank/blob/master/docs/http-api.md#set-cluster-status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodetool snapshot
won't store the snapshot in the cwd
, it puts them in the data directory of cassandra.
Good point with making it secondary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest version:
The following tables have been created: metric_8
Now continue with the following steps on your cassandra node(s):
- Recreate (or copy) this directory structure on each cassandra node:
/tmp/metrictank185098310
- Stop writes by stopping the metrictanks or set their cluster status to secondary
- Use `nodetool snapshot --table metric metrictank` on all cluster nodes to create
a snapshot of the metric table
https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsSnapShot.html
- Move snapshot files to directory /tmp/metrictank185098310/snapshot
- Load the data by executing:
sstableloader -d localhost /tmp/metrictank185098310/metrictank/metric_8
cmd/mt-split-metrics-by-ttl/main.go
Outdated
fmt.Println(" a snapshot of the metric table") | ||
fmt.Println(" https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsSnapShot.html") | ||
fmt.Println(fmt.Sprintf("- Move snapshot files to directory %s", snapshotDir)) | ||
fmt.Println("- Restore by executing:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"restore" isn't the most fitting word here. Perhaps "Execute this to assure all new ttl tables will have the data for metrics with the right ttl (and also data for other series which won't be used, but that's ok)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "Load the data" instead of "restore"? That's short and reflects what's happening, if they want to know more details there is already a link to the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k
@@ -611,7 +611,10 @@ func TestAlignRequests(t *testing.T) { | |||
reqRaw("a", 0, 30, 800, 10, consolidation.Avg), | |||
reqRaw("b", 0, 30, 800, 60, consolidation.Avg), | |||
}, | |||
[]mdata.AggSetting{}, | |||
mdata.AggSettings{ | |||
35 * 24 * 60 * 60, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now all these 35d ttl settings are basically ignored right? like if we changed them, the tests would still succeed? is this the case for all tests where this change is made?
I guess this puts us on track if we later want to make alignRequests smarter and take ttl into account. is that the purpose of these changes or do you have something else in mind? I'm having some trouble understanding what the goal is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I changed the tests is because alignRequests
now expects the second parameter to be of type AggSettings{}
and that type requires a number to be there as the default TTL. You're right that in the case of these tests that number does not matter at all, so I've just put the default value there.
mdata/aggmetrics.go
Outdated
@@ -15,7 +15,7 @@ type AggMetrics struct { | |||
Metrics map[string]*AggMetric | |||
chunkSpan uint32 | |||
numChunks uint32 | |||
aggSettings []AggSetting // for now we apply the same settings to all AggMetrics. later we may want to have different settings. | |||
aggSettings AggSettings // for now we apply the same settings to all AggMetrics. later we may want to have different settings. | |||
chunkMaxStale uint32 | |||
metricMaxStale uint32 | |||
ttl uint32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe this field is now not needed anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right! missed that
mdata/aggregator.go
Outdated
Ready bool // ready for reads? | ||
} | ||
|
||
type AggSettings struct { | ||
TTL uint32 // default TTL for raw data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this field should be RawTTL
or something because it easily gets confusing otherwise. btw why does it say "default TTL"? this value is whatever the user asked for, not any default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, i meant default in the sense of that it isn't a rollup or any aggregation, but just the normal raw data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update to RawTTL
mdata/store_cassandra_test.go
Outdated
{oneYear, 20, "metric_%d", "metric_8192", 410}, | ||
{oneDay, 50, "metric_%d", "metric_16", 1}, | ||
{oneMonth, 50, "metric_%d", "metric_512", 11}, | ||
{oneYear, 50, "metric_%d", "metric_8192", 164}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this a bit more extensive and test some edge cases (like TTL of 0s, 10s, 1min, maxuint32, and on and around a boundary, for example 1024 hours minus a second, 1024 hours minus 2 seconds, 1024 hours, 1024 hours and a second)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added some more edge cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should update the table in the comments to add a case for TTL
s that are 0
-1
hours.
done: 158736f#diff-f63a36d3200e1e4aa79054431d8b5c52R110
158736f
to
b605d91
Compare
mdata/store_cassandra.go
Outdated
key ascii, | ||
ts int, | ||
data blob, | ||
PRIMARY KEY (key, ts) | ||
) WITH CLUSTERING ORDER BY (ts DESC) | ||
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': '1' } | ||
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'HOURS', 'compaction_window_size': '%d' } | ||
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets change this to the correct format for cassandra3.X
AND compression = { 'class' : 'LZ4Compressor' }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
mdata/store_cassandra.go
Outdated
key ascii, | ||
ts int, | ||
data blob, | ||
PRIMARY KEY (key, ts) | ||
) WITH CLUSTERING ORDER BY (ts DESC) | ||
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': '1' } | ||
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'HOURS', 'compaction_window_size': '%d' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets update to use cassandra3.x format for the class.
AND compaction = {'class': 'TimeWindowCompactionStrategy',
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated it. I tested how it looks when the table gets created, and it resolves the class path when it creates the table so they end up looking pretty much the same:
cqlsh:metrictank> CREATE TABLE metrictank.metric_5122 (
... key ascii,
... ts int,
... data blob,
... PRIMARY KEY (key, ts)
... ) WITH CLUSTERING ORDER BY (ts DESC)
... AND bloom_filter_fp_chance = 0.01
... AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
... AND comment = ''
... AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '26', 'compaction_window_unit': 'HOURS', 'max_threshold': '32', 'min_threshold': '4'}
... AND compression = {'chunk_length_in_kb': '64', 'class': 'LZ4Compressor'}
... AND crc_check_chance = 1.0
... AND dclocal_read_repair_chance = 0.1
... AND default_time_to_live = 0
... AND gc_grace_seconds = 864000
... AND max_index_interval = 2048
... AND memtable_flush_period_in_ms = 0
... AND min_index_interval = 128
... AND read_repair_chance = 0.0
... AND speculative_retry = '99PERCENTILE';
cqlsh:metrictank> describe table metric_5122
CREATE TABLE metrictank.metric_5122 (
key ascii,
ts int,
data blob,
PRIMARY KEY (key, ts)
) WITH CLUSTERING ORDER BY (ts DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '26', 'compaction_window_unit': 'HOURS', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
mdata/store_cassandra.go
Outdated
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': '1' } | ||
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}` | ||
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'HOURS', 'compaction_window_size': '%d' } | ||
AND compression = { 'sstable_compression' : 'LZ4Compressor' }` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be
AND compression = { 'class' : 'LZ4Compressor' }
mdata/store_cassandra.go
Outdated
key ascii, | ||
ts int, | ||
data blob, | ||
PRIMARY KEY (key, ts) | ||
) WITH CLUSTERING ORDER BY (ts DESC) | ||
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': '1' } | ||
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}` | ||
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'HOURS', 'compaction_window_size': '%d' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{'class': 'TimeWindowCompactionStrategy'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this better? some sort of protection against namespace refactors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full class path should only be used for custom classes. The in-built ones should use the short name. I don't really know why it is just how it is presented in the latest docs.
We want tables to have varying compaction window sizes depending on a metric's TTL. At the same time we want to be able to group those with similar TTLs. This is some kind of middle way. issue #382
fixes #382