Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

distribute metrics across tables of varying compaction windows #484

Merged
merged 4 commits into from
Feb 1, 2017
Merged

Conversation

replay
Copy link
Contributor

@replay replay commented Jan 19, 2017

fixes #382

table_name := fmt.Sprintf("metric_%d", window)

// table isn't know to exist, ensure it exists
err := c.session.Query(fmt.Sprintf(table_schema, keyspace, table_name, window)).Exec()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The window size and the table suffix need to be different.

The general rule of thumb with Cassandra is to try and limit tables to 10s of SSTABLES. I think setting the window appropriately to get no more then 20sstables is a good fit for us.

window_size := (window/20) +1
Once the data is compacted into our "window_size" sstable, it is never compacted again (this means tombstoned data is not deleted). So by having up to 20 sstables, we only incur a 5-10% overhead before data past its ttl is deleted, which is nice.

Copy link
Contributor Author

@replay replay Jan 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we can only guess how large the time range of most queries is, it doesn't necessarily need to be tied to the TTL. So I was also considering to add some option like compaction-window-factor and the I'd just:

var window int = 1
for tmp := ttl / (24 * 60 * 60); tmp > 1; tmp = tmp / 2 {
    window *= 2
}
window = window / compactionWindowFactor

Because in our case one ratio between TTL and compaction window might make sense, but if somebody is mostly interested in archiving until forever but they always just query the last day then our assumption might be sub optimal for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your comment you mean this, right? window_size := (ttl/20) +1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@woodsaj what do you think about adding the compation window factor option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it have to be a loop? can't you make it a formula? that's easier to reason about then stepping through it in your head.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the table in your comment there are multiple tables of the window size 1day. Wouldn't it be better to put them together and use the window size as suffix of the table name?

I dont think so. We wouldnt want 31day ttls and 2day ttls in the same sstable as it impacts delete performance.

However, for these smaller TTLs we should probably use a window size of less then a day.

Copy link
Contributor Author

@replay replay Jan 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure I understand, with delete performance you mean stuff that has been deleted would be kept around unnecessarily for too long, right?
Ok, I can change everything to use hours instead of days.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with TWCS Cassnadra only deletes data when all data in an SSTABLE has reached its TTL.

http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html has a really great write up of TWCS, i suggest you read it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the proposed compaction-window-factor is an operator-tunable knob so that the operator can make a tradeoff between ~20sstables and "many more" sstables, if their queries are mostly for small ranges. The assumption here is that it's inefficient or suboptimal to read small subsets of data out of a single large sstable.
But is this assumption really true?
I just spent some time trying to get a better understanding of the impact of such a setup.
The recommendation to have fewer larger sstables seems almost universal, though I still haven't gonne to the bottom of it.
In particular I found the "memtable" and "sstable" section of
http://manuel.kiessling.net/2016/07/11/how-cassandras-inner-workings-relate-to-performance/ quite interesting. In particular:

As said, the structure of an SSTable is optimized for efficient reads from disk – entries in one SSTable are sorted by partition key, and an index of all partition keys is loaded into memory when an SSTable is openend. Looking up a row therefore is only one disk seek, with further sequential reads for retrieving the actual row data – thus, no expensive random I/O needs to be performed.

So the larger the sstable, the larger the index it has to load into memory (are these cached by cassandra?), and it can find the row for the given partition (metric key) very fast.
However since we use wide rows it still has to seek to the right ranges of columns that fall within the requested time range. I'm not sure exactly how that works. we use ts as clustering key.

Anyway this needs some more looking into, but I would be cautious introducing a config option if none of us anticipates using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really interesting discussion. I think for me the reason why I decided to add the option was simple:

I'm pretty sure we will never need to change that to another value than the default 20 in our use case, but since this available online other people who have different use cases might use it. Can we be sure that there is not a single possible use case in which it makes sense to tune that value? I don't think so. So I added it because it's not much additional overhead anyway.

@@ -542,6 +543,7 @@ type requestContext struct {
To uint32 // may be different than user request, see below
Key string // key to query
AggKey string // aggkey to query (if needed)
Ttl uint32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mdata/ifaces.go Outdated
@@ -13,4 +13,5 @@ type Metric interface {
Add(ts uint32, val float64)
Get(from, to uint32) (uint32, []chunk.Iter)
GetAggregated(consolidator consolidation.Consolidator, aggSpan, from, to uint32) (uint32, []chunk.Iter)
GetTtl() uint32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* | 512 - 1023 | metrics_512 | 26 |
* | 1024 - 2047 | metrics_1024 | 52 |
* |--------------------|--------------------|-------------------|
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this table can be better by:

  • putting the unit in the TTL range and window size column (I think both are in hours). in fact for the larger hours, can we also put the number in days (1024 is ~43 days, 2047 is ~85)
  • adding a number of sstables range column, e.g. for TTL range 64 - 127 this would contain 16 - 55
  • going up to about 5 years (it will be common i suspect to have ttls of a few years, at least much more than 2047 hours. should be only about 5 new rows in the table to add)

Copy link
Contributor

@Dieterbe Dieterbe Jan 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note also that all ttl's (see ttl option and the ttl within agg-settings) can currently be any non-zero number of seconds.
either we clarify this table to show what happens when your ttl is for example 3.5 hours (we simply round down before looking at the range) or we could just add a restriction that it has to be an even number of hours.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of talking about what happens if the ttl is 3.5 hours i could simply make ttlUnits return a float64 instead of a uint32, right now the returned value is converted into a float64 right after it is returned anyway, then no clarification should be necessary.

// calculate the window size by finding the largest power of 2 that's smaller than ttl
window := uint32(math.Exp2(math.Floor(math.Log2(float64(ttlUnits(ttl))))))
tableName := fmt.Sprintf(nameFormat, window)
result := tables[ttl]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to lookup from the table. just add the (overwrite any pre-existing) entry?

Copy link
Contributor Author

@replay replay Jan 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just did that because the alternative would be pretty clunky, because that's an anonymous struct:

                tables[ttl] = struct {
                        table  string
                        window uint32
                }{
                        table:  tableName,
                        window: window/uint32(windowFactor) + 1,
                }

I guess I could also just define a type for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess that's nicer indeed: 96f3bf6

*
*/

// calculate the window size by finding the largest power of 2 that's smaller than ttl
Copy link
Contributor

@Dieterbe Dieterbe Jan 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you're talking about "window size" but below you're calling it "window"

tableName := fmt.Sprintf(nameFormat, window)
result := tables[ttl]
result.table = tableName
result.window = window/uint32(windowFactor) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is confusing. higher up window meant the lowest power of 2 (e.g. 2014), here you're using it as the window size. you're using "window" twice with different meanings.
my suggestion: use roundedTTL instead of window above, and just 'window' to talk about the compaction window.

metrictank.go Outdated
@@ -237,6 +238,8 @@ func main() {
set := strings.Split(*aggSettings, ",")
finalSettings := make([]mdata.AggSetting, 0)
highestChunkSpan := chunkSpan
ttls := make([]uint32, 1)
ttls[0] = ttl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both of these lines can be replaced by ttls := []uint32{ttl}

Copy link
Contributor

@Dieterbe Dieterbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

@replay replay changed the title [WIP] distribute metrics across tables of varying compaction windows distribute metrics across tables of varying compaction windows Jan 25, 2017
@replay
Copy link
Contributor Author

replay commented Jan 25, 2017

I think this is ready for review again once you have some time

@@ -466,7 +467,7 @@ func (s *Server) getSeriesCachedStore(ctx *requestContext, until uint32) []chunk
// the request cannot completely be served from cache, it will require cassandra involvement
if !cacheRes.Complete {
if cacheRes.From != cacheRes.Until {
storeIterGens, err := s.BackendStore.Search(key, cacheRes.From, cacheRes.Until)
storeIterGens, err := s.BackendStore.Search(key, ctx.TTL, cacheRes.From, cacheRes.Until)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the TTL was assigned to the context, it was that of the raw metric. but we may be requesting a rolled up series which may have a different ttl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, good catch. I think that means I'll need to modify models.Req and add the TTL there. Then that TTL attribute cat get set in alignRequests() when the aggregation is selected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like introducing new dependencies. But I think before doing that I should wait for this #509 to get merged, because of the modifications it's making in alignRequests() which would be useful here.


func main() {
flag.Usage = func() {
fmt.Fprintln(os.Stderr, "mt-create-metrics-by-ttl [flags] ttl [ttl...]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split not create

tables := store.GetTableNames()

// create directory/link structure that we need to define the future table names
err = os.Mkdir(fmt.Sprintf("%s/%s", tmpDir, *cassandraKeyspace), 0700)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

}
}

fmt.Println(fmt.Sprintf("The following tables have been created: %s", strings.Join(tables, ", ")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just call fmt.Printf

fmt.Println("- Stop metrictanks to stop writes")
fmt.Println(fmt.Sprintf("- Use `nodetool snapshot --table metric %s` on all cluster nodes to create snapshot of metric table", *cassandraKeyspace))
fmt.Println(" https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsSnapShot.html")
fmt.Println(fmt.Sprintf("- Move snapshot files to directory %s", snapshotDir))
Copy link
Contributor

@Dieterbe Dieterbe Jan 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're asking people to log into all cassandra nodes and create snapshots, and then copy those snapshots over the network to the machine we ran this program on? that seems very inefficient, and this machine may not even have enough space. am i missing something? can't we call sstableloader from the machines that hold the snapshots? that would probably keep the data local to the node too. if you move the snapshots to another node and then load them there, that node will probably have to send the data to the other cluster nodes (including the one it came from) anyway; since the partitioning will be the same i think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the instructions are actually for a single node cassandra. In a cluster this directory structure would be be created/copied to every node in the cluster and then these steps are done on each of the nodes. I guess the output should mention that, but in the end it will always require quite some manual interaction by the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the instructions here say to run nodetool snapshot on all cluster nodes and move the files to the directory on the system where this tool was run. (at least that's how it comes across)

to clarify it to users, perhaps the help message of this tool should say "this tool, and all of its instructions, should be run on all cluster nodes" and then don't instruct people here to go run commands on other nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it. Actually not everything the tool does needs to be run on each node, because the table schema only needs to be created once. But the message is now better at explaining what needs to be done on each node.

@@ -204,12 +205,20 @@ func main() {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 should be 4 now

@Dieterbe
Copy link
Contributor

metrictank has configurable windowfactor, but mt-store-cat and mt-split-metrics-by-ttl have it hardcoded to 20? I'm fine with keeping it configurable but then it should also be configurable in the tools.

* generated with: https://gist.github.com/replay/69ad7cfd523edfa552cd12851fa74c58
*
* +-----------------------------+---------------+-------------+----------+
* | hours / days | table_name | window_size | sstables |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

window_size should mention what unit it is in.

* +-----------------------------+---------------+-------------+----------+
* | hours / days | table_name | window_size | sstables |
* +-----------------------------+---------------+-------------+----------+
* | 1 / <1 | metrics_1 | 1 | 1 - 1 |
Copy link
Contributor

@Dieterbe Dieterbe Jan 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find this confusing because we don't say anything about various values in between. I suggest we use a range syntax like so:

ttl              table_name
------------------------------
hours < 1        metrics_0
1 <= hours < 2   metrics_1
2 <= hours < 4   metrics_2
4 <= hours < 8   metrics_4

(the amount of days can't be specified this way, it's ok for that to be an approximation. maybe just turn that into a days (approx) column)

Copy link
Contributor Author

@replay replay Jan 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean with the values in between. I don't think 1 <= hours < 2 is easier to understand than 1 - 2 if the column is named ttl hours. Do you mean the values between the limits of the different rows? Actually I think an approximation is good enough in this case, otherwise it just gets unnecessarily complicated. How about this:

mst@mst-nb1:~$ python ./generate_window_size_comment_table.py 
+-------------------+---------------+-------------+----------+
| TTL hours (floor) |    table_name | window_size | sstables |
+-------------------+---------------+-------------+----------+
|                 1 |     metrics_1 |           1 |    1 - 1 |
|             2 - 3 |     metrics_2 |           1 |    2 - 3 |
|             4 - 7 |     metrics_4 |           1 |    4 - 7 |
|            8 - 15 |     metrics_8 |           1 |   8 - 15 |
|           16 - 31 |    metrics_16 |           1 |  16 - 31 |
|           32 - 63 |    metrics_32 |           2 |  16 - 31 |
|          64 - 127 |    metrics_64 |           4 |  16 - 31 |
|         128 - 255 |   metrics_128 |           7 |  18 - 36 |
|         256 - 511 |   metrics_256 |          13 |  19 - 39 |
|        512 - 1023 |   metrics_512 |          26 |  19 - 39 |
|       1024 - 2047 |  metrics_1024 |          52 |  19 - 39 |
|       2048 - 4095 |  metrics_2048 |         103 |  19 - 39 |
|       4096 - 8191 |  metrics_4096 |         205 |  19 - 39 |
|      8192 - 16383 |  metrics_8192 |         410 |  19 - 39 |
|     16384 - 32767 | metrics_16384 |         820 |  19 - 39 |
|     32768 - 65535 | metrics_32768 |        1639 |  19 - 39 |
+-------------------+---------------+-------------+----------+

I removed the days again because if the days are also floored it gets confusing because sometimes there are overlaps between the rows and sometimes there aren't. I'm sure everybody knows a day has 24h.

Copy link
Contributor

@Dieterbe Dieterbe Jan 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the values between the limits of the different rows?

yes. I think your suggested table is basically the same as my suggestion, but I still see two slight advantages to my proposal:

  • seeing floor in the header and figuring out how that relates to the values shown (take your ttl, floor it, only then look at the ranges in the table) is not immediately obvious. though you could clarify that with a comment instructing people how to use the table, but I think my version is clearer more quickly.
  • my values are simply all the powers of 2, and they relate directly to the table names in the 2nd column. Values like 31 and 1023 look slightly weird.

Actually I think an approximation is good enough in this case, otherwise it just gets unnecessarily complicated

it seems trivial to me to do the right thing here, no need for approximations. your version isn't an approximation either (which is good) so I don't understand what you're getting at here.

Copy link
Contributor Author

@replay replay Jan 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, nevermind. I just thought the x <= hours <y notation is confusing.
How is that:

+------------------------+---------------+---------------------+----------+
|              TTL hours |    table_name | window_size (hours) | sstables |
+------------------------+---------------+---------------------+----------+
|         1 <= hours < 2 |     metrics_1 |                   1 |    1 - 3 |
|         2 <= hours < 4 |     metrics_2 |                   1 |    2 - 5 |
|         4 <= hours < 8 |     metrics_4 |                   1 |    4 - 9 |
|        8 <= hours < 16 |     metrics_8 |                   1 |   8 - 17 |
|       16 <= hours < 32 |    metrics_16 |                   1 |  16 - 33 |
|       32 <= hours < 64 |    metrics_32 |                   2 |  16 - 33 |
|      64 <= hours < 128 |    metrics_64 |                   4 |  16 - 33 |
|     128 <= hours < 256 |   metrics_128 |                   7 |  19 - 38 |
|     256 <= hours < 512 |   metrics_256 |                  13 |  20 - 41 |
|    512 <= hours < 1024 |   metrics_512 |                  26 |  20 - 41 |
|   1024 <= hours < 2048 |  metrics_1024 |                  52 |  20 - 41 |
|   2048 <= hours < 4096 |  metrics_2048 |                 103 |  20 - 41 |
|   4096 <= hours < 8192 |  metrics_4096 |                 205 |  20 - 41 |
|  8192 <= hours < 16384 |  metrics_8192 |                 410 |  20 - 41 |
| 16384 <= hours < 32768 | metrics_16384 |                 820 |  20 - 41 |
| 32768 <= hours < 65536 | metrics_32768 |                1639 |  20 - 41 |
+------------------------+---------------+---------------------+----------+

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks 👍 but I think the sstable counts are still off.
for example 65535 / 1639 = 39.9847, so how could it be 41 sstables, should be 40 I think?

Copy link
Contributor Author

@replay replay Jan 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the time span during which a datapoint needs to be kept can be 39.9847 tables, and the ts of a datapoint is in the middle of the table, then the end of live of that datapoint will be in the middle of the 41st table. So that comes from the fact that the ts doesn't need to be aligned with the table wrapping times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes a lot of sense. but i see that your latest push has the upper limit at (the incorrect) 40 now ? :/ and also the table is in the old format again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right, i think i only updated the script to generate the table and didn't copy paste the output yet. updating now

@@ -301,7 +304,7 @@ func main() {
/***********************************
Initialize our backendStore
***********************************/
store, err := mdata.NewCassandraStore(*cassandraAddrs, *cassandraKeyspace, *cassandraConsistency, *cassandraCaPath, *cassandraUsername, *cassandraPassword, *cassandraHostSelectionPolicy, *cassandraTimeout, *cassandraReadConcurrency, *cassandraWriteConcurrency, *cassandraReadQueueSize, *cassandraWriteQueueSize, *cassandraRetries, *cqlProtocolVersion, *cassandraSSL, *cassandraAuth, *cassandraHostVerification)
store, err := mdata.NewCassandraStore(*cassandraAddrs, *cassandraKeyspace, *cassandraConsistency, *cassandraCaPath, *cassandraUsername, *cassandraPassword, *cassandraHostSelectionPolicy, *cassandraTimeout, *cassandraReadConcurrency, *cassandraWriteConcurrency, *cassandraReadQueueSize, *cassandraWriteQueueSize, *cassandraRetries, *cqlProtocolVersion, *cassandraWindowFactor, *cassandraSSL, *cassandraAuth, *cassandraHostVerification, ttls)
Copy link
Contributor

@Dieterbe Dieterbe Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ttl values we're passing to Cassandra here are a number of seconds.
cassandra expects a number of hours.

I think we should do one of these two:
A) document that CassandraStore needs ttl in hours, and adjust the invocation here
B) make cassandraStore accept TTL as number of seconds, and adjust the other invocations.

I prefer B because a number of hours is awkward (at the very least surprising) compared to all the other internal api's which deal with numbers of seconds.

Copy link
Contributor Author

@replay replay Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand why you're saying that cassandra expects the number to be in hours.

When the getTTLTables() generates the table map it is converting those ttls from seconds to hours.
In the INSERT statement, according to the docs, it should always expect the TTL to be in seconds no matter what the compaction_window_unit is.
So where does cassandra expect hours?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I meant NewCassandraStore takes the ttl argument as number of hours, which is what I think we should change to seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's the case.
The ttls variable in NewCassandraStore only gets used once, as parameter to GetTTLTables(), which returns a map of the type ttlTables where the tables are indexed by ttl in seconds.
GetTTLTables() only temporarily converts the ttl by using ttlUnits() to calculate the right windowSize because compaction_window_unit is HOURS, but apart from that it's all in seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i get where the confusion comes from... the mt-split-metrics-by-ttl wrongly treats the TTL as hours, i'll fix that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should clear up the mess: da4a73e

if err != nil {
panic(fmt.Sprintf("Failed to create directory: %s", err))
}
namedTableLinks := make([]string, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make correctly sized slice and then just use [i] syntax to set elements instead of append ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right! thx

fmt.Printf("The following tables have been created: %s\n", strings.Join(tables, ", "))
fmt.Println("Now continue with the following steps on each cassandra node. In the case")
fmt.Printf("of a cluster setup you may want to copy the directory structure in %s\n", tmpDir)
fmt.Println("to each of the nodes.")
Copy link
Contributor

@Dieterbe Dieterbe Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we tell people to do the following steps and also to copy the dir structure. but we're not clear on the order of what should happen first (I think first copy dir structure, then do next steps?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we say "do this on on each cassandra node.. and then "in the case of a cluster". isn't having multiple nodes and a cluster the same thing?
So I think we could simply say.

On each of your cassandra nodes:

  • recreate this directory structure (by copying it, or running this tool)
  • execute these steps

Copy link
Contributor Author

@replay replay Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, how is that:

mst@ubuntu:~/go/src/github.com/raintank/metrictank/cmd/mt-split-metrics-by-ttl$ ./mt-split-metrics-by-ttl 36000
The following tables have been created: metric_8
Now continue with the following steps on your cassandra node(s):
- Recreate (or copy) this directory structure on each cassandra node: 
  /tmp/metrictank859670692
- Stop metrictanks to stop writes
- Use `nodetool snapshot --table metric metrictank` on all cluster nodes to create
  a snapshot of the metric table
  https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsSnapShot.html
- Move snapshot files to directory /tmp/metrictank859670692/snapshot
- Restore by executing:
  sstableloader -d localhost /tmp/metrictank859670692/metrictank/metric_8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the "Move snapshot files to directory /tmp/metrictank859670692/snapshot" step, can't we just tell them to cd into that dir before running the nodetool snapshot?

also to stop writes, you could stop metrictank or set it to secondary status, see https://github.com/raintank/metrictank/blob/master/docs/http-api.md#set-cluster-status

Copy link
Contributor Author

@replay replay Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodetool snapshot won't store the snapshot in the cwd, it puts them in the data directory of cassandra.
Good point with making it secondary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest version:

The following tables have been created: metric_8
Now continue with the following steps on your cassandra node(s):
- Recreate (or copy) this directory structure on each cassandra node:
  /tmp/metrictank185098310
- Stop writes by stopping the metrictanks or set their cluster status to secondary
- Use `nodetool snapshot --table metric metrictank` on all cluster nodes to create
  a snapshot of the metric table
  https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsSnapShot.html
- Move snapshot files to directory /tmp/metrictank185098310/snapshot
- Load the data by executing:
  sstableloader -d localhost /tmp/metrictank185098310/metrictank/metric_8

fmt.Println(" a snapshot of the metric table")
fmt.Println(" https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsSnapShot.html")
fmt.Println(fmt.Sprintf("- Move snapshot files to directory %s", snapshotDir))
fmt.Println("- Restore by executing:")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"restore" isn't the most fitting word here. Perhaps "Execute this to assure all new ttl tables will have the data for metrics with the right ttl (and also data for other series which won't be used, but that's ok)"

Copy link
Contributor Author

@replay replay Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Load the data" instead of "restore"? That's short and reflects what's happening, if they want to know more details there is already a link to the docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k

@@ -611,7 +611,10 @@ func TestAlignRequests(t *testing.T) {
reqRaw("a", 0, 30, 800, 10, consolidation.Avg),
reqRaw("b", 0, 30, 800, 60, consolidation.Avg),
},
[]mdata.AggSetting{},
mdata.AggSettings{
35 * 24 * 60 * 60,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now all these 35d ttl settings are basically ignored right? like if we changed them, the tests would still succeed? is this the case for all tests where this change is made?
I guess this puts us on track if we later want to make alignRequests smarter and take ttl into account. is that the purpose of these changes or do you have something else in mind? I'm having some trouble understanding what the goal is here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I changed the tests is because alignRequests now expects the second parameter to be of type AggSettings{} and that type requires a number to be there as the default TTL. You're right that in the case of these tests that number does not matter at all, so I've just put the default value there.

@@ -15,7 +15,7 @@ type AggMetrics struct {
Metrics map[string]*AggMetric
chunkSpan uint32
numChunks uint32
aggSettings []AggSetting // for now we apply the same settings to all AggMetrics. later we may want to have different settings.
aggSettings AggSettings // for now we apply the same settings to all AggMetrics. later we may want to have different settings.
chunkMaxStale uint32
metricMaxStale uint32
ttl uint32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this field is now not needed anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right! missed that

Ready bool // ready for reads?
}

type AggSettings struct {
TTL uint32 // default TTL for raw data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this field should be RawTTL or something because it easily gets confusing otherwise. btw why does it say "default TTL"? this value is whatever the user asked for, not any default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i meant default in the sense of that it isn't a rollup or any aggregation, but just the normal raw data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update to RawTTL

{oneYear, 20, "metric_%d", "metric_8192", 410},
{oneDay, 50, "metric_%d", "metric_16", 1},
{oneMonth, 50, "metric_%d", "metric_512", 11},
{oneYear, 50, "metric_%d", "metric_8192", 164},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a bit more extensive and test some edge cases (like TTL of 0s, 10s, 1min, maxuint32, and on and around a boundary, for example 1024 hours minus a second, 1024 hours minus 2 seconds, 1024 hours, 1024 hours and a second)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added some more edge cases

Copy link
Contributor Author

@replay replay Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should update the table in the comments to add a case for TTLs that are 0-1 hours.
done: 158736f#diff-f63a36d3200e1e4aa79054431d8b5c52R110

key ascii,
ts int,
data blob,
PRIMARY KEY (key, ts)
) WITH CLUSTERING ORDER BY (ts DESC)
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': '1' }
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'HOURS', 'compaction_window_size': '%d' }
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}`
Copy link
Member

@woodsaj woodsaj Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets change this to the correct format for cassandra3.X

AND compression = { 'class' : 'LZ4Compressor' }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

key ascii,
ts int,
data blob,
PRIMARY KEY (key, ts)
) WITH CLUSTERING ORDER BY (ts DESC)
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': '1' }
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'HOURS', 'compaction_window_size': '%d' }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets update to use cassandra3.x format for the class.

AND compaction = {'class': 'TimeWindowCompactionStrategy',

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it. I tested how it looks when the table gets created, and it resolves the class path when it creates the table so they end up looking pretty much the same:

cqlsh:metrictank> CREATE TABLE metrictank.metric_5122 (
              ...     key ascii,
              ...     ts int,
              ...     data blob,
              ...     PRIMARY KEY (key, ts)
              ... ) WITH CLUSTERING ORDER BY (ts DESC)
              ...     AND bloom_filter_fp_chance = 0.01
              ...     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
              ...     AND comment = ''
              ...     AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '26', 'compaction_window_unit': 'HOURS', 'max_threshold': '32', 'min_threshold': '4'}
              ...     AND compression = {'chunk_length_in_kb': '64', 'class': 'LZ4Compressor'}
              ...     AND crc_check_chance = 1.0
              ...     AND dclocal_read_repair_chance = 0.1
              ...     AND default_time_to_live = 0
              ...     AND gc_grace_seconds = 864000
              ...     AND max_index_interval = 2048
              ...     AND memtable_flush_period_in_ms = 0
              ...     AND min_index_interval = 128
              ...     AND read_repair_chance = 0.0
              ...     AND speculative_retry = '99PERCENTILE';
cqlsh:metrictank> describe table metric_5122

CREATE TABLE metrictank.metric_5122 (
    key ascii,
    ts int,
    data blob,
    PRIMARY KEY (key, ts)
) WITH CLUSTERING ORDER BY (ts DESC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '26', 'compaction_window_unit': 'HOURS', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': '1' }
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}`
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'HOURS', 'compaction_window_size': '%d' }
AND compression = { 'sstable_compression' : 'LZ4Compressor' }`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be

 AND compression = { 'class' : 'LZ4Compressor' }

key ascii,
ts int,
data blob,
PRIMARY KEY (key, ts)
) WITH CLUSTERING ORDER BY (ts DESC)
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'DAYS', 'compaction_window_size': '1' }
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}`
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_unit': 'HOURS', 'compaction_window_size': '%d' }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{'class': 'TimeWindowCompactionStrategy'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this better? some sort of protection against namespace refactors?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full class path should only be used for custom classes. The in-built ones should use the short name. I don't really know why it is just how it is presented in the latest docs.

https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsConfigCompress.html?hl=compression

We want tables to have varying compaction window sizes depending on a
metric's TTL. At the same time we want to be able to group those with
similar TTLs. This is some kind of middle way.

issue #382
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use separate cassandra tables for raw and rollup data
3 participants