Clustering #400

woodsaj · 2016-11-17T19:32:09Z

keep track of nodes in the cluster
assign "partitions" to each node
include partition info in the metric index for each metric
update MetricIndex plugins to be partition aware
- this means the data stored in the backend stores will change so we will need to provide some migration tools.
make carbon and kafkamdm input plugins partition aware
when calling graphite's /render api fan out the request to
one node for each partition
when calling graphite's /metrics/find api fan out the requests
to one node for each partition
when calling graphites /metrics/index.jon api fan out the requests
to one node for each partition
add internal "cluster" api's to support the clustered graphite api
requests

Dieterbe

I have a lot more reviewing to do, i only understand a part of the code so far.
at first glance this looks like a solid piece of code.
here's some feedback to get started.

Also any thoughts on adding a idx table in cassandra per partition? that way we don't have to specify the partition for every single metricdef, and we don't need an index either.

Dieterbe · 2016-11-18T12:07:46Z

vendor/vendor.json

-			"path": "github.com/tinylib/msgp/msgp",
-			"revision": "0cea1fa86e8403be1284013014f87ab942056de8",
-			"revisionTime": "2015-10-23T22:38:53Z"
-		},


why? (don't just tell me, put it in the commit message)
it seems like we still import github.com/tinylib/msgp/msgp

Dieterbe · 2016-11-18T12:22:01Z

cluster/node.go

+	Timeout: time.Second,
+}
+
+func NodeStateFromString(s string) NodeState {


replaced by stringer output?

my bad this is string to state

Dieterbe · 2016-11-18T12:24:53Z

metrictank.go

-	primaryNode = flag.Bool("primary-node", false, "the primary node writes data to cassandra. There should only be 1 primary node per cluster of nodes.")
+	primaryNode      = flag.Bool("primary-node", false, "the primary node writes data to cassandra. There should only be 1 primary node per cluster of nodes.")
+	peersStr         = flag.String("peers", "", "http/s addresses of other nodes, comma separated. use this if you shard your data and want to query other instances")
+	probeIntervalStr = flag.String("probe-interval", "2s", "Interval to probe peer nodes")


i was gonna say https://golang.org/pkg/flag/#Duration but then i realized we lose access to a validation error, which is so weird. so flag.String it is..

Dieterbe · 2016-11-18T12:26:52Z

metrictank.go

+	primaryNode      = flag.Bool("primary-node", false, "the primary node writes data to cassandra. There should only be 1 primary node per cluster of nodes.")
+	peersStr         = flag.String("peers", "", "http/s addresses of other nodes, comma separated. use this if you shard your data and want to query other instances")
+	probeIntervalStr = flag.String("probe-interval", "2s", "Interval to probe peer nodes")
+	mode             = flag.String("mode", "single", "Operating mode of cluster. (single|multi)")


note to self: can this be derived from len(peers) ?

The "mode" flag is only used when deciding on whether other nodes need to be used to handle graphite queries. I wasnt sure if there was a use case where you would want to know about other nodes in the cluster but not co-ordinate queries with them. The example i had in mind is if we later support automatic primary failover for a cluster of nodes that all handle all partitions.

Dieterbe · 2016-11-18T12:32:07Z

metrictank.go

+	if !cluster.ValidMode(cluster.ModeType(*mode)) {
+		log.Fatal(4, "invalid cluster operating mode")
+	}
+


wouldn't it make more sense to have the cluster package declare the flags, validate them, parse the peer list etc ?
I like the approach we're taking more and more where main hooks into package callbacks to take care of that stuff.

yeah, that sounds reasonable.

Dieterbe · 2016-11-18T13:13:04Z

cluster/cluster.go

+
+// return the list of peers to broadcast requests to
+// Only 1 peer per partition is returned
+func PeersForQuery() []*Node {


worth pointing out that in half of the code (peers, AddPeer) we consider peers as "the other nodes" (not ThisNode).
but anything that uses PeersForQuery will get a peers list that includes ThisNode. not a bad thing, i can see how it makes sense, just worth documenting as it was a surprise to me

I originally had PeersForQuery exclude ThisNode, but i then added it back in to allow future features that provide better scheduling.

For example in future, we way want to extend the "state" of a node to differentiate between ready to handle queries and ready to serve data (ie due to just having started up and not having data in memory). We could then preference other nodes that have been online longer over a node that has just started up.

also, but returning all nodes that should be used for handling the query we can apply the "mode" enforcement easily. If "mode" is set to single, we just return ThisNode.

yeah it makes sense, i would just document it for clarity

Dieterbe · 2016-11-18T13:20:43Z

cluster/cluster.go

+		case <-ticker.C:
+			mu.Lock()
+			for _, peer := range peers {
+				go peer.Probe()


if the peer is slow, the Probe calls will just pile up, each waiting to acquire the lock and probe the node right after it just finished probing.
this can also happen if the ticker ticks right after AddPeer() which also calls go peer.Probe()

I think we need something like https://github.com/golang/groupcache/blob/master/singleflight/singleflight.go

The HTTP call is made without the lock being held. So the easiest solution here is to just track if a probe is currently underway and if it is, immediately return.

fixed with https://github.com/raintank/metrictank/blob/clustering/cluster/node.go#L120-L133

hm this link seems off. also not seeing any change wrt Probe() in the new version.

ohh, i clobbered the change when i reverted a commit.

https://github.com/raintank/metrictank/blob/clustering/cluster/node.go#L133-L139

nice solution. might want to set n.probing = false when done ;)

thanks, added it.

Dieterbe · 2016-11-18T13:46:28Z

cluster/cluster.go

+	// we only need to query 1 node for each partition.
+	// we are just using the first node for each partition, which will be
+	// ThisNode if ThisNode owns the partition.  Later we can add more
+	// complex scheduling logic to distribute queries across the cluster


this seems a tad too basic, because:

we consistently select the first one, always

a peer may be selected multiple times because it hosts multiple partitions

this seems like we can get very easily into situations where traffic is severely imbalanced, with some nodes receiving a multiple of what is normal and others receiving 0 requests. or am i missing something?

yes this is simple, but gets the job done. It is by no means the best solution, but i think it is good enough for now.

Assume we have 4 partitions and 4 nodes

nodeA nodeB nodeC nodeD

0,1 0, 1 2, 3 2, 3

if a query is sent to nodeA, it will select nodeA for partitions 0,1 and either nodeC or nodeD for partitions 2,3. So every query will only map to 2 nodes, the local node and 1 other.

The only problem that i see is that both nodeA and nodeB could consistently send queries to nodeC and never nodeD.

Ok, i have refactored this to ensure balanced distribution between nodes. Also included unitTest to make sure.

Isn't it better to shuffle the partitions across nodes though?
To follow your example with parts=4 and nodes=4, I would rather set it up like so:

nodeA nodeB nodeC nodeD

0,1 0, 2 1,3 2, 3

this way, when a node crashes, the extra workload gets split up over 2 nodes, as opposed to moving everything over to 1 node which would double its workload and seems more likely to crash.
the more partitions we have, the more we can evenly distribute it too. if we have 6 partitions and 4 nodes we can make it so that on a node failure, the excess load gets split 33% each to the other nodes.

BTW I like the new approach!

I still have to review the changes in the api package, but one important concern that comes up with my partition assignment approach (which is not an issue with yours) is that when we query multiple nodes and there is some overlap in between, they might do duplicate work, unless we specify in the request which partitions we want to use the node for.

for example: let's say ThisNode is A, and we query data for all 4 partitions.
we'll pick A for 0 and 1. let's say we select B for 2 and D for 3. if we send the request without further specification, then D will respond with data for both part 2 and 3 because it has data for it, even though we selected B to do the work for partition 2.

another problem with my approach is assigning primaries becomes hard.
with my proposal above, we could make A and D primaries, so all partitions are getting saved, and only one time.
but if A crashes and we want to establish a new primary for part 0 and 1, we would have to select B and C, but those would also save data that D is already saving. we could de-promote D and promote B and C. which is a little weird but should work though.

note if then another node fails and we only have 2/4 nodes we can't establish a proper primary setup where we save all shards once. with your approach if we have 2/4 nodes, i think we would be either more lucky (e.g. B and D crash in your example) or more unlucky (A and B crash)

I have to do the exercises, but I think with more shards, my approach would probably lend itself worse to maintaining proper primary assignment, unless we implement primary status on a per-partition level instead of per-node. so maybe for now we should stick to your way. and do the advanced stuff later as needed

I think we just need to have the documentation explain that partitions should be split into "groups" and each node handles 1 group. Though your layout provides better request distribution when a node fails, it adds significant complexity.

I think we should try and move towards supporting your layout, but that can be done as a future improvement.

Dieterbe · 2016-11-18T14:18:12Z

idx/cassandra/cassandra.go

+			def := schema.MetricDefinitionFromMetricData(data)
+			def.Partition = partition
+			c.MemoryIdx.AddDef(def)
+			c.writeQueue <- writeReq{recvTime: time.Now(), def: def}


what's the goal of this piece?
a metric was reassigned to a different partition, probably because the number of partitions changed.? is there another reason? can you comment the intention here.

BTW this will properly overwrite the old entry with the previous partition, in cassandra right?

Yes this will handle changes in the partition of a metric. But primarily this is just to persist the change to the "lastUpdate" time.

Previously, we just sent "existing" to the write queue. But as a result, if the data in cassandra becomes corrupt for any reason it can never be fixed due to metrictank loading the data from cassandra and only every writing that same data back to cassandra with the updated lastUpdate field.
sending a fresh metricDef built from the metricData is a much more resilient approach and has a negligible performance penalty.

ok. and why do now skip updates if the partition doesn't change? this means that if a metric never changes partition, we will never update it in cassandra. that seems wrong (same for EsIdx)

yeah, that should be if "existing.Partition == partition" so that if the partition hasnt changed we update only every ~updateIterval. But if the partition has changed we treat it like a new metric which will update it immediately.

this fix still has to be applied to ES index

Dieterbe · 2016-11-18T14:39:04Z

input/kafkamdm/kafkamdm.go

+		} else {
+			missing := diffPartitions(partitions, availParts)
+			if len(missing) > 0 {
+				log.Fatal(4, "kafka-mdm: configured partitions not in list of available partitions.")


can't we be more specific and just print which are missing?

Dieterbe · 2016-11-21T13:58:47Z

usage/usage.go

@@ -103,7 +103,8 @@ func (u *Usage) Report() {

 		m := metrics.GetOrCreate(met.Id)
 		m.Add(uint32(met.Time), met.Value)
-		metricIndex.Add(met)
+		//TODO: how to set the partition of the metric?  We probably just need to publish the metric to our Input Plugin


the bigger problem here is that the usage stats from different metrictank instances will conflict with each other. maybe the better approach now is to have a separate service for usage tracking, a service that consumes from all partitions.
we could also have usage metrics per partition but that's still weird due to multiple instances consuming the same partitions

yeah, i dont have a good answer for what we should do here yet.

Dieterbe · 2016-11-21T14:02:22Z

metrictank.go

@@ -176,9 +184,19 @@ func main() {

 	conf.ParseAll()

+	// if the user just wants the version, give it and exit


i would even move this before config parsing.
if the only thing you want is the version, it's annoying that it aborts on invalid configs.
and no sane person would (or should) set show-version = true in the config file. (e.g. it's reasonable to ignore a show-version = true in the config)

Dieterbe · 2016-11-21T14:05:40Z

metrictank.go

+	mdata.InitPersistNotifier(stats, handlers...)
+
+	/***********************************
+		Initialize our MemoryIdx


this should just be "initialize our index". this could be too easily confused for the memory idx implementation

Dieterbe · 2016-11-21T14:20:42Z

input/kafkamdm/kafkamdm.go

+	if err != nil {
+		log.Fatal(4, "kafka-mdm failed to create client. %s", err)
+	}
+


i think we should put a defer client.Close() here, since we only use this client in this function. in New() we create a new one, which btw should probably also be closed in KafkaMdm.Close()

Dieterbe · 2016-11-21T14:30:44Z

cluster/node.go

@@ -75,18 +117,150 @@ func (n *Node) SetState(s NodeState) {
 	n.Unlock()
 }

-// provide thread safe json Marshaling
-func (n *Node) MarshalJSON() ([]byte, error) {
+func (n *Node) SetPartitions(part []int32) {


this should never be called once the initial configuration has been established by input plugins, right?
might want to put that in a comment

This could be called at any time, but it would only make sense for the input plugin to call it as that is what controls the partitions that are being ingested. For example if the kafkamdm input plugin is configured to consume from "*" and we update the kafka topic to use more partitions it should call SetPartitions()

oh wow I hadn't thought about on-the-fly adjustments like that. sounds good.
just so we're clear though:

carbon input partition is set by the operator in the config, and is just something that represents how they manually shard their data, based on their relaying setup and how they route carbon traffic. until we have config reloading or an api for this, it can only be changed by restarting MT

the mdm input does not seem to have a provision either to dynamically become aware of more partitions and call SetPartitions accordingly. so we also only call it at startup. is this something that still has to be implemented? (perhaps as a future improvement)

Yes, currently SetPartitions() is only every called once. But the point is that it is only called once due to limitations in the input plugins. There is no reason to enforce the partitions to only be set once.

Dieterbe · 2016-11-21T14:59:48Z

idx/cassandra/cassandra.go

+			def := schema.MetricDefinitionFromMetricData(data)
+			def.Partition = partition
+			c.MemoryIdx.AddDef(def)
+			c.writeQueue <- writeReq{recvTime: time.Now(), def: def}


ok. and why do now skip updates if the partition doesn't change? this means that if a metric never changes partition, we will never update it in cassandra. that seems wrong (same for EsIdx)

Dieterbe · 2016-11-21T15:16:38Z

cluster/cluster.go

 	"time"
 )

-var ThisNode *Node
+type ModeType string


why not just type Mode string?

"Mode" is the current variable that stores the ModeType

var ( ThisNode *Node Mode ModeType

Dieterbe · 2016-11-21T15:18:39Z

cluster/cluster.go

+	peers = make([]*Node, 0)
+)
+
+func Init(name, version string, primary bool, started time.Time, interval time.Duration, m ModeType) {


whatInterval ? maybe probeInterval

Dieterbe · 2016-11-21T16:32:33Z

api/cluster.go

+	response.Write(ctx, response.NewJson(200, status, ""))
+}
+
+var NotFoundErr = errors.New("not found")


declarations for global variables should be after the imports (imho or best practice not sure)

Dieterbe · 2016-11-22T09:09:29Z

api/routes.go

+		r.Get("/", s.getClusterStatus)
+		r.Combo("/getdata", bind(models.GetData{})).Get(s.getData).Post(s.getData)
+		r.Combo("/index/find", bind(models.IndexFind{})).Get(s.indexFind).Post(s.indexFind)
+		r.Combo("/index/get", bind(models.IndexGet{})).Get(s.indexGet).Post(s.indexGet)


i don't think we call /index/get anywhere. can be removed?

I think we should still keep it as it might be useful for fetching metricDefs by Id

do you mean manually, e.g. when troubleshooting?

or as part of cluster operation? because currently the implementation is in api/cluster.go and the http api endpoint is under /cluster which is normally for only for internal cluster operations

(note that the implementation only looks in the local index)

yes, manually for troubleshooting.

in that case i don't think this line should be with all the internal clustering stuff, and the implementation code should probably also be moved away from the clustering handlers

where do you suggest it get moved to?

maybe under /debug/index/get and in api/debug.go
I don't care so much but it does seem out of place with all the cluster stuff, since it's not used by any clustering.

We could also consider stripping the /cluster/ prefix from all http paths used by the clustering.
and just putting them at /index/list and so on. (and this one could simply be /index/get) The notion of who we expect to call a certain endpoint should perhaps not dictate the http path hierarchy

Yeah, i think removing the "cluster" prefix makes the most sense.

Dieterbe · 2016-11-22T09:27:12Z

api/models/request.go

@@ -15,6 +16,7 @@ type Req struct {
 	MaxPoints    uint32                     `json:"maxPoints"`
 	RawInterval  uint32                     `json:"rawInterval"` // the interval of the raw metric before any consolidation
 	Consolidator consolidation.Consolidator `json:"consolidator"`
+	Node         *cluster.Node              `json:"node"`


we don't really need to encode the Node into json. the only place we decode incoming Req json is /cluster/getdata which uses getTargetsLocal which doesn't care about this attribute.

I guess it could be useful maybe to see the node name when sniffing traffic for troubleshooting (though you could also look at the dest ip), but then again all Node attributes are private so it would just be {} in json.

we should probably use json:"-" here

yes this should be json:"-"

Dieterbe · 2016-11-22T09:37:16Z

api/dataprocessor.go

+	localReqs := make([]models.Req, 0)
+	remoteReqs := make(map[*cluster.Node][]models.Req)
+	for _, req := range reqs {
+		if req.Node == nil || req.Node == cluster.ThisNode {


if req.Node were nil, that would be a nasty bug in our code (correct me if i'm wrong).
i would rather crash MT with a nilpointer so we can fix the code, rather then silently ignoring it, and postponing when we become aware of the bug (which would be harder to track down because it'll be less obvious)

note also that in the only place we call this function (api/graphite.go) a few lines before we call getTargets(), we call req.Node.GetName()

EDIT: never mind, not true
also I think we can simplify this function: instead of first splitting up the requests in two categories, and then for each category, do very similar logic. we can just iterate over all requests, for each do the wg.Add, and launch the goroutine where we check the err and series results, etc. but just call getTargetsLocal vs getTargetsRemote based on whether it's a local or remote node

Dieterbe · 2016-11-22T10:13:45Z

api/graphite.go

+
+	}
+
+	rSeries, err := s.findSeries(ctx.OrgId, patterns, int64(fromUnix))


what does rSeries mean? why not series ?

Dieterbe · 2016-11-22T10:13:57Z

api/graphite.go

@@ -208,10 +315,25 @@ func (s *Server) metricsFind(ctx *middleware.Context, request models.GraphiteFin
 	if request.From != 0 {
 		request.From -= 86400
 	}
-	nodes, err := s.MetricIndex.Find(ctx.OrgId, request.Query, request.From)
+	nodes := make([]idx.Node, 0)
+	rSeries, err := s.findSeries(ctx.OrgId, []string{request.Query}, request.From)


what does rSeries mean? why not series ?

Dieterbe · 2016-11-22T10:18:13Z

api/graphite.go

+			msg := fmt.Sprintf("%d errors: ", len(errors))
+			for _, e := range errors {
+				msg += e.Error() + ", "
+			}


instead of this loop you can just print errors as "%q" that will show it as a list. and in case an error message contains a , it will be less confusing. see https://play.golang.org/p/bjBWpMQ_R5

Dieterbe · 2016-11-22T10:24:47Z

api/graphite.go

-	parsedTargets := make(map[string]string)
+
+	// consolidatorForPattern[<pattern>]<consolidateBy>
+	consolidatorForPattern := make(map[string]string)


note that in graphite it's legal to do a query such as:

render?target=consolidateBy(foo.qps, "min") &target=consolidateBy(foo.qps, "max") &target=consolidateBy(foo.qps, "avg")

I think this code wouldn't work with that.

that is true, but this code works the same as what was here before (just some of the variable names have been changed)

I guess we need to make this map[string][]string?

I am not sure we should fix this in this PR. It is going to require a lot of refactoring of how we parse targets. I think it is best if we address this after this PR has been merged.

yeah that's fine by me

Dieterbe · 2016-11-22T10:32:32Z

api/graphite.go

@@ -177,7 +284,7 @@ func (s *Server) renderMetrics(ctx *middleware.Context, request models.GraphiteR



there is a new implication here: aggsettings must be consistent across all peers. seems obvious, but we should document it in the config files.

added to docs.

Dieterbe · 2016-11-22T10:34:54Z

api/graphite.go

+	}
+	seenPaths := make(map[string]struct{})
+	// different nodes may have overlapping data in their index.
+	// maybe because they loaded the entire index from a persistent store,


this line is no longer applicable

Dieterbe · 2016-11-22T10:37:04Z

api/graphite.go

+			msg := fmt.Sprintf("%d errors: ", len(errors))
+			for _, e := range errors {
+				msg += e.Error() + ", "
+			}


also here can just use '%q'

Dieterbe · 2016-11-22T10:38:00Z

api/graphite.go

+	}
+
+	// different nodes may have overlapping data in their index.
+	// maybe because they loaded the entire index from a persistent store,


this line no longer applies.

also this comment corresponds to code that is a bit higher up

Dieterbe · 2016-11-22T10:41:30Z

api/graphite.go

+
+	rSeries, err := s.findSeries(ctx.OrgId, patterns, int64(fromUnix))
+	if err != nil {
+		response.Write(ctx, response.NewError(http.StatusInternalServerError, err))


note that err here can contain low-level errors revealing details about our infrastructure. (could be like dial timeout to 10.240.x.x , network unreachable, etc)
I think those should be abstracted away to the user, and just tell the user we had an internal problem with the cluster, or something

Dieterbe · 2016-11-22T10:42:20Z

api/graphite.go

 	if err != nil {
-		response.Write(ctx, response.NewError(http.StatusBadRequest, err))
-		return
+		response.Write(ctx, response.NewError(http.StatusInternalServerError, err))


same note here about not revealing low-level details

Dieterbe · 2016-11-22T10:42:44Z

api/graphite.go

+
+	if err != nil {
+		log.Error(3, "HTTP IndexJson() %s", err.Error())
+		response.Write(ctx, response.NewError(http.StatusInternalServerError, err))


same note here about not revealing low level details to user

Dieterbe · 2016-11-22T11:22:28Z

let's say we have a hypothetical org that does not use the new clustering (e.g. single node, or multiple nodes that all get all data), and then we decide to split up data into shards, do you foresee any issues?
I think:

the MT's will update the partition info in their memory+cass/es idx, that part will work
some data will not load during the transition, because the nodes that now own the metricdefs will only have very recent points in RAM and what is in cassandra is only the last saved chunk (this is why our current MT has a warmup feature, but not sure how a warmup would like with the clustering). one way to alleviate is postpone partition reassignments until right after a chunk save interval
there would be some data loss: let's say chunks save every hour (raw + aggregated), and at 10:02 we change our ingestion so that we change partition reassignments for metricdefs. i'm not quite clear yet on how primary assignment will work, but either the data from 10:00-10:02 will be lost, or the data from 10:02-11:00, it depends who gets to write the last chunk and if there's any overwriting.

Dieterbe · 2016-11-22T11:23:05Z

to extend a bit on my earlier question: the approach used in this PR uses:

a partition field in the metricdef. so memory,cass and ES will all store the partition for every single metricdef.
a partition field in cassandra and ES, to track partition for each def
cassandra has an index on the partition field, and ES an analyzer on it.

this seems pretty redundant & high-overhead:

cassandra and ES store the partition twice (in the metricdef, and as a separate field)
the index/analyzer has a cost.

since partition is low cardinality (with >10k or more metricdefs for every partition), it seems to me we can simply create separate tables in cassandra, or separate "indices" in elasticsearch for every partition. this way we don't have to include the partition in every metricdef (when stored), we don't need an extra field for every def, and we don't need the cost of the index/analyzer.

in memory, we may want to do something similar: basically just shard the in-memory structures by partition, instead of storing the partition for every single def.

(note: there is a catch here, a partition reassignment will also involve a delete from the old shard/table/index)

Dieterbe · 2016-11-22T11:38:44Z

todo: (can be done after everything else stabilized) update {sample,docker,package} config, docs/config.md, README.md (limitations), docs/clustering.md, docs/http-api.md, docs/roadmap.md

woodsaj · 2016-11-22T11:42:07Z

We need to store the partition in ES and Cassandra in an indexed field so we can query using it.
For ES, there is no further optimization we can do, but we can optimize a little better for cassandra. I do not like the idea of separate indexes in ES or tables in Cassandra for each partition. This will have a significant performance penalty. In Cassandra for example, writes are buffered in the memtables before being periodically flushed to SStables. If you have 64 partitions in 1 table you will likely see 1 write every flush interval. If you have 64 tables you will need 64 writes.

We need to know the partition everytime we try and re-index a metric, as we need to know if the partition has changed.
The way the ESIndex works (specifically how it handles retries) we need the partition in the metricDef. It would be a significant amount of refactoring to change this behavior.

keeping the partition in the metricDef makes handling things much easier. Given that the partition is an int32, storing it only costs os 4MB of ram for every 1million metricDefs. So it is my opinion that with the additional complexity needed to avoid storing the partition you are going to be worse off. And if you are not worse off, i still don't think it is worth the effort.

For the cassandra index i have already considered moving from the whole metricDef in a msgp encoded Blob to columns for each field of the metricDef. This might be another reason to do that.

woodsaj · 2016-11-22T11:49:17Z

let's say we have a hypothetical org that does not use the new clustering (e.g. single node, or multiple nodes that all get all data), and then we decide to split up data into shards, do you foresee any issues?
I think the MT's will update the partition info in their memory+cass/es idx, and everything should just work.

If you are using kafkamdm and already using multiple partitions, then everything will just work nicely as the metricDefs will already include the correct partition.

If you change the number of partitions or change the hashing used, then the cluster might have a few issues as we dont have great logic for dealing with multiple nodes believing they have a metric. But this would pass once the node no longer seeing the metric marks it as stale and deletes it from its memoryIdx.

Dieterbe · 2016-11-22T14:12:09Z

cluster/config.go

+
+func ConfigSetup() {
+	clusterCfg := flag.NewFlagSet("cluster", flag.ExitOnError)
+	clusterCfg.BoolVar(&primary, "primary-node", false, "the primary node writes data to cassandra. There should only be 1 primary node per cluster of nodes.")


only one "per shardgroup"

Dieterbe · 2016-11-23T09:42:41Z

I can confirm again I got the following error:

2016/11/23 09:27:34 [metrictank.go:334 main()] [E] failed to initialize cassandra. java.lang.RuntimeException: java.util.concurrent.ExecutionException: org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found 103be8f0-b15f-11e6-92df-7964cb8cd63c; expected 103902c0-b15f-11e6-92df-7964cb8cd63c)

and I can also confirm that making other MT instances wait for 1 of them (who does the initialisation) to start their api server is a good solution, better than 283c85f and 28c3f93 from #320

so this should be documented in our cluster documentation

Dieterbe · 2016-12-09T13:50:22Z

docs/cassandra.md also needs updating

Dieterbe · 2016-12-12T11:41:22Z

api/cluster.go

+func (s *Server) indexDelete(ctx *middleware.Context, req models.IndexDelete) {
+	defs, err := s.MetricIndex.Delete(req.OrgId, req.Query)
+	if err != nil {
+		// errors can only be caused by bad request.


another case of callee implementation leaking into caller, where I think more specific, detailed errors should flow from the code that actually generates them.

i agree with this sentiment. But setting the response status code based on what we know (think) the errors returned can be is used a lot throughout the code base, eg like this example from pre api refactoring,. https://github.com/raintank/metrictank/blob/0.5.8/http.go#L294

so fixing it is outside the scope of this PR.

- keep track of nodes in the cluster - assign "partitions" to each node - include parition info in the metric index for each metric - update MetricIndex plugins to be partition aware - this means the data stored in the backend stores will change we will need to provide some migration tools. - make carbon and kafkamdm input plugins parition aware - when calling graphite's /render api fan out the request to one node for each partition - when calling graphite's /metrics/find api fan out the requests to one node for each partition - when calling graphites /metrics/index.jon api fan out the requests to one node for each partition - add internal "cluster" api's to support the clustered graphite api requests

default primary to true so out of the box experience is nicer that's how it is in docker and package configs also btw.

instead of encoding the metricDef to msgpack then storing as a blob in Cassandra, lets just store each field as a column in cassandra. This allows data to be encoded in cassandra's native format and also makes troubleshooting easier as the metricDefs can be viewed/searched directly in Cassandra. This new branch shows a decrease in performance for writes primarily due to cassandra overheads handling the secondary index on the partition id. Reads also show a slight drop in throughput, which is again due to cassandra. Metrictank actually sees a drop in allocations and overall bytes per op. go test -run=None -v -bench=. -benchtime=10s MASTER: BenchmarkIndexing-4 200000 84400 ns/op 2738 B/op 56 allocs/op BenchmarkLoad-4 1000000 15864 ns/op 2098 B/op 22 allocs/op ------------------------ THIS BRANCH BenchmarkIndexing-4 200000 106320 ns/op 2973 B/op 58 allocs/op BenchmarkLoad-4 500000 37836 ns/op 1784 B/op 15 allocs/op

Dieterbe · 2016-12-12T13:23:37Z

Seems like there's 6 things left to do :

Clustering #400 (review)
Clustering #400 (review)
Clustering #400 (review)
Clustering #400 (review)
docs/cassandra.md needs update for new schema
dataprocessor consistent log prefixes (my other remarks around that are not worth further blocking this PR)
If you want, I can do them

if errors implenet the response.Error interface then we use the e.Code() in the API response, else we assume an internal server error and set the code to 500.

instead of encoding the metricDef to msgpack then storing as a blob in Cassandra, lets just store each field as a column in cassandra. This allows data to be encoded in cassandra's native format and also makes troubleshooting easier as the metricDefs can be viewed/searched directly in Cassandra. This new branch shows a decrease in performance for writes primarily due to cassandra overheads handling the secondary index on the partition id. Reads also show a slight drop in throughput, which is again due to cassandra. Metrictank actually sees a drop in allocations and overall bytes per op. go test -run=None -v -bench=. -benchtime=10s MASTER: BenchmarkIndexing-4 200000 84400 ns/op 2738 B/op 56 allocs/op BenchmarkLoad-4 1000000 15864 ns/op 2098 B/op 22 allocs/op ------------------------ THIS BRANCH BenchmarkIndexing-4 200000 106320 ns/op 2973 B/op 58 allocs/op BenchmarkLoad-4 500000 37836 ns/op 1784 B/op 15 allocs/op

woodsaj · 2016-12-12T16:50:42Z

@Dieterbe the only outstading item is to move the /cluster/index/get endpoint somewhere

woodsaj · 2016-12-13T03:48:56Z

@Dieterbe LGTM

Dieterbe mentioned this pull request Nov 17, 2016

WIP: Cluster #320

Closed

woodsaj force-pushed the clustering branch from df1e7c5 to 885cd32 Compare November 18, 2016 08:04

Dieterbe suggested changes Nov 18, 2016

View reviewed changes

woodsaj force-pushed the clustering branch 3 times, most recently from 3dc473a to 47d5022 Compare November 20, 2016 00:34

Dieterbe suggested changes Nov 21, 2016

View reviewed changes

Dieterbe reviewed Nov 21, 2016

View reviewed changes

Dieterbe reviewed Nov 22, 2016

View reviewed changes

api/graphite.go

}

rSeries, err := s.findSeries(ctx.OrgId, patterns, int64(fromUnix))

Copy link

Contributor

Dieterbe Nov 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does rSeries mean? why not series ?

Dieterbe reviewed Nov 22, 2016

View reviewed changes

woodsaj force-pushed the clustering branch from 47d5022 to 5bb5359 Compare November 22, 2016 11:50

Dieterbe reviewed Nov 22, 2016

View reviewed changes

woodsaj force-pushed the clustering branch 2 times, most recently from 42f4597 to b64f47f Compare December 12, 2016 03:12

Dieterbe reviewed Dec 12, 2016

View reviewed changes

woodsaj and others added 7 commits December 12, 2016 20:35

fixes

085e0a3

update documentation/configs for clustering + default primary true

b4385ef

default primary to true so out of the box experience is nicer that's how it is in docker and package configs also btw.

fan out IndexDelete requests to all nodes in the cluster

e4f93b8

note about concurrent cassandra initialisation with clustering

1b9235e

add all exteranal packages to vendor dir

8e3abcc

woodsaj force-pushed the clustering branch from b64f47f to 244f35a Compare December 12, 2016 12:39

Anthony Woods added 3 commits December 12, 2016 23:12

update vendor dir to use v1.2 tag of raintank/schema

836b26e

cleanup error handling.

690a63f

if errors implenet the response.Error interface then we use the e.Code() in the API response, else we assume an internal server error and set the code to 500.

IndexDelete json schema should match Form schema

ddcf1f9

woodsaj force-pushed the clustering branch from 244f35a to ddcf1f9 Compare December 12, 2016 15:13

Anthony Woods added 2 commits December 13, 2016 00:41

logging and docs

ed6aa05

strip /cluster/ prefix from http paths

a9e93c6

woodsaj and others added 2 commits December 13, 2016 12:06

fix unit test

7c63c27

consistent message in all configs and config doc

8fcf2bd

Dieterbe approved these changes Dec 13, 2016

View reviewed changes

Dieterbe mentioned this pull request Dec 13, 2016

cleanup error passing #422

Closed

Dieterbe merged commit 6a5bb69 into master Dec 13, 2016

Dieterbe mentioned this pull request Dec 30, 2016

metrictank sharding phase 1 #315

Closed

Dieterbe mentioned this pull request Feb 2, 2017

Feed back perf stats back to hosted metrics customers #225

Closed

Dieterbe deleted the clustering branch December 15, 2017 22:05

		@@ -176,9 +184,19 @@ func main() {

		conf.ParseAll()

		// if the user just wants the version, give it and exit


		}

		rSeries, err := s.findSeries(ctx.OrgId, patterns, int64(fromUnix))

		@@ -177,7 +284,7 @@ func (s Server) renderMetrics(ctx middleware.Context, request models.GraphiteR

Clustering #400

Clustering #400

Conversation

woodsaj commented Nov 17, 2016

Dieterbe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Nov 18, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Nov 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Nov 18, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Nov 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Nov 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Dec 12, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Nov 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Nov 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Nov 18, 2016 •

edited

Loading

Dieterbe Nov 22, 2016 •

edited

Loading

Dieterbe Nov 18, 2016 •

edited

Loading

Dieterbe Nov 21, 2016 •

edited

Loading

Dieterbe Nov 22, 2016 •

edited

Loading

Dieterbe Dec 12, 2016 •

edited

Loading

Dieterbe Nov 22, 2016 •

edited

Loading

Dieterbe Nov 22, 2016 •

edited

Loading