reworking routing for performance and being more powerful #23

Dieterbe · 2014-09-29T01:57:39Z

Hi everybody.
I'ld like to hear your thoughts on this. 1 and 2 are fairly obvious, but 3 could really use input on how you are using, or want to use the relay.

regex has much overhead. I don't think much can be done about this, so we should try to avoid regex matching when we can. I'm thinking of introducing "contains" and "starts_with" options for string checks in addition to regex, so that in many cases we could use those instead of regex and it will perform much better.
specifically, if the regex pattern is "", we shouldn't even do the regex Match() like we currently do.
I've noticed that first_only, while useful for some setups (see E below), prevents us from "also sending traffic elsewhere". so i started thinking about getting rid of first_only and making the routing/dispatching more powerful. that said, it's a neat mechanism because you can make routing decisions with just a bool check, no pattern checking needed.
so, taking a step back, i'ld like to collect use cases of how people want to use carbon-relay-ng, especially since i know some people are interested (and working on) sharding/consistent hashing, and round robin, and i've been pondering how to keep routing/dispatching performant, yet viable for various/all use cases. (ideally, without running multiple carbon-relay-ng instances if we can)

I see the following use cases

A consistent hashing / round robin to a pool of servers, for loadbalancing and HA (each pool is just one route)
B mirroring the same load to multiple machines, possibly a subset of traffic using regex pattern (for example to test influxdb next to graphite)
C sending a subset of metrics to anomaly detection
D sending all or some - of the metrics to carbon-tagger
E poor man's load balancing: sending certain metrics to server A, some others to B and maybe some others to C, using regexes to control what goes where (maybe some storage is faster then others), and using first_only to make sure we don't store metrics on more than one system.

am i missing a case? first_only is really only used for E, but it doesn't really work anymore as soon as you want to send some metrics also elsewhere like in B, C or D.

What I'm thinking is, to keep the system powerful, and still efficient, without overcomplicating things too much, we should implement E as a multi-endpoint route just like we are approaching round robin and hashing/sharding. the key is that these are multiple endpoints within one route, instead of multiple routes in the "global level"
this way, on the global routing/dispatching level, we check the pattern (and string check) of every route, and if it matches, the route gets it. but a route can be a special one with multiple endpoints,
and send traffic to those endpoints using a dedicated mechanism:

round robin
sharded
further pattern matching within the route, with first_only

thoughts?
cc @robinbowes @rcrowley @pauloconnor @willowpet @pwielgolaski @shiaho @vladimir-smirnov-sociomantic @prune998 @curtisgithub

vladimir-smirnov-sociomantic · 2014-09-29T18:14:42Z

Maybe you can get rid of use-case E by adding "weight" to each endpoint. Anyway balancing with regexps is very hard and inefficient and I doubt that anybody will use if if they got A + weights.

prune998 · 2014-09-29T18:18:37Z

I’m using carbon-relay-hg to take old graphite compatible apps inside influxdb server so I’m not using any of the functionality you’re talking about (right now). Don’t plane to use them either as I’ll remove everything graphite as soon as all my clients have been migrated to native influxdb…

Thanks anyway.
Prune

On 28Sep, 2014, at 21:57, Dieter Plaetinck notifications@github.com wrote:

Hi everybody.
I'ld like to hear your thoughts on this. 1 and 2 are fairly obvious, but 3 could really use input on how you are using, or want to use the relay.

regex has much overhead. I don't think much can be done about this, so we should try to avoid regex matching when we can. I'm thinking of introducing "contains" and "starts_with" options for string checks in addition to regex, so that in many cases we could use those instead of regex and it will perform much better.

specifically, if the regex pattern is "", we shouldn't even do the regex Match() like we currently do.

I've noticed that first_only, while useful for some setups (see E below), prevents us from "also sending traffic elsewhere". so i started thinking about getting rid of first_only and making the routing/dispatching more powerful. that said, it's a neat mechanism because you can make routing decisions with just a bool check, no pattern checking needed.
so, taking a step back, i'ld like to collect use cases of how people want to use carbon-relay-ng, especially since i know some people are interested (and working on) sharding/consistent hashing, and round robin, and i've been pondering how to keep routing/dispatching performant, yet viable for various/all use cases. (ideally, without running multiple carbon-relay-ng instances if we can)

I see the following use cases

A consistent hashing / round robin to a pool of servers, for loadbalancing and HA (each pool is just one route)
B mirroring the same load to multiple machines, possibly a subset of traffic using regex pattern (for example to test influxdb next to graphite)
C sending a subset of metrics to anomaly detection
D sending all or some - of the metrics to carbon-tagger
E poor man's load balancing: sending certain metrics to server A, some others to B and maybe some others to C, using regexes to control what goes where (maybe some storage is faster then others), and using first_only to make sure we don't store metrics on more than one system.

am i missing a case? first_only is really only used for E, but it doesn't really work anymore as soon as you want to send some metrics also elsewhere like in B, C or D.

What I'm thinking is, to keep the system powerful, and still efficient, without overcomplicating things too much, we should implement E as a multi-endpoint route just like we are approaching round robin and hashing/sharding. the key is that these are multiple endpoints within one route, instead of multiple routes in the "global level"
this way, on the global routing/dispatching level, we check the pattern (and string check) of every route, and if it matches, the route gets it. but a route can be a special one with multiple endpoints,
and send traffic to those endpoints using a dedicated mechanism:

round robin
sharded
further pattern matching within the route, with first_only
thoughts?
cc @robinbowes @rcrowley @pauloconnor @willowpet @pwielgolaski @shiaho @vladimir-smirnov-sociomantic @prune998 @curtisgithub

—
Reply to this email directly or view it on GitHub.

Dieterbe · 2014-09-29T18:26:56Z

@vladimir-smirnov-sociomantic yeah frankly i'm not so sure about E. I just added it because i assume that's why @rcrowley added the "first_only" parameter. i'ld love to know @rcrowley 's background/context so we can maybe get rid of the first_only stuff.

Dieterbe · 2014-09-29T18:29:50Z

looking at the inital commit, it seems @rcrowley did this so you could send all your metrics of staging and prod to the same relay, but send them to separate servers, like so:

+ carbon-relay-ng -f \\.staging\\.=1.2.3.4:2003 \\.production\\.=5.6.7.8:2003
+
+Note the use of `-f` to relay data only to the first matching route.

so that's basically E, it probably makes sense to keep support for this, but as a pool of endpoints within 1 route.

chjohnst · 2014-10-10T21:09:10Z

I am very interested in having load balancing similar to the C carbon relay floating around out there. I have several cyanite instances running (4 currently but expecting that to double) and I am taking in stead streams of data from several applications in the company and statsd so being able to load balance that traffic (round robin) is very useful as a scale out for me.

Dieterbe · 2014-10-11T17:23:06Z

ok so the more I think about it, the more the proposed design makes sense to me.
let me explain it a little better.

note: when I say "match" i mean you can match on one or more of: prefix_string, substring, or regex (you can save lots of performance by bypassing regex checks)
there's basically two levels of matching:

in the global level, we basically have a list of routes, and we send every incoming metric to all matching routes. there's no need to check for first_only flags here (see further down), and these matches will very commonly be very cheap because a route will often just take in all metrics, or use a simple substring.
this level is pretty similar to what we have today.
but each route can contain 1 or more tcp endpoints and the route is of a specific type.
the route_type describes a specific behavior that controls how to route the metrics that come into the route amongst the endpoints within a route. the types are as follows:
- send_all: send all metrics to all the defined endpoints (possibly, and probably commonly only 1 endpoint).
- send_first_match: send the metrics to the first endpoint that matches it. (i.e. his covers the E use case but conveniently tucked away in a route_type and without needing a first_only variable)
- consistent hashing: the route is a CH pool
- round robin: the route is a RR pool.

the key benefit here is that we can combine various use cases without the route matching and expected behaviors getting in each other's ways.

thoughts/feedback ?

Dieterbe · 2014-10-13T20:45:53Z

think i'm going to start working on this in a next branch, which will contain the next gen routing , and a new admin/telnet UI to match the new structure.

chjohnst · 2014-10-14T16:56:07Z

+1
On Oct 13, 2014 4:45 PM, "Dieter Plaetinck" notifications@github.com
wrote:

think i'm going to start working on this in a next branch, which will
contain the next gen routing , and a new admin/telnet UI to match the new
structure.

—
Reply to this email directly or view it on GitHub
#23 (comment)
.

Dieterbe · 2014-10-16T15:39:54Z

so I've been doing some work and pushed it to https://github.com/graphite-ng/carbon-relay-ng/tree/next

I would like fellow developers to look at https://github.com/graphite-ng/carbon-relay-ng/blob/next/HACKING.txt which lists the main todo's.
in particular, we need to adjust the http (and telnet) interface, and depending on how the http interface will shape up we can continue working out the api of the table/route/destination interfaces.

Dieterbe · 2014-10-20T23:14:58Z

i've pushed a bunch of work and did some very basic testing.
it works: it initializes, the routing seems to work, stats are now exposed via expvar instead of statsd (so we can also show them in the admin/telnet interfaces). needs more testing of course, and the admin ui's need to be updated.

vladimir-smirnov-sociomantic · 2014-10-21T09:07:50Z

Can you give an example of new config? I'll try to test it.

Dieterbe · 2014-10-21T12:18:43Z

i just run it with the included config for testing go build && ./carbon-relay-ng carbon-relay-ng.ini (i made one change, to have a local spool_dir)

Dieterbe · 2014-10-22T22:59:50Z

pushed a bunch of updates again. it needs some more work around resending lines that were sent when a connection broke (there's also some test cases for this, that "almost" work), but for the most part it seems to work fine. i just ran it in our pipeline and it held up fine :) (without using the admin interfaces).

Dieterbe · 2014-10-27T15:33:39Z

it's been a month, progress has been going well enough. i merged next into master.
as documented in the new readme (https://github.com/graphite-ng/carbon-relay-ng#releases--versions) for now you need to check out v0.5 to get the last stable version with http admin ui, while we finish the new http admin ui for the new version

Dieterbe · 2014-11-25T20:22:43Z

the new relay (in master) is ready for more testing. the web and tcp admin interfaces are a WIP but you can run the relay from a config file and test.
for me it's been working well. better than the old version :) the hardest part was tuning the internal performance variables, see the included perf-tuning.md file for more info. curious for feedback (please open tickets for new issues) thanks.

randallt · 2018-01-15T21:38:04Z

The main readme states that round robin of a list of destinations in a route is not implemented. Is that accurate? Did this issue not include round-robin support?

Dieterbe · 2019-05-28T15:56:11Z

it may not have included it initially but the relay has supported it for a while now

Dieterbe mentioned this issue Sep 29, 2014

Routes list is not in any order #22

Closed

shiaho mentioned this issue Oct 10, 2014

First_only not working properly #27

Closed

Dieterbe closed this as completed Oct 27, 2014

Dieterbe mentioned this issue Aug 12, 2021

Plans on RR (round robin) implementation? #474

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reworking routing for performance and being more powerful #23

reworking routing for performance and being more powerful #23

Dieterbe commented Sep 29, 2014

vladimir-smirnov-sociomantic commented Sep 29, 2014

prune998 commented Sep 29, 2014

Dieterbe commented Sep 29, 2014

Dieterbe commented Sep 29, 2014

chjohnst commented Oct 10, 2014

Dieterbe commented Oct 11, 2014

Dieterbe commented Oct 13, 2014

chjohnst commented Oct 14, 2014

Dieterbe commented Oct 16, 2014

Dieterbe commented Oct 20, 2014

vladimir-smirnov-sociomantic commented Oct 21, 2014

Dieterbe commented Oct 21, 2014

Dieterbe commented Oct 22, 2014

Dieterbe commented Oct 27, 2014

Dieterbe commented Nov 25, 2014

randallt commented Jan 15, 2018

Dieterbe commented May 28, 2019

reworking routing for performance and being more powerful #23

reworking routing for performance and being more powerful #23

Comments

Dieterbe commented Sep 29, 2014

vladimir-smirnov-sociomantic commented Sep 29, 2014

prune998 commented Sep 29, 2014

Dieterbe commented Sep 29, 2014

Dieterbe commented Sep 29, 2014

chjohnst commented Oct 10, 2014

Dieterbe commented Oct 11, 2014

Dieterbe commented Oct 13, 2014

chjohnst commented Oct 14, 2014

Dieterbe commented Oct 16, 2014

Dieterbe commented Oct 20, 2014

vladimir-smirnov-sociomantic commented Oct 21, 2014

Dieterbe commented Oct 21, 2014

Dieterbe commented Oct 22, 2014

Dieterbe commented Oct 27, 2014

Dieterbe commented Nov 25, 2014

randallt commented Jan 15, 2018

Dieterbe commented May 28, 2019