Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reworking routing for performance and being more powerful #23

Closed
Dieterbe opened this issue Sep 29, 2014 · 17 comments
Closed

reworking routing for performance and being more powerful #23

Dieterbe opened this issue Sep 29, 2014 · 17 comments

Comments

@Dieterbe
Copy link
Contributor

Hi everybody.
I'ld like to hear your thoughts on this. 1 and 2 are fairly obvious, but 3 could really use input on how you are using, or want to use the relay.

  1. regex has much overhead. I don't think much can be done about this, so we should try to avoid regex matching when we can. I'm thinking of introducing "contains" and "starts_with" options for string checks in addition to regex, so that in many cases we could use those instead of regex and it will perform much better.

  2. specifically, if the regex pattern is "", we shouldn't even do the regex Match() like we currently do.

  3. I've noticed that first_only, while useful for some setups (see E below), prevents us from "also sending traffic elsewhere". so i started thinking about getting rid of first_only and making the routing/dispatching more powerful. that said, it's a neat mechanism because you can make routing decisions with just a bool check, no pattern checking needed.
    so, taking a step back, i'ld like to collect use cases of how people want to use carbon-relay-ng, especially since i know some people are interested (and working on) sharding/consistent hashing, and round robin, and i've been pondering how to keep routing/dispatching performant, yet viable for various/all use cases. (ideally, without running multiple carbon-relay-ng instances if we can)

I see the following use cases

A consistent hashing / round robin to a pool of servers, for loadbalancing and HA (each pool is just one route)
B mirroring the same load to multiple machines, possibly a subset of traffic using regex pattern (for example to test influxdb next to graphite)
C sending a subset of metrics to anomaly detection
D sending all or some - of the metrics to carbon-tagger
E poor man's load balancing: sending certain metrics to server A, some others to B and maybe some others to C, using regexes to control what goes where (maybe some storage is faster then others), and using first_only to make sure we don't store metrics on more than one system.

am i missing a case? first_only is really only used for E, but it doesn't really work anymore as soon as you want to send some metrics also elsewhere like in B, C or D.

What I'm thinking is, to keep the system powerful, and still efficient, without overcomplicating things too much, we should implement E as a multi-endpoint route just like we are approaching round robin and hashing/sharding. the key is that these are multiple endpoints within one route, instead of multiple routes in the "global level"
this way, on the global routing/dispatching level, we check the pattern (and string check) of every route, and if it matches, the route gets it. but a route can be a special one with multiple endpoints,
and send traffic to those endpoints using a dedicated mechanism:

  • round robin
  • sharded
  • further pattern matching within the route, with first_only

thoughts?
cc @robinbowes @rcrowley @pauloconnor @willowpet @pwielgolaski @shiaho @vladimir-smirnov-sociomantic @prune998 @curtisgithub

@vladimir-smirnov-sociomantic

Maybe you can get rid of use-case E by adding "weight" to each endpoint. Anyway balancing with regexps is very hard and inefficient and I doubt that anybody will use if if they got A + weights.

@prune998
Copy link

I’m using carbon-relay-hg to take old graphite compatible apps inside influxdb server so I’m not using any of the functionality you’re talking about (right now). Don’t plane to use them either as I’ll remove everything graphite as soon as all my clients have been migrated to native influxdb…

Thanks anyway.
Prune

On 28Sep, 2014, at 21:57, Dieter Plaetinck notifications@github.com wrote:

Hi everybody.
I'ld like to hear your thoughts on this. 1 and 2 are fairly obvious, but 3 could really use input on how you are using, or want to use the relay.

  1. regex has much overhead. I don't think much can be done about this, so we should try to avoid regex matching when we can. I'm thinking of introducing "contains" and "starts_with" options for string checks in addition to regex, so that in many cases we could use those instead of regex and it will perform much better.

  2. specifically, if the regex pattern is "", we shouldn't even do the regex Match() like we currently do.

  3. I've noticed that first_only, while useful for some setups (see E below), prevents us from "also sending traffic elsewhere". so i started thinking about getting rid of first_only and making the routing/dispatching more powerful. that said, it's a neat mechanism because you can make routing decisions with just a bool check, no pattern checking needed.
    so, taking a step back, i'ld like to collect use cases of how people want to use carbon-relay-ng, especially since i know some people are interested (and working on) sharding/consistent hashing, and round robin, and i've been pondering how to keep routing/dispatching performant, yet viable for various/all use cases. (ideally, without running multiple carbon-relay-ng instances if we can)

I see the following use cases

A consistent hashing / round robin to a pool of servers, for loadbalancing and HA (each pool is just one route)
B mirroring the same load to multiple machines, possibly a subset of traffic using regex pattern (for example to test influxdb next to graphite)
C sending a subset of metrics to anomaly detection
D sending all or some - of the metrics to carbon-tagger
E poor man's load balancing: sending certain metrics to server A, some others to B and maybe some others to C, using regexes to control what goes where (maybe some storage is faster then others), and using first_only to make sure we don't store metrics on more than one system.

am i missing a case? first_only is really only used for E, but it doesn't really work anymore as soon as you want to send some metrics also elsewhere like in B, C or D.

What I'm thinking is, to keep the system powerful, and still efficient, without overcomplicating things too much, we should implement E as a multi-endpoint route just like we are approaching round robin and hashing/sharding. the key is that these are multiple endpoints within one route, instead of multiple routes in the "global level"
this way, on the global routing/dispatching level, we check the pattern (and string check) of every route, and if it matches, the route gets it. but a route can be a special one with multiple endpoints,
and send traffic to those endpoints using a dedicated mechanism:

round robin
sharded
further pattern matching within the route, with first_only
thoughts?
cc @robinbowes @rcrowley @pauloconnor @willowpet @pwielgolaski @shiaho @vladimir-smirnov-sociomantic @prune998 @curtisgithub


Reply to this email directly or view it on GitHub.

@Dieterbe
Copy link
Contributor Author

@vladimir-smirnov-sociomantic yeah frankly i'm not so sure about E. I just added it because i assume that's why @rcrowley added the "first_only" parameter. i'ld love to know @rcrowley 's background/context so we can maybe get rid of the first_only stuff.

@Dieterbe
Copy link
Contributor Author

looking at the inital commit, it seems @rcrowley did this so you could send all your metrics of staging and prod to the same relay, but send them to separate servers, like so:

+ carbon-relay-ng -f \\.staging\\.=1.2.3.4:2003 \\.production\\.=5.6.7.8:2003
+
+Note the use of `-f` to relay data only to the first matching route.

so that's basically E, it probably makes sense to keep support for this, but as a pool of endpoints within 1 route.

@chjohnst
Copy link

I am very interested in having load balancing similar to the C carbon relay floating around out there. I have several cyanite instances running (4 currently but expecting that to double) and I am taking in stead streams of data from several applications in the company and statsd so being able to load balance that traffic (round robin) is very useful as a scale out for me.

@Dieterbe
Copy link
Contributor Author

ok so the more I think about it, the more the proposed design makes sense to me.
let me explain it a little better.

note: when I say "match" i mean you can match on one or more of: prefix_string, substring, or regex (you can save lots of performance by bypassing regex checks)
there's basically two levels of matching:

  • in the global level, we basically have a list of routes, and we send every incoming metric to all matching routes. there's no need to check for first_only flags here (see further down), and these matches will very commonly be very cheap because a route will often just take in all metrics, or use a simple substring.
    this level is pretty similar to what we have today.
  • but each route can contain 1 or more tcp endpoints and the route is of a specific type.
    the route_type describes a specific behavior that controls how to route the metrics that come into the route amongst the endpoints within a route. the types are as follows:
    • send_all: send all metrics to all the defined endpoints (possibly, and probably commonly only 1 endpoint).
    • send_first_match: send the metrics to the first endpoint that matches it. (i.e. his covers the E use case but conveniently tucked away in a route_type and without needing a first_only variable)
    • consistent hashing: the route is a CH pool
    • round robin: the route is a RR pool.

the key benefit here is that we can combine various use cases without the route matching and expected behaviors getting in each other's ways.

thoughts/feedback ?

@Dieterbe
Copy link
Contributor Author

think i'm going to start working on this in a next branch, which will contain the next gen routing , and a new admin/telnet UI to match the new structure.

@chjohnst
Copy link

+1
On Oct 13, 2014 4:45 PM, "Dieter Plaetinck" notifications@github.com
wrote:

think i'm going to start working on this in a next branch, which will
contain the next gen routing , and a new admin/telnet UI to match the new
structure.


Reply to this email directly or view it on GitHub
#23 (comment)
.

@Dieterbe
Copy link
Contributor Author

so I've been doing some work and pushed it to https://github.com/graphite-ng/carbon-relay-ng/tree/next

I would like fellow developers to look at https://github.com/graphite-ng/carbon-relay-ng/blob/next/HACKING.txt which lists the main todo's.
in particular, we need to adjust the http (and telnet) interface, and depending on how the http interface will shape up we can continue working out the api of the table/route/destination interfaces.

@Dieterbe
Copy link
Contributor Author

i've pushed a bunch of work and did some very basic testing.
it works: it initializes, the routing seems to work, stats are now exposed via expvar instead of statsd (so we can also show them in the admin/telnet interfaces). needs more testing of course, and the admin ui's need to be updated.

@vladimir-smirnov-sociomantic

Can you give an example of new config? I'll try to test it.

@Dieterbe
Copy link
Contributor Author

i just run it with the included config for testing go build && ./carbon-relay-ng carbon-relay-ng.ini (i made one change, to have a local spool_dir)

@Dieterbe
Copy link
Contributor Author

pushed a bunch of updates again. it needs some more work around resending lines that were sent when a connection broke (there's also some test cases for this, that "almost" work), but for the most part it seems to work fine. i just ran it in our pipeline and it held up fine :) (without using the admin interfaces).

@Dieterbe
Copy link
Contributor Author

it's been a month, progress has been going well enough. i merged next into master.
as documented in the new readme (https://github.com/graphite-ng/carbon-relay-ng#releases--versions) for now you need to check out v0.5 to get the last stable version with http admin ui, while we finish the new http admin ui for the new version

@Dieterbe
Copy link
Contributor Author

the new relay (in master) is ready for more testing. the web and tcp admin interfaces are a WIP but you can run the relay from a config file and test.
for me it's been working well. better than the old version :) the hardest part was tuning the internal performance variables, see the included perf-tuning.md file for more info. curious for feedback (please open tickets for new issues) thanks.

@randallt
Copy link

The main readme states that round robin of a list of destinations in a route is not implemented. Is that accurate? Did this issue not include round-robin support?

@Dieterbe
Copy link
Contributor Author

it may not have included it initially but the relay has supported it for a while now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants