Skip to content
This repository has been archived by the owner on Feb 18, 2021. It is now read-only.

[QUESTION] How to get Cherami to Spread the Load #284

Closed
djoodle opened this issue Aug 30, 2017 · 13 comments
Closed

[QUESTION] How to get Cherami to Spread the Load #284

djoodle opened this issue Aug 30, 2017 · 13 comments

Comments

@djoodle
Copy link

djoodle commented Aug 30, 2017

I'm running a Cherami cluster using docker containers. I have:

  • 3x Frontend Hosts (0.5GB mem)
  • 3x Controllers (0.5GB mem)
  • 3x Inputhosts (2GB mem)
  • 3x Outputhosts (2GB mem)
  • 5x Storage hosts (4GB mem)

And Cassandra with 5 nodes, and a RF of 3.

However when we have times of heavy load across the system, most of the boxes sit idle with one storage host that climbs to all its memory and then is killed (OOM).

Is there a way to tell Cherami to spread the load to the other storage hosts? It seems all the extents are placed on the one host, and even when that hosts starts slowing down and getting behind, the load is never shared.

Cassandra is aware of all nodes, each node seems to be working

Many thanks!

@doodle-tnw
Copy link

For reference my current configuration is:

DefaultServiceConfig:
  ListenAddress: "XXXX"
  RingHosts: "frontend-0.cherami:4922,frontend-1.cherami:4922,frontend-2.cherami:4922"
  EnableLimits: true

# ServiceConfig overrides default config with service-specific config such as ports
ServiceConfig:
  cherami-inputhost:
    Port: 4240
    WebsocketPort: 6189
    tchannel:
      port: 4240
      disableHyperbahn: true
      logLevel: warn
      disableLogging: true
  cherami-storehost:
    Port: 4253
    WebsocketPort: 6191
    tchannel:
      port: 4253
      disableHyperbahn: true
      logLevel: warn
      disableLogging: true
  cherami-outputhost:
    Port: 4254
    WebsocketPort: 6190
    tchannel:
      port: 4254
      disableHyperbahn: true
      logLevel: warn
      disableLogging: true
  cherami-frontendhost:
    Port: 4922
    tchannel:
      port: 4922
      disableHyperbahn: true
      logLevel: warn
      disableLogging: false
  cherami-controllerhost:
    StorePlacementConfig:
      MinFreeDiskSpaceBytes: 10000
    Port: 5425
    tchannel:
      port: 5425
      disableHyperbahn: true
      logLevel: warn
      disableLogging: false
  cherami-replicator:
    Port: 6280
    WebsocketPort: 6310
    tchannel:
      port: 6280
      disableHyperbahn: true
      logLevel: warn
      disableLogging: false

DefaultDestinationConfig:
  Replicas: 1

MetadataConfig:
  CassandraHosts: "cassandra"
  Keyspace: "cherami"
  Consistency: "one"
  ClusterName: "test-cluster"
  NumConns: 1

StorageConfig:
  BaseDir: /var/lib/cherami-store/data
  HostUUID: "XXX"

logging:
  level: info
  stdout: true

The environment peaks at maybe 4000 messages a second, averages a couple hundred - We see cherami back up under the load at every peak time, and usually when this happens one of the storehosts will have hit its 4GB cap, while the others will be laying around a 100MB or so. Same with the output host except it hits its 2GB cap while the others sit firmly on 20MB.

Any help from the active devs @datoug / @kobeyang / @kirg would be most appreciated, as this is the final stumbling block for us with Cherami

@datoug
Copy link
Contributor

datoug commented Aug 30, 2017

@danudell-trustnetworks, @djoodle I suspect this is caused by a bug. I sent out a PR for this:
#285

Could you try it out? You can use branch 'distance'

@doodle-tnw
Copy link

Hi @datoug , thanks for getting back to me quickly.

So now no data is flowing through Cherami - however I am wondering if this is due to other updates since I last pulled the image.

time="2017-08-31T11:33:29Z" level=error msg="ListConsumerGroupsByDstID failed" ctrlID=48fd47d2 deploymentName= destID=f337b32e-4db1-4a8a-bdc0-2ccc33fbad51 err="InternalServiceError({Message:consumer_group of type frozen<consumer_group> has no field options})" module=extentMon

I get this error appear on the controllers. I cannot list consumergroups via the cli tool - only destinations.

@doodle-tnw
Copy link

Actually you can ignore that, seems like the upgrade wasn't very clean. Will run it this evening and let you know how it goes.

@datoug
Copy link
Contributor

datoug commented Aug 31, 2017

yea from the error, it seems your cherami-thrift is not update to date.

@doodle-tnw
Copy link

Ok, so the resource profile now I've redeployed looks much more sensible, each host is using about the same amount of memory. The system is quiet overnight so the real test will be tomorrow morning when it comes under load - Will let you know then you can close this.

Thanks again for all the help @datoug !

@doodle-tnw
Copy link

Hi @datoug, sorry for being slow to get back to you - the fix seems to have worked well, the load is far more evenly balanced across the cluster.

Would you guys be open to pull requests around things like example k8s configs or documentation additions?

@doodle-tnw
Copy link

@datoug is this fix being recreated or dropped? I just noticed it's been reverted out of master?

@datoug
Copy link
Contributor

datoug commented Sep 5, 2017

@danudell-trustnetworks Yes, PRs are welcome.

My patch had some test issues. I'm investigating that and will re-merge it after it's solved.

@doodle-tnw
Copy link

@datoug Cheers, we will stick on the branch for now until you can sort the test failures.

@datoug
Copy link
Contributor

datoug commented Sep 18, 2017

@danudell-trustnetworks on it now, will update you today or tomorrow.

@datoug
Copy link
Contributor

datoug commented Sep 19, 2017

@danudell-trustnetworks I have landed #294, could you try latest master branch?

@doodle-tnw
Copy link

Yeah that all works - cheers! You can close this on

@datoug datoug closed this as completed Sep 20, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants