[QUESTION] How to get Cherami to Spread the Load #284

djoodle · 2017-08-30T08:00:48Z

I'm running a Cherami cluster using docker containers. I have:

3x Frontend Hosts (0.5GB mem)
3x Controllers (0.5GB mem)
3x Inputhosts (2GB mem)
3x Outputhosts (2GB mem)
5x Storage hosts (4GB mem)

And Cassandra with 5 nodes, and a RF of 3.

However when we have times of heavy load across the system, most of the boxes sit idle with one storage host that climbs to all its memory and then is killed (OOM).

Is there a way to tell Cherami to spread the load to the other storage hosts? It seems all the extents are placed on the one host, and even when that hosts starts slowing down and getting behind, the load is never shared.

Cassandra is aware of all nodes, each node seems to be working

Many thanks!

doodle-tnw · 2017-08-30T21:03:50Z

For reference my current configuration is:

DefaultServiceConfig:
  ListenAddress: "XXXX"
  RingHosts: "frontend-0.cherami:4922,frontend-1.cherami:4922,frontend-2.cherami:4922"
  EnableLimits: true

# ServiceConfig overrides default config with service-specific config such as ports
ServiceConfig:
  cherami-inputhost:
    Port: 4240
    WebsocketPort: 6189
    tchannel:
      port: 4240
      disableHyperbahn: true
      logLevel: warn
      disableLogging: true
  cherami-storehost:
    Port: 4253
    WebsocketPort: 6191
    tchannel:
      port: 4253
      disableHyperbahn: true
      logLevel: warn
      disableLogging: true
  cherami-outputhost:
    Port: 4254
    WebsocketPort: 6190
    tchannel:
      port: 4254
      disableHyperbahn: true
      logLevel: warn
      disableLogging: true
  cherami-frontendhost:
    Port: 4922
    tchannel:
      port: 4922
      disableHyperbahn: true
      logLevel: warn
      disableLogging: false
  cherami-controllerhost:
    StorePlacementConfig:
      MinFreeDiskSpaceBytes: 10000
    Port: 5425
    tchannel:
      port: 5425
      disableHyperbahn: true
      logLevel: warn
      disableLogging: false
  cherami-replicator:
    Port: 6280
    WebsocketPort: 6310
    tchannel:
      port: 6280
      disableHyperbahn: true
      logLevel: warn
      disableLogging: false

DefaultDestinationConfig:
  Replicas: 1

MetadataConfig:
  CassandraHosts: "cassandra"
  Keyspace: "cherami"
  Consistency: "one"
  ClusterName: "test-cluster"
  NumConns: 1

StorageConfig:
  BaseDir: /var/lib/cherami-store/data
  HostUUID: "XXX"

logging:
  level: info
  stdout: true

The environment peaks at maybe 4000 messages a second, averages a couple hundred - We see cherami back up under the load at every peak time, and usually when this happens one of the storehosts will have hit its 4GB cap, while the others will be laying around a 100MB or so. Same with the output host except it hits its 2GB cap while the others sit firmly on 20MB.

Any help from the active devs @datoug / @kobeyang / @kirg would be most appreciated, as this is the final stumbling block for us with Cherami

datoug · 2017-08-30T22:12:34Z

@danudell-trustnetworks, @djoodle I suspect this is caused by a bug. I sent out a PR for this:
#285

Could you try it out? You can use branch 'distance'

doodle-tnw · 2017-08-31T11:38:53Z

Hi @datoug , thanks for getting back to me quickly.

So now no data is flowing through Cherami - however I am wondering if this is due to other updates since I last pulled the image.

time="2017-08-31T11:33:29Z" level=error msg="ListConsumerGroupsByDstID failed" ctrlID=48fd47d2 deploymentName= destID=f337b32e-4db1-4a8a-bdc0-2ccc33fbad51 err="InternalServiceError({Message:consumer_group of type frozen<consumer_group> has no field options})" module=extentMon

I get this error appear on the controllers. I cannot list consumergroups via the cli tool - only destinations.

doodle-tnw · 2017-08-31T15:28:35Z

Actually you can ignore that, seems like the upgrade wasn't very clean. Will run it this evening and let you know how it goes.

datoug · 2017-08-31T16:36:32Z

yea from the error, it seems your cherami-thrift is not update to date.

doodle-tnw · 2017-08-31T20:47:58Z

Ok, so the resource profile now I've redeployed looks much more sensible, each host is using about the same amount of memory. The system is quiet overnight so the real test will be tomorrow morning when it comes under load - Will let you know then you can close this.

Thanks again for all the help @datoug !

doodle-tnw · 2017-09-05T05:42:50Z

Hi @datoug, sorry for being slow to get back to you - the fix seems to have worked well, the load is far more evenly balanced across the cluster.

Would you guys be open to pull requests around things like example k8s configs or documentation additions?

doodle-tnw · 2017-09-05T05:47:26Z

@datoug is this fix being recreated or dropped? I just noticed it's been reverted out of master?

datoug · 2017-09-05T23:27:47Z

@danudell-trustnetworks Yes, PRs are welcome.

My patch had some test issues. I'm investigating that and will re-merge it after it's solved.

doodle-tnw · 2017-09-18T11:05:52Z

@datoug Cheers, we will stick on the branch for now until you can sort the test failures.

datoug · 2017-09-18T17:10:38Z

@danudell-trustnetworks on it now, will update you today or tomorrow.

datoug · 2017-09-19T23:11:10Z

@danudell-trustnetworks I have landed #294, could you try latest master branch?

doodle-tnw · 2017-09-20T12:25:46Z

Yeah that all works - cheers! You can close this on

datoug mentioned this issue Aug 30, 2017

pick host randomly when distance map is empty #285

Merged

datoug closed this as completed Sep 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] How to get Cherami to Spread the Load #284

[QUESTION] How to get Cherami to Spread the Load #284

djoodle commented Aug 30, 2017 •

edited

Loading

doodle-tnw commented Aug 30, 2017

datoug commented Aug 30, 2017

doodle-tnw commented Aug 31, 2017

doodle-tnw commented Aug 31, 2017

datoug commented Aug 31, 2017

doodle-tnw commented Aug 31, 2017

doodle-tnw commented Sep 5, 2017

doodle-tnw commented Sep 5, 2017

datoug commented Sep 5, 2017

doodle-tnw commented Sep 18, 2017

datoug commented Sep 18, 2017

datoug commented Sep 19, 2017

doodle-tnw commented Sep 20, 2017

[QUESTION] How to get Cherami to Spread the Load #284

[QUESTION] How to get Cherami to Spread the Load #284

Comments

djoodle commented Aug 30, 2017 • edited Loading

doodle-tnw commented Aug 30, 2017

datoug commented Aug 30, 2017

doodle-tnw commented Aug 31, 2017

doodle-tnw commented Aug 31, 2017

datoug commented Aug 31, 2017

doodle-tnw commented Aug 31, 2017

doodle-tnw commented Sep 5, 2017

doodle-tnw commented Sep 5, 2017

datoug commented Sep 5, 2017

doodle-tnw commented Sep 18, 2017

datoug commented Sep 18, 2017

datoug commented Sep 19, 2017

doodle-tnw commented Sep 20, 2017

djoodle commented Aug 30, 2017 •

edited

Loading