Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bound the number of get/put FSMs to prevent overload #544

Merged
merged 4 commits into from
May 11, 2013

Conversation

jtuple
Copy link
Contributor

@jtuple jtuple commented May 4, 2013

This pull-request implements FSM overload protection that prevents Riak from spawning an unbounded number of get/put FSMs, eventually reaching the maximum number of allowed processes and crashing.

The limit is configured by setting the riak_kv/fsm_limit application variable. Setting to undefined disables overload protection, falling back to the prior code path. This value cannot be changed at runtime, a node restart is necessary for a change to take effect.

The limiting is implemented using the sidejob library. There are two sidejob_supervisors, one for get FSMs and one for put FSMs. Each supervisor independently enforces the FSM limit, thus there can be at most (limit) get FSMs and (limit) put FSMs.

The response {error, overload} is returned whenever the limit is reached.

This pull-request requires the related pull-request: basho/sidejob#1

I will update this pull-request with benchmark results within the next day or so. As well as a link to the related riak_test, which tests both this change as well as the the new vnode overload protection.

jtuple added 3 commits May 3, 2013 13:06
This change prevents Riak from spawning an unbounded number of get/put
FSMs during overload scenarios, eventually reaching the maximum number
of allowed processes and crashing.

The limit is configured by setting the riak_kv/fsm_limit application
variable. Setting to undefined disables overload protection, falling
back to the prior code path. This value cannot be changed at runtime,
a node restart is necessary for a change to take effect.

The limiting is implemented using the sidejob library. There are two
sidejob_supervisors, one for get FSMs and one for put FSMs. Each
supervisor independently enforces the FSM limit, thus there can be
at most (limit) get FSMs and (limit) put FSMs.

The response `{error, overload}` is returned whenever the limit is
reached.
Conflicts:
	rebar.config
	src/riak_kv.app.src
	src/riak_kv_util.erl
The get/put monitors are no longer needed when using sidejob_supervisors
to manage get/put FSMs. This commit disables the use of get/put monitors
when overload protection is enabled, as well as changes the existing
`puts_active` and `gets_active` API to use either the get/put monitor
data or sidejob_supervisor data depending on which option is being used.
@ghost ghost assigned jtuple and engelsanchez May 7, 2013
{error, overload} ->
riak_kv_util:overload_reply(From);
{ok, _Pid} ->
ok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to use the return value of riak_kv_get_fsm:start_link/4 anywhere, but it might be a good idea for sanity to have it return consistently. If sidejob is not enabled (first case), it returns {ok, Pid}, etc from gen_fsm:start_link, but if it is, we return ok, or the {ReqId, {error, overload}} from overload_reply.

@engelsanchez
Copy link
Contributor

I might jut being dumb here, but from reading the sidejob code and the resource name here, I was expecting there would be a riak_kv_fsm_{put,get}_sj module so that code like this https://github.com/basho/sidejob/pull/1/files#L12R99 calls functions width(), limit(), etc on it.

@jtuple
Copy link
Contributor Author

jtuple commented May 8, 2013

The resource information modules are generated, compiled, and loaded at runtime when a resource is created. Think mochiglobal. This is done by sidejob_config.erl when called here.

@engelsanchez
Copy link
Contributor

Got it, thanks.

@engelsanchez
Copy link
Contributor

I've been running some of the client tests on a Riak containing all overload protection changes merged, no issues so far (Java, Python, Ruby client is not happy but also fails on master). Running under basho_bench with a small ring, 2 schedulers and 300 concurrent workers demonstrates the desired behavior: the mailbox sizes stay under the desired size and overload responses are returned (also timeouts, etc).

@engelsanchez
Copy link
Contributor

The only comment on this PR is not a practical problem. This code and the sidejob library have been reviewed, discussed and tested manually without finding any problems. We should follow up by beefing up the automated tests, which is a really hard problem. But no real problems have been found and overload protection has been observed to work well so far.

👍 💃 ⛵

@jtuple jtuple merged commit 1ebdfaf into 1.3 May 11, 2013
@seancribbs seancribbs deleted the jdb-fsm-overload-1.3 branch April 1, 2015 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants