-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bound the number of get/put FSMs to prevent overload #544
Conversation
This change prevents Riak from spawning an unbounded number of get/put FSMs during overload scenarios, eventually reaching the maximum number of allowed processes and crashing. The limit is configured by setting the riak_kv/fsm_limit application variable. Setting to undefined disables overload protection, falling back to the prior code path. This value cannot be changed at runtime, a node restart is necessary for a change to take effect. The limiting is implemented using the sidejob library. There are two sidejob_supervisors, one for get FSMs and one for put FSMs. Each supervisor independently enforces the FSM limit, thus there can be at most (limit) get FSMs and (limit) put FSMs. The response `{error, overload}` is returned whenever the limit is reached.
Conflicts: rebar.config src/riak_kv.app.src src/riak_kv_util.erl
The get/put monitors are no longer needed when using sidejob_supervisors to manage get/put FSMs. This commit disables the use of get/put monitors when overload protection is enabled, as well as changes the existing `puts_active` and `gets_active` API to use either the get/put monitor data or sidejob_supervisor data depending on which option is being used.
{error, overload} -> | ||
riak_kv_util:overload_reply(From); | ||
{ok, _Pid} -> | ||
ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to use the return value of riak_kv_get_fsm:start_link/4 anywhere, but it might be a good idea for sanity to have it return consistently. If sidejob is not enabled (first case), it returns {ok, Pid}, etc from gen_fsm:start_link, but if it is, we return ok, or the {ReqId, {error, overload}} from overload_reply.
I might jut being dumb here, but from reading the sidejob code and the resource name here, I was expecting there would be a riak_kv_fsm_{put,get}_sj module so that code like this https://github.com/basho/sidejob/pull/1/files#L12R99 calls functions width(), limit(), etc on it. |
The resource information modules are generated, compiled, and loaded at runtime when a resource is created. Think |
Got it, thanks. |
I've been running some of the client tests on a Riak containing all overload protection changes merged, no issues so far (Java, Python, Ruby client is not happy but also fails on master). Running under basho_bench with a small ring, 2 schedulers and 300 concurrent workers demonstrates the desired behavior: the mailbox sizes stay under the desired size and overload responses are returned (also timeouts, etc). |
The only comment on this PR is not a practical problem. This code and the sidejob library have been reviewed, discussed and tested manually without finding any problems. We should follow up by beefing up the automated tests, which is a really hard problem. But no real problems have been found and overload protection has been observed to work well so far. 👍 💃 ⛵ |
This pull-request implements FSM overload protection that prevents Riak from spawning an unbounded number of get/put FSMs, eventually reaching the maximum number of allowed processes and crashing.
The limit is configured by setting the
riak_kv/fsm_limit
application variable. Setting toundefined
disables overload protection, falling back to the prior code path. This value cannot be changed at runtime, a node restart is necessary for a change to take effect.The limiting is implemented using the sidejob library. There are two sidejob_supervisors, one for get FSMs and one for put FSMs. Each supervisor independently enforces the FSM limit, thus there can be at most (limit) get FSMs and (limit) put FSMs.
The response
{error, overload}
is returned whenever the limit is reached.This pull-request requires the related pull-request: basho/sidejob#1
I will update this pull-request with benchmark results within the next day or so. As well as a link to the related riak_test, which tests both this change as well as the the new vnode overload protection.