Initial implementation/release #1

jtuple · 2013-05-04T02:09:49Z

This is the initial release of sidejob.

Note: this library was originally written to support process bounding in Riak using the sidejob_supervisor behavior. In Riak, this is used to limit the number of concurrent get/put FSMs that can be active, failing client requests with {error, overload} if the limit is ever hit. The purpose being to provide a fail-safe mechanism during extreme overload scenarios.

sidejob is an Erlang library that implements a parallel, capacity-limited request pool. In sidejob, these pools are called resources. A resource is managed by multiple gen_server like processes which can be sent calls and casts using sidejob:call or sidejob:cast respectively.

A resource has a fixed capacity. This capacity is split across all the workers, with each worker having a worker capacity of resource capacity/num_workers.

When sending a call/cast, sidejob dispatches the request to an available worker, where available means the worker has not reached it's designated limit. Each worker maintains a usage count in a per-resource public ETS table. The process that tries to send a request will read/update slots in this table to determine available workers.

This entire approach is implemented in a scalable manner. When a process tries to send a sidejob request, sidejob determines the Erlang scheduler id the process is running on, and uses that to pick a certain worker in the worker pool to try and send a request to. If that worker is at it's limit, the next worker in order is selected until all workers have been tried. Since multiple concurrent processes in Erlang will be running on different schedulers, and therefore start at different offsets in the worker list, multiple concurrent requests can be attempted with little lock contention. Specifically, different processes will be touching different slots in the ETS table and hitting different ETS segment locks.

For a normal sidejob worker, the limit corresponds to the size of a worker's mailbox. Before sending a request to a worker the relevant usage value is incremented by the sender, after receiving the message the worker decrements the usage value. Thus, the total number of messages that can be sent to a set of sidejob workers is limited; in other words, a bounded process mailbox.

However, sidejob workers can also implement custom usage strategies. For example, sidejob comes with the sidejob_supervisor worker that implements a parallel, capacity limited supervisor for dynamic, transient children. In this case, the capacity being managed is the number of spawned children. Trying to spawn additional results in the standard overload response from sidejob.

In addition to providing a capacity limit, the sidejob_supervisor behavior is more scalable than a single OTP supervisor when there multiple processes constantly attempting to start new children via said supervisor. This is because there are multiple parallel workers rather than a single gen_server process. For example, Riak moved away from using supervisors to manage it's get and put FSMs because the supervisor ended up being a bottleneck. Unfortunately, not using a supervisor made it hard to track the number of spawned children, return a list of child pids, etc. By moving to sidejob_supervisor for get/put FSM management, Riak can now how easily track FSM pids without the scalability problems -- in addition to having the ability to bound process growth.

engelsanchez · 2013-05-08T14:54:22Z

src/sidejob_resource_stats.erl

+    [{usage, Usage},
+     {rejected, Rejected},
+     {in_rate, In},
+     {out_out, Out},


Did you maybe mean out_rate here?

engelsanchez · 2013-05-10T23:31:38Z

src/sidejob.erl

+            case Value >= Limit of
+                true ->
+                    ets:insert(ETS, {{full, Worker}, 1}),
+                    false;


Here, the worker gets full after incrementing. But it wasn't full to begin with, so this request should return true anyway. This guy just took the last slot. The next request should return false. I totally found that by myself. It's not like like it was Quickcheck and I had nothing to do with it. Not at all. :)

For any other readers: discussed with Joe and there is a race condition here, but we believe the only effect would be to occasionally go over the limit with enough schedulers. Clients check and increment counter here, worker updates it to its actual mailbox size on processing a message. Messages still not placed in the mailbox are unaccounted for.

engelsanchez · 2013-05-11T00:57:21Z

I verified the comments were addressed. Joe and I know there are racy conditions that are very hard to strictly test around the actual number of resources that get used vs. the desired thresholds. But we haven't really found any actual problem beyond thresholds not being exact numbers. The numbers should never be set too low for this reason. But the code performs overload protection well in manual tests and such. We should follow up with more tests and smart ways to test around the fuzziness here, but as far as I'm concerned this does the job
👍 💃 ⛵

Initial implementation/release

Zotonic changes to side job

Initial implementation/release

652473d

jtuple mentioned this pull request May 4, 2013

Bound the number of get/put FSMs to prevent overload basho/riak_kv#544

Merged

engelsanchez reviewed May 8, 2013
View reviewed changes

src/sidejob_resource_stats.erl

[{usage, Usage},

{rejected, Rejected},

{in_rate, In},

{out_out, Out},

Copy link

engelsanchez May 8, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you maybe mean out_rate here?

Address code review comments / fix minor issues

587fd11

engelsanchez reviewed May 10, 2013
View reviewed changes

jtuple added a commit that referenced this pull request May 11, 2013

Merge pull request #1 from basho/initial-release

3919031

Initial implementation/release

jtuple merged commit 3919031 into master May 11, 2013

seancribbs deleted the initial-release branch April 1, 2015 23:50

aramallo pushed a commit to Leapsight/sidejob that referenced this pull request Aug 23, 2017

Merge pull request basho#1 from zotonic/develop

5d58bca

Zotonic changes to side job

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation/release #1

Initial implementation/release #1

jtuple commented May 4, 2013

engelsanchez May 8, 2013

engelsanchez May 10, 2013

engelsanchez May 11, 2013

engelsanchez commented May 11, 2013

Initial implementation/release #1

Initial implementation/release #1

Conversation

jtuple commented May 4, 2013

engelsanchez May 8, 2013

Choose a reason for hiding this comment

engelsanchez May 10, 2013

Choose a reason for hiding this comment

engelsanchez May 11, 2013

Choose a reason for hiding this comment

engelsanchez commented May 11, 2013