Skip to content
Mike Perham edited this page Mar 2, 2021 · 23 revisions

Faktory is a background job server. Faktory workers are necessary to execute those jobs.

Lifecycle

A Faktory worker process uses one or more threads to execute jobs. The four steps are:

  • Connect
  • Fetch
  • Execute
  • Report Result

Network Connection

Each producer and consumer process opens one or more TCP connections to Faktory. These connections are designed to be long-lasting.

The Faktory protocol is line-oriented. Most messages from the client worker to Faktory follow the general format: VERB {JSON}.

All messages from Faktory follow the Redis protocol format. For example, when sending an OK response, the server actually writes the bytes +OK\r\n.

Initial Handshake

On initial connection, Faktory immediately sends a HI message to the client with a JSON hash. The "v" attribute is the version of the protocol the server expects and is a monotonically increasing number. The version will be bumped any time there is a change in protocol, even minor. After Faktory 1.0 is released, any breaking protocol change will be denoted with a major version bump in Faktory itself. If the server protocol version is larger than a client expects, the client should print a message recommending a client upgrade.

# no password
HI {"v":2}
# password required
HI {"v":2,"s":"123456789abc","i":1735}

If password authentication is required, the hash will include nonce and iterations attributes (the "s" and "i" attributes):

The client must send a HELLO response. In the case of the latter HI, it must include a pwdhash parameter where pwdhash is calculated like so:

data = password+nonce
for i=0; i<iterations; i++ {
  data = sha256(data)
}
hex(data)

A resulting HELLO for a worker process might look like this:

HELLO {
 "hostname":"MikeBookPro.local",
 "wid":"4qpc2443vpvai",
 "pid":2676,
 "labels":["golang"],
 "pwdhash":"1e440e3f3d2db545e9129bb4b63121b6b09d594dae4344d1d2a309af0e2acac1",
 "v":2
}
> OK

If the process is a producer (only pushing jobs, not executing them), then the HELLO can be as simple as HELLO {"v":2} when no password is required.

If successful, the server responds with "OK" and the connection can now use the full Faktory command set.

Definition:

  • wid - worker id, a unique random string for every worker process
  • hostname/pid - specifics about the machine and process for this worker
  • labels - application-specific labels, shown in the Web UI
  • pwdhash - used to authenticate each connection
  • v - the protocol version the client expects

Hostname and PID are informational, for debugging use in the Web UI, but not useful in all environments (e.g. Heroku, containers).

Heartbeat

Workers must send a BEAT every N seconds, as proof of liveness. I recommend every 10 or 15 seconds. After 60 seconds without a beat, Faktory will remove them from the Busy page.

BEAT {"wid":"4qpc2443vpvai","rss_kb":1234567}

rss_kb is the worker's process memory size in KB and allows Faktory to show per-process memory usage on the Busy page. It is optional.

The response can be OK or a JSON hash with further data for the worker:

{"state":"quiet"}

The state can be either quiet or terminate. See Deployment below.

Fetching Jobs

A worker can request a job from a list of queues with the fetch command:

FETCH critical default low

The return value will be nil or the JSON for the job payload.

Notes: the list of queues will be checked in the order given. If all queues are empty, fetch will block for 2 seconds, waiting for a job from the first queue. This short blocking period serves several purposes:

  • Workers don't poll Faktory with thousands of queue checks per second.
  • Jobs are dispatched to a waiting worker almost instantly, within microseconds of being enqueued.
  • Since the blocking is relatively short, we don't need to worry about TCP keepalives or network stability.
  • Workers can randomize their queue ordering on each FETCH to counteract queue starvation.

Executing Jobs

FETCH reserves the job for the worker for N seconds (default of 1800). The worker must send an ACK or FAIL for the Job's JID by that time or the job will be released for re-execution. You can adjust this timeout by setting the job's reserve_for element:

# ruby
faktory_options reserve_for: 1.hour
# raw JSON
"reserve_for": 3600,

Report Result

The result of a job execution is either success, ACK, or failure, FAIL.

ACK {"jid":"8712638abd2"}
> OK
FAIL {"jid":"8712638abd2", "errtype":"RuntimeError", "message":"Invalid argument", "backtrace":["line1","line2"]}
> OK

FAIL should include error data about the failure if possible for display in the Web UI. Keep in mind that error messages and backtraces can be quite large in many cases. Out of the box, Faktory limits error messages to 1000 bytes and backtraces to the first 30 lines.

Information

You can fetch a blob of stats about Faktory with the INFO command:

INFO
> {"faktory"=>
  {"default_size"=>0,
   "tasks"=>
    {"Busy"=>{"reaped"=>0, "size"=>0},
     "Dead"=>
      {"cycles"=>2, "enqueued"=>0, "size"=>0, "wall_time_sec"=>2.472e-05},
     "Retries"=>
      {"cycles"=>23, "enqueued"=>1, "size"=>1, "wall_time_sec"=>0.004135707},
     "Scheduled"=>
      {"cycles"=>23, "enqueued"=>0, "size"=>5, "wall_time_sec"=>0.002255319}},
   "total_enqueued"=>6,
   "total_failures"=>0,
   "total_processed"=>0,
   "total_queues"=>3},
 "server"=>
  {"command_count"=>2,
   "connections"=>1,
   "faktory_version"=>"0.5.0",
   "uptime"=>"12345",
   "used_memory_mb"=>"123 MB"},
 "server_utc_time"=>"10:25:39 UTC"}

Best Practices

URL Configuration

Worker processes should allow configuration of the Faktory server URL via environment variable. It's normal for modern apps to run on Heroku or other managed environments where memcached, redis, postgresql and other daemons are provided as managed add-ons. In this case, the URL is provided to the application via URL passed as an ENV variable specific to the service:

REDISTOGO_URL=redis://...

Best practice is for the application developer to provide *_PROVIDER, which tells the client which ENV variable contains the server URL.

REDIS_PROVIDER=REDISTOGO_URL

We recommend the same pattern for Faktory:

FAKTORYTOGO_URL=tcp://:password@hostname:7419
FAKTORY_PROVIDER=FAKTORYTOGO_URL

Deployment

Shut down is surprising hard with all the different deployment tools, environments and processes people use. Some jobs are long-running, some jobs are not idempotent. My years of experience with Sidekiq has resulted in the following best practices for worker restarts:

  • Workers should send a BEAT every 15 seconds, only stopping upon process exit.
  • The BEAT response may contain a "quiet" or "terminate" state change.
  • Upon seeing "quiet", the worker process should immediately stop further FETCH'ing.
  • Upon seeing "terminate", the worker process should wait up to N seconds for any remaining jobs to finish. After 25 seconds (see below), the worker should send FAIL to Faktory for those lingering jobs (so they'll restart) and exit.
  • Once a worker has been quieted, it must be terminated. You can't "unquiet" a worker. Any deploy error/rollback should account for this.

Note: that Heroku allows processes up to 30 seconds to exit before a hard kill, that's why I recommend 25 above.

Clone this wiki locally