TODO.moxi

moxi ideas, internals, notes...

moxi used to be called "cproxy".

TODO ideas...

- tests!

- add OS version to stats.

- histogram of request codes.
- histogram of response codes.

- stats of bandwidth
- stats of front cache bandwidth
- stats for inline latency, statistical
- key-level stats, by command, with bandwidth
- prefix stats [A-Za-z][A-Za-z0-9]+
- session stats (between session GET and SET)
- command pattern/signature stats
  - between GET and SET
- cache miss latency, between GET miss and SET

- too many gettimeofday()'s?

- keepalive on the connections with a timer
  - add stats to see if this is a problem?

- request retry
  - revisit timeout/retry logic.

- revisit settings.verbose levels

- revisit error msg strings

- revisit DONE list, add tests as necessary

- optimization: have free item list per proxy_td
  - so don't need to hit item/cache locks.

- optimization: use trond's allocator instead of malloc/free
  - for multiget hashtable
  - for multiget hashtable entries.

- failover/drain mode
  - sort of an edge case of replication
  - if we know a server's going to be going down in the future
    and we know where keys should end up, do "simple"
    cooling/warming-up replication.
    - not so simple when new server goes down.
    - or when old/new servers are flapping.
    - or net-split.
    - or different moxi/clients see different old/new servers.
  - need configuration, possibly...
    - svr-111
      - host = aaa
      - port = 11211
      - drain = 222
    - svr-222
      - host = aaa
      - port = 11211

- doxygen for moxi

- timeouts
  - request is taking longer than X ms -- must be a failure.
  - need timeout for upstream conns waiting in the queue
    for an available downstream -- DONE
    - if no downstreams are available, then upstream conn
      might wait forever, if there's no timeout.
  - need timeout for downstream requests -- DONE
    - on timeout, closes the downstream connection

- pipelining/streaming
  - instead of 100% store-and-forward
    - store-and-forward is helpful for replay.
  - use a tee design?
  - per facebook conversation
  - pipelining for GET value bodies.
    - consider multiget key-deduplication case. Should work.
    - consider non-uniform protocol case.  Should work.
    - need to keep item refcount sane.
  - pipelining for SET value bodies.

- handling binary protocol
  - on upstream
  - on downstream - DONE
  - when downstreams are not uniform protocol
    - some ascii, some binary
  - easiest - binary upstream and binary downstream
    - binary quiet commands are similar to ascii noreply?
      - there are state assumptions so we can't just
        jump to another downstream.
    - need to associate upstream with a favored downstream

- retry SET's and DELETE's
  - so that upstream can continue fast and just fire-&-forget.
    - but, need to respect serializability, similar to noreply bug?
  - log and replay these?
    - possibly in separate datacenter?

- 1000 customers per box -- php hosting scenario.
  - fall back to set-password approach.
  - use unix-domain-sockets, and filesystem security/access control.

- optimization to use SETQ for downstream binary SET's

- optimization: use opaque's instead of GETK during binary multiget smush.
  - issue: ascii-to-ascii uses same codepath, and ascii doesn't have opaques.
  - issue: non-uniform downstream would be more complicated
           if we used opaques.

- authorization
  - login/auth after creating downstream conn.
  - PLAIN support - DONE.
  - need mech support one day.
  - should support more mech's later.

- should mcp/libconflate msgs have version numbers?
  - added note to mcp issue tracker.

- message aggregation
  - aka, get key squashing or key-dedupe.
  - take popular get keys in wait queue and only get them once.
  - think through binary get/getq,getk,getkq issues before
      building this.
    - because we want one codebase/logic that handles both
      ascii and binary situations.
    - might need to convert get/getq's into getk/getkq's
      on the downstream.
    - also think through protocol conversions,
      with ascii on the front and binary on the back,
      and vice versa.
  - idea: if num(upstream_conn's) > 0 or need a inflight hashmap
  - what if conn goes down during scatter/gather

==========================================================
below, "a2b" means ascii upstream to binary downstream...

      scatter/gather-multiget
----------------------------------------------------------
a2a   done
a2b   easy, because we have all keys and can just
      use getq+noop pattern.  we don't need to use opaque
      on the downstream conns.
a2ab  need track the allowed protocols of downstream servers.

b2a   need multiple read/nreads, so we can gather all
      keys first until final upstream no-op or non-quiet get,
      or see an upstream conn_closing.
      then we can do a downstream multiget.
b2b   need multiple read/nreads, so we can gather all
      keys first until final upstream no-op or non-quiet get
      or see an upstream conn_closing.
      or, need noreply-style soft/hard affinity between
      upstream and downstream with timeouts.
b2ab

      key-squashing, when upstream protocols are uniform
----------------------------------------------------------
a2a   easy, with inflight hashtable
a2b   easy, with inflight hashtable
a2ab

b2a   ascii has no opaques, so need to remember and match them.
b2b   can't squash opaques, so need to remember them and rematch.
b2ab  the inflight hashtable value shoul have...
        upstream conns, and their opaques

- key squashing inflight hashtable might have bitfield values
  that tell us which upstream conns are interested in
  a particular key.  (NO, too tricky, just use memory
  like a normal person vs over-optimizing.)

- better stats
  - hot keys

failure modes...

- downstream conn closes or in weird state (just close).
  - normally, downstream *wants* to get to released state.
    - it should end up there even when its upstream closes,
      or when there are swallowable bytes that can
      return the downstream conn to a conn_pause state.
  - goal: no deadlock
  - goal: no leak
  - upstream sees what?
    - depends on config?
    - depends on the command that was inflight?
    - depends on when the downstream failure was noticed?
      - while reserving?
      - while writing?
      - while mwriting?
      - while reading?
      - while nreading?
      - while alloc'ing?
    - borrow failure recovery modes from spymemcached
    - ISSUE: we don't see downstream conn close until we actually
      try to read from the downstream conn, very late in state
      machine.
      - see the retry logic in on_close_downstream_conn().
  - upstream *wants* to get to conn_new_cmd state,
    after swallowing bad downstream data and sending an ERROR.

- when to reconnect?
  - backoff period?
    - borrowing from libmemcached to handle connects.
  - when to failover?
  - when to failback?
  - when to report?

known bugs
- t/getset.t fails
  - during a process_update_command(), when handling
    a "set" that is too big (>1MB), the key is not
    deleted on downstream server.

- noreply.t fails
  - happens because we're using multiple downstreams for concurrency,
    which is ok when using synchronous, request/response commands.
    but, this breaks the serialized assumptions of noreplies.
  - the same pattern here is needed for binary getq/getkq handling.
  - solution 1 <<< this path needs timeout experimentation.
    - keep upstream assigned to a downstream even after
      the downstream write is done and the downstream
      goes into conn_pause state.  The upstream goes
      into conn_new_cmd state.
      also, put a timer to wake up so we can release
      an unused downstream.
    - negatives: a bunch of upstreams whose last action
      was a noreply can really reduce throughput by
      keeping their downstreams reserved and unavailable
      to other clients.
      - might never see this, depending on on the upstream
        usage pattern.  if the last thing upstream does
        is fire off some noreply updates, it could be bad.
      - also, the upstream conn will need a pointer to the
        downstream its assigned to.
    - also, more complexity when downstreams and upstreams go down.
  - solution 2
    - never use noreply for downstreams.  always use
      synchronous request/reply commands.
    - negatives: this is always slow, rather than
      just sometimes slow.
    - positives: simpler.  and, upstream conns don't
      need pointer to downstream.
  - solution 3
    - keep upstream assigned to a preferred downstream,
      but in a soft fashion.
    - add a waiting_for_downstream_head/tail slot to downstream.
    - allow downstream to be stealable on concurrency pressure.
    - could really slow down an upstream that uses noreply,
      if it has to wait for a specific downstream.
    - but also keep others connects good.
    - also, more complexity when downstreams and upstreams go down.
    - also, need more pointers to track the assignment preference?

------------------------------------------------------------

- submodules, easy low-dependency build steps.
  - git submodule and script magic to make the
    right metaproject.  this is proxybuild.
  - need a faster way to test end-to-end, with mcp.
    faster edit-compile-test-debug cycle

- servers owned by others.

- on proxy config change of a proxy's name, but which keeps the
  same port as an old proxy, we should close any old
  upstream connections.

- proxy shtudown thru agent_config doesn't close the
  listening port?

- test command, described by dustina.
  - ping timings
  - time between request and response on each downstream conn.
  - initial version DONE.
  - need version that can use small "ping test" buckets.

- bucket selection
  - after auth -- DONE.
  - need server side of this.

------------------------------------------------------------

- FIXED: multiversioning.t fails / deadlocks
  - when the cproxy is the final downstream.
  - does not deadlock when a real memcached server
    running in a separate process is the final downstream.
  - FIXED: by putting libmemcached in noblock mode.

- DONE: upstream conn closes
  - goal: any reserved downstreams do not leak
    and, downstreams ideally go back into the free,
    released list.
  - upstream conn needs way to find its attached, reserved downstream.
    - solution 1
      - add an extra2 pointer to the conn?
    - solution 2 << we choose this path.
      - walk the reserved downstreams, comparing upstream_conn values.
      - since upstream conn close is infrequent, this might be ok.
      - need to see why else we'd need this.
      - need to keep a reserved downstreams list?  yes, we do this.
  - also, need to remove the closing upstream conn from the wait list.
    - done

- WONTFIX: change from proxy config keyed by port to keyed by name.
  - actually, config algorithm key'ed by port is right.

- DONE: need debug/log in libconflate.

- DONE: agent_stats returns current config and VERSION.

- DONE: no thread-id in agent_stats.  Now aggregating results.

- DONE: optimized connection to self, where moxi is running
  as a memcached server.

- DONE: release downstreams back to head of list, not to tail

- DONE: make clock timer resolution configurable

- DONE: allow apikey to not be on the cmdline, instead optionally
  readable from secure 600 file
  - usage:
    - moxi -z ./path/to/apikeyfile