add High Availability research #685

giubacc · 2023-08-30T10:06:30Z

High Availability research

This is a first attempt to define the direction we want to take for the HA topic with s3gw.
Feedbacks, comments, requests, considerations etc; all is good at this time.

Related to: https://github.com/aquarist-labs/s3gw/issues/361
Signed-off-by: Giuseppe Baccini giuseppe.baccini@suse.com

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
CHANGELOG.md has been updated should there be relevant changes in this PR.

m-ildefons

Nice research. I very much like that you put some thought into what kind of failure scenarios are even within our scope and that there is a comprehensive overview of the various possible configurations with the components at hand.
Here are some comments I thought of while reading, hope you find some useful information in there

docs/research/ha/RATIONALE.md

l-mb · 2023-08-31T15:15:32Z

A high-level comment would be to touch base with the LH team on the work on the NFS share manager and their HA plans/ideas. While they can't take advantage of an ingress, they share a few of the similar concerns - node/pod failure detection, recovery, etc.

Perhaps there's overlap and tech we can leverage jointly.

james-munson · 2023-09-06T19:05:43Z

One item we (Longhorn) would like to know about the S3GW HA is whether it will assume the volume beneath the object store must be RWX, or whether RWO would suffice. If the gateway is active/active enough that both sides need simultaneous write access in order to transfer the work fast enough, that will make a difference.

From what I gather in the discussion here, it is unacceptable to have to start a pod on the new owner as part of the failover, but should be acceptable to defer attaching the backing volume until then. If so, then a simple RWO volume would suffice.

If not, the RWX volume would itself be layered on NFS, and any HA transfer would be gated by the NFS HA transfer, which currently requires significant time to clear locks, wait for grace periods, and all that. (I have the ticket to try to improve its performance, if possible.)

giubacc · 2023-09-07T07:41:14Z

One item we (Longhorn) would like to know about the S3GW HA is whether it will assume the volume beneath the object store must be RWX, or whether RWO would suffice. If the gateway is active/active enough that both sides need simultaneous write access in order to transfer the work fast enough, that will make a difference.

From what I gather in the discussion here, it is unacceptable to have to start a pod on the new owner as part of the failover, but should be acceptable to defer attaching the backing volume until then. If so, then a simple RWO volume would suffice.

If not, the RWX volume would itself be layered on NFS, and any HA transfer would be gated by the NFS HA transfer, which currently requires significant time to clear locks, wait for grace periods, and all that. (I have the ticket to try to improve its performance, if possible.)

Our current idea is to propose the HA model: "active/standby".
This can be translated into Kubernetes concepts as: a deployment with an "immutable" replicas: 1 spec.
So in case of failure, Kubernetes would restart a new s3gw's pod that will gain the "ownership" over the LH volume without too much complications; we suppose it would be a not-failable operation for the new s3gw's pod being able to attach to the existing LH PVC (previously mounted on the failed pod).
In this way ("active/standby"), we would avoid to have others "cold" instances of the s3gw; an "active/passive" approach would require a potentially not trivial complexity to be added over the Kubernetes primitives we should implement.

l-mb · 2023-09-07T13:41:47Z

One item we (Longhorn) would like to know about the S3GW HA is whether it will assume the volume beneath the object store must be RWX, or whether RWO would suffice. If the gateway is active/active enough that both sides need simultaneous write access in order to transfer the work fast enough, that will make a difference.

RWO is, in fact, the only supported mode.

We need features from XFS that NFS would no longer expose; and we will not support multiple s3gw instances on the same store (which conceptually also doesn't really make sense from a performance PoV, and not really from an availability point of view either, since it still all depends on a single node).

We've got no plans to support RWX.

(At that point, s3gw would be slowly implementing a distributed K/V object store as a backend, and ... that'd be called RADOS/Ceph :-D )

jecluis · 2023-10-02T14:22:15Z

@l-mb are we good to merge this?

jecluis · 2023-10-13T15:22:31Z

@giubacc there are conflicts with this PR, mind addressing them?

- add research/ha/RATIONALE.md Related to: https://github.com/aquarist-labs/s3gw/issues/361 Signed-off-by: Giuseppe Baccini <giuseppe.baccini@suse.com>

Related to: https://github.com/aquarist-labs/s3gw/issues/361 Signed-off-by: Giuseppe Baccini <giuseppe.baccini@suse.com>

regular-localhost-incremental-fill-5k regular_localhost_load_fio_64_write regular_localhost_zeroload_400_800Kdb regular_localhost_zeroload_emptydb segfault_localhost_zeroload_emptydb Related to: https://github.com/aquarist-labs/s3gw/issues/361 Signed-off-by: Giuseppe Baccini <giuseppe.baccini@suse.com>

- scale_deployment_0_1-k3s3nodes-zeroload-emptydb - s3wl-putobj-100ms-clusterip - s3wl-putobj-100ms-ingress Related to: https://github.com/aquarist-labs/s3gw/issues/361 Signed-off-by: Giuseppe Baccini <giuseppe.baccini@suse.com>

Related to: https://github.com/aquarist-labs/s3gw/issues/361 Signed-off-by: Giuseppe Baccini <giuseppe.baccini@suse.com>

giubacc · 2023-10-16T12:26:47Z

rebased on latest main

giubacc · 2023-10-24T15:02:26Z

@l-mb @jecluis @vmoutoussamy
Can we merge this first HA research?
I'd rather handle the current activity over medik8s with its dedicated LH issue.

giubacc added kind/research Issues that need to be researched area/kubernetes k8s and related area/rgw-sfs RGW & SFS related labels Aug 30, 2023

jhmarina mentioned this pull request Aug 30, 2023

Document current HA model (Epic) #361

Open

4 tasks

giubacc self-assigned this Aug 30, 2023

m-ildefons reviewed Aug 30, 2023

View reviewed changes

docs/research/ha/RATIONALE.md Outdated Show resolved Hide resolved