ETCD-610: automated backups no config #1646

Elbehery · 2024-06-17T21:46:26Z

This PR add enhancement proposal for Etcd Automated Backups No Config.

Resolves https://issues.redhat.com/browse/ETCD-610

cc @openshift/openshift-team-etcd

openshift-ci-robot · 2024-06-17T21:46:31Z

@Elbehery: This pull request references ETCD-610 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This PR add enhancement proposal for Etcd Automated Backups No Config.

Resolves https://issues.redhat.com/browse/ETCD-610

cc @openshift/openshift-team-etcd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Elbehery · 2024-06-17T21:52:59Z

/assign @hasbro17
/assign @tjungblu
/assign @dusk125
/assign @soltysh

Elbehery · 2024-06-17T21:54:17Z

enhancements/etcd/automated-backups-no-config.md

+- Need to agree on a default schedule.
+- Need to agree on a default retention policy.
+
+- Several options exist for the default PVCName.


@gnufied I would need your input here please

openshift-ci · 2024-06-17T22:06:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hasbro17. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tjungblu · 2024-06-18T07:33:06Z

enhancements/etcd/automated-backups-no-config.md

+
+### User Stories
+- As a cluster administrator I want cluster backup to be taken without configuration.
+- As a cluster administrator I want to schedule recurring cluster backups so that I have a recent cluster state to recover from in the event of quorum loss (i.e. losing a majority of control-plane nodes).


I think it's worth to settle on a cadence we think is a good default as part of the requirements. Maybe once a day?

+1, I think once a day during mid-night is sufficient for most users .. I think mid-night should be a time where the cluster is not under load

I wouldn't assume that, but we can smear the schedule over the course of the day

tjungblu · 2024-06-18T07:35:23Z

enhancements/etcd/automated-backups-no-config.md

+
+- Several options exist for the default PVCName.
+  - Relying on `dynamic provisioning` is sufficient, however not an option for `SNO` or `BM` clusters.
+  - Utilising `local storage operator` is a proper solution, however installing a whole operator is too much overhead.


why not hostPath as we do it in our e2e test?
https://github.com/openshift/cluster-etcd-operator/blob/master/test/e2e/backup_test.go#L444-L487

that should be portable?

so the hostPath mounts a path from the node' file system as a volume into the pod. my concerns

hostPath could have security impact as it exposes the node's filesystem.

there is no scheduling guarantee for the backup pod with using hostpath as is the case with localVolume.

Using localVolume, the node affinity with the PV will force the backup pod to be scheduled on a specific node, where the volume is attached.

localVolume allows using a separate disk as PV, unlike hostPath which mounts a folder from the node's FS.

localVolume is handled by the PV controller and the scheduler in different manner, in fact it was created to resolve issues with hostPath

That being said, if we were to use localVolume, we need to find a solution for balancing the backups across the master nodes, some ideas could be

create a PV for each master node using localVolume and the backup controller from CEO should take care of balancing the backups across the available and healthy volumes.

the controller should keep the most recent backup on a healthy node available for restoration.

the controller should skip an unhealthy master node from taking a backup

can you list your concerns and this solution as an alternative here in the doc as well, please?

added these comments to the alternatives and Risks sections 👍🏽

Elbehery · 2024-06-18T15:59:27Z

enhancements/etcd/automated-backups-no-config.md

+
+## Design Details
+
+- Add default values to current `Etcdbackup Spec` using annotation.


@openshift/openshift-team-etcd I added this section after discussion on the Architecture Call

Input is from @soltysh @jsafrane

It is recommended to use default values on current API.

Do not rely on CVO on managing the default config.

Use storage solution based on OCP :-

Dynamic provisioning on Cloud based.

Local Volume for SNO / BM.

The general flow should be like follows:

All the EtcdBackup fields are defaulted, where possible. That's why I'm proposing to put this functionality behind already existing AutomatedEtcdBackup feature gate, rather than introducing a new one.

The pvcName will be the one that will have to by populated automatically by the CEO, unless a users sets one. It will use:

the available default storage class in the cluster (storage team can provide you with where you should be looking at it);

if above is not available, fallback to using a localVolume, as explained by Hemant above;

add a warning condition to the CEO (at least for that localVolume), reporting that we're working off of default etcd backup, and that the cluster admin should verify correctness of that configuration.

tjungblu · 2024-06-18T16:13:47Z

enhancements/etcd/automated-backups-no-config.md

+
+In the event of the node becoming inaccessible or unscheduled, the recurring backups would not be scheduled. The periodic backup config would have to be recreated or updated with a different PVC that allows for a new PV to be provisioned on a node that is healthy.
+
+If we were to use `localVolume`, we need to find a solution for balancing the backups across the master nodes, some ideas could be


how would we restore from a localVolume? Is the data stored somewhere in /var/?

so the localVolume is being mounted anywhere as specified in the Pod. In fact, you can mount a whole external drive.

iiuc, hostPath supports only mount points from the node's FS

imagine we have to run the restore script, how would we get the local volume mounted into a recovery machine? would we create another static pod that would mount the localVolume?

I imagine the local storage plugin must be running along with the kubelet for that to work? @jsafrane / @gnufied maybe you guys can elaborate a bit how that works, I don't have much time this week to read through the CSI implementation 🐎

pardon my ignorance for not being familiar with etcd deployment, but even if there is a default SC available, why are we backing up just one replica of etcd? I thought, we would want to backup all replicas - no? Such as - what if in case of network partition or something, the replica we are backing up is behind?

So IMO it sounds whether we are using LocalVolume or something else, we should always be creating backups across master nodes? Is that accurate?

Also to answer @tjungblu question. Local Volumes are a inbuilt feature of k8s and hence no additional plugin is necessary. Heck, a local PV can be provisioned statically and local-storage-operator is not necessary either, https://docs.openshift.com/container-platform/4.15/storage/persistent_storage/persistent_storage_local/persistent-storage-local.html#local-create-cr-manual_persistent-storage-local

... and can be mounted in the new etcd pod and the backup will exist.

check the restore script, you will need to unpack the snapshot before you can run any pod

@gnufied in order to make sure the backups are accessible, and that we can round-robin the backups across all master nodes, which are up-to-date with the etcd-cluster leader

Is it possible to

Create StatefulSet across the master nodes, where the PVC template uses the localVolume to provision PV.

Since we need to do the backup according to a schedule as per EtcdBackupSpec, the issue to manage the backup pods within the STS using the CronJob

I am not aware if this is possible, but at least, we can generate Event from the CronJob and the STS could react to these event by taking a backup

check the restore script, you will need to unpack the snapshot before you can run any pod

We can use init-container to unpack the backup, and then the etcd-pod can start ?

Otherwise, we can start a fresh etcd-pod and then run etcdctl restore ?

Please correct me if I am wrong

Create StatefulSet across the master nodes, where the PVC template uses the localVolume to provision PV.

I believe you mean DaemonSet across the master nodes 😉

I actually thought about this as well, but there is no PersistentVolumeClaim template on DaemonSet.

Having a separate PVC & PV for each master node would be ideal for spreading the backups across the master nodes.

That being said, according to the latest update, we are going to use SideCar container && hostPath volume

tjungblu · 2024-06-18T16:14:49Z

enhancements/etcd/automated-backups-no-config.md

+
+If we were to use `localVolume`, we need to find a solution for balancing the backups across the master nodes, some ideas could be
+
+- create a PV for each master node using `localVolume` and the backup controller from CEO should take care of balancing the backups across the `available` and `healthy` volumes.


would/could a DaemonSet do that for us?

Good point.

We can use DS and use node affinity to run the DS pods only on master nodes. However, I am not aware of how the storage will be handled in this case.

Is there is a way to keep the backups in round-robin fashion among all master nodes ?

Also what about STS ? wdyt ?

yep, STS could also work, especially with the storage. Would also be useful to have this compared to the static pod/hostpath approach

Alright, I think if we have STS, then the Pods, PVC, and PVs are being managed together

I think we can in this case create backups on RoundRobin manner across the master nodes.

For BM, I think we can still use local volume as storage option and being managed by STS

How does this round robin backup work?

I read the original enhancement - https://github.com/openshift/enhancements/blob/master/enhancements/etcd/automated-backups.md and I understand this now better. But still this begs the question - how do you know which snapshot to restore from? What if you take snapshot of a etcd replica which was behind and is just catching up with other master nodes? Is that possible?

So we will make sure that the backup is taken from the a member whose log is identical to the leader. This way we make sure that we are not lagging behind.

But, could this approach work ?

Why do you want to handle the balancing? That's not what you should care for at all.

Yes I agree, therefore the SideCar approach is the most appropriate

tjungblu · 2024-06-18T16:18:55Z

enhancements/etcd/automated-backups-no-config.md

+#### Cloud based OCP
+
+##### Pros
+- Utilising `Dynamic provisioning` is best option on cloud based OCP.


how is the customer able to access a dynamically created PV to get their snapshots for restoration?

Gr8 question, so I have defaulted the RetentionPolicy in my sample configs to Retain.

This way the PV contents will never be erased automatically.

Now to answer the restoration

If the cluster still running, the PV can be attached to any node, as long as it is in the same availability zone.

If the cluster is completely down, the PV can be accessed by the CU, it will never be delete unless manually.

it's not about the retention, it's about the access and mounting possibilities

If the cluster still running, the PV can be attached to any node, as long as it is in the same availability zone.

that's quite a bad constraint for a disaster recovery procedure, isn't it? :)

:D :D .. I know, well the best option imho is to push backups to Remote storage option, then even in case of the whole cluster is down, a new installation can be restored using the remote backup.

However, in my discussion yesterday, the remote storage was not an option for BM. Also they recommended to keep it simple not too complicated

If it were to me, I would create a side car which pushes the backups to remote storage, I think this option would work for any OCP variant. The master nodes dont have to be in the cloud, the side car could authenticate and push backups regardless of OCP underlying infrastructure, wdyt

If dyanmically provisioned volumes are available in the cluster, it is 100% likely that, cluster has Remote Storage and hence backups are available even if cluster is down.

As for availability zones concerns, yeah that is why, backing up into a PV is not enough. We should consider using CSI snapshots of those PVCs, so as snapshot of backups can be available across availability zones and in case file system on PV gets corrupt or something.

I actually want to fall back to the localVolume approach, since having two separate approaches is harder to maintain.

I believe the localVolume should be sufficient. However, if we could utilize STS across all the master nodes would be ideal in situation where we lose one or more master nodes.

If I were to design this, I would definitely use two different backup strategies. It is far more reliable to use dynamically provisioned PVCs for backups. They can be snapshotted and can be accessible if node goes down.

It seems to me that - any hostpath/localvolume based approach is basically inferior. If last backup was taken on leader and then leader node goes down, then so does our backup(other nodes may have slightly behind backups). So hostpath/localvolume requires fundamentally different recovery mechanism.

IMO - if we take other components that use persistent storage in Openshift, if customers don't provide Storage then no persistent configuration is created. Prometheus or image-registry, they all require storage. And they don't do anything automatically.

So, what I am saying is - we should probably limit scope of this KEP only to environments where an StorageClass is available. If no StorageClass is configured then no automatic backups are taken.

We should not take upon ourselves to decide a storage strategy for customer. cc @tjungblu

So to summarize - I would simply not bother to configure backups via localvolume/hostpath in environments where no storage is available and solve that problem via documentation or ask customer to configure storage. I do not think, we should automatically configuring local-storage for these clusters.

I actually agree that using dynamic provisioning and CSI snapshot is vital for the automated backup.

I also agree to limit the automated backup for clusters with dynamic provisioning only.

wdyt @tjungblu @hasbro17 @dusk125 @soltysh

Elbehery · 2024-06-18T18:01:48Z

Also see this default I used to test the approach


apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: etcd-backup-local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: Immediate

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: etcd-backup-pvc
  namespace: openshift-etcd
spec:
  accessModes:
  - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi 
  storageClassName: etcd-backup-local-storage

---

apiVersion: v1
kind: PersistentVolume
metadata:
  name: etcd-backup-pv-fs
spec:
  capacity:
    storage: 100Gi 
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: etcd-backup-local-storage
  local:
    path: /mnt
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:

---

apiVersion: operator.openshift.io/v1alpha1
kind: EtcdBackup
metadata:
  name: etcd-single-backup
  namespace: openshift-etcd
spec:
  pvcName: etcd-backup-pvc

gnufied · 2024-06-20T17:08:16Z

enhancements/etcd/automated-backups-no-config.md

+- Several options exist for the default `PVCName`.
+  - Relying on `dynamic provisioning` is sufficient, however not an option for `SNO` or `BM` clusters.
+  - Utilising `local storage operator` is a proper solution, however installing a whole operator is too much overhead.
+  - The most viable solution to cover all OCP variants is to use `local volume`.


You don't have to necessarily install local-storage-operator to use/consume local-storage. For simple one-off use cases like this, it may be possible that etcd-backup operator or something creates static-pv as documented - https://docs.openshift.com/container-platform/4.15/storage/persistent_storage/persistent_storage_local/persistent-storage-local.html#local-create-cr-manual_persistent-storage-local and uses it to perform etcd backups.

soltysh · 2024-06-26T13:30:54Z

enhancements/etcd/automated-backups-no-config.md

+
+
+### Workflow Description
+- The user will enable the AutomatedBackupNoConfig feature gate (under discussion).


Given that AutomatedEtcdBackup is still a tech preview feature, I'd consider expanding that functionality with the default configuration, rather than adding another one. This will save a lot of problems for some of the fields in the API you're planning to provide default values for.

+1, I will fix this 👍🏽

soltysh · 2024-06-26T13:31:46Z

enhancements/etcd/automated-backups-no-config.md

+#### Standalone Clusters
+TBD
+#### Single-node Deployments or MicroShift
+TBD


I believe the strategy will be different for different topology, so I'd expect an outline how each topology will be handled wrt defaults.

So we are planning to use SideCar Container within each etcd pod manifest, and the backup will be save to a hostPath volume.

iiuc in this approach the strategy will be the same regardless of the topology, or ?

Yes, it should, just make sure to call it out explicitly.

If you're taking the exact same approach across all supported topologies, I'd at minimum add here links to the design details, pointing out that the solution is orthogonal to topology of the cluster.

soltysh · 2024-06-26T13:33:52Z

enhancements/etcd/automated-backups-no-config.md

+
+In the event of the node becoming inaccessible or unscheduled, the recurring backups would not be scheduled. The periodic backup config would have to be recreated or updated with a different PVC that allows for a new PV to be provisioned on a node that is healthy.
+
+If we were to use `localVolume`, we need to find a solution for balancing the backups across the master nodes, some ideas could be


Create StatefulSet across the master nodes, where the PVC template uses the localVolume to provision PV.

I believe you mean DaemonSet across the master nodes 😉

soltysh · 2024-06-26T13:34:42Z

enhancements/etcd/automated-backups-no-config.md

+
+If we were to use `localVolume`, we need to find a solution for balancing the backups across the master nodes, some ideas could be
+
+- create a PV for each master node using `localVolume` and the backup controller from CEO should take care of balancing the backups across the `available` and `healthy` volumes.


Why do you want to handle the balancing? That's not what you should care for at all.

soltysh · 2024-06-26T13:39:36Z

enhancements/etcd/automated-backups-no-config.md

+
+## Design Details
+
+- Add default values to current `Etcdbackup Spec` using annotation.


The general flow should be like follows:

All the EtcdBackup fields are defaulted, where possible. That's why I'm proposing to put this functionality behind already existing AutomatedEtcdBackup feature gate, rather than introducing a new one.

The pvcName will be the one that will have to by populated automatically by the CEO, unless a users sets one. It will use:

the available default storage class in the cluster (storage team can provide you with where you should be looking at it);

if above is not available, fallback to using a localVolume, as explained by Hemant above;

add a warning condition to the CEO (at least for that localVolume), reporting that we're working off of default etcd backup, and that the cluster admin should verify correctness of that configuration.

soltysh · 2024-06-26T13:40:38Z

enhancements/etcd/automated-backups-no-config.md

+
+It supports all OCP variants, including `SNO` and `BM`.  
+
+However, I am strongly against using it for the following reasons


Suggested change

However, I am strongly against using it for the following reasons

However, the following reasons suggest against that solution:

this whole has been removed, radical changes :D :D

soltysh · 2024-06-26T13:41:22Z

enhancements/etcd/automated-backups-no-config.md

+      - No scheduling guarantees for the pod using `hostpath` as is the case with `localVolume`. The pod could be scheduled on a different node from where the hostPath volume exist.
+      - On the other hand, using `localVolume` and the node affinity within the PV manifest forces the backup pod to be scheduled on a specific node, where the volume is attached.
+      - `localVolume` allows using a separate disk as PV, unlike `hostPath` which mounts a folder from the node's FS.
+      - `localVolume` is handled by the PV controller and the scheduler in different manner, in fact it was created to resolve issues with `hostPath`


You forgot to add the locality of the backup. Iow. if it happens that backup is kept in the current leader, and that dies, we're loosing that data with it.

So the current approach will take backups among all master nodes. Actually we have another issue now, how to distinguish which backup is most up-to-date since we have backups from all master nodes :)

Elbehery · 2024-06-27T21:37:30Z

/label tide/merge-method-squash

Elbehery · 2024-06-27T21:47:16Z

enhancements/etcd/automated-backups-no-config.md

+This enhancement adds on previous work [automated backup of etcd](https://docs.openshift.com/container-platform/4.15/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html#creating-automated-etcd-backups_backup-etcd).
+Therefore, it is vital to use the same feature gate for both approaches. The following issues need to be clarified before implementation
+
+* How to distinguish between `NoConfig` backups and configs that are triggered using `EtcdBackup` CR.


@soltysh Would you kindly help me answering these questions ?

we will call it name="default" and in the controller we'll skip the reconciliation of that CRD

so if the sidecar is enabled we skip the backup using the controller, or we skip the default backups if the controllers is enabled ?

enhancements/etcd/automated-backups-no-config.md

soltysh

I think the biggest feedback is the clear separation between automated backups and the user-configured automated backups. That's currently hidden in the implementation details. Having a clear boundary and explanation how one differs from the other, how one impacts the other is important.

soltysh · 2024-07-12T12:58:57Z

enhancements/etcd/automated-backups-no-config.md

+### Goals
+
+* Backups should be taken without configuration after cluster installation from day 1.
+* Backups are saved to a default PersistentVolume, that could be overridden by user.


I believe we discussed that users should override that value. We provide a best possible PV for the cluster, but in some cases that might mean ephemeral storage, which doesn't provide any guarantees. So it's best to make it explicit here.

so I am checking the Backup CR name and the periodicbackup controller ignores the CR with name default

This way we have both approaches alongside each other, and they both work independently

soltysh · 2024-07-12T13:00:49Z

enhancements/etcd/automated-backups-no-config.md

+### Non-Goals
+
+* Save cluster backups to remote cloud storage (e.g. S3 Bucket).
+  - This could be a future enhancement or extension to the API.


As discussed in one of our calls, the ability to guarantee a persistent storage for any supported cluster installation is very small. So I wouldn't even call out future extension. It's the cluster admin role to ensure that the PVCname is backed by a solid storage.

soltysh · 2024-07-12T13:03:06Z

enhancements/etcd/automated-backups-no-config.md

+
+### API Extensions
+
+No [API](https://github.com/openshift/enhancements/blob/master/enhancements/etcd/automated-backups.md#api-extensions) changes are required, since this approach work independently with default config.


I believe you'll need to introduce the defaults in the API, no?

So i will create a default CR in CEO .. i will not use kubebuilder markers on the API

apiVersion: config.openshift.io/v1alpha1 kind: Backup metadata: name: default annotations: default: "true" spec: etcd: schedule: "20 4 * * *" timeZone: "UTC" retentionPolicy: retentionType: RetentionNumber retentionNumber: maxNumberOfBackups: 5

When you create this default CR, is this going to be in all clusters? Are we happy doing this for existing clusters?

Would it be better for the installer to create this object so that it's only created on new clusters? Con: We can't manage it to change it on existing clusters (though we probably don't want to)

soltysh · 2024-07-12T13:04:54Z

enhancements/etcd/automated-backups-no-config.md

+#### Standalone Clusters
+TBD
+#### Single-node Deployments or MicroShift
+TBD


If you're taking the exact same approach across all supported topologies, I'd at minimum add here links to the design details, pointing out that the solution is orthogonal to topology of the cluster.

soltysh · 2024-07-12T13:07:31Z

enhancements/etcd/automated-backups-no-config.md

+Since the `SideCar` is being deployed alongside each etcd cluster member, it is possible to keep backups across all master nodes.
+
+On the other hand, the backups may **not** be up-to-date since the snapshot might be lagging behind the `WAL`. Therefore, it is recommended to use this approach alongside the Automated Backup enabled using the `EtcdBackup` CR.
+Since this work will be enabled with no configuration, it is possible to use define a default values for the `Scheule`, `Retention` independently.


It looks like this is completely different from what I remember we talked about last time. I'd like you to change the initial part of this document to clearly explain this mechanism is orthogonal to the user-configured backups.

soltysh · 2024-07-12T13:10:14Z

enhancements/etcd/automated-backups-no-config.md

+
+### Open Questions
+
+This enhancement adds on previous work [automated backup of etcd](https://docs.openshift.com/container-platform/4.15/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html#creating-automated-etcd-backups_backup-etcd).


Based on the above statements they are two separate mechanisms, are they not?

openshift-ci · 2024-07-30T20:43:36Z

@Elbehery: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

JoelSpeed · 2024-08-09T12:12:48Z

enhancements/etcd/automated-backups-no-config.md

+
+## Proposal
+
+To enable automated etcd backups of an Openshift cluster, a default backup option is to be taking sing a `SideCar` container within each etcd pod, with no configuration from user.


Does this document describe how the existing backups are taken? Or is there something that can be read to understand how that works today?

JoelSpeed · 2024-08-09T12:14:20Z

enhancements/etcd/automated-backups-no-config.md

+
+### API Extensions
+
+No [API](https://github.com/openshift/enhancements/blob/master/enhancements/etcd/automated-backups.md#api-extensions) changes are required, since this approach work independently with default config.


When you create this default CR, is this going to be in all clusters? Are we happy doing this for existing clusters?

Would it be better for the installer to create this object so that it's only created on new clusters? Con: We can't manage it to change it on existing clusters (though we probably don't want to)

JoelSpeed · 2024-08-09T12:15:46Z

enhancements/etcd/automated-backups-no-config.md

+
+* How to distinguish between `NoConfig` backups and configs that are triggered using `EtcdBackup` CR.
+  * The `NoConfig` backups relies on `EtcdBackup` CR with `default` name.
+  * The cluster-etcd-operator reacts to the `default` CR by deploying the backup sidecar containers alongside each etcd member. 


Does it need to be a sidecar? Is it going to run always? Could it not be a job? (I assume the existing backups happen via a job?)

JoelSpeed · 2024-08-09T12:16:24Z

enhancements/etcd/automated-backups-no-config.md

+
+
+* Upon enabling the `AutomatedBackup` feature gate, which approach should be used and according to what criteria.
+  * As the `Noconfig` backups is orthogonal to the [automated backup of etcd](https://docs.openshift.com/container-platform/4.15/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html#creating-automated-etcd-backups_backup-etcd), it has been decided to use the same feature gate.


That means you have to get both features ready and tested before either can be promoted. Strongly advise to use a separate gate

JoelSpeed · 2024-08-09T12:16:50Z

enhancements/etcd/automated-backups-no-config.md

+
+
+* How to disable the `NoConfig` behaviour, if a user want to.
+  * This has not been decided or implemented yet. 


If the installer created the object, and it was unmanaged, the user would just delete it

JoelSpeed · 2024-08-09T12:25:49Z

enhancements/etcd/automated-backups-no-config.md

+
+### Local Storage
+
+Relying on local storage works for all Openshift variants (i.e. cloud based, SNO and BM). There are three possible approaches using local storage, each is detailed below.


On SNO, this doesn't seem particularly helpful. What is the recommendation going to be for SNO users?

JoelSpeed · 2024-08-09T12:28:01Z

enhancements/etcd/automated-backups-no-config.md

+* Pros
+  - It supports all Openshift variants, including `SNO` and `BM`.
+* Cons
+  - `hostPath` could have security impact as it exposes the node's filesystem.


What about the other way around, are backups encrypted? What happens if someone gets hold of a backup by plucking it from a hostpath?

JoelSpeed · 2024-08-09T12:28:37Z

enhancements/etcd/automated-backups-no-config.md

+  - It supports all Openshift variants, including `SNO` and `BM`.
+* Cons
+  - `hostPath` could have security impact as it exposes the node's filesystem.
+  - No scheduling guarantees for the pod using `hostpath` as is the case with `localVolume`. The pod could be scheduled on a different node from where the hostPath volume exist.


If you have a particular want, you could schedule the pods using the CEO to guarantee where they end up

JoelSpeed · 2024-08-09T12:29:11Z

enhancements/etcd/automated-backups-no-config.md

+
+As shown above, dynamic provisioning is the best storage solution to be used, but it is not viable among all Openshift variants.
+However, a hybrid solution is possible, in which dynamic provisioning is being used in cloud based Openshift, while utilising local storage for BM and SNO.
+Relying on `infrastructure` resource type, we can create the storage programmatically according to the underlying infrastructure and Openshift variant.


So you plan to create cloud based PVs on the clouds?

JoelSpeed · 2024-08-09T12:29:52Z

enhancements/etcd/automated-backups-no-config.md

+### StatefulSet approach
+
+A `StatefulSet` could be deployed among all master nodes, where each backup pod has its own `PV`. This approach has the pros of spreading the backups among all master nodes.
+The complexity come from the fact that the backups are being triggered by a `CronJob` which spawn a `Job` to take the actual backup, by deploying a Pod.


CEO could add the nodeName to the podSpec in the Job when it creates it and handle this

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 17, 2024

openshift-ci bot requested review from ashcrow and LalatenduMohanty June 17, 2024 21:48

openshift-ci bot assigned dusk125, hasbro17, soltysh and tjungblu Jun 17, 2024

Elbehery commented Jun 17, 2024

View reviewed changes

Elbehery force-pushed the etcd-automated-backup-no-config branch from 447cbfa to 9439c80 Compare June 17, 2024 22:05

Elbehery force-pushed the etcd-automated-backup-no-config branch 2 times, most recently from 035e9d5 to f54b7a3 Compare June 17, 2024 22:31

tjungblu reviewed Jun 18, 2024

View reviewed changes

Elbehery force-pushed the etcd-automated-backup-no-config branch 2 times, most recently from 16f8e08 to a40a86f Compare June 18, 2024 15:55

Elbehery commented Jun 18, 2024

View reviewed changes

tjungblu reviewed Jun 18, 2024

View reviewed changes

Elbehery force-pushed the etcd-automated-backup-no-config branch 2 times, most recently from 8ad5f84 to 99044c1 Compare June 18, 2024 18:00

Elbehery force-pushed the etcd-automated-backup-no-config branch from 99044c1 to 51fdbad Compare June 18, 2024 21:05

gnufied reviewed Jun 20, 2024

View reviewed changes

soltysh reviewed Jun 26, 2024

View reviewed changes

openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 27, 2024

Elbehery commented Jun 27, 2024

View reviewed changes

Elbehery force-pushed the etcd-automated-backup-no-config branch 2 times, most recently from 2004ffe to 90ed5b2 Compare June 28, 2024 00:42

tjungblu reviewed Jul 1, 2024

View reviewed changes

enhancements/etcd/automated-backups-no-config.md Show resolved Hide resolved

tjungblu reviewed Jul 1, 2024

View reviewed changes

enhancements/etcd/automated-backups-no-config.md Outdated Show resolved Hide resolved

Elbehery force-pushed the etcd-automated-backup-no-config branch 2 times, most recently from b00993a to 2c56179 Compare July 9, 2024 11:44

soltysh reviewed Jul 12, 2024

View reviewed changes

Elbehery added 2 commits July 30, 2024 20:43

ETCD-610: automated backups no config

9edf8cf

reformat

a105ee3

Elbehery force-pushed the etcd-automated-backup-no-config branch from 2c56179 to 9a34c82 Compare July 30, 2024 19:48

implementation details

57ed843

Elbehery force-pushed the etcd-automated-backup-no-config branch from 9a34c82 to 57ed843 Compare July 30, 2024 20:34

JoelSpeed reviewed Aug 9, 2024

View reviewed changes


		## Design Details

		- Add default values to current `Etcdbackup Spec` using annotation.


		In the event of the node becoming inaccessible or unscheduled, the recurring backups would not be scheduled. The periodic backup config would have to be recreated or updated with a different PVC that allows for a new PV to be provisioned on a node that is healthy.

		If we were to use `localVolume`, we need to find a solution for balancing the backups across the master nodes, some ideas could be


		If we were to use `localVolume`, we need to find a solution for balancing the backups across the master nodes, some ideas could be

		- create a PV for each master node using `localVolume` and the backup controller from CEO should take care of balancing the backups across the `available` and `healthy` volumes.



		### Workflow Description
		- The user will enable the AutomatedBackupNoConfig feature gate (under discussion).


		It supports all OCP variants, including `SNO` and `BM`.

		However, I am strongly against using it for the following reasons

	However, I am strongly against using it for the following reasons
	However, the following reasons suggest against that solution:


		### API Extensions

		No [API](https://github.com/openshift/enhancements/blob/master/enhancements/etcd/automated-backups.md#api-extensions) changes are required, since this approach work independently with default config.


		### Open Questions

		This enhancement adds on previous work [automated backup of etcd](https://docs.openshift.com/container-platform/4.15/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html#creating-automated-etcd-backups_backup-etcd).


		## Proposal

		To enable automated etcd backups of an Openshift cluster, a default backup option is to be taking sing a `SideCar` container within each etcd pod, with no configuration from user.



		* Upon enabling the `AutomatedBackup` feature gate, which approach should be used and according to what criteria.
		* As the `Noconfig` backups is orthogonal to the [automated backup of etcd](https://docs.openshift.com/container-platform/4.15/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html#creating-automated-etcd-backups_backup-etcd), it has been decided to use the same feature gate.



		* How to disable the `NoConfig` behaviour, if a user want to.
		* This has not been decided or implemented yet.


		### Local Storage

		Relying on local storage works for all Openshift variants (i.e. cloud based, SNO and BM). There are three possible approaches using local storage, each is detailed below.

ETCD-610: automated backups no config #1646

Are you sure you want to change the base?

ETCD-610: automated backups no config #1646

Conversation

Elbehery commented Jun 17, 2024

openshift-ci-robot commented Jun 17, 2024 • edited by openshift-ci bot Loading

Elbehery commented Jun 17, 2024

Choose a reason for hiding this comment

openshift-ci bot commented Jun 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tjungblu Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Elbehery commented Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Elbehery commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Jul 30, 2024

openshift-ci-robot commented Jun 17, 2024 •

edited by openshift-ci bot

Loading

gnufied Jun 20, 2024 •

edited

Loading

tjungblu Jun 18, 2024 •

edited

Loading

gnufied Jun 20, 2024 •

edited

Loading

Elbehery commented Jun 18, 2024 •

edited

Loading