From eb9ee21b6560fe84d0be1b438973b8b8fc0085e4 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Mon, 28 Aug 2023 09:20:22 -0400 Subject: [PATCH 1/7] Design doc for undelete feature --- doc/design/s3-undelete.md | 69 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 69 insertions(+) create mode 100644 doc/design/s3-undelete.md diff --git a/doc/design/s3-undelete.md b/doc/design/s3-undelete.md new file mode 100644 index 000000000..65c470990 --- /dev/null +++ b/doc/design/s3-undelete.md @@ -0,0 +1,69 @@ +# **S3 Undelete** + +## **Why is “undelete” necessary?** + +The core value of the DANDI Archive comes from the data we host. The process for getting this data into DANDI often involves coordination between several people to get an extremely large volume of data annotated with useful metadata and uploaded to our system. Because of the amount of time and work involved in this process, we need to minimize the risk of accidental data loss to the greatest extent that is possible and reasonable. Additionally, we would like to implement “garbage collection” in the future, which involves programmatically clearing out stale asset blobs from S3. All of this leads to a desire to be able to “undelete” an s3 object that has been deleted. + +Our ultimate goal is to prevent data loss from application programming errors. With protection such as an undelete capability, we will be safer in implementing application features that involve intentional deletion of data. Any bugs we introduce while doing so are far less likely to destroy data that was not supposed to be deleted. + +The original GitHub issue around this feature request can be found at [https://github.com/dandi/dandi-archive/issues/524](https://github.com/dandi/dandi-archive/issues/524). Although the issue asks for a Deep Glacier storage tier, the design in this document solves the underlying problem differently (and in a more robust way). Below we address the possible usage of a Deep Glacier tiered bucket as a solution to the orthogonal problem of data ******backup******, which addresses a different problem than the undelete capability described in this document. + +## **Requirements** + +- After deletion of an asset blob, there needs to be a period of 30 days during which that blob can be restored. + +## **Proposed Solution** + +What we want can be described as a “trailing delete” mechanism. Upon deletion of an asset from the bucket, we would like the object to remain recoverable for some amount of time. S3 already supports this in the form of Bucket Versioning. + +### **S3 Bucket Versioning** + +Enabling bucket versioning will change what happens when an object in S3 is deleted. Instead of permanently deleting the object, S3 will simply place a delete marker on it. At that point, the object is hidden from view and appears to be deleted, but still exists and is recoverable. + +In addition, we can place an S3 Lifecycle policy on the bucket that automatically clears delete markers and “permanently deletes” their associated objects after some set amount of time. + +``` + # Terraform-encoded lifecycle rule. + # Based on https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex7 + rule { + id = "ExpireOldDeleteMarkers" + filter {} + + # Expire objects with delete markers after 1 day + noncurrent_version_expiration { + noncurrent_days = 1 + } + + # Also delete any delete markers associated with the expired object + expiration { + expired_object_delete_marker = true + } + + status = "Enabled" + } +``` + +This may raise an additional question - since one of the main reasons for this “undelete” functionality is to recover from accidental deletion of data, what happens if a delete marker is accidentally deleted? We can solve this by introducing a bucket policy that prevents deletion of delete markers. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "PreventDeletionOfDeleteMarkers", + "Effect": "Deny", + "Principal": "*", + "Action": "s3:DeleteObjectVersion", + "Resource": "arn:aws:s3:::dandi-s3-experiment-bucket/*" + } + ] +} +``` + +In sum: deletion of an asset at the application level will trigger placing a delete marker on the appropriate S3 object; an S3 lifecycle rule will schedule that object for actual deletion 30 days later; an appropriate bucket policy will ensure that nobody can manually destroy data, even by accident. (There is a way to manually destroy data, but it cannot be done by accident: someone with the power to change the bucket policies would first need to remove the protective policy above, and ****then**** perform a manual delete of the appropriate objects. This affords the right level of security for our purposes: application-level errors will not be able to destroy data irrevocably.) + +# Distinction from Data Backup + +It’s important to note that the “undelete” implementation proposed above does not cover backup of the data in the bucket. While backup is out of scope for this design document, nothing proposed here *prevents* backup from being implemented, and features such as [S3 Replication](https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html) may be useful for that. + +A possible backup implementation would simply allocate a new bucket in a separate S3 region set to use the Deep Glacier storage tier (to control costs), then use S3 Replication as mentioned above to maintain an object-for-object copy of the production bucket as a pure backup. This backup solution would ***not*** defend against application-level bugs deleting data, but would instead protect the production bucket against larger-scale threats such as destruction of Amazon data centers, etc. From cfa8f0a142a136b595f357c0b065cac5947861b6 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh <37340715+mvandenburgh@users.noreply.github.com> Date: Tue, 29 Aug 2023 13:13:34 -0400 Subject: [PATCH 2/7] Remove unnecessary asterisks for bolding --- doc/design/s3-undelete.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/design/s3-undelete.md b/doc/design/s3-undelete.md index 65c470990..4007d7ac2 100644 --- a/doc/design/s3-undelete.md +++ b/doc/design/s3-undelete.md @@ -6,7 +6,7 @@ The core value of the DANDI Archive comes from the data we host. The process for Our ultimate goal is to prevent data loss from application programming errors. With protection such as an undelete capability, we will be safer in implementing application features that involve intentional deletion of data. Any bugs we introduce while doing so are far less likely to destroy data that was not supposed to be deleted. -The original GitHub issue around this feature request can be found at [https://github.com/dandi/dandi-archive/issues/524](https://github.com/dandi/dandi-archive/issues/524). Although the issue asks for a Deep Glacier storage tier, the design in this document solves the underlying problem differently (and in a more robust way). Below we address the possible usage of a Deep Glacier tiered bucket as a solution to the orthogonal problem of data ******backup******, which addresses a different problem than the undelete capability described in this document. +The original GitHub issue around this feature request can be found at [https://github.com/dandi/dandi-archive/issues/524](https://github.com/dandi/dandi-archive/issues/524). Although the issue asks for a Deep Glacier storage tier, the design in this document solves the underlying problem differently (and in a more robust way). Below we address the possible usage of a Deep Glacier tiered bucket as a solution to the orthogonal problem of data **backup** which addresses a different problem than the undelete capability described in this document. ## **Requirements** @@ -60,7 +60,7 @@ This may raise an additional question - since one of the main reasons for this } ``` -In sum: deletion of an asset at the application level will trigger placing a delete marker on the appropriate S3 object; an S3 lifecycle rule will schedule that object for actual deletion 30 days later; an appropriate bucket policy will ensure that nobody can manually destroy data, even by accident. (There is a way to manually destroy data, but it cannot be done by accident: someone with the power to change the bucket policies would first need to remove the protective policy above, and ****then**** perform a manual delete of the appropriate objects. This affords the right level of security for our purposes: application-level errors will not be able to destroy data irrevocably.) +In sum: deletion of an asset at the application level will trigger placing a delete marker on the appropriate S3 object; an S3 lifecycle rule will schedule that object for actual deletion 30 days later; an appropriate bucket policy will ensure that nobody can manually destroy data, even by accident. (There is a way to manually destroy data, but it cannot be done by accident: someone with the power to change the bucket policies would first need to remove the protective policy above, and **then** perform a manual delete of the appropriate objects. This affords the right level of security for our purposes: application-level errors will not be able to destroy data irrevocably.) # Distinction from Data Backup From 218e1b08d8b32a95f8cd52d46c97d8168b1b7a68 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh <37340715+mvandenburgh@users.noreply.github.com> Date: Tue, 5 Sep 2023 11:43:05 -0400 Subject: [PATCH 3/7] Change title to "S3 Trailing Delete" Co-authored-by: Yaroslav Halchenko --- doc/design/s3-undelete.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/s3-undelete.md b/doc/design/s3-undelete.md index 4007d7ac2..5295165ce 100644 --- a/doc/design/s3-undelete.md +++ b/doc/design/s3-undelete.md @@ -1,4 +1,4 @@ -# **S3 Undelete** +# **S3 Trailing Delete** ## **Why is “undelete” necessary?** From 193684fe9c72df7cc335bca5422f912fc7052014 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Tue, 5 Sep 2023 11:43:59 -0400 Subject: [PATCH 4/7] Update filename with new title --- doc/design/{s3-undelete.md => s3-trailing-delete.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename doc/design/{s3-undelete.md => s3-trailing-delete.md} (100%) diff --git a/doc/design/s3-undelete.md b/doc/design/s3-trailing-delete.md similarity index 100% rename from doc/design/s3-undelete.md rename to doc/design/s3-trailing-delete.md From 7ae9f9ea8ae523013522652bf33ff061ad7842e1 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Tue, 5 Sep 2023 13:25:40 -0400 Subject: [PATCH 5/7] Update "undelete" references to "trailing delete" --- doc/design/s3-trailing-delete.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/doc/design/s3-trailing-delete.md b/doc/design/s3-trailing-delete.md index 5295165ce..53463629a 100644 --- a/doc/design/s3-trailing-delete.md +++ b/doc/design/s3-trailing-delete.md @@ -1,12 +1,12 @@ # **S3 Trailing Delete** -## **Why is “undelete” necessary?** +## **Why is "trailing delete" necessary?** -The core value of the DANDI Archive comes from the data we host. The process for getting this data into DANDI often involves coordination between several people to get an extremely large volume of data annotated with useful metadata and uploaded to our system. Because of the amount of time and work involved in this process, we need to minimize the risk of accidental data loss to the greatest extent that is possible and reasonable. Additionally, we would like to implement “garbage collection” in the future, which involves programmatically clearing out stale asset blobs from S3. All of this leads to a desire to be able to “undelete” an s3 object that has been deleted. +The core value of the DANDI Archive comes from the data we host. The process for getting this data into DANDI often involves coordination between several people to get an extremely large volume of data annotated with useful metadata and uploaded to our system. Because of the amount of time and work involved in this process, we need to minimize the risk of accidental data loss to the greatest extent that is possible and reasonable. Additionally, we would like to implement “garbage collection” in the future, which involves programmatically clearing out stale asset blobs from S3. All of this leads to a desire to be able to recover an s3 object that has been deleted. -Our ultimate goal is to prevent data loss from application programming errors. With protection such as an undelete capability, we will be safer in implementing application features that involve intentional deletion of data. Any bugs we introduce while doing so are far less likely to destroy data that was not supposed to be deleted. +Our ultimate goal is to prevent data loss from application programming errors. With protection such as a trailing delete capability, we will be safer in implementing application features that involve intentional deletion of data. Any bugs we introduce while doing so are far less likely to destroy data that was not supposed to be deleted. -The original GitHub issue around this feature request can be found at [https://github.com/dandi/dandi-archive/issues/524](https://github.com/dandi/dandi-archive/issues/524). Although the issue asks for a Deep Glacier storage tier, the design in this document solves the underlying problem differently (and in a more robust way). Below we address the possible usage of a Deep Glacier tiered bucket as a solution to the orthogonal problem of data **backup** which addresses a different problem than the undelete capability described in this document. +The original GitHub issue around this feature request can be found at [https://github.com/dandi/dandi-archive/issues/524](https://github.com/dandi/dandi-archive/issues/524). Although the issue asks for a Deep Glacier storage tier, the design in this document solves the underlying problem differently (and in a more robust way). Below we address the possible usage of a Deep Glacier tiered bucket as a solution to the orthogonal problem of data **backup** which addresses a different problem than the trailing delete capability described in this document. ## **Requirements** @@ -43,7 +43,7 @@ In addition, we can place an S3 Lifecycle policy on the bucket that automaticall } ``` -This may raise an additional question - since one of the main reasons for this “undelete” functionality is to recover from accidental deletion of data, what happens if a delete marker is accidentally deleted? We can solve this by introducing a bucket policy that prevents deletion of delete markers. +This may raise an additional question - since one of the main reasons for this "trailing delete" functionality is to recover from accidental deletion of data, what happens if a delete marker is accidentally deleted? We can solve this by introducing a bucket policy that prevents deletion of delete markers. ```json { @@ -64,6 +64,6 @@ In sum: deletion of an asset at the application level will trigger placing a del # Distinction from Data Backup -It’s important to note that the “undelete” implementation proposed above does not cover backup of the data in the bucket. While backup is out of scope for this design document, nothing proposed here *prevents* backup from being implemented, and features such as [S3 Replication](https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html) may be useful for that. +It’s important to note that the "trailing delete" implementation proposed above does not cover backup of the data in the bucket. While backup is out of scope for this design document, nothing proposed here *prevents* backup from being implemented, and features such as [S3 Replication](https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html) may be useful for that. A possible backup implementation would simply allocate a new bucket in a separate S3 region set to use the Deep Glacier storage tier (to control costs), then use S3 Replication as mentioned above to maintain an object-for-object copy of the production bucket as a pure backup. This backup solution would ***not*** defend against application-level bugs deleting data, but would instead protect the production bucket against larger-scale threats such as destruction of Amazon data centers, etc. From f53cc75de7e26f634a9ca5627e3c06e802d8e369 Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Thu, 7 Sep 2023 14:34:49 -0400 Subject: [PATCH 6/7] [DATALAD RUNCMD] Remove unnecessary **emphasis in section headers === Do not change lines below === { "chain": [], "cmd": "sed -e 's,# \\*\\*\\(.*\\)\\*\\*,# \\1,g' -i doc/design/s3-trailing-delete.md", "exit": 0, "extra_inputs": [], "inputs": [], "outputs": [], "pwd": "." } ^^^ Do not change lines above ^^^ --- doc/design/s3-trailing-delete.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/design/s3-trailing-delete.md b/doc/design/s3-trailing-delete.md index 53463629a..676c3302b 100644 --- a/doc/design/s3-trailing-delete.md +++ b/doc/design/s3-trailing-delete.md @@ -1,6 +1,6 @@ -# **S3 Trailing Delete** +# S3 Trailing Delete -## **Why is "trailing delete" necessary?** +## Why is "trailing delete" necessary? The core value of the DANDI Archive comes from the data we host. The process for getting this data into DANDI often involves coordination between several people to get an extremely large volume of data annotated with useful metadata and uploaded to our system. Because of the amount of time and work involved in this process, we need to minimize the risk of accidental data loss to the greatest extent that is possible and reasonable. Additionally, we would like to implement “garbage collection” in the future, which involves programmatically clearing out stale asset blobs from S3. All of this leads to a desire to be able to recover an s3 object that has been deleted. @@ -8,15 +8,15 @@ Our ultimate goal is to prevent data loss from application programming errors. W The original GitHub issue around this feature request can be found at [https://github.com/dandi/dandi-archive/issues/524](https://github.com/dandi/dandi-archive/issues/524). Although the issue asks for a Deep Glacier storage tier, the design in this document solves the underlying problem differently (and in a more robust way). Below we address the possible usage of a Deep Glacier tiered bucket as a solution to the orthogonal problem of data **backup** which addresses a different problem than the trailing delete capability described in this document. -## **Requirements** +## Requirements - After deletion of an asset blob, there needs to be a period of 30 days during which that blob can be restored. -## **Proposed Solution** +## Proposed Solution What we want can be described as a “trailing delete” mechanism. Upon deletion of an asset from the bucket, we would like the object to remain recoverable for some amount of time. S3 already supports this in the form of Bucket Versioning. -### **S3 Bucket Versioning** +### S3 Bucket Versioning Enabling bucket versioning will change what happens when an object in S3 is deleted. Instead of permanently deleting the object, S3 will simply place a delete marker on it. At that point, the object is hidden from view and appears to be deleted, but still exists and is recoverable. From 713c2e00f131629e5e7f1bd82991830d9c614f4e Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh <37340715+mvandenburgh@users.noreply.github.com> Date: Thu, 7 Sep 2023 16:42:40 -0400 Subject: [PATCH 7/7] Update expiration time to reflect requirements section Co-authored-by: Yaroslav Halchenko --- doc/design/s3-trailing-delete.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/design/s3-trailing-delete.md b/doc/design/s3-trailing-delete.md index 676c3302b..129eec49c 100644 --- a/doc/design/s3-trailing-delete.md +++ b/doc/design/s3-trailing-delete.md @@ -29,9 +29,9 @@ In addition, we can place an S3 Lifecycle policy on the bucket that automaticall id = "ExpireOldDeleteMarkers" filter {} - # Expire objects with delete markers after 1 day + # Expire objects with delete markers after 30 days noncurrent_version_expiration { - noncurrent_days = 1 + noncurrent_days = 30 } # Also delete any delete markers associated with the expired object