From b3d1fca44fb37386a5f4ff796c0c137e8cecda9d Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Thu, 2 Nov 2023 23:27:50 -0400 Subject: [PATCH 1/7] Add upload/asset blob GC design doc --- .../garbage-collection-uploads-asset-blobs.md | 26 +++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 doc/design/garbage-collection-uploads-asset-blobs.md diff --git a/doc/design/garbage-collection-uploads-asset-blobs.md b/doc/design/garbage-collection-uploads-asset-blobs.md new file mode 100644 index 000000000..8cd8fc51a --- /dev/null +++ b/doc/design/garbage-collection-uploads-asset-blobs.md @@ -0,0 +1,26 @@ +# Upload and Asset Blob Garbage Collection + +## Background + +Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed to staging, we are ready to implement garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned “Assets” (i.e. files that have been properly uploaded, have metadata, etc. but are no associated with any dandisets) is more complex and is left for a future design document. + +## Why do we need garbage collection? + +When a user creates an asset, they send a request to the API and the API returns a series of presigned URLs for the user to perform a multipart upload to. Then, an `Upload` database row is created to track the status of the upload. When the user is done uploading their data to the presigned URLs, they must “finalize” the upload by sending a request to the API to create an `AssetBlob` out of that `Upload`. Finally, they must make one more request to actually associate this new `AssetBlob` with an `Asset`. + +### Orphaned Uploads + +If the user cancels a multipart upload partway through, or completes the multipart upload to S3 but does not “finalize” the upload, then the upload becomes “orphaned”, i.e. the associated `Upload` record and S3 object remain in the database/bucket indefinitely. + +### Orphaned AssetBlobs + +In this case, assume that the user properly completes the multipart upload flow and “finalizes” the `Upload` record such that it is now an `AssetBlob`, but they do not send a request to associate the new blob with an `Asset`. That `AssetBlob` record and associated S3 object will remain in the database/bucket indefinitely. + +## Implementation + +We will introduce a new celery-beat task that runs daily. This task will + +- Query for and delete any uploads that are older than the multipart upload presigned URL expiration time (this is currently 7 days). +- Query for and delete any AssetBlobs that are (1) not associated with any Assets, and (2) older than 7 days. + +Due to the trailing delete lifecycle rule, the actual uploaded data will remain recoverable for up to 30 days after this deletion, after which the lifecycle rule will clear it out of the bucket permanently. From 85b2c96d111b9659622f96993427f8a86b2986b3 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh <37340715+mvandenburgh@users.noreply.github.com> Date: Fri, 3 Nov 2023 15:16:45 -0400 Subject: [PATCH 2/7] Fix typo Co-authored-by: Yaroslav Halchenko --- doc/design/garbage-collection-uploads-asset-blobs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/garbage-collection-uploads-asset-blobs.md b/doc/design/garbage-collection-uploads-asset-blobs.md index 8cd8fc51a..c34b46caf 100644 --- a/doc/design/garbage-collection-uploads-asset-blobs.md +++ b/doc/design/garbage-collection-uploads-asset-blobs.md @@ -2,7 +2,7 @@ ## Background -Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed to staging, we are ready to implement garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned “Assets” (i.e. files that have been properly uploaded, have metadata, etc. but are no associated with any dandisets) is more complex and is left for a future design document. +Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed to staging, we are ready to implement garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned “Assets” (i.e. files that have been properly uploaded, have metadata, etc. but are no longer associated with any version of a dandiset) is more complex and is left for a future design document. ## Why do we need garbage collection? From 11942e10dee6a8369f9f732ff535a59f2fdbfdf6 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Fri, 3 Nov 2023 15:19:41 -0400 Subject: [PATCH 3/7] Clarify zarr garbage collection Add a note that this design only applies to regular assets, and not zarrs. --- doc/design/garbage-collection-uploads-asset-blobs.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/design/garbage-collection-uploads-asset-blobs.md b/doc/design/garbage-collection-uploads-asset-blobs.md index c34b46caf..266b48fb6 100644 --- a/doc/design/garbage-collection-uploads-asset-blobs.md +++ b/doc/design/garbage-collection-uploads-asset-blobs.md @@ -2,13 +2,13 @@ ## Background -Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed to staging, we are ready to implement garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned “Assets” (i.e. files that have been properly uploaded, have metadata, etc. but are no longer associated with any version of a dandiset) is more complex and is left for a future design document. +Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed to staging, we are ready to implement garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned “Assets” (i.e. files that have been properly uploaded, have metadata, etc. but are no longer associated with any version of a dandiset) is more complex and is left for a future design document. Additionally, the garbage collection process in this design document only applies to regular assets; garbage collection of Zarrs is not covered. ## Why do we need garbage collection? When a user creates an asset, they send a request to the API and the API returns a series of presigned URLs for the user to perform a multipart upload to. Then, an `Upload` database row is created to track the status of the upload. When the user is done uploading their data to the presigned URLs, they must “finalize” the upload by sending a request to the API to create an `AssetBlob` out of that `Upload`. Finally, they must make one more request to actually associate this new `AssetBlob` with an `Asset`. -### Orphaned Uploads +### Orphaned Asset Uploads If the user cancels a multipart upload partway through, or completes the multipart upload to S3 but does not “finalize” the upload, then the upload becomes “orphaned”, i.e. the associated `Upload` record and S3 object remain in the database/bucket indefinitely. From cfff55fb123752cefb651437f9036e74f81d1057 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Mon, 6 Nov 2023 10:11:36 -0500 Subject: [PATCH 4/7] Clarify that objects need to be cleared from both S3 and DB --- doc/design/garbage-collection-uploads-asset-blobs.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/doc/design/garbage-collection-uploads-asset-blobs.md b/doc/design/garbage-collection-uploads-asset-blobs.md index 266b48fb6..f0f2a8c23 100644 --- a/doc/design/garbage-collection-uploads-asset-blobs.md +++ b/doc/design/garbage-collection-uploads-asset-blobs.md @@ -23,4 +23,6 @@ We will introduce a new celery-beat task that runs daily. This task will - Query for and delete any uploads that are older than the multipart upload presigned URL expiration time (this is currently 7 days). - Query for and delete any AssetBlobs that are (1) not associated with any Assets, and (2) older than 7 days. +In both cases, we need to delete both the blob from S3 and the row from the DB in order to avoid getting into an inconsistent state. + Due to the trailing delete lifecycle rule, the actual uploaded data will remain recoverable for up to 30 days after this deletion, after which the lifecycle rule will clear it out of the bucket permanently. From bd1dee2cd75a083c3354ac18a1a41b57a150442a Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Mon, 6 Nov 2023 10:12:55 -0500 Subject: [PATCH 5/7] Use relative links for other design docs --- doc/design/garbage-collection-uploads-asset-blobs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/garbage-collection-uploads-asset-blobs.md b/doc/design/garbage-collection-uploads-asset-blobs.md index f0f2a8c23..25c1f8943 100644 --- a/doc/design/garbage-collection-uploads-asset-blobs.md +++ b/doc/design/garbage-collection-uploads-asset-blobs.md @@ -2,7 +2,7 @@ ## Background -Now that the [design for S3 trailing delete](https://github.com/dandi/dandi-archive/blob/master/doc/design/s3-trailing-delete.md) is deployed to staging, we are ready to implement garbage collection. [This older design document](https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned “Assets” (i.e. files that have been properly uploaded, have metadata, etc. but are no longer associated with any version of a dandiset) is more complex and is left for a future design document. Additionally, the garbage collection process in this design document only applies to regular assets; garbage collection of Zarrs is not covered. +Now that the [design for S3 trailing delete](./s3-trailing-delete.md) is deployed to staging, we are ready to implement garbage collection. [This older design document](./garbage-collection-1.md#uploads) is still relevant, and summarizes the various types of garbage collection we want to implement. This document will present a design for garbage collection of uploads and asset blobs, i.e. garbage that accumulates due to improper uploads done by users. A design for garbage collection of orphaned “Assets” (i.e. files that have been properly uploaded, have metadata, etc. but are no longer associated with any version of a dandiset) is more complex and is left for a future design document. Additionally, the garbage collection process in this design document only applies to regular assets; garbage collection of Zarrs is not covered. ## Why do we need garbage collection? From 5f1301a478a471e818a916cee29bda639002b5bb Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Mon, 6 Nov 2023 10:23:52 -0500 Subject: [PATCH 6/7] Add current orphaned data count --- doc/design/garbage-collection-uploads-asset-blobs.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/doc/design/garbage-collection-uploads-asset-blobs.md b/doc/design/garbage-collection-uploads-asset-blobs.md index 25c1f8943..6325cca2e 100644 --- a/doc/design/garbage-collection-uploads-asset-blobs.md +++ b/doc/design/garbage-collection-uploads-asset-blobs.md @@ -26,3 +26,13 @@ We will introduce a new celery-beat task that runs daily. This task will In both cases, we need to delete both the blob from S3 and the row from the DB in order to avoid getting into an inconsistent state. Due to the trailing delete lifecycle rule, the actual uploaded data will remain recoverable for up to 30 days after this deletion, after which the lifecycle rule will clear it out of the bucket permanently. + +## Data + +The current amount of orphaned data in the system as of 11/6/2023 is as follows: + +Orphaned `Uploads`: 740 + +Orphaned `AssetBlobs`: 5 + +Orphaned `Assets`: 175,545 From 23c55ad6727985c98a8f7d18bfb05548c44f55c7 Mon Sep 17 00:00:00 2001 From: Mike VanDenburgh Date: Mon, 6 Nov 2023 10:38:05 -0500 Subject: [PATCH 7/7] Add note about additional cause of orphaned asset blobs --- doc/design/garbage-collection-uploads-asset-blobs.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/doc/design/garbage-collection-uploads-asset-blobs.md b/doc/design/garbage-collection-uploads-asset-blobs.md index 6325cca2e..64efd2248 100644 --- a/doc/design/garbage-collection-uploads-asset-blobs.md +++ b/doc/design/garbage-collection-uploads-asset-blobs.md @@ -16,6 +16,8 @@ If the user cancels a multipart upload partway through, or completes the multipa In this case, assume that the user properly completes the multipart upload flow and “finalizes” the `Upload` record such that it is now an `AssetBlob`, but they do not send a request to associate the new blob with an `Asset`. That `AssetBlob` record and associated S3 object will remain in the database/bucket indefinitely. +Another potential cause of orphaned `AssetBlobs` could be `Asset` garbage collection itself. `Asset` garbage collection will be designed independently of the logic to clean up `AssetBlobs`, and running it might also result in orphaned `AssetBlobs`. + ## Implementation We will introduce a new celery-beat task that runs daily. This task will