Support S3-compatible blob storage #1071

alxndrsn · 2024-01-13T11:38:40Z

Closes getodk/central#585

Adds:

s3_status column on blobs table
CLI tool for:
- uploading blobs to s3
- resetting failed blobs -> pending

Changes:

blob.content uses must now be accompanied by a conditional fetch of the data from s3

Queries:

would it be preferable to make blob.content async everywhere, and moving the s3 check/fetch inside the getter?
- pro: neater code, less likely to have bugs where blob.content used when data is not available
- con: seems intrusive of frames
should "purging" forms in central also delete content from S3? yes

.github/workflows/soak-test.yml

ktuite

I have a lot to look at still, but I want to summarize the shape I see:

Blob touchpoints (with a little bit of code for each case because the generic blob table is joined with other types of tables):

submission attachments
encrypted submissions
form xls files
client audits (subset of submission attachments) which are processed by a worker
that briefcase file

The function blobResponse that serves blobs modified to be able to serve from s3 if not in the database.

s3_status added to blobs table (pending, in_progress, uploaded, failed)

Simple worker defined in util/s3.js that is started in runServer.js
This is where the blob uploading happens!
Worker doesn't retry any failures, that is handled by an external cli tool.

exhaustBlobs function for tests and for cli script to use.

Am I missing anything big?

I haven't yet looked at:

the details of the worker
the different blob scenarios/touchpoints
tests (changes to existing tests, or e2e tests)

lib/bin/s3.js

lib/bin/s3-create-bucket.js

lib/bin/s3.js

lib/model/migrations/20231025-01-add-blob-s3.js

ktuite · 2024-02-22T01:02:07Z

lib/util/http.js

-
-  response.set('ETag', `"${serverEtag}"`);
+const withEtag = (serverEtag, fn, always=true) => (request, response) => {
+  if (always) response.set('ETag', `"${serverEtag}"`);


What's this always flag about? I see it's false for the s3 blobs. I'm just looking for a mini explanation/reminder about this etag stuff.

I've tried to clarify by:

adding a comment about always, and

splitting the exported withEtag() function to add an additional withEtagOnMatch()

Does this help? Perhaps withEtagOrRedirect() would be a clearer name for the new function?

I mostly follow but now I'm confused by the !always line below. I saw online that a 304 not modified request should always be sending back the etag in the header anyway.

But also in the s3 redirect case, always will be false, so:

dont set etag

check client etag

if it matches, DO set etag and return 304 not modified

Does s3 make its own etag? Will these ever match and would we ever return a 304 in place of a 307?

Does s3 make its own etag? Will these ever match and would we ever return a 304 in place of a 307?

Great question. If a client sees an odk-central URL for an s3-backed blob, odk-central needs to ensure that the final ETag provided to the user is the same regardless of the binary data source.

This means either both systems need to generate the same ETag for the same data, or odk-central needs to substitute its ETag in place of S3's.

Currently:

odk-central's nginx should be transparently following 307 redirects for s3-backed content

both S3 and odk-central-backend use MD5(blob-content) as a strong ETag

This means that the current implementation should work fine. As there are no end-to-end central tests running both odk-central-backend and nginx, there's a risk this could break in future.

What's this always flag about?

I've reversed this flag, renaming it as onlyOnMatch, and updated the comments. Is the intention clearer now?

ktuite

I'm going through this bit by bit (while a bigger discussions also take place).

I think I now have a good understanding of the part of the system that uses blobs that may or may not be on S3.

Next up, I'll probably take a closer look s3.js and how the blobs actually get to and from s3.

ktuite · 2024-03-01T22:14:49Z

lib/data/attachments.js

+        s3.getContentFor(att)
+          .then(appendContent)
+          .catch(err => { this.destroy(err); });


I see what you mean about this query:

would it be preferable to make blob.content async everywhere, and moving the s3 check/fetch inside the getter?

pro: neater code, less likely to have bugs where blob.content used when data is not available

con: seems intrusive of frames

(Just highlighting this one because it seemed like the smallest example.)

Maybe it's okay to leave it like this since there are only about 4 places blobs are used (unless you have an idea for some clever alternative):

exporting attachments (here)

exporting encrypting submissions that were stored as blobs

processing client audit attachments

(different mechanism, urlForBlob instead of getContentFor line the ones above) the http blob response

As long as each of these 4 paths is tested, since each might have a different behavior if there is an error fetching from S3.

Some errors i'm thinking of:

auth problem

data doesn't exist where it should

response from s3 is taking too long

data is broken in some other way?

Makefile

.github/workflows/s3-e2e.yml

lib/bin/s3-create-bucket.js

ktuite

Comments from review today (more review tomorrow!)

test/e2e/s3/ci

Makefile

lib/model/migrations/20240311-01-add-blob-s3.js

.github/workflows/s3-e2e.yml

test/unit/data/attachments.js

test/unit/data/briefcase.js

test/util/s3.js

TODO.md

test/integration/api/submissions.js

lib/resources/submissions.js

alxndrsn · 2024-08-29T07:03:43Z

lib/external/s3.js

+
+  let destroyed = false;
+
+  const inflight = new Set();


Could use weakRefs in this set if there is concern about memory leaks

lib/external/s3.js

ktuite

Left a couple small comments about the database migration and package lock, but otherwise it looks good to go!!!

lib/model/migrations/20240619-01-add-blob-s3.js

package-lock.json

alxndrsn commented Feb 2, 2024

View reviewed changes

.github/workflows/soak-test.yml Outdated Show resolved Hide resolved

matthew-white mentioned this pull request Feb 8, 2024

eslint: resolve prefer-promise-reject-errors violations #1040

Merged

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

alxndrsn marked this pull request as ready for review February 19, 2024 16:51

alxndrsn changed the title ~~wip: allow blob storage in s3~~ Support blob storage in s3 Feb 19, 2024

alxndrsn requested a review from ktuite February 19, 2024 16:56

ktuite reviewed Feb 22, 2024

View reviewed changes

lognaturel changed the title ~~Support blob storage in s3~~ Make it possible to configure S3-compatible blob storage Mar 1, 2024

ktuite reviewed Mar 1, 2024

View reviewed changes

alxndrsn changed the title ~~Make it possible to configure S3-compatible blob storage~~ Support S3-compatible blob storage May 14, 2024

alxndrsn commented Jun 13, 2024

View reviewed changes

Makefile Outdated Show resolved Hide resolved

alxndrsn commented Jun 13, 2024

View reviewed changes

.github/workflows/s3-e2e.yml Outdated Show resolved Hide resolved

alxndrsn commented Jun 13, 2024

View reviewed changes

lib/bin/s3-create-bucket.js Show resolved Hide resolved

ktuite reviewed Jun 13, 2024

View reviewed changes

test/e2e/s3/ci Outdated Show resolved Hide resolved

Makefile Outdated Show resolved Hide resolved

lib/model/migrations/20240311-01-add-blob-s3.js Outdated Show resolved Hide resolved

alxndrsn commented Jun 14, 2024

View reviewed changes

.github/workflows/s3-e2e.yml Outdated Show resolved Hide resolved

alxndrsn commented Jun 14, 2024

View reviewed changes

test/unit/data/attachments.js Outdated Show resolved Hide resolved

alxndrsn commented Jun 14, 2024

View reviewed changes

test/unit/data/briefcase.js Outdated Show resolved Hide resolved

alxndrsn commented Jun 14, 2024

View reviewed changes

test/unit/data/briefcase.js Outdated Show resolved Hide resolved

alxndrsn commented Jun 18, 2024

View reviewed changes

test/util/s3.js Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

ktuite reviewed Jun 18, 2024

View reviewed changes

TODO.md Outdated Show resolved Hide resolved

ktuite reviewed Jun 18, 2024

View reviewed changes

test/integration/api/submissions.js Show resolved Hide resolved

alxndrsn mentioned this pull request Jun 19, 2024

resources/odata: don't crash with surprising instanceId #1158

Merged

alxndrsn added 2 commits June 21, 2024 15:59

Add TODOs

7f10c97

Add TODO

f4368ee

alxndrsn commented Jun 21, 2024

View reviewed changes

lib/resources/submissions.js Outdated Show resolved Hide resolved

alxndrsn added 3 commits June 21, 2024 16:38

Add TODOs

a236c1f

Add TODO

ca4a05a

Add TODO

58c536f

alxndrsn added 4 commits August 27, 2024 12:25

Fix blob streaming

0114300

blobId

45da0a6

Merge branch 'master' into s3-blob-storage-wip

3cd892b

reintro worker queue

cf80876

alxndrsn commented Aug 29, 2024

View reviewed changes

lib/external/s3.js Outdated Show resolved Hide resolved

alxndrsn commented Aug 29, 2024

View reviewed changes

lib/external/s3.js Outdated Show resolved Hide resolved

alxndrsn added 8 commits August 29, 2024 07:11

external/s3: move require()s to top

cc3805f

migration: drop extension on unmigrate

c068356

Update comment

cd44b5e

re-order vars to make comparison of fns easier

6a7a9e0

revert whitespace change

aeaaf14

clarify test explanation

221f081

comment

d03a98f

Add option: objectPrefix

116d48b

ktuite mentioned this pull request Sep 3, 2024

Followup S3 blob work getodk/central#700

Closed

5 tasks

Update TODOs

e35646a

alxndrsn mentioned this pull request Sep 3, 2024

Support S3 blob storage getodk/central#701

Merged

alxndrsn added 5 commits September 3, 2024 17:58

Update TODOs

d569f2c

try include stack in Error

b8aae1c

remove TODO file

f9e3b69

Makefile: revert changes

f2b732d

Merge branch 'master' into s3-blob-storage-wip

7c91aba

ktuite approved these changes Sep 13, 2024

View reviewed changes

alxndrsn added 4 commits September 13, 2024 08:41

Remove unnecessary linter exception

e391c4d

Reswet package-lock

d9bb2db

update migration name

94bc690

remove down migration

ce22192

alxndrsn merged commit 3ae8a69 into getodk:master Sep 13, 2024
1 check passed

alxndrsn deleted the s3-blob-storage-wip branch October 30, 2024 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support S3-compatible blob storage #1071

Support S3-compatible blob storage #1071

alxndrsn commented Jan 13, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as outdated.

ktuite left a comment

ktuite Feb 22, 2024

alxndrsn Feb 22, 2024

ktuite Mar 1, 2024

alxndrsn Jun 12, 2024

alxndrsn Jun 12, 2024

ktuite left a comment

ktuite Mar 1, 2024

ktuite left a comment

This comment was marked as resolved.

alxndrsn Aug 29, 2024

ktuite left a comment

Support S3-compatible blob storage #1071

Support S3-compatible blob storage #1071

Conversation

alxndrsn commented Jan 13, 2024 • edited Loading

This comment was marked as resolved.

This comment was marked as outdated.

ktuite left a comment

Choose a reason for hiding this comment

ktuite Feb 22, 2024

Choose a reason for hiding this comment

alxndrsn Feb 22, 2024

Choose a reason for hiding this comment

ktuite Mar 1, 2024

Choose a reason for hiding this comment

alxndrsn Jun 12, 2024

Choose a reason for hiding this comment

alxndrsn Jun 12, 2024

Choose a reason for hiding this comment

ktuite left a comment

Choose a reason for hiding this comment

ktuite Mar 1, 2024

Choose a reason for hiding this comment

ktuite left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

alxndrsn Aug 29, 2024

Choose a reason for hiding this comment

ktuite left a comment

Choose a reason for hiding this comment

alxndrsn commented Jan 13, 2024 •

edited

Loading