Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support S3-compatible blob storage #1071

Merged
merged 760 commits into from
Sep 13, 2024
Merged

Conversation

alxndrsn
Copy link
Contributor

@alxndrsn alxndrsn commented Jan 13, 2024

Closes getodk/central#585

Adds:

  • s3_status column on blobs table
  • CLI tool for:
    • uploading blobs to s3
    • resetting failed blobs -> pending

Changes:

  • blob.content uses must now be accompanied by a conditional fetch of the data from s3

Queries:

  • would it be preferable to make blob.content async everywhere, and moving the s3 check/fetch inside the getter?
    • pro: neater code, less likely to have bugs where blob.content used when data is not available
    • con: seems intrusive of frames
  • should "purging" forms in central also delete content from S3? yes

@alxndrsn

This comment was marked as resolved.

@alxndrsn

This comment was marked as outdated.

@alxndrsn alxndrsn marked this pull request as ready for review February 19, 2024 16:51
@alxndrsn alxndrsn changed the title wip: allow blob storage in s3 Support blob storage in s3 Feb 19, 2024
@alxndrsn alxndrsn requested a review from ktuite February 19, 2024 16:56
Copy link
Member

@ktuite ktuite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a lot to look at still, but I want to summarize the shape I see:

Blob touchpoints (with a little bit of code for each case because the generic blob table is joined with other types of tables):

  • submission attachments
  • encrypted submissions
  • form xls files
  • client audits (subset of submission attachments) which are processed by a worker
  • that briefcase file

The function blobResponse that serves blobs modified to be able to serve from s3 if not in the database.

s3_status added to blobs table (pending, in_progress, uploaded, failed)

Simple worker defined in util/s3.js that is started in runServer.js
This is where the blob uploading happens!
Worker doesn't retry any failures, that is handled by an external cli tool.

exhaustBlobs function for tests and for cli script to use.

Am I missing anything big?

I haven't yet looked at:

  • the details of the worker
  • the different blob scenarios/touchpoints
  • tests (changes to existing tests, or e2e tests)

lib/bin/s3.js Outdated Show resolved Hide resolved
lib/bin/s3-create-bucket.js Show resolved Hide resolved
lib/bin/s3.js Outdated Show resolved Hide resolved
lib/model/migrations/20231025-01-add-blob-s3.js Outdated Show resolved Hide resolved
lib/util/http.js Outdated

response.set('ETag', `"${serverEtag}"`);
const withEtag = (serverEtag, fn, always=true) => (request, response) => {
if (always) response.set('ETag', `"${serverEtag}"`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this always flag about? I see it's false for the s3 blobs. I'm just looking for a mini explanation/reminder about this etag stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to clarify by:

  1. adding a comment about always, and
  2. splitting the exported withEtag() function to add an additional withEtagOnMatch()

Does this help? Perhaps withEtagOrRedirect() would be a clearer name for the new function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly follow but now I'm confused by the !always line below. I saw online that a 304 not modified request should always be sending back the etag in the header anyway.

But also in the s3 redirect case, always will be false, so:

  • dont set etag
  • check client etag
  • if it matches, DO set etag and return 304 not modified

Does s3 make its own etag? Will these ever match and would we ever return a 304 in place of a 307?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does s3 make its own etag? Will these ever match and would we ever return a 304 in place of a 307?

Great question. If a client sees an odk-central URL for an s3-backed blob, odk-central needs to ensure that the final ETag provided to the user is the same regardless of the binary data source.

This means either both systems need to generate the same ETag for the same data, or odk-central needs to substitute its ETag in place of S3's.

Currently:

  • odk-central's nginx should be transparently following 307 redirects for s3-backed content
  • both S3 and odk-central-backend use MD5(blob-content) as a strong ETag

This means that the current implementation should work fine. As there are no end-to-end central tests running both odk-central-backend and nginx, there's a risk this could break in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this always flag about?

I've reversed this flag, renaming it as onlyOnMatch, and updated the comments. Is the intention clearer now?

@lognaturel lognaturel changed the title Support blob storage in s3 Make it possible to configure S3-compatible blob storage Mar 1, 2024
Copy link
Member

@ktuite ktuite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going through this bit by bit (while a bigger discussions also take place).

I think I now have a good understanding of the part of the system that uses blobs that may or may not be on S3.

Next up, I'll probably take a closer look s3.js and how the blobs actually get to and from s3.

Comment on lines 49 to 51
s3.getContentFor(att)
.then(appendContent)
.catch(err => { this.destroy(err); });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean about this query:

would it be preferable to make blob.content async everywhere, and moving the s3 check/fetch inside the getter?

  • pro: neater code, less likely to have bugs where blob.content used when data is not available
  • con: seems intrusive of frames

(Just highlighting this one because it seemed like the smallest example.)

Maybe it's okay to leave it like this since there are only about 4 places blobs are used (unless you have an idea for some clever alternative):

  1. exporting attachments (here)
  2. exporting encrypting submissions that were stored as blobs
  3. processing client audit attachments
  4. (different mechanism, urlForBlob instead of getContentFor line the ones above) the http blob response

As long as each of these 4 paths is tested, since each might have a different behavior if there is an error fetching from S3.

Some errors i'm thinking of:

  • auth problem
  • data doesn't exist where it should
  • response from s3 is taking too long
  • data is broken in some other way?

@alxndrsn alxndrsn changed the title Make it possible to configure S3-compatible blob storage Support S3-compatible blob storage May 14, 2024
Makefile Outdated Show resolved Hide resolved
Copy link
Member

@ktuite ktuite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments from review today (more review tomorrow!)

test/e2e/s3/ci Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
lib/model/migrations/20240311-01-add-blob-s3.js Outdated Show resolved Hide resolved
test/unit/data/briefcase.js Outdated Show resolved Hide resolved
test/unit/data/briefcase.js Outdated Show resolved Hide resolved
test/util/s3.js Outdated Show resolved Hide resolved
ktuite

This comment was marked as resolved.

TODO.md Outdated Show resolved Hide resolved

let destroyed = false;

const inflight = new Set();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Could use weakRefs in this set if there is concern about memory leaks

lib/external/s3.js Outdated Show resolved Hide resolved
lib/external/s3.js Outdated Show resolved Hide resolved
@ktuite ktuite mentioned this pull request Sep 3, 2024
5 tasks
Copy link
Member

@ktuite ktuite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple small comments about the database migration and package lock, but otherwise it looks good to go!!!

lib/model/migrations/20240619-01-add-blob-s3.js Outdated Show resolved Hide resolved
lib/model/migrations/20240619-01-add-blob-s3.js Outdated Show resolved Hide resolved
lib/model/migrations/20240619-01-add-blob-s3.js Outdated Show resolved Hide resolved
package-lock.json Show resolved Hide resolved
@alxndrsn alxndrsn merged commit 3ae8a69 into getodk:master Sep 13, 2024
1 check passed
@alxndrsn alxndrsn deleted the s3-blob-storage-wip branch October 30, 2024 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store blobs outside database
2 participants