-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support S3-compatible blob storage #1071
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a lot to look at still, but I want to summarize the shape I see:
Blob touchpoints (with a little bit of code for each case because the generic blob table is joined with other types of tables):
- submission attachments
- encrypted submissions
- form xls files
- client audits (subset of submission attachments) which are processed by a worker
- that briefcase file
The function blobResponse
that serves blobs modified to be able to serve from s3 if not in the database.
s3_status
added to blobs table (pending, in_progress, uploaded, failed)
Simple worker defined in util/s3.js
that is started in runServer.js
This is where the blob uploading happens!
Worker doesn't retry any failures, that is handled by an external cli tool.
exhaustBlobs
function for tests and for cli script to use.
Am I missing anything big?
I haven't yet looked at:
- the details of the worker
- the different blob scenarios/touchpoints
- tests (changes to existing tests, or e2e tests)
lib/util/http.js
Outdated
|
||
response.set('ETag', `"${serverEtag}"`); | ||
const withEtag = (serverEtag, fn, always=true) => (request, response) => { | ||
if (always) response.set('ETag', `"${serverEtag}"`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this always
flag about? I see it's false for the s3 blobs. I'm just looking for a mini explanation/reminder about this etag stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to clarify by:
- adding a comment about
always
, and - splitting the exported
withEtag()
function to add an additionalwithEtagOnMatch()
Does this help? Perhaps withEtagOrRedirect()
would be a clearer name for the new function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly follow but now I'm confused by the !always
line below. I saw online that a 304 not modified request should always be sending back the etag in the header anyway.
But also in the s3 redirect case, always
will be false, so:
- dont set etag
- check client etag
- if it matches, DO set etag and return 304 not modified
Does s3 make its own etag? Will these ever match and would we ever return a 304 in place of a 307?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does s3 make its own etag? Will these ever match and would we ever return a 304 in place of a 307?
Great question. If a client sees an odk-central URL for an s3-backed blob, odk-central needs to ensure that the final ETag provided to the user is the same regardless of the binary data source.
This means either both systems need to generate the same ETag for the same data, or odk-central needs to substitute its ETag in place of S3's.
Currently:
- odk-central's nginx should be transparently following 307 redirects for s3-backed content
- both S3 and odk-central-backend use MD5(blob-content) as a strong ETag
This means that the current implementation should work fine. As there are no end-to-end central tests running both odk-central-backend and nginx, there's a risk this could break in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this
always
flag about?
I've reversed this flag, renaming it as onlyOnMatch
, and updated the comments. Is the intention clearer now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going through this bit by bit (while a bigger discussions also take place).
I think I now have a good understanding of the part of the system that uses blobs that may or may not be on S3.
Next up, I'll probably take a closer look s3.js and how the blobs actually get to and from s3.
lib/data/attachments.js
Outdated
s3.getContentFor(att) | ||
.then(appendContent) | ||
.catch(err => { this.destroy(err); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean about this query:
would it be preferable to make blob.content async everywhere, and moving the s3 check/fetch inside the getter?
- pro: neater code, less likely to have bugs where blob.content used when data is not available
- con: seems intrusive of frames
(Just highlighting this one because it seemed like the smallest example.)
Maybe it's okay to leave it like this since there are only about 4 places blobs are used (unless you have an idea for some clever alternative):
- exporting attachments (here)
- exporting encrypting submissions that were stored as blobs
- processing client audit attachments
- (different mechanism,
urlForBlob
instead ofgetContentFor
line the ones above) the http blob response
As long as each of these 4 paths is tested, since each might have a different behavior if there is an error fetching from S3.
Some errors i'm thinking of:
- auth problem
- data doesn't exist where it should
- response from s3 is taking too long
- data is broken in some other way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments from review today (more review tomorrow!)
|
||
let destroyed = false; | ||
|
||
const inflight = new Set(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Could use weakRefs in this set if there is concern about memory leaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple small comments about the database migration and package lock, but otherwise it looks good to go!!!
Closes getodk/central#585
Adds:
s3_status
column onblobs
tableChanges:
blob.content
uses must now be accompanied by a conditional fetch of the data from s3Queries:
blob.content
async
everywhere, and moving the s3 check/fetch inside the getter?blob.content
used when data is not available