-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Results of storing OCFL objects in S3 and walking the list of object keys #526
Comments
May I ask in what circumstances you crawl your repository to discover objects and how frequently you need to do this? |
@pwinckles That's a great question and I should have put that in as part of the ticket - apologies! We crawl the repo to build a search index from the metadata that's stored as part of each object (the metadata about the data not the OCFL metadata). In reality, not too often - really only if the search index is empty or somehow corrupted and needs to be rebuilt from scratch. The other point to consider is that the difference in time required to page through every item would probably be swamped by the time required to process any given item. So in reality, I'm not sure how much of an issue this will be but after ruminating on it for a while my gut feeling is that it probably won't be much of an issue at all. |
I should add that I don't like the key file idea as it's technical hack to a limitation imposed by the AWS API / SDK. Yet another thing to keep up to date... |
Hi @marcolarosa , thanks for posting this. I think it will be interesting to see how your results scale to large stores in this single treaded case, and also what you see over a real AWS bucket. |
Merged into #372 |
I'm putting this here not as an issue but perhaps to inform the spec in relation to S3 as a backend. I hope it's useful. In other tickets (#522, CoEDL/ocfl-js#3) I've discussed some design ideas and library implementation details. This is about testing one specific implementation and how it performs.
For this test I'm using the AWS v3 JS SDK against an S3 implementation that I think is coming from ceph (so not actually an AWS bucket). I will test a real AWS bucket in due course and report any differences here.
Question
What is it like to walk an OCFL implementation to find the objects living inside an S3 bucket? For example, when indexing an OCFL resource, without knowing anything about it, how do we locate each object and is the flat nature of S3 going to be problematic at scale?
Test description
For this test I've created 150 OCFL objects in a bucket where each object has one version and 10 files. The object ids are a hash and each file is empty because the point of this test is to walk the tree, not interact with any real data.
An object looks like:
Though in reality it's more like:
Results
OCFL_oid
/.*\/0=ocfl_object_.*$/
Explanation
AWS docs state
The high availability engineering of Amazon S3 is focused on get, put, list, and delete operations
. The command for listing keys in a bucket isListObjectsV2Command
https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-s3/classes/listobjectsv2command.html. By default, the maximum number of keys returned is 1000 though this can be set lower (but not higher).One of the params of the ListObjectsV2Command method is prefix. That is, you can set the prefix for keys to retrieve from the bucket. Then, you get a continuation token to continue to the next set of keys so that you can iterate over all of the keys in a bucket. Note that prefix doesn't support a regex. So, in the absence of any knowledge about the objects the only thing you can do is repeatedly pull 1000 keys, find those that match
/.*\/0=ocfl_object_.*$/
and then use that to determine the object id and figure out where the object inventory is.If I set the command to return a maximum of 150 keys and run this against the bucket it takes about
1.37secs to locate the 150 objects
.Alternately, since I can set the prefix for the lookup, if I also create a sort of index key named as
OCFL_oid_${object id}
I can do a list command with prefix =OCFL_oid_
. Then, rather than walking every single S3 key in the bucket to find the OCFL objects I can actually query just the lookup keys. As you might expect, this is significantly faster. With maxKeys = 150 it takes about.238secs to locate the 150 objects
.Clearly this becomes more and more problematic as the number of files in an S3 bucket increases because without a lookup of some kind, the only way to find an object is to walk every single key looking for the object handle file. Conversely, having a lookup key means there is some thing else to keep in sync and goes against the spec in its current form as well.
Results 200 objects
Without changing anything except increasing the number of objects to 200:
Results 250 objects
Without changing anything except increasing the number of objects to 250:
Final thought
So - this isn't a solution; rather I just want to raise it as a potential issue for when an OCFL resource lives in an object store like S3. In reality, this isn't an issue for small OCFL resources but given that OCFL scales easily (we have 70TB in it so far and we haven't noticed any issues with it on a filesystem) this could easily become an issue.
The text was updated successfully, but these errors were encountered: