Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results of storing OCFL objects in S3 and walking the list of object keys #526

Closed
marcolarosa opened this issue Feb 20, 2021 · 5 comments
Closed

Comments

@marcolarosa
Copy link

marcolarosa commented Feb 20, 2021

I'm putting this here not as an issue but perhaps to inform the spec in relation to S3 as a backend. I hope it's useful. In other tickets (#522, CoEDL/ocfl-js#3) I've discussed some design ideas and library implementation details. This is about testing one specific implementation and how it performs.

For this test I'm using the AWS v3 JS SDK against an S3 implementation that I think is coming from ceph (so not actually an AWS bucket). I will test a real AWS bucket in due course and report any differences here.

Question

What is it like to walk an OCFL implementation to find the objects living inside an S3 bucket? For example, when indexing an OCFL resource, without knowing anything about it, how do we locate each object and is the flat nature of S3 going to be problematic at scale?

Test description

For this test I've created 150 OCFL objects in a bucket where each object has one version and 10 files. The object ids are a hash and each file is empty because the point of this test is to walk the tree, not interact with any real data.

An object looks like:


- 0068d6f097e9f491831ad8c367f48e4d9f0e7852
  - 0=ocfl_object_1.0
  - inventory.json
  - inventory.json.sha512
  - v1
    - inventory.json
    - inventory.json.sha512 
    - content
      - 10 x files

Though in reality it's more like:

0068d6f097e9f491831ad8c367f48e4d9f0e7852/0=ocfl_object_1.0
0068d6f097e9f491831ad8c367f48e4d9f0e7852/inventory.json
0068d6f097e9f491831ad8c367f48e4d9f0e7852/inventory.json.sha512
etc

Results

  • It takes about .24 seconds to locate the 150 objects by listing lookup keys prefixed with OCFL_oid
  • It takes about 1.4 seconds to locate the 150 objects when walking all the keys to find the one that matches /.*\/0=ocfl_object_.*$/

Explanation

AWS docs state The high availability engineering of Amazon S3 is focused on get, put, list, and delete operations. The command for listing keys in a bucket is ListObjectsV2Command https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-s3/classes/listobjectsv2command.html. By default, the maximum number of keys returned is 1000 though this can be set lower (but not higher).

One of the params of the ListObjectsV2Command method is prefix. That is, you can set the prefix for keys to retrieve from the bucket. Then, you get a continuation token to continue to the next set of keys so that you can iterate over all of the keys in a bucket. Note that prefix doesn't support a regex. So, in the absence of any knowledge about the objects the only thing you can do is repeatedly pull 1000 keys, find those that match /.*\/0=ocfl_object_.*$/ and then use that to determine the object id and figure out where the object inventory is.

If I set the command to return a maximum of 150 keys and run this against the bucket it takes about 1.37secs to locate the 150 objects.

Alternately, since I can set the prefix for the lookup, if I also create a sort of index key named as OCFL_oid_${object id} I can do a list command with prefix = OCFL_oid_. Then, rather than walking every single S3 key in the bucket to find the OCFL objects I can actually query just the lookup keys. As you might expect, this is significantly faster. With maxKeys = 150 it takes about .238secs to locate the 150 objects.

Clearly this becomes more and more problematic as the number of files in an S3 bucket increases because without a lookup of some kind, the only way to find an object is to walk every single key looking for the object handle file. Conversely, having a lookup key means there is some thing else to keep in sync and goes against the spec in its current form as well.

Results 200 objects

Without changing anything except increasing the number of objects to 200:

  • index key lookup: 0.3 seconds
  • looking for object handle: 1.6 seconds

Results 250 objects

Without changing anything except increasing the number of objects to 250:

  • index key lookup: 0.31 seconds
  • looking for object handle: 2.2 seconds

Final thought

So - this isn't a solution; rather I just want to raise it as a potential issue for when an OCFL resource lives in an object store like S3. In reality, this isn't an issue for small OCFL resources but given that OCFL scales easily (we have 70TB in it so far and we haven't noticed any issues with it on a filesystem) this could easily become an issue.

@pwinckles
Copy link

May I ask in what circumstances you crawl your repository to discover objects and how frequently you need to do this?

@marcolarosa
Copy link
Author

@pwinckles That's a great question and I should have put that in as part of the ticket - apologies! We crawl the repo to build a search index from the metadata that's stored as part of each object (the metadata about the data not the OCFL metadata). In reality, not too often - really only if the search index is empty or somehow corrupted and needs to be rebuilt from scratch.

The other point to consider is that the difference in time required to page through every item would probably be swamped by the time required to process any given item. So in reality, I'm not sure how much of an issue this will be but after ruminating on it for a while my gut feeling is that it probably won't be much of an issue at all.

@marcolarosa
Copy link
Author

I should add that I don't like the key file idea as it's technical hack to a limitation imposed by the AWS API / SDK. Yet another thing to keep up to date...

@zimeon
Copy link
Contributor

zimeon commented Feb 23, 2021

Hi @marcolarosa , thanks for posting this. I think it will be interesting to see how your results scale to large stores in this single treaded case, and also what you see over a real AWS bucket.

@neilsjefferies
Copy link
Member

Merged into #372

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants