Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to start listing at a particular key #3970

Closed
wjones127 opened this issue Mar 28, 2023 · 5 comments · Fixed by #3973
Closed

Add option to start listing at a particular key #3970

wjones127 opened this issue Mar 28, 2023 · 5 comments · Fixed by #3973
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface

Comments

@wjones127
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In an object store, we might have a bunch of sequential files being written:

0000001.json
0000002.json
...
0001000.json

We'd like to be able to query for all the "new" files starting at a certain point, skipping all the earlier files.

S3 has a start-after parameter we can use for this. https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_RequestParameters
TDB on other systems.

Describe the solution you'd like

Not sure the best way to add the parameter. Does it belong in a new method? Should we introduce a more complex "ListCallBuilder" API?

let list_stream = object_store
     .build_list_call(Some(&prefix))
     .with_start_key("0000999.json")
     .await
     .expect("Error listing files");

Describe alternatives you've considered

Not sure if there is an easier way to do it.

Additional context

@wjones127 wjones127 added the enhancement Any new improvement worthy of a entry in the changelog label Mar 28, 2023
@tustvold
Copy link
Contributor

tustvold commented Mar 28, 2023

Seems like a useful feature, I'll have a think about what an API for this could look like. I wonder if we should just add a list_opts method that acts as a superset of list_with_delimiter and list 🤔 Similar perhaps to the proposal in #2241

What is support for this like in other object stores, I presume they support it, but I've learnt never to assume anything when it comes to object stores 😅

@wjones127
Copy link
Member Author

It looks like support isn't that wide:

  • S3: has start-after (exclusive?)
  • GCS: has a startOffset (inclusive)
  • Azure Blob Store not supported.
  • I don't see any obvious API for local filesystems.

So this is mostly providing a useful optimization for S3 and GCS. There can be a default implementation that just throws out earlier entries. Also, for consistency between S3 and GCS, we would have to make the lower bound exclusive, since that seems to be the S3 behavior.

@rtyler
Copy link
Contributor

rtyler commented Mar 28, 2023

Speaking selfishly supporting S3-based optimizations goes a long way given its dominance in the market. Most data workloads I see are on AWS or GCP, so that's great you found a compatible API in GCS @wjones127

@tustvold
Copy link
Contributor

I've created #3973 if we like the interface I can flesh it out

@JHibbard
Copy link

The Azure Data Lake Storage Gen2 REST API has endpoints for filesystem list and path list that looks interesting, but the documentation is vague. ADLS-G2 is hierarchical in nature... so some offset/skipping API is likely available somewhere.

filesystem list:

  • prefix: Filters results to filesystems within the specified prefix.

path list:

  • directory: Filters results to paths within the specified directory. An error occurs if the directory does not exist.

tustvold added a commit that referenced this issue Mar 30, 2023
* Stub out ObjectStore::list_with_offset (#3970)

* Add tests and add AWS implementation

* Update localstack

* Add further implementations
@tustvold tustvold added the object-store Object Store Interface label Mar 30, 2023
@tustvold tustvold changed the title [object_store] Add option to start listing at a particular key Add option to start listing at a particular key Mar 30, 2023
wjones127 pushed a commit to delta-io/delta-rs that referenced this issue May 30, 2023
# Description
Adds the `list_with_offset` delegation method to `DeltaObjectStore`.

# Related Issue(s)
- closes #1252 

# Documentation

apache/arrow-rs#3970

Signed-off-by: Shingo OKAWA <shingo.okawa.g.h.c@gmail.com>
roeap pushed a commit to roeap/delta-rs that referenced this issue Jun 2, 2023
# Description
Adds the `list_with_offset` delegation method to `DeltaObjectStore`.

# Related Issue(s)
- closes delta-io#1252 

# Documentation

apache/arrow-rs#3970

Signed-off-by: Shingo OKAWA <shingo.okawa.g.h.c@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants