Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure we return a sorted listing when using a local client #344

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Sep 19, 2024

This ensures that calls to list_files are returned sorted if a LocalFileSystem ObjectStore is used.

It's not clean for a few reasons:

  1. We have to materialize the entire iter in order to sort it
  2. We can't easily detect if we're local just by using type_of or similar. We have a foreign trait object that doesn't have an as_any on it, so we can force the reference into an Any which would allow us to use is, and we can't add an implementation of something like as_any because dyn ObjectStore isn't Sized. So this resorts to formating the object at creation (since ObjectStore requires Display) and checking if it starts with LocalFileSystem....

Adds a test that when using local client things come back sorted now. Without these changes the test failed.

I think we should merge (or something like it) and then also see about adding a list_sorted and list_with_offset_sorted to ObjectStore

@nicklan nicklan marked this pull request as draft September 19, 2024 01:10
@@ -627,10 +627,6 @@ mod tests {

#[test]
fn test_scan_data() {
use tracing_subscriber::EnvFilter;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was left over that shouldn't be here, and was causing test failures because you shouldn't init in a test like this

Copy link

codecov bot commented Sep 19, 2024

Codecov Report

Attention: Patch coverage is 92.10526% with 6 lines in your changes missing coverage. Please review.

Project coverage is 77.75%. Comparing base (6b0df5d) to head (765b33c).

Files with missing lines Patch % Lines
kernel/src/engine/default/filesystem.rs 91.83% 0 Missing and 4 partials ⚠️
kernel/src/snapshot.rs 87.50% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #344      +/-   ##
==========================================
+ Coverage   77.68%   77.75%   +0.06%     
==========================================
  Files          49       50       +1     
  Lines       10084    10141      +57     
  Branches    10084    10141      +57     
==========================================
+ Hits         7834     7885      +51     
- Misses       1805     1807       +2     
- Partials      445      449       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code seems fine, if we decide that ordered listing is the way to go. Given that S3 directory buckets don't support ordered listing, I wonder if we need to seriously consider hardening kernel against disorder.

HOWEVER -- everything I can think of would require materializing the listing result, and if we do that unconditionally we penalize stores that do provide ordered listing. Also, it's probably simpler overall to ensure the listing is ordered up front, so the rest of kernel doesn't have to worry about it?

kernel/src/engine/default/filesystem.rs Outdated Show resolved Hide resolved
kernel/src/engine/default/filesystem.rs Show resolved Hide resolved
@nicklan nicklan marked this pull request as ready for review September 19, 2024 21:16
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@nicklan nicklan added the merge hold Don't allow the PR to merge label Sep 19, 2024
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm just added a question on expanding this to all non-s3-normal-bucket object_store implementations

@@ -72,7 +80,14 @@ impl<E: TaskExecutor> FileSystemClient for ObjectStoreFileSystemClient<E> {
}
});

Ok(Box::new(receiver.into_iter()))
if self.is_local {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like we might need to just detect if it "isn't S3 non-directory bucket" and then keep expanding from there? Seems safer to do a bit of extra sorting for now and just remove it as we discover different object_stores can do sorting? (or better yet we do some upstream fixes to expose a new API to communicate sorting)

Copy link

@hackintoshrao hackintoshrao Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be huge for building analytical engines, as it will leave them in a dilemma about whether to trust the iterators to return the sorted entries. Do you know if any conversations are going on upstream about this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! I added some context here #344 (comment). But I think we want to guarantee ordered from our listing.

@hackintoshrao
Copy link

hackintoshrao commented Oct 5, 2024

@scovich: Should I base the PR #322 on these changes? Are we ignoring the directory buckets listing, unsorted in S3 for now?
Can we assume that the listing iterators return sorted entries and that any need for materialization required for non-sorted stores will be handled the same way as in this PR?

@scovich
Copy link
Collaborator

scovich commented Oct 5, 2024

@nicklan -- what is left for this to be able to merge? Are we concerned about the approach? Worried it's somehow incorrect? Something else?

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Oct 10, 2024
@nicklan
Copy link
Collaborator Author

nicklan commented Oct 10, 2024

Sorry for the delay here! This this is ready now. I did some work to figure out if we need to do anything for non-local stores. As far as I can tell, aws, azure, and gcp all return sorted listings.

  • AWS: ListObjectsV2 states: "For general purpose buckets, ListObjectsV2 returns objects in lexicographical order based on their key names." (Directory buckets are out of scope for now)
  • Azure: Docs state here: "A listing operation returns an XML response that contains all or part of the requested list. The operation returns entities in alphabetical order."
  • GCP: The main doc doesn't indicate order, but this page does say: "This page shows you how to list the objects stored in your Cloud Storage buckets, which are ordered in the list lexicographically by name."

So it should just be local that we need for now. I've made a follow-up issue to check if the hdfs integration returns in order (I suspect it does) (TODO: Link issue)

@nicklan nicklan removed the merge hold Don't allow the PR to merge label Oct 10, 2024
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple questions but seems fine :) and maybe we want to add the comment that you had about each cloud's storing as an actual code comment for posterity?

}

pub fn new(store: Arc<DynObjectStore>, prefix: Path, task_executor: Arc<E>) -> Self {
// HACK to check if we're using a LocalFileSystem from ObjectStore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make an issue for follow-up to fix in the future?

commit_files
);
// We assume listing returned ordered, we want reverse order
let commit_files = commit_files.into_iter().rev().collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: it says .rev() works on double-ended iterators - is that what's going on here? This still has to be O(N) right? makes me wonder if we lose that much just by sorting like we used to..?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Development

Successfully merging this pull request may close these issues.

5 participants