Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asyncio Listing and Inventory Report Integration #573

Merged
merged 18 commits into from
Aug 18, 2023

Conversation

hanseaston
Copy link
Contributor

@hanseaston hanseaston commented Aug 9, 2023

@martindurant

I've completed the integration of the inventory report service and asyncio listing logic based on our discussions.

The overview of the logic is as follows:

For listing, the user can optionally pass in a inventory_report_info argument as part of the kwargs.

If the argument is not passed in, then the listing logic is exactly the same as previously, thus it is backward compatible with existing implementation.

If the argument is passed in, the high-level is as follows:

  1. Validate whether inventory_report_info has the correct fields included to fetch the inventory report.
  2. Fetch the inventory report configuration.
  3. Parse the inventory report configuration to validate and extract the inventory report location.
  4. Does listing on the inventory report metadata.
  5. Use the metadata to fetch the latest inventory report(s).
  6. Parse the inventory report content and extract objects metadata.
  7. If the user wants to directly use the inventory report for listing, then return the objects metadata.
  8. If the user wants to retrieve the most up-to-date listing, then use the object names from the inventory report to divide up workload for each coroutine fast listing.

The listing performance is ~10x from my own experiments.

Latency for 1 - 6 should be trivial since:

  1. Fetching the inventory report is very fast, a single call.
  2. Performing listing on inventory is very fast, most likely a single call (5000 inventory reports metadata on a single page).
  3. Downloading inventory report is relatively past (inventory reports with 1M objects is ~40MB).

In addition, if the user chooses to use the inventory report content directly for listing results, the speedup can be >100x.

Comments have been added throughout the implementation, and there are comprehensive test cases covering each method in the InventoryReport class.

Would really appreciate it if any of you has any feedback, so that we can have this awesome feature integrated!

Copy link

@bernardhan33 bernardhan33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi all, this is Bernard Han from Google GCS Team. I'm Hans' co-host this summer. Leaving some general comments on this public PR.

gcsfs/inventory_report.py Show resolved Hide resolved
gcsfs/tests/test_inventory_report.py Outdated Show resolved Hide resolved
gcsfs/inventory_report.py Outdated Show resolved Hide resolved
gcsfs/inventory_report.py Outdated Show resolved Hide resolved
gcsfs/tests/test_inventory_report.py Outdated Show resolved Hide resolved
gcsfs/inventory_report.py Outdated Show resolved Hide resolved
gcsfs/core.py Show resolved Hide resolved
gcsfs/core.py Outdated Show resolved Hide resolved
gcsfs/core.py Outdated Show resolved Hide resolved
gcsfs/core.py Outdated Show resolved Hide resolved
@bernardhan33
Copy link

bernardhan33 commented Aug 13, 2023

Also, are README changes in a separate commit or PR?

@martindurant
Copy link
Member

Everything passes, so that's a good start :)

I will start looking at this later today - there is quite a lot! Sorry for being slow getting here, got hit by covid.

@hanseaston
Copy link
Contributor Author

hanseaston commented Aug 15, 2023

Everything passes, so that's a good start :)
I will start looking at this later today - there is quite a lot! Sorry for being slow getting here, got hit by covid.

Oh no sorry to hear, covid sucks ;(

@hanseaston
Copy link
Contributor Author

Also, are README changes in a separate commit or PR?

I am thinking of adding a README change, and maybe a blog post on how to use the inventory report, once the PR is officially merged (in order to finalize all the details).

@martindurant
Copy link
Member

Yes, we certainly need prose "howto" documentation around this, in our own docs source. I would recommend any blog focus on the process of development and speedup for your example case rather that a how-to-use piece. Of course, the two can link to one-another.

I don't mind if the docs lag behind this PR, effectively making the new feature experimental. In fact, calling it an experimental feature isn't a bad idea, since we don't have a good feel of the set of use cases it might be pulled into, and there may be some that behave badly.

@hanseaston
Copy link
Contributor Author

Yes, we certainly need prose "howto" documentation around this, in our own docs source. I would recommend any blog focus on the process of development and speedup for your example case rather that a how-to-use piece. Of course, the two can link to one-another.

Sounds good, I will write up a prose (internally) and a blog (shared externally) once all the changes are finalized for this PR. Meanwhile, should I add a couple comments on top of the inventory report logic, indicating this is currently experimental?

@hanseaston
Copy link
Contributor Author

Update: I fixed the typos that Bernard has pointed out in the latest commit @Magichan33.

@martindurant
Copy link
Member

should I add a couple comments on top of the inventory report logic, indicating this is currently experimental?

Yes, I think so

@hanseaston
Copy link
Contributor Author

should I add a couple comments on top of the inventory report logic, indicating this is currently experimental?

Yes, I think so

Sounds good, just updated that.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small things left

gcsfs/core.py Show resolved Hide resolved
gcsfs/core.py Show resolved Hide resolved
gcsfs/core.py Show resolved Hide resolved
gcsfs/core.py Show resolved Hide resolved
gcsfs/core.py Show resolved Hide resolved
gcsfs/core.py Show resolved Hide resolved
gcsfs/core.py Show resolved Hide resolved
@martindurant martindurant merged commit 1758134 into fsspec:main Aug 18, 2023
5 checks passed
@martindurant
Copy link
Member

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants