Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: much faster file name resolution using ListFiles #75

Merged
merged 2 commits into from
Aug 31, 2024
Merged

Conversation

sg-s
Copy link
Collaborator

@sg-s sg-s commented Aug 30, 2024

problem description

api.get_dataframe (and therefore DataFrame.from_deeporigin) can take extremely long because:

  1. 2 calls to list_database_rows and describe_row take place in series when they could take place in parallel (this is a problem of 700ms vs 300ms)
  2. (more serious problem). there is a place where N DescriebFile calls are made in series, where N is the number of files in the database. this means that the time it takes to fetch data grows with the number of files linearly, and the time for each file resolution is ~350ms.

proposed solution in this PR

using a new ListFiles endpoint that allows us to resolve all file IDs and names in a single API call (which typically takes 300ms no matter how many files)

screenshot

from cold start, this is how long it takes to display a dataframe. in the old version, this took ~10s

Screenshot 2024-08-30 at 8 52 19 AM

rejected alternative

using async methods. this approach was tried in this PR: #73

it was rejected because it's much slower, and complicates testing dramatically

@sg-s sg-s requested review from a team as code owners August 30, 2024 01:52
@sg-s sg-s self-assigned this Aug 30, 2024
src/data_hub/api.py Show resolved Hide resolved
src/data_hub/api.py Show resolved Hide resolved
src/data_hub/api.py Show resolved Hide resolved
src/data_hub/api.py Show resolved Hide resolved
src/data_hub/api.py Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
@sg-s sg-s enabled auto-merge (squash) August 30, 2024 12:54
Copy link
Member

@jonrkarr jonrkarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Nice job identifying a bottleneck and finding a way to speed it up.

Using async to execute many queries in parallel is a good thought. In some cases, we can get even higher performance with specialized bulk or composite methods.

src/data_hub/_api.py Show resolved Hide resolved
@sg-s sg-s merged commit a5537b2 into main Aug 31, 2024
11 checks passed
@sg-s sg-s deleted the faster-list-files branch August 31, 2024 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants