Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Files metadata table #614

Merged
merged 19 commits into from
Jul 4, 2024
Merged

Conversation

Gowthami03B
Copy link
Contributor

No description provided.

@Gowthami03B Gowthami03B changed the title files metadata table Add Files metadata table Apr 18, 2024
@Gowthami03B Gowthami03B mentioned this pull request Apr 18, 2024
8 tasks
Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Gowthami03B this looks great. I have one request to add the optional snapshot_id argument to support time traveling between different snapshots Iceberg metadata table.

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
tests/integration/test_inspect_table.py Outdated Show resolved Hide resolved
tests/integration/test_inspect_table.py Outdated Show resolved Hide resolved
@sungwy
Copy link
Collaborator

sungwy commented May 6, 2024

Hi @HonahX could we get your help in triggering this workflow to see if the CI succeeds?

@Gowthami03B Gowthami03B requested a review from geruh May 12, 2024 22:11
Copy link
Contributor Author

@Gowthami03B Gowthami03B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @HonahX

Can I get a re-trigger again (the checks succeeds in my PR to my fork, so I am confident)?
Also requesting re-review here , I addressed the comments, hoping we could close this :)

@kevinjqliu kevinjqliu mentioned this pull request May 14, 2024
39 tasks
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding this @Gowthami03B.

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
@Gowthami03B Gowthami03B requested a review from geruh May 22, 2024 19:20
@Gowthami03B Gowthami03B force-pushed the metadata-files-table branch from 42d091d to 334cf94 Compare May 22, 2024 21:04
@Fokko
Copy link
Contributor

Fokko commented May 29, 2024

Sorry for now following up on this @Gowthami03B Could you rebase so we can get this in? Thanks!

@Fokko
Copy link
Contributor

Fokko commented Jun 26, 2024

@Gowthami03B gentle ping, this is the last metadata table, and we would love to include this into the release! 🙌

@Gowthami03B
Copy link
Contributor Author

@Fokko @kevinjqliu @amogh-jahagirdar Can I get a re-review here please? Want to close this asap for the release timeline :)

@Gowthami03B Gowthami03B requested a review from kevinjqliu July 3, 2024 17:35
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

r? @Fokko / @HonahX

@HonahX
Copy link
Contributor

HonahX commented Jul 4, 2024

LGTM too. @Gowthami03B Thanks for working on this! Thanks everyone for reviewing. Let's get this last metadata table in!

@HonahX HonahX merged commit 254a701 into apache:main Jul 4, 2024
7 checks passed
@DieHertz
Copy link

DieHertz commented Sep 25, 2024

Hi guys, sorry if it's not the right place to ask this question.
Do you know of a viable way to speed up table.inspect.files() for large tables?
Maybe something in mind that I could implement and contribute to upstream.

I haven't profiled yet but I guess the gist of the issue is manifest.fetch_manifest_entry being called synchronously and sequentially in a loop.
Offloading this call to a thread-based executor doesn't help much, probably because of GIL, and a process-based executor is harder to implement because of unpicklable types involved.

As of now pyspark's .files metatable collection can be done considerably quicker than pyiceberg's

@kevinjqliu
Copy link
Contributor

I think there's definitely room for improvement. @DieHertz do you mind opening an issue for this?

@DieHertz
Copy link

Will do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants