Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: expose function to get table of add actions #1033

Merged
merged 13 commits into from
Jan 11, 2023

Conversation

wjones127
Copy link
Collaborator

@wjones127 wjones127 commented Dec 21, 2022

Description

Exposes function to get a dataframe of add actions for selected version of the table.

TODO:

  • add unit tests
  • write user guide
  • handle partition columns
  • handle stats
  • handle tags
  • add a flatten option

Related Issue(s)

Documentation

@chitralverma
Copy link
Contributor

@wjones127 Can we have an indicator of "number of version available" to this metadata some where?

@wjones127
Copy link
Collaborator Author

Example:

In [1]: from deltalake import DeltaTable, write_deltalake

In [2]: import pyarrow as pa

In [3]: data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})

In [4]: write_deltalake("tmp", data, partition_by=["x"])

In [5]: dt = DeltaTable("tmp")

In [6]: dt.get_add_actions_df()
Out[6]: 
pyarrow.RecordBatch
path: string
size_bytes: int64
modification_time: timestamp[ms]
data_change: bool
partition_values: struct<x: int64>
  child 0, x: int64
num_records: int64
null_count: struct<y: int64 not null>
  child 0, y: int64 not null
min: struct<y: int64 not null>
  child 0, y: int64 not null
max: struct<y: int64 not null>
  child 0, y: int64 not null

In [7]: dt.get_add_actions_df().to_pandas()
Out[7]: 
                                                path  size_bytes       modification_time  data_change partition_values  num_records null_count       min       max
0  x=2/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True         {'x': 2}            1   {'y': 0}  {'y': 5}  {'y': 5}
1  x=3/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True         {'x': 3}            1   {'y': 0}  {'y': 6}  {'y': 6}
2  x=1/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True         {'x': 1}            1   {'y': 0}  {'y': 4}  {'y': 4}

In [8]: dt.get_add_actions_df(flatten=True).to_pandas()
Out[8]: 
                                                path  size_bytes       modification_time  data_change  partition.x  num_records  null_count.y  min.y  max.y
0  x=2/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True            2            1             0      5      5
1  x=3/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True            3            1             0      6      6
2  x=1/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True            1            1             0      4      4

@@ -70,6 +70,7 @@

#![deny(warnings)]
#![deny(missing_docs)]
#![allow(rustdoc::invalid_html_tags)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need this in order to build the docs for some reason. Newer lint?

@wjones127 wjones127 marked this pull request as ready for review January 5, 2023 03:08
roeap
roeap previously approved these changes Jan 5, 2023
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Really looking forward to moving arrow deeper into our log handling :).

Left one minor namin comment that you may want to look at, otherwise LGTM!

@@ -440,3 +440,37 @@ def __stringify_partition_values(
str_value = str(value)
out.append((field, op, str_value))
return out

def get_add_actions_df(self, flatten: bool = False) -> pyarrow.RecordBatch:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super important, but when i see "df" in python, I always think pandas dataframe. Since we are returning a record batch maybe a different name is more fitting for this function? Maybe get_add_action_table, like the one used internally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to that. What do you think @MrPowers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility is to change to return a flattened Pandas DataFrame by default, but allow returning record batch:

def get_add_actions_df(self, flatten: bool, as_pandas: Literal[True]) -> pandas.DataFrame;
def get_add_actions_df(self, flatten: bool, as_pandas: Literal[False]) -> pyarrow.RecordBatch;
def get_add_actions_df(self, flatten=True, as_pandas=True):
    ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I prefer the chained style (.to_pandas()) , as it is consistent with loading the table data. Then again, my personal preference is just that 😆. But @MrPowers seems to know the community quite well :).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think get_add_actions and to_pandas() is fine. I'm not the best authority for the Pythonic way of doing things 😉

roeap
roeap previously approved these changes Jan 10, 2023
@MrPowers
Copy link
Collaborator

I am really excited about this functionality!!!

roeap
roeap previously approved these changes Jan 11, 2023
@wjones127 wjones127 merged commit 83260a8 into delta-io:main Jan 11, 2023
@wjones127 wjones127 deleted the feat/add_action_table branch January 11, 2023 16:40
chitralverma pushed a commit to chitralverma/delta-rs that referenced this pull request Mar 17, 2023
# Description

Exposes function to get a dataframe of add actions for selected version
of the table.

TODO:

 * [x] add unit tests
 * [x] write user guide
 * [x] handle partition columns
 * [x] handle stats
 * [x] handle tags
 * [x] add a `flatten` option

# Related Issue(s)

- closes delta-io#1031

# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose the Delta Log in a DataFrame that's easy for analysis
4 participants