Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(python): add read / write benchmarks #933

Merged
merged 10 commits into from
Jan 17, 2023
Merged

Conversation

wjones127
Copy link
Collaborator

Description

Considering adding continuous benchmarks to Python reader / writer.

Related Issue(s)

Documentation

roeap added a commit that referenced this pull request Nov 17, 2022
# Description

This PR builds in top of the changes to handling the runtime in #933. In
my local tests this fixed #915. Additionally, I added the runtime as a
property on the fs handler to avoid re-creating it on every call. In
some non-representative tests with a large number of very small
partitions it cut the runtime in about half.

cc @wjones127 

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
@github-actions github-actions bot added the python label Jan 6, 2023
@wjones127
Copy link
Collaborator Author

Performance improvements in filesystems. Slight improvement in read, and 3x improvement for writing.

Before

----------------------------------------------------------------------------------- benchmark 'read': 2 tests -----------------------------------------------------------------------------------
Name (time in ms)                   Min                Max               Mean             StdDev             Median                IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_read_pyarrow     40.1554 (1.0)      41.5515 (1.0)      40.5156 (1.0)       0.3544 (1.0)      40.4168 (1.0)       0.3003 (1.0)           4;2  24.6818 (1.0)          19           1
test_benchmark_read             66.8070 (1.66)     93.4869 (2.25)     79.7360 (1.97)     10.5039 (29.64)    78.7281 (1.95)     16.3939 (54.59)         2;0  12.5414 (0.51)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------- benchmark 'write': 1 tests ----------------------------------------
Name (time in s)            Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_benchmark_write     3.6929  3.7473  3.7126  0.0224  3.7107  0.0327       1;0  0.2694       5           1
-------------------------------------------------------------------------------------------------------------

After

---------------------------------------------------------------------------------- benchmark 'read': 2 tests ----------------------------------------------------------------------------------
Name (time in ms)                   Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_read_pyarrow     40.7163 (1.0)      42.9173 (1.0)      41.3004 (1.0)      0.5795 (1.0)      41.0563 (1.0)      0.4153 (1.0)           3;2  24.2128 (1.0)          15           1
test_benchmark_read             61.5885 (1.51)     81.7273 (1.90)     74.1336 (1.79)     7.6207 (13.15)    75.9647 (1.85)     8.2228 (19.80)         1;0  13.4892 (0.56)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------- benchmark 'write': 1 tests ----------------------------------------
Name (time in s)            Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_benchmark_write     1.0848  1.1201  1.0992  0.0156  1.0958  0.0274       1;0  0.9098       5           1
-------------------------------------------------------------------------------------------------------------

@@ -531,10 +531,10 @@ impl ObjectOutputStream {
Err(PyNotImplementedError::new_err("'read' not implemented"))
}

fn write(&mut self, data: Vec<u8>) -> PyResult<i64> {
fn write(&mut self, data: &PyBytes) -> PyResult<i64> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Vec<u8> meant PyO3 was cloning all the input data. By using PyBytes we avoid the copy.

Cargo.toml Outdated Show resolved Hide resolved
@wjones127 wjones127 marked this pull request as ready for review January 11, 2023 03:21
@@ -229,7 +229,7 @@ impl DeltaFileSystemHandler {
let file = self
.rt
.block_on(ObjectInputFile::try_new(
self.rt.clone(),
Arc::clone(&self.rt),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, I would have thought the statements are equivalent, since rt is inside an Arc ... ?

@roeap
Copy link
Collaborator

roeap commented Jan 17, 2023

@wjones127 - our users are very much looking forward to a new python release, and I think it would be an even greater release if it contains this PR, due to the performance improvements it contains :). Is there still something you want to do under this PR, or can we merge it?

@wjones127
Copy link
Collaborator Author

I think we can merge.

@wjones127 wjones127 merged commit 750f400 into delta-io:main Jan 17, 2023
@wjones127 wjones127 deleted the benchmark branch January 17, 2023 18:00
chitralverma pushed a commit to chitralverma/delta-rs that referenced this pull request Mar 17, 2023
# Description

Considering adding continuous benchmarks to Python reader / writer.

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants