test(python): add read / write benchmarks #933

wjones127 · 2022-11-13T19:20:34Z

Description

Considering adding continuous benchmarks to Python reader / writer.

Related Issue(s)

Documentation

@wjones127

# Description This PR builds in top of the changes to handling the runtime in #933. In my local tests this fixed #915. Additionally, I added the runtime as a property on the fs handler to avoid re-creating it on every call. In some non-representative tests with a large number of very small partitions it cut the runtime in about half. cc @wjones127 # Related Issue(s)  # Documentation

wjones127 · 2023-01-06T05:20:50Z

Performance improvements in filesystems. Slight improvement in read, and 3x improvement for writing.

Before

----------------------------------------------------------------------------------- benchmark 'read': 2 tests -----------------------------------------------------------------------------------
Name (time in ms)                   Min                Max               Mean             StdDev             Median                IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_read_pyarrow     40.1554 (1.0)      41.5515 (1.0)      40.5156 (1.0)       0.3544 (1.0)      40.4168 (1.0)       0.3003 (1.0)           4;2  24.6818 (1.0)          19           1
test_benchmark_read             66.8070 (1.66)     93.4869 (2.25)     79.7360 (1.97)     10.5039 (29.64)    78.7281 (1.95)     16.3939 (54.59)         2;0  12.5414 (0.51)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------- benchmark 'write': 1 tests ----------------------------------------
Name (time in s)            Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_benchmark_write     3.6929  3.7473  3.7126  0.0224  3.7107  0.0327       1;0  0.2694       5           1
-------------------------------------------------------------------------------------------------------------

After

---------------------------------------------------------------------------------- benchmark 'read': 2 tests ----------------------------------------------------------------------------------
Name (time in ms)                   Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_read_pyarrow     40.7163 (1.0)      42.9173 (1.0)      41.3004 (1.0)      0.5795 (1.0)      41.0563 (1.0)      0.4153 (1.0)           3;2  24.2128 (1.0)          15           1
test_benchmark_read             61.5885 (1.51)     81.7273 (1.90)     74.1336 (1.79)     7.6207 (13.15)    75.9647 (1.85)     8.2228 (19.80)         1;0  13.4892 (0.56)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------- benchmark 'write': 1 tests ----------------------------------------
Name (time in s)            Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_benchmark_write     1.0848  1.1201  1.0992  0.0156  1.0958  0.0274       1;0  0.9098       5           1
-------------------------------------------------------------------------------------------------------------

wjones127 · 2023-01-06T05:22:33Z

python/src/filesystem.rs

@@ -531,10 +531,10 @@ impl ObjectOutputStream {
        Err(PyNotImplementedError::new_err("'read' not implemented"))
    }

-    fn write(&mut self, data: Vec<u8>) -> PyResult<i64> {
+    fn write(&mut self, data: &PyBytes) -> PyResult<i64> {


Using Vec<u8> meant PyO3 was cloning all the input data. By using PyBytes we avoid the copy.

Cargo.toml

This reverts commit bcc8b0a.

roeap · 2023-01-12T10:59:11Z

python/src/filesystem.rs

@@ -229,7 +229,7 @@ impl DeltaFileSystemHandler {
        let file = self
            .rt
            .block_on(ObjectInputFile::try_new(
-                self.rt.clone(),
+                Arc::clone(&self.rt),


Just out of curiosity, I would have thought the statements are equivalent, since rt is inside an Arc ... ?

roeap · 2023-01-17T16:46:16Z

@wjones127 - our users are very much looking forward to a new python release, and I think it would be an even greater release if it contains this PR, due to the performance improvements it contains :). Is there still something you want to do under this PR, or can we merge it?

wjones127 · 2023-01-17T17:27:04Z

I think we can merge.

# Description Considering adding continuous benchmarks to Python reader / writer. # Related Issue(s)  # Documentation

This was referenced Nov 17, 2022

Threading issues accessing ADLSGen2 table from Python #915

Closed

feat: improve write perfromance of DeltaFileSystemHandler #943

Merged

github-actions bot added the python label Jan 6, 2023

wjones127 added 5 commits January 5, 2023 20:36

test(python): add read write benchmark

5908576

refactor(python): eliminate most use of gil lock

84b4489

test(python): add workflow and todo

fd923f9

enhance benchmark

976fe9b

performance improvements

6fa6398

wjones127 force-pushed the benchmark branch from 5ccf698 to 6fa6398 Compare January 6, 2023 05:18

wjones127 commented Jan 6, 2023

View reviewed changes

format

d39d016

wjones127 commented Jan 6, 2023

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

wjones127 added 4 commits January 5, 2023 21:25

Update Cargo.toml

ef1c463

CI fixes

dd53315

Dummy benchmark to test regression message

bcc8b0a

Revert "Dummy benchmark to test regression message"

cb1d66a

This reverts commit bcc8b0a.

wjones127 marked this pull request as ready for review January 11, 2023 03:21

wjones127 requested review from fvaleye, rtyler, roeap, houqp, xianwill and mosyp as code owners January 11, 2023 03:21

roeap approved these changes Jan 12, 2023

View reviewed changes

wjones127 merged commit 750f400 into delta-io:main Jan 17, 2023

wjones127 deleted the benchmark branch January 17, 2023 18:00

wjones127 mentioned this pull request Jan 24, 2023

Deltalake read generate a massive number of read request #931

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(python): add read / write benchmarks #933

test(python): add read / write benchmarks #933

wjones127 commented Nov 13, 2022

wjones127 commented Jan 6, 2023

wjones127 Jan 6, 2023

roeap Jan 12, 2023

roeap commented Jan 17, 2023

wjones127 commented Jan 17, 2023

test(python): add read / write benchmarks #933

test(python): add read / write benchmarks #933

Conversation

wjones127 commented Nov 13, 2022

Description

Related Issue(s)

Documentation

wjones127 commented Jan 6, 2023

Before

After

wjones127 Jan 6, 2023

Choose a reason for hiding this comment

roeap Jan 12, 2023

Choose a reason for hiding this comment

roeap commented Jan 17, 2023

wjones127 commented Jan 17, 2023