Change iterrows method for index attribute in row data generation #1706

Mmoncadaisla · 2020-11-23T17:37:26Z

Context

This small PR from Support aims to perform a minor change on the _compute_copy_data function used by _copy_from and at the same time by to_carto, to improve performance when dealing with large datasets in terms of rows and columns.

The referred function is currently using the pandas.DataFrame.iterrows() method, which retrieves both the row index and a Series containing column values but only using the index afterward.

Further context can be found in this CH story.

PR changes

This PR contains one file modification:

context_manager: previously referred change in the _compute_copy_data function

Detected potential improvement

After performing a test with a 100.000 x 10 (rows x cols) dummy DataFrame, it seems that there could be a timing difference,

df = pd.DataFrame([np.arange(10) for number_rows in range(100000)])

Moreover, a single to_carto test performed against mmoncada account using a 722720 rows x 172 columns retrieved the following results,

A) With index instead of iterrows()

Geometry column not found in the GeoDataFrame.
Success! Data uploaded to table "test_upload_full_1" correctly
CPU times: user 18min 55s, sys: 20.2 s, total: 19min 15s
Wall time: 23min 26s
'test_upload_full_1'

B) With actual iterrows()

Geometry column not found in the GeoDataFrame.
Success! Data uploaded to table "test_upload_full_2" correctly
CPU times: user 23min 30s, sys: 50.3 s, total: 24min 21s
Wall time: 30min 27s

Jesus89

Cool performance improvement. Thanks!

Jesus89 · 2020-11-25T14:06:34Z

It will be available in the next release.

Change iterrows method for index attribute in row data generation

f1d83a1

Mmoncadaisla requested a review from Jesus89 November 23, 2020 17:37

Mmoncadaisla self-assigned this Nov 23, 2020

Jesus89 approved these changes Nov 25, 2020

View reviewed changes

Jesus89 merged commit 2b2a22b into develop Nov 25, 2020

Jesus89 deleted the mmoncadaisla/compute_copy_data_performance branch November 25, 2020 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change iterrows method for index attribute in row data generation #1706

Change iterrows method for index attribute in row data generation #1706

Mmoncadaisla commented Nov 23, 2020

Jesus89 left a comment

Jesus89 commented Nov 25, 2020

Change iterrows method for index attribute in row data generation #1706

Change iterrows method for index attribute in row data generation #1706

Conversation

Mmoncadaisla commented Nov 23, 2020

Context

PR changes

Detected potential improvement

A) With index instead of iterrows()

B) With actual iterrows()

Jesus89 left a comment

Choose a reason for hiding this comment

Jesus89 commented Nov 25, 2020