Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change iterrows method for index attribute in row data generation #1706

Merged
merged 1 commit into from
Nov 25, 2020

Conversation

Mmoncadaisla
Copy link
Contributor

Context

This small PR from Support aims to perform a minor change on the _compute_copy_data function used by _copy_from and at the same time by to_carto, to improve performance when dealing with large datasets in terms of rows and columns.

The referred function is currently using the pandas.DataFrame.iterrows() method, which retrieves both the row index and a Series containing column values but only using the index afterward.

Further context can be found in this CH story.

PR changes

This PR contains one file modification:

  • context_manager: previously referred change in the _compute_copy_data function

Detected potential improvement

After performing a test with a 100.000 x 10 (rows x cols) dummy DataFrame, it seems that there could be a timing difference,

df = pd.DataFrame([np.arange(10) for number_rows in range(100000)])

image

Moreover, a single to_carto test performed against mmoncada account using a 722720 rows x 172 columns retrieved the following results,

A) With index instead of iterrows()

Geometry column not found in the GeoDataFrame.
Success! Data uploaded to table "test_upload_full_1" correctly
CPU times: user 18min 55s, sys: 20.2 s, total: 19min 15s
Wall time: 23min 26s
'test_upload_full_1'

B) With actual iterrows()

Geometry column not found in the GeoDataFrame.
Success! Data uploaded to table "test_upload_full_2" correctly
CPU times: user 23min 30s, sys: 50.3 s, total: 24min 21s
Wall time: 30min 27s

@Mmoncadaisla Mmoncadaisla requested a review from Jesus89 November 23, 2020 17:37
@Mmoncadaisla Mmoncadaisla self-assigned this Nov 23, 2020
Copy link
Member

@Jesus89 Jesus89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool performance improvement. Thanks!

@Jesus89 Jesus89 merged commit 2b2a22b into develop Nov 25, 2020
@Jesus89 Jesus89 deleted the mmoncadaisla/compute_copy_data_performance branch November 25, 2020 14:06
@Jesus89
Copy link
Member

Jesus89 commented Nov 25, 2020

It will be available in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants