Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: performance regression in assigning value to df[col_name] #46267

Closed
2 of 3 tasks
Hubedge opened this issue Mar 8, 2022 · 11 comments · Fixed by #50148
Closed
2 of 3 tasks

PERF: performance regression in assigning value to df[col_name] #46267

Hubedge opened this issue Mar 8, 2022 · 11 comments · Fixed by #50148
Labels
Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance

Comments

@Hubedge
Copy link

Hubedge commented Mar 8, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd
from tqdm import tqdm

df = pd.DataFrame(np.random.random((5000,15000)))
df.columns = df.columns.astype("str")
col_0_np = df.iloc[:, 0].to_numpy()

for idx in tqdm(df.columns):
    df[idx] = col_0_np
  0%|          | 20/15000 [00:04<57:50,  4.32it/s] 

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.8.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : ***

pandas : 1.4.1
numpy : 1.19.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.2
setuptools : 58.0.4
Cython : None
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : 7.31.1
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.3.2
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : 0.54.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.27
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None

Prior Performance

In 1.3.5, the code should complete within seconds.

@Hubedge Hubedge added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Mar 8, 2022
@Hubedge
Copy link
Author

Hubedge commented Mar 8, 2022

Works fine on both 1.4.1 and 1.3.5 with df.loc[:, idx] = col_0_np:

import numpy as np
import pandas as pd
from tqdm import tqdm

df = pd.DataFrame(np.random.random((5000,15000)))
df.columns = df.columns.astype("str")
col_0_np = df.iloc[:, 0].to_numpy()

for idx in tqdm(df.columns):
    df.loc[:, idx] = col_0_np
100%|██████████| 15000/15000 [00:04<00:00, 3246.66it/s]

@phofl
Copy link
Member

phofl commented Mar 9, 2022

The performance regression is caused by a change in behavior underneath. Before 1.4 the setitem operation wrote into the underlying numpy array in place -> if you do

na = np.random.random((2,15))
df = pd.DataFrame(na)
df.columns = df.columns.astype("str")
col_0_np = df.iloc[:, 1].to_numpy()
df[df.columns[0]] = col_0_np

na was changed after the setitem call. This makes now a copy of the underlying array, hence the slowdown. If this is not relevant for you, I would suggest using loc. I am not sure if we can speed this up easily

cc @jbrockmendel

@phofl phofl added Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 9, 2022
@Hubedge
Copy link
Author

Hubedge commented Mar 10, 2022

@phofl

In this case whether the original na is modified is not relevant, but I'm curious the motivation to make the change, because I would assume that people are more used to setitem than to loc, and that they assume setitem does not copy.

Can you share the link to the discussion, if any?

@phofl
Copy link
Member

phofl commented Mar 11, 2022

That might have been misleading on my side.

Setitem was supposed to make a copy while loc was supposed to operate inplace. Before 1.4 this was inconsistent based on the code paths this was running through. This should be more consistent now, hence the change above. Technically speaking, the fact that this was modifying the underlying array before 1.4 was a bug.

@Hubedge
Copy link
Author

Hubedge commented Mar 11, 2022

It's fine as long as it's by design (and it would be even better if this difference is well documented).

Let's see if @jbrockmendel has anything to comment on improving performance. Otherwise, feel free to close the issue.

@jbrockmendel
Copy link
Member

https://github.com/pandas-dev/pandas/blob/main/doc/source/whatsnew/v1.3.0.rst#never-operate-inplace-when-setting-framekeys--values

df.loc[:, foo] = bar tries to write into the existing array(s), so should be fast (though we're still not totally consistent about this xref #45333). df[foo] = bar will never write into the existing array(s).

I think there may be a perf improvement available by changing Block.delete to not call np.delete. Not sure if that'd be relevant here.

@phofl
Copy link
Member

phofl commented Mar 11, 2022

Yeah that would help. Most of the time is spent in np.delete

@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Mar 16, 2022
@sappersapper
Copy link

sappersapper commented May 25, 2022

Another performance regression example when assigning value to df[col_names]:

import time
import pandas as pd
import numpy as np

n = 2000
columns = list(range(n))
df = pd.DataFrame(np.ones([n, n]), columns=columns)
start = time.time()
df[columns] = df[columns]
print(time.time() - start)

for pandas 1.4.2, about 7.6s
for pandas 1.3.4, about 0.2s

It is ok when assigning with loc (about 0.05s):

start = time.time()
df.loc[:, columns] = df[columns]
print(time.time() - start)

@jtilly
Copy link

jtilly commented Jun 27, 2022

I would like to point out that .assign(...) has seen a performance regression in 1.4.0 that seems to go beyond what can be explained by changes in setitem:

import pandas as pd
import numpy as np
from time import perf_counter

k = 500    # number of columns
n = 10000  # number of rows
r = 10     # number of repetitions

data = {f"col{key}": np.random.randint(0, 5, size=n) for key in range(k)}

df = pd.DataFrame(data)

print(f"{pd.__version__=}")

# use []
t0 = perf_counter()
for _ in range(r):
    for key in data:
        df[key] = data[key]
t1 = perf_counter()

print(f"[] {(t1-t0)/r:.4f}")

# use copy + .loc
t0 = perf_counter()
for _ in range(r):
    for key in data:
        df.loc[:, key] = df.loc[:, key].copy()
        df.loc[:, key] = data[key]
t1 = perf_counter()

print(f"copy + .loc {(t1-t0)/r:.4f}")

# use assign
t0 = perf_counter()
for _ in range(r):
    df = df.assign(**data)
t1 = perf_counter()

print(f"assign {(t1-t0)/r:.4f}")

Output:

pd.__version__='1.3.5'
[] 0.0203
copy + .loc 0.2112
assign 0.0329

pd.__version__='1.4.0'
[] 0.2501
copy + .loc 0.1436
assign 1.9943

pd.__version__='1.4.1'
[] 0.1976
copy + .loc 0.1464
assign 1.9205

pd.__version__='1.4.2'
[] 0.1916
copy + .loc 0.1519
assign 2.0024

pd.__version__='1.4.3'
[] 0.2191
copy + .loc 0.1347
assign 1.9111

pd.__version__='1.5.0.dev0+1030.g7d2f9b8d59'
[] 0.2060
copy + .loc 0.1269
assign 1.8481

@lbittarello
Copy link

It's also worth noting that it's difficult to use .loc when chaining operations, unlike assign, so it can be a lot clunkier.

@simonjayhawkins
Copy link
Member

removing from 1.4.x milestone

@simonjayhawkins simonjayhawkins removed this from the 1.4.4 milestone Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants