PERF: performance regression in assigning value to df[col_name] #46267

Hubedge · 2022-03-08T12:55:57Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd
from tqdm import tqdm

df = pd.DataFrame(np.random.random((5000,15000)))
df.columns = df.columns.astype("str")
col_0_np = df.iloc[:, 0].to_numpy()

for idx in tqdm(df.columns):
    df[idx] = col_0_np

  0%|          | 20/15000 [00:04<57:50,  4.32it/s]

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.8.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : ***

pandas : 1.4.1
numpy : 1.19.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.2
setuptools : 58.0.4
Cython : None
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : 7.31.1
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.3.2
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : 0.54.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.27
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None

Prior Performance

In 1.3.5, the code should complete within seconds.

The text was updated successfully, but these errors were encountered:

Hubedge · 2022-03-08T13:00:24Z

Works fine on both 1.4.1 and 1.3.5 with df.loc[:, idx] = col_0_np:

import numpy as np
import pandas as pd
from tqdm import tqdm

df = pd.DataFrame(np.random.random((5000,15000)))
df.columns = df.columns.astype("str")
col_0_np = df.iloc[:, 0].to_numpy()

for idx in tqdm(df.columns):
    df.loc[:, idx] = col_0_np

100%|██████████| 15000/15000 [00:04<00:00, 3246.66it/s]

phofl · 2022-03-09T21:49:00Z

The performance regression is caused by a change in behavior underneath. Before 1.4 the setitem operation wrote into the underlying numpy array in place -> if you do

na = np.random.random((2,15))
df = pd.DataFrame(na)
df.columns = df.columns.astype("str")
col_0_np = df.iloc[:, 1].to_numpy()
df[df.columns[0]] = col_0_np

na was changed after the setitem call. This makes now a copy of the underlying array, hence the slowdown. If this is not relevant for you, I would suggest using loc. I am not sure if we can speed this up easily

cc @jbrockmendel

Hubedge · 2022-03-10T17:31:58Z

@phofl

In this case whether the original na is modified is not relevant, but I'm curious the motivation to make the change, because I would assume that people are more used to setitem than to loc, and that they assume setitem does not copy.

Can you share the link to the discussion, if any?

phofl · 2022-03-11T11:14:28Z

That might have been misleading on my side.

Setitem was supposed to make a copy while loc was supposed to operate inplace. Before 1.4 this was inconsistent based on the code paths this was running through. This should be more consistent now, hence the change above. Technically speaking, the fact that this was modifying the underlying array before 1.4 was a bug.

Hubedge · 2022-03-11T13:22:49Z

It's fine as long as it's by design (and it would be even better if this difference is well documented).

Let's see if @jbrockmendel has anything to comment on improving performance. Otherwise, feel free to close the issue.

jbrockmendel · 2022-03-11T21:22:46Z

https://github.com/pandas-dev/pandas/blob/main/doc/source/whatsnew/v1.3.0.rst#never-operate-inplace-when-setting-framekeys--values

df.loc[:, foo] = bar tries to write into the existing array(s), so should be fast (though we're still not totally consistent about this xref #45333). df[foo] = bar will never write into the existing array(s).

I think there may be a perf improvement available by changing Block.delete to not call np.delete. Not sure if that'd be relevant here.

phofl · 2022-03-11T21:25:37Z

Yeah that would help. Most of the time is spent in np.delete

sappersapper · 2022-05-25T01:16:45Z

Another performance regression example when assigning value to df[col_names]:

import time
import pandas as pd
import numpy as np

n = 2000
columns = list(range(n))
df = pd.DataFrame(np.ones([n, n]), columns=columns)

start = time.time()
df[columns] = df[columns]
print(time.time() - start)

for pandas 1.4.2, about 7.6s
for pandas 1.3.4, about 0.2s

It is ok when assigning with loc (about 0.05s):

start = time.time()
df.loc[:, columns] = df[columns]
print(time.time() - start)

jtilly · 2022-06-27T19:01:15Z

I would like to point out that .assign(...) has seen a performance regression in 1.4.0 that seems to go beyond what can be explained by changes in setitem:

import pandas as pd
import numpy as np
from time import perf_counter

k = 500    # number of columns
n = 10000  # number of rows
r = 10     # number of repetitions

data = {f"col{key}": np.random.randint(0, 5, size=n) for key in range(k)}

df = pd.DataFrame(data)

print(f"{pd.__version__=}")

# use []
t0 = perf_counter()
for _ in range(r):
    for key in data:
        df[key] = data[key]
t1 = perf_counter()

print(f"[] {(t1-t0)/r:.4f}")

# use copy + .loc
t0 = perf_counter()
for _ in range(r):
    for key in data:
        df.loc[:, key] = df.loc[:, key].copy()
        df.loc[:, key] = data[key]
t1 = perf_counter()

print(f"copy + .loc {(t1-t0)/r:.4f}")

# use assign
t0 = perf_counter()
for _ in range(r):
    df = df.assign(**data)
t1 = perf_counter()

print(f"assign {(t1-t0)/r:.4f}")

Output:

pd.__version__='1.3.5'
[] 0.0203
copy + .loc 0.2112
assign 0.0329

pd.__version__='1.4.0'
[] 0.2501
copy + .loc 0.1436
assign 1.9943

pd.__version__='1.4.1'
[] 0.1976
copy + .loc 0.1464
assign 1.9205

pd.__version__='1.4.2'
[] 0.1916
copy + .loc 0.1519
assign 2.0024

pd.__version__='1.4.3'
[] 0.2191
copy + .loc 0.1347
assign 1.9111

pd.__version__='1.5.0.dev0+1030.g7d2f9b8d59'
[] 0.2060
copy + .loc 0.1269
assign 1.8481

lbittarello · 2022-07-01T13:18:26Z

It's also worth noting that it's difficult to use .loc when chaining operations, unlike assign, so it can be a lot clunkier.

simonjayhawkins · 2022-08-30T12:41:06Z

removing from 1.4.x milestone

Hubedge added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Mar 8, 2022

phofl added Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 9, 2022

mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Mar 16, 2022

phofl mentioned this issue Jun 16, 2022

PERF: DataFrame.update of pandas version is slower than the older version #47392

Closed

2 tasks

warm200 mentioned this issue Jun 17, 2022

PERF: Performance improvement on dataframe.update. #47407

Closed

2 tasks

simonjayhawkins added this to the 1.4.3 milestone Jun 21, 2022

simonjayhawkins mentioned this issue Jun 21, 2022

Fix segfault with printing dataframe #47097

Closed

4 tasks

simonjayhawkins modified the milestones: 1.4.3, 1.4.4 Jun 22, 2022

simonjayhawkins removed this from the 1.4.4 milestone Aug 30, 2022

phofl mentioned this issue Dec 9, 2022

PERF: Split blocks in blk.delete #50148

Merged

5 tasks

mroeschke closed this as completed in #50148 Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: performance regression in assigning value to df[col_name] #46267

PERF: performance regression in assigning value to df[col_name] #46267

Hubedge commented Mar 8, 2022

INSTALLED VERSIONS

Hubedge commented Mar 8, 2022 •

edited

Loading

phofl commented Mar 9, 2022

Hubedge commented Mar 10, 2022 •

edited

Loading

phofl commented Mar 11, 2022

Hubedge commented Mar 11, 2022

jbrockmendel commented Mar 11, 2022

phofl commented Mar 11, 2022

sappersapper commented May 25, 2022 •

edited

Loading

jtilly commented Jun 27, 2022

lbittarello commented Jul 1, 2022

simonjayhawkins commented Aug 30, 2022

PERF: performance regression in assigning value to df[col_name] #46267

PERF: performance regression in assigning value to df[col_name] #46267

Comments

Hubedge commented Mar 8, 2022

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

Hubedge commented Mar 8, 2022 • edited Loading

phofl commented Mar 9, 2022

Hubedge commented Mar 10, 2022 • edited Loading

phofl commented Mar 11, 2022

Hubedge commented Mar 11, 2022

jbrockmendel commented Mar 11, 2022

phofl commented Mar 11, 2022

sappersapper commented May 25, 2022 • edited Loading

jtilly commented Jun 27, 2022

lbittarello commented Jul 1, 2022

simonjayhawkins commented Aug 30, 2022

Hubedge commented Mar 8, 2022 •

edited

Loading

Hubedge commented Mar 10, 2022 •

edited

Loading

sappersapper commented May 25, 2022 •

edited

Loading