Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uproot.iterate throws a pandas PerformanceWarning #1070

Closed
jannisspeer opened this issue Dec 15, 2023 · 4 comments · Fixed by #1086
Closed

uproot.iterate throws a pandas PerformanceWarning #1070

jannisspeer opened this issue Dec 15, 2023 · 4 comments · Fixed by #1086
Assignees
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged

Comments

@jannisspeer
Copy link

I am using the latest uproot version 5.2.0.
When reading a ROOT file with uproot.iterate into pandas.DataFrame, a PerformanceWarning is triggered:

PerformanceWarning: DataFrame is highly fragmented.
This is usually the result of calling `frame.insert` many times, which has poor performance. 
Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

The issue seems to be in the function _pandas_memory_efficient in the file interpretation/library.py.

@jannisspeer jannisspeer added the bug (unverified) The problem described would be a bug, but needs to be triaged label Dec 15, 2023
@jannisspeer
Copy link
Author

jannisspeer commented Dec 15, 2023

The issue only appears in ROOT files with many branches.
Here are code snippets to reproduce the problem.

  1. Produce a ROOT file with many branches:
import numpy as np
import pandas as pd
import uproot

df_dict = {}

column_names = []

for i in range(100):
    column_names.append("column_{}".format(i))

for i in column_names:
    df_dict[i] = np.random.rand(1000)

df = pd.DataFrame.from_dict(df_dict)

with uproot.recreate("test.root", compression=uproot.ZLIB(4)) as file:
    file["tree"] = df
  1. Open the ROOT file with uproot.iterate:
import uproot

for array in uproot.iterate("./test.root:tree", step_size=100, library="pandas"):
    pass

@lobis
Copy link
Collaborator

lobis commented Dec 15, 2023

Some additional information:

  • This warning also appears for uproot v5.1.0
  • It's not related to fsspec, passing handler=uproot.source.file.MemmapSource does not make a difference

@jpivarski
Copy link
Member

Some history of the _pandas_memory_efficient function:

def _pandas_memory_efficient(pandas, series, names):
# Pandas copies the data, so at least feed columns one by one
gc.collect()
out = None
for name in names:
if out is None:
if not isinstance(series[name], pandas.core.series.Series):
out = pandas.Series(data=series[name]).to_frame(name=name)
else:
out = series[name].to_frame(name=name)
else:
out[name] = series[name]
del series[name]
if out is None:
return pandas.DataFrame(data=series, columns=names)
else:
return out

It was added in #281 to solve #277. It was one of many different attempts to construct a pd.DataFrame, and it's not completely satisfactory because it calls gc.collect() explicitly, which we usually shouldn't do in production code. Since we already have the data in arrays, our goal is to give them to Pandas in such a way that Pandas doesn't copy, rewrite, or iterate through them, and I'm surprised at how many attempts it has taken to try to do that. There are many ways to create a pd.DataFrame; we want to find the way that takes over our arrays so that we can just hand them off.

Originally, that function was for a special case, but in #734, it became the only way. (Kush removed the complex code that "exploded" ragged arrays into columns with pd.MultiIndex in favor of using awkward-pandas, and now _pandas_memory_efficient is the only path that creates a pd.DataFrame, I believe.)

@ioanaif is investigating pd.DataFrame construction.

@ioanaif
Copy link
Collaborator

ioanaif commented Jan 18, 2024

Appending to a dataframe in pandas has O(n^2) complexity because in each iteration a new dataframe is created and the data is copied over. Thus the call to pandas_memory_efficient is not needed and indeed it raises a performance warning when a lot of columns are involved. 

With pandas 2.1.4 the best course of action is to build the dataframe directly from a dictionary of arrays and list of column names. We already build these two in _pandas_only_series

Enlarging a dataframe through

df.loc[len(df)] = new_row
or
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)

is much slower than constructing a dataframe once:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants