`uproot.iterate` throws a `pandas` PerformanceWarning #1070

jannisspeer · 2023-12-15T11:41:22Z

I am using the latest uproot version 5.2.0.
When reading a ROOT file with uproot.iterate into pandas.DataFrame, a PerformanceWarning is triggered:

PerformanceWarning: DataFrame is highly fragmented.
This is usually the result of calling `frame.insert` many times, which has poor performance. 
Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

The issue seems to be in the function _pandas_memory_efficient in the file interpretation/library.py.

The text was updated successfully, but these errors were encountered:

jannisspeer · 2023-12-15T12:31:35Z

The issue only appears in ROOT files with many branches.
Here are code snippets to reproduce the problem.

Produce a ROOT file with many branches:

import numpy as np
import pandas as pd
import uproot

df_dict = {}

column_names = []

for i in range(100):
    column_names.append("column_{}".format(i))

for i in column_names:
    df_dict[i] = np.random.rand(1000)

df = pd.DataFrame.from_dict(df_dict)

with uproot.recreate("test.root", compression=uproot.ZLIB(4)) as file:
    file["tree"] = df

Open the ROOT file with uproot.iterate:

import uproot

for array in uproot.iterate("./test.root:tree", step_size=100, library="pandas"):
    pass

lobis · 2023-12-15T15:33:26Z

Some additional information:

This warning also appears for uproot v5.1.0
It's not related to fsspec, passing handler=uproot.source.file.MemmapSource does not make a difference

jpivarski · 2024-01-11T14:52:53Z

Some history of the _pandas_memory_efficient function:

uproot5/src/uproot/interpretation/library.py

Lines 787 to 803 in 63bb43f

    
           def _pandas_memory_efficient(pandas, series, names): 
        
               # Pandas copies the data, so at least feed columns one by one 
        
               gc.collect() 
        
               out = None 
        
               for name in names: 
        
                   if out is None: 
        
                       if not isinstance(series[name], pandas.core.series.Series): 
        
                           out = pandas.Series(data=series[name]).to_frame(name=name) 
        
                       else: 
        
                           out = series[name].to_frame(name=name) 
        
                   else: 
        
                       out[name] = series[name] 
        
                   del series[name] 
        
               if out is None: 
        
                   return pandas.DataFrame(data=series, columns=names) 
        
               else: 
        
                   return out

It was added in #281 to solve #277. It was one of many different attempts to construct a pd.DataFrame, and it's not completely satisfactory because it calls gc.collect() explicitly, which we usually shouldn't do in production code. Since we already have the data in arrays, our goal is to give them to Pandas in such a way that Pandas doesn't copy, rewrite, or iterate through them, and I'm surprised at how many attempts it has taken to try to do that. There are many ways to create a pd.DataFrame; we want to find the way that takes over our arrays so that we can just hand them off.

Originally, that function was for a special case, but in #734, it became the only way. (Kush removed the complex code that "exploded" ragged arrays into columns with pd.MultiIndex in favor of using awkward-pandas, and now _pandas_memory_efficient is the only path that creates a pd.DataFrame, I believe.)

@ioanaif is investigating pd.DataFrame construction.

ioanaif · 2024-01-18T17:05:15Z

Appending to a dataframe in pandas has O(n^2) complexity because in each iteration a new dataframe is created and the data is copied over. Thus the call to pandas_memory_efficient is not needed and indeed it raises a performance warning when a lot of columns are involved.   With pandas 2.1.4 the best course of action is to build the dataframe directly from a dictionary of arrays and list of column names. We already build these two in _pandas_only_series

Enlarging a dataframe through

df.loc[len(df)] = new_row
or
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)

is much slower than constructing a dataframe once:

jannisspeer added the bug (unverified) The problem described would be a bug, but needs to be triaged label Dec 15, 2023

jpivarski assigned ioanaif Jan 11, 2024

ioanaif mentioned this issue Jan 18, 2024

feat: add uproot-issue-1070 data scikit-hep/scikit-hep-testdata#138

Merged

ioanaif linked a pull request Jan 18, 2024 that will close this issue

fix: pandas performance on files with many branches #1086

Merged

ioanaif closed this as completed in #1086 Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`uproot.iterate` throws a `pandas` PerformanceWarning #1070

`uproot.iterate` throws a `pandas` PerformanceWarning #1070

jannisspeer commented Dec 15, 2023

jannisspeer commented Dec 15, 2023 •

edited

Loading

lobis commented Dec 15, 2023

jpivarski commented Jan 11, 2024

ioanaif commented Jan 18, 2024

uproot.iterate throws a pandas PerformanceWarning #1070

uproot.iterate throws a pandas PerformanceWarning #1070

Comments

jannisspeer commented Dec 15, 2023

jannisspeer commented Dec 15, 2023 • edited Loading

lobis commented Dec 15, 2023

jpivarski commented Jan 11, 2024

ioanaif commented Jan 18, 2024

`uproot.iterate` throws a `pandas` PerformanceWarning #1070

`uproot.iterate` throws a `pandas` PerformanceWarning #1070

jannisspeer commented Dec 15, 2023 •

edited

Loading